mis6356Dimension Reduction(1)

pdf

School

Arizona State University *

*We aren’t endorsed by this school

Course

6356

Subject

Information Systems

Date

Oct 30, 2023

Type

pdf

Pages

20

Report

Uploaded by PrivateDugongMaster226

Dimension Reduction James Zhang MIS 6356 BA with R
Exploring the data Statistical summary of data: common metrics Average Median Minimum Maximum Standard deviation Counts & percentages
Reducing Categories A single categorical variable with m categories is typically transformed into m or m-1 dummy variables (handled automatically by most R modeling functions Each dummy variable takes the values 0 or 1 0 = “no” for the category 1 = “yes” Problem: Can end up with too many variables Solution: Reduce by combining categories that are close to each other Use pivot tables to assess outcome variable sensitivity to the dummies
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Combining Categories Many zoning categories are the same or similar with respect to CATMEDV
Principal Components Analysis Goal: Reduce a set of numerical variables. The idea: Remove the overlap of information between these variable. [“Information” is measured by the sum of the variances of the variables.] Final product: A smaller number of numerical variables that contain most of the information
Principal Components Analysis How does PCA do this? Create new variables that are linear combinations of the original variables (i.e., they are weighted averages of the original variables). These new variables are uncorrelated (no information overlap), and only a few of them contain most of the original information. The new variables are called principal components .
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Example Breakfast Cereals (excerpt) name mfr type calories protein rating 100%_Bran N C 70 4 68 100%_Natural_Bran Q C 120 3 34 All-Bran K C 70 4 59 All-Bran_with_Extra_Fiber K C 50 4 94 Almond_Delight R C 110 2 34 Apple_Cinnamon_Cheerios G C 110 2 30 Apple_Jacks K C 110 2 33 Basic_4 G C 130 3 37 Bran_Chex R C 90 2 49 Bran_Flakes P C 90 3 53 Cap'n'Crunch Q C 120 1 18 Cheerios G C 110 6 51 Cinnamon_Toast_Crunch G C 120 1 20
Description of Variables Name: name of cereal mfr: manufacturer type: cold or hot calories: calories per serving protein: grams fat: grams sodium: mg. fiber: grams carbo: grams complex carbohydrates sugars: grams potass: mg. vitamins: % FDA rec shelf: display shelf weight: oz. 1 serving cups: in one serving rating: consumer reports
Consider calories & ratings covariance matrix Total variance (=“information”) is sum of individual variances: 379.63 + 197.32 Calories accounts for 379.63/577 = 66% If we want to use just calories, we lose 34% of the variation calories ratings calories 379.63 -189.68 ratings -189.68 197.32
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Using linear combinations to redistribute the variability in a more polarized way Z 1 and Z 2 are two new variables Linear combinations of Rating and Calories Z 1 has the highest variation (spread of values) Z 2 has the lowest variation
PCA output for these 2 variables Weights to project original data onto Z 1 & Z 2, e.g. (-0.847, 0.532) are weights for Z 1 pcs <- prcomp(data.frame(cereals.df$calories, cereals.df$rating)) summary(pcs) PC1 PC2 cereals.df.calories 0.8470535 0.5315077 cereals.df.rating -0.5315077 0.8470535 Importance of components: PC1 PC2 Standard deviation 22.3165 8.8844 Proportion of Variance 0.8632 0.1368 Cumulative Proportion 0.8632 1.0000 86% of the total variance is accounted for by component 1
PC1 PC2 [1,] -44.921528 2.1971833 [2,] 15.725265 -0.3824165 [3,] -40.149935 -5.4072123 [4,] -75.310772 12.9991256 [5,] 7.041508 -5.3576857 Principal Component Scores for the First Five Records
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
> pcs <- prcomp(na.omit(cereals.df[,-c(1:3)])) > summary(pcs) PCA for the 13 Numerical Variables in the Cereals Data The first two components account for 93% of the total variance, so using 2-3 components in further modeling would probably be sufficient
The Weightings for the First Five Components
Generalization X 1 , X 2 , X 3 , … X p , original p variables Z 1 , Z 2 , Z 3 , … Z p , weighted averages of original variables All pairs of Z variables have 0 correlation Order Z’s by variance (z 1 largest, Z p smallest) Usually the first few Z variables contain most of the information, and so the rest can be dropped.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Normalizing data In these results, sodium dominates first PC Just because of the way it is measured (mg), its scale is greater than almost all other variables Hence its variance will be a dominant component of the total variance Normalize each variable to remove scale effect Divide by std. deviation (may subtract mean first) Normalization (= standardization) is usually performed in PCA; otherwise measurement units affect results > pcs.cor <- prcomp(na.omit(cereals.df[,-c(1:3)]), scale. = T) Normalize the variables
PCA Output Using all 13 Normalized Numerical Variables
Weightings for the First Five Normalized Components
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
PCA in Classification/Prediction Apply PCA to training data Decide how many PC’s to use Use variable weights in those PC’s with validation/new data This creates a new reduced set of predictors in validation/new data
Summary Data reduction is useful for compressing the information in the data into a smaller subset Categorical variables can be reduced by combining similar categories Principal components analysis transforms an original set of numerical data into a smaller set of weighted averages of the original data that contain most of the original information in less variables.

Browse Popular Homework Q&A

Q: Let A, B, C, D be the vertices of a square with side length 100. If we want to create a…
Q: The radian measure of an angle of - 218 degrees is
Q: Find the payment necessary to amortize a 4% loan of $1800 compounded quarterly, with 19 quarterly…
Q: Zeller's congruence is an algorithm developed by Christian Zeller to calculate the day of the week.…
Q: What minimum specifications does his computer need in order to run Windows 10? Which of the two CPU…
Q: ? ? ? -11) 1. T(x) = (311+12, 2. T(x) = (2x1, x₂) ✓ 3. T(x) = (₁ + 10, ₂)T I1)T
Q: Prove (°C ), - r(WP) and (°CF ), --r(3) - T ӘР For ideal gas, pV=RT. Show that Cy is independent of…
Q: Find the equivalent capacitance of the circuit below.
Q: Which of the following will have the lowest average kinetic energy? OA) H₂ at 400 °C O B) O₂ at 300…
Q: The ponderal index is a measure of overall size similar to a body mass index. The ponderal index of…
Q: Evaluate the following integral:4² - 10t - 1 dt O7714 O98/3 86/3 O 83/3
Q: Let B = {(1, 3), (-2,-2)} and B' = {(-12, 0), (-4,4)} be bases for R2, and let - [²9] 43 be the…
Q: Use the geometric series test to determine whether \sum_(n=0)^(\infty ) 4((\pi )/(5))^(n) converges…
Q: If a dumbbell has a weight of 44.5 N, its mass is: a. 44.5 kg b. 10 lbs c. 4.5 kg d. 436 kg
Q: Use the formula for nPr to solve the following question. A club with sixteen members is to choose…
Q: 1 Simplify the expression: 10h - 4h =
Q: Polly Manufacturing Company acquired equipment on January 1, 2022, for $527,000. Estimated useful…
Q: Write the proton condition and acid/base mass balance equation for each of the following systems.
Q: 60% of the voters favor Ms. Stein. If 250 voters are chosen at random, what is the expected number…
Q: Consider the function f(x) = x*el8. For this function there are three important intervals: (– 00,…
Q: Using the Discriminant In Exercises9–14, use the discriminant to find the numberof real and…
Q: Consider two markets: the market for cat food and the market for snake oil. The initial equilibrium…