PCA Primer

.pdf

School

North Carolina State University *

*We aren’t endorsed by this school

Course

102

Subject

Industrial Engineering

Date

Oct 30, 2023

Type

pdf

Pages

Uploaded by DukeScience99627

Colm Howlin October 21, 2018 Principal Component Analysis - A Primer lab.realizeitlearning.com /resource/2018/10/22/Principal-Component-Analysis-A-Primer This is a resource post that gives a very brief overview of Principal Components Analysis (PCA). We create these resource posts to provide a basic introduction to some techniques and algorithms that we use in our research. Rather than clogging up the original article with lengthy explanations, we've put them into posts like this so that you can refer to them if needed. The idea is not to give all the details but just enough to understand why we used the approach and what benefits it gives us. Principal Component Analysis is a dimension reduction technique. It takes a set of variables, looks for relationships between them, and tries to use those relationships to hopefully help us produce a smaller set of components that we can use in our analysis. In the following I’m going to give some basic detail on how it achieves this, and what these components are, using a simple example. A Simple Example Imagine we have a collection of some objects - it doesn’t matter what they are. We have a weighing scale and a measuring tape, and we can measure only two variables about each object - its weight and its length. In the following graph, I’ve plotted the weight versus 1/5

length for our collection objects. A sample set of data showing the relationship between the length and weight of some objects. One thing we notice immediately is that they appear to be highly correlated - all the points fall approximately along a straight line. For the sample points, the correlation is 0.95. Since the two variables are so highly correlated, when we know the value of one variable, we can make an excellent guess at the value of the other one. So we don’t really have two different pieces of information about each object. If, for example, we know the weight of an object, we can get a reasonable estimate of its length form the graph (or from building a linear regression model). So actually measuring the length of the object only gives a small bit of extra information - the error in our estimate, which should be small given the two variables are so highly correlated. We could now decide to only use one of these variables in our analysis as including both would not tell us anything extra - we can translate anything we learn from one variable to the other. This is all relatively straightforward in this simple example, but as we include more variables in our data we have to consider far more relationships between variables and sets of variables. This is where PCA comes in. It helps us achieve the same reduction in a more rigorous way. Let us continue to use our simple data set to see how PCA works. Extracting the Components Now that we have our data plotted imagine drawing a new set of axes on top of this data and measuring the distance from each point to the new set of axes, just like in the following graph. We end up with each object measured on these two new variables, which we could use in place of our original two variables. We haven’t lost any, but it would be challenging to interpret our analysis in real-world terms using these two new random axes. 2/5

The sample data can be measured against any set of new axes that we want. PCA creates a new set of axes, but it doesn’t just draw any random set. It finds the best set subject to some given criteria. There is lots of detail in here that I’m glossing over but understanding this should be sufficient for the rest of what we do. These new axes are the new components we mentioned earlier. We have two variables in our sample data, so we end up with two components. If we had 13 variables, PCA would generate 13 components. So how do these new components help us? In the following graph, I’ve plotted our sample data points on the two components given to us by PCA. Notice how the data is still all spread out along a line but that line is now pretty much horizontal. The sample data plotted using the new components. Reducing the Dimensions Let’s look at the variance of the data points on our original metrics and on our new components to see how things have changed. The variances of the length and weights of the objects are 2.89 and 2.92 respectively. We can think of the variance of a variable as how much information that variable contains. For example, if the variance is small, then all the points are close to the mean value, and if you know the mean, you have an excellent approximation for all the data points. If the variance is large, then merely knowing the mean is not a very accurate approximation of all the individual data points. 3/5

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version