Principle Component Analysis

Principal
Component
Analysis
Step by Step Walk Through
Paul Biliniski
Purpose
• Find patterns in data with many dimensions
• Reduces the number of dimensions, analysis becomes easier
• ONLY WORKS ON SQUARE MATRICES
Mathematical Concepts
• Measures of Spread in 1 Dimension
• Standard Deviation – spread of data from mean
• Variance
• Measure of Spread in 2 Dimensions
• Covariance – Variance between 2 data sets, to see if they change
at similar rates; sign is important
• Covariance Matrix
• Matrix of all of the covariances between each pair of data sets
• Eigenvector
• Transformation vector that creates a reflection of a data set onto
itself
• Eigenvalue
• Amount by which the original vector is scaled
Step 1: Data, Subtract Means
• Find the Mean of each component of the data set
• Subtract that mean from each of the components
Height (CM)
OFC (CM)
161.1
56.1
179.8
57.5
186.3
60.1
163.9
56.6
190
59.8
179.9
58
177.9
59.3
195
59.9
• Mean of Height: 173.9366667 Mean of OFC: 57.59333333
Step 1: Data Graphed
OFC(cm)
Clearly,
there is a
linear
relationshi
p
62
61
60
OFC(cm)
59
58
OFC(cm)
Linear (OFC(cm))
57
56
55
54
150
160
170
180
Height(cm)
190
200
Step 1: Subtract Means
Height (CM)
OFC (CM)
-12.83666667
-1.665789474
5.863333333
-0.265789474
12.36333333
2.334210526
-10.03666667
-1.165789474
16.06333333
2.034210526
5.963333333
0.234210526
3.963333333
1.534210526
21.06333333
2.134210526
Step 2: Covariance Matrix
• Calculate the covariance matrix
104.901023
15.36128736
15.36128736
2.791678161
• The diagonal should be the variance in each data set, the antidiagonal should be the covariance
• All positive values tells us that we expect to see that as data
set 1 increases, data set 2 also increases. Verified by graph
Step 3: Calculate Eigens
• Eigenvector x times matrix A equals Eigenvalue (Λ) times x
• Ax = Λx
• The eigenvalue is found with:
• det (A – ΛI) = 0
• It is the determinant (performs a linear transformation of vector
space) of the original matrix – Λ in the diagonals
• So with the original matrix:
• The Eigenvalue can be found with the equation:
• Use the Quadratic to solve
Step 3: Calculate Eigens
• So for our situation, we use the 2x2 matrix of the covariances:
104.901023
15.36128736
15.36128736
2.791678161
• So the determinant for this is:
det
104.901023 - Λ
15.36128736
15.36128736
2.791678161 - Λ
• (104.901023 – Λ) * (2.791678161 – Λ) - 15.36128736^2 =0
• So, the eigenvalues are 0.54 and 107.161!
Step 3: Calculate Eigens
• With the Eigens solved, we can now solve for the vectors in
the null space, getting our vector to Row Echelon form; first
just use one of the eigenvalues as the Λ
15.36128736
X
0
15.36128736 2.791678161 – 0.54
Y
0
104.901023 - 0.54
104.35
15.36128736
X
0
15.36128736
2.26
Y
0
• For row echelon form, get the item in the second row(15.36),
first column to equal – 105.44, or super close to it.
• So, multiply the second row by 6.86!
Step 3: Calculate Eigens
• Row 1 stays the same, Row 2 = Row 1 – 6.86*(Row2)
105.44
15.36128736
X
0
~0
~0
Y
0
• So, do the multiplication:
• 105.44X +15.36Y = 0
• We can define 105.44X = S, the variable S. This means that
15.36Y = 0.14S
• The Vector for the value 0.54 is 1
0.14
• Apply this same technique to the eigenvalue of 107, and get
-.014
1
Step 3: Calculate Eigens
• Now lets see how our Eigens look on our graph of subtracted
means data… looks like a line of best fit, reasonable.
4
3
Transformed OFC
2
-20
1
Series1
Eigen1
0
-15
-10
-5
0
5
-1
-2
-3
-4
Transformed Height
10
15
20
25
Eigen2
Linear (Eigen1)
Linear (Eigen2)
Step 3: Calculate Eigens
• The eigenvalue with the highest value is considered the
principle component of the data set. Set up a matrix with
your two eigenvalues, so you can transform the data.
1
0.146
-0.146
1
• Note that column 1 is the eigenvector associated with our 107
eigenvalue, the bigger one
• Now, we get back to the data…
Step 4: Transform Data
• Using the 2x2 matrix of eigenvectors, multiply each matrix of
the data (each should be a 1 column 2 row matrix, X value on
top of Y value)
1
0.146
-0.146
1
-12.83666667
-13.05469333
-1.493333333
0.38082
• Continue this for EVERY set of points…
Step 4: Transform Data
• This is what the data should look like after this eigenvector
multiplication step.
1.5
1
Transformed OFC
0.5
-20
0
-15
-10
-5
0
5
-0.5
-1
-1.5
Transformed Heigh
10
15
20
25
Step 5: Define Noise
• One of the axes is going to be the noise that we assume
occurs as a result of sampling. Choose one, in this case the Y
values.
1
0.146
-0.146
1
-12.83666667
-13.05469333
-1.493333333
0.38082
-13.05469333
0
Step 6: Getting back the Data
• Use the new noise-less data to plot your new points. Multiply
the data without noise by the eigen matrix
1
0.146
-0.146
1
13.05469333
-13.05469333
0
-1.905985227
• Repeat for all of your points. Now, add the means of each
back to the data:
-13.05469333
-1.905985227
173.9366667
160.8819733
57.59333333
55.68734811
Step 7: Victory
• Plot your new data points. They are now one line without
noise. This is now the new axis against which you can plot
another component. Keep adding variables into each
component until there is no longer a linear relationship. That
will show you what components cause the variation in your
data.
61
60
OFC(cm)
59
58
57
56
55
0
50
100
150
HEIGHT(cm)
200
250