Processing math: 0%
+ - 0:00:00
Notes for current slide
Notes for next slide

Principal Compoonents Analysis

AU STAT627

Emil Hvitfeldt

2021-06-07

1 / 38
`\require{color}\definecolor{orange}{rgb}{1, 0.603921568627451, 0.301960784313725} \require{color}\definecolor{blue}{rgb}{0.301960784313725, 0.580392156862745, 1} \require{color}\definecolor{pink}{rgb}{0.976470588235294, 0.301960784313725, 1}`

More Unsupervised Learning

Another branch of unsupervised learning

  • Dimensionality Reduction

take many dimensions and create fewer ones that represent as much of the original data as possible

2 / 38

Dimensionality Reduction

Why would you want to do this?

  • Allows for visualization of many dimensions in 2 dimensions
  • Can be used as a preprocessing step for models that can't handle many dimensions
3 / 38

What is Principal Components Analysis?

  • Motivation

Suppose you have many dimension and want to visualize the relationship between them

If you wanted to do them pairwise then you have {p \choose 2} = \dfrac{p(p-1)}{2} plots to do

This adds up very fast!

4 / 38

Many plots

5 / 38

What is Principal Components Analysis?

we want to find a low dimensional representation of the high dimensional data

Specifically, we would want 2 dimensional for plotting purposes

PCA is one such technique that does just that

6 / 38

What is Principal Components Analysis?

we want to find a low dimensional representation of the high dimensional data

Specifically, we would want 2 dimensional for plotting purposes

PCA is one such technique that does just that

PCA finds a low-dimensional representation of the data set that contains as much of the variation as possible in as few columns as possible

6 / 38

Re-formulation

PCA is a linear combination of the original data such that most of the variation is captured in the first variable, then second, then third and so one

7 / 38

PCA Construction

The first principal component of a set of features X_1, X_2, ..., X_p is the normalized linear combination of the features

Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + ... + \phi_{p1}X_p that has the largest variance

We mean that normalized that \sum_{j=1}^p \phi_{j1}^2 = 1.

we refer to \phi_{11}, ..., \phi_{p1} as the loadings of the first principal component.

And think of them as the loading vector \phi_1

8 / 38

PCA Construction

These loadings are constrained, otherwise, we don't get any solutions since arbitrarily large values of the loadings would increase the variance

9 / 38

How do we get these?

Assuming we have a n \times p data set \mathbf{X}

since we are only interested in the variance we assume that the variables have been centered

\underset{\phi_{11}, ..., \phi_{p1}}{\text{maximize}} \left\{ \dfrac{1}{n} \sum^n_{i=j} \left( \sum^p_{j=1} \phi_{j1}x_{ij} \right)^2 \right\} \quad \text{subject to} \quad \sum_{j=1}^p \phi_{j1}^2 = 1

10 / 38

How do we get these?

since we have z_{i1} = \phi_{11} x_{i1} + \phi_{21} x_{i2} + ... + \phi_{p1}x_{ip}, then we can write

\underset{\phi_{11}, ..., \phi_{p1}}{\text{maximize}} \left\{ \dfrac{1}{n} \sum^n_{i=j} z_{i1} ^2 \right\} \quad \text{subject to} \quad \sum_{j=1}^p \phi_{j1}^2 = 1

We are in essence maximizing the sample variance of the n values of z_{i1}.

We refer to z_{11}, ..., z_{n1} as the scores of the first principal component.

11 / 38

How do we solve that problem?

Luckily this can be solved using techniques from Linear Algebra

more specifically, it can be solved using a eigen decomposition

One of the main strengths of PCA is that you don't need to use optimization to get the results without approximations!!!!

12 / 38

Remaining principal components

Once the first principal component is calculated, we can calculate the second principal component

We find the second principal component Z_2 as a linear combination of X_1, ..., X_p that has the maximal variance out of the linear combinations that are uncorrelated with Z_1

this is the same as saying that \phi_2 should be orthogonal to the direction \phi_1

13 / 38

Remaining principal components

We can do this to calculate all the principal components

since we are working literately through the principal components, we can calculate only as many as we want

14 / 38

The proportion of variance explained

the proportion of variance explained of the mth principal component is given by

\dfrac{\sum_{i=1}^n \left( \sum_{j=1}^p \phi_{jm}x_{ij} \right)^2}{\sum_{j=1}^p\sum_{i=1}^n x_{ij}^2}

Don't worry, this is already calculated by the software you use to get PCA

15 / 38

Visualizing PCA

Once we have the principal components there are a couple of things we can visualize

16 / 38

Art by Allison Horst

17 / 38

Plotting PC1 against PC2

18 / 38

Plotting PC1 against PC2

19 / 38

Plotting PC1 against PC3

20 / 38

Plotting PC1 against PC3

21 / 38

Plotting PC2 against PC3

22 / 38

Plotting PC2 against PC3

23 / 38

Loadings for PC1

24 / 38

Loadings for all Principal components

25 / 38

Percent variance explained by each PCA component

26 / 38

Alternative intepretation

Interpretation as a rotation of the space

27 / 38

Scaling of variables

You must think about scaling the variables

Since we are maximizing some value then the magnitude of the variables will matter

If you don't have any prior knowledge of the data, it is advisable to set all the variables to the same scale

28 / 38

Uniqueness of Principal Components

The principal components you generate should be unique up to a sign-flip of the loadings

29 / 38

How is this a dimensionality reduction technique?

PCA is not a dimensionality reduction method by itself in the strictest sense

You get the reduction by only keeping some of the columns

  • by number of columns
  • Threshold by variance explained
30 / 38

Extensions

Think of this problem as

X\approx U V

where

  • X is a n \times p matrix
  • U is a n \times d matrix
  • V is a d \times p matrix
31 / 38

Extensions

What we want to do is find

\sum_{i=1}^n \sum_{j=1}^p \text{loss}\left( X_{ij}, (UV)_{ij} \right)

subject to some constraints

32 / 38

PCA

\sum_{i=1}^n \sum_{j=1}^p \left( X_{ij} - (UV)_{ij} \right)^2

with no constraints

33 / 38

Sparse PCA

\sum_{i=1}^n \sum_{j=1}^p \left( X_{ij}- (UV)_{ij} \right)^2

Under the constraint that there is at most k columns in U

In other words, each principal component can contain at most k loadings

34 / 38

K-Means

\sum_{i=1}^n \sum_{j=1}^p \left( X_{ij}- (UV)_{ij} \right)^2

Under the constraint that there is at most 1 columns in U

35 / 38

Non-Negative Matrix Factorization

\sum_{i=1}^n \sum_{j=1}^p \left( X_{ij}- (UV)_{ij} \right)^2

Under the constraint that all the values of U and V are non-negative

36 / 38

A Bluffer's Guide to Dimension Reduction - Leland McInnes

More about this: https://www.youtube.com/watch?v=9iol3Lk6kyU&t=6s

37 / 38
`\require{color}\definecolor{orange}{rgb}{1, 0.603921568627451, 0.301960784313725} \require{color}\definecolor{blue}{rgb}{0.301960784313725, 0.580392156862745, 1} \require{color}\definecolor{pink}{rgb}{0.976470588235294, 0.301960784313725, 1}`

More Unsupervised Learning

Another branch of unsupervised learning

  • Dimensionality Reduction

take many dimensions and create fewer ones that represent as much of the original data as possible

2 / 38
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow