class: center, middle, title-slide # Principal Compoonents Analysis ## AU STAT627 ### Emil Hvitfeldt ### 2021-03-16 --- <div style = "position:fixed; visibility: hidden"> `$$\require{color}\definecolor{orange}{rgb}{1, 0.603921568627451, 0.301960784313725}$$` `$$\require{color}\definecolor{blue}{rgb}{0.301960784313725, 0.580392156862745, 1}$$` `$$\require{color}\definecolor{pink}{rgb}{0.976470588235294, 0.301960784313725, 1}$$` </div> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { Macros: { orange: ["{\\color{orange}{#1}}", 1], blue: ["{\\color{blue}{#1}}", 1], pink: ["{\\color{pink}{#1}}", 1] }, loader: {load: ['[tex]/color']}, tex: {packages: {'[+]': ['color']}} } }); </script> <style> .orange {color: #FF9A4D;} .blue {color: #4D94FF;} .pink {color: #F94DFF;} </style> # More Unsupervised Learning Another branch of unsupervised learning - Dimensionality Reduction > take many dimensions and create fewer ones that represent as much of the original data as possible --- # Dimensionality Reduction Why would you want to do this? - Allows for visualization of many dimensions in 2 dimensions - Can be used as a preprocessing step for models that can't handle many dimensions --- ## What is Principal Components Analysis? - Motivation Suppose you have many dimension and want to visualize the relationship between them If you wanted to do them pairwise then you have `\({p \choose 2} = \dfrac{p(p-1)}{2}\)` plots to do This adds up very fast! --- # Many plots <img src="index_files/figure-html/unnamed-chunk-2-1.png" width="700px" style="display: block; margin: auto;" /> --- ## What is Principal Components Analysis? we want to find a low dimensional representation of the high dimensional data Specifically, we would want 2 dimensional for plotting purposes PCA is one such technique that does just that -- PCA finds a low-dimensional representation of the data set that contains as much of the variation as possible in as few columns as possible --- # Re-formulation PCA is a linear combination of the original data such that most of the variation is captured in the first variable, then second, then third and so one --- # PCA Construction The .blue[first principal component] of a set of features `\(X_1, X_2, ..., X_p\)` is the normalized linear combination of the features `$$Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + ... + \phi_{p1}X_p$$` that has the largest variance We mean that normalized that `\(\sum_{j=1}^p \phi_{j1}^2 = 1\)`. we refer to `\(\phi_{11}, ..., \phi_{p1}\)` as the loadings of the first principal component. And think of them as the loading vector `\(\phi_1\)` --- # PCA Construction These loadings are constrained, otherwise, we don't get any solutions since arbitrarily large values of the loadings would increase the variance --- # How do we get these? Assuming we have a `\(n \times p\)` data set `\(\mathbf{X}\)` since we are only interested in the variance we assume that the variables have been centered `$$\underset{\phi_{11}, ..., \phi_{p1}}{\text{maximize}} \left\{ \dfrac{1}{n} \sum^n_{i=j} \left( \sum^p_{j=1} \phi_{j1}x_{ij} \right)^2 \right\} \quad \text{subject to} \quad \sum_{j=1}^p \phi_{j1}^2 = 1$$` --- # How do we get these? since we have `\(z_{i1} = \phi_{11} x_{i1} + \phi_{21} x_{i2} + ... + \phi_{p1}x_{ip}\)`, then we can write `$$\underset{\phi_{11}, ..., \phi_{p1}}{\text{maximize}} \left\{ \dfrac{1}{n} \sum^n_{i=j} z_{i1} ^2 \right\} \quad \text{subject to} \quad \sum_{j=1}^p \phi_{j1}^2 = 1$$` We are in essence maximizing the sample variance of the `\(n\)` values of `\(z_{i1}\)`. We refer to `\(z_{11}, ..., z_{n1}\)` as the scores of the first principal component. --- # How do we solve that problem? Luckily this can be solved using techniques from Linear Algebra more specifically, it can be solved using a .orange[eigen decomposition] One of the main strengths of PCA is that you don't need to use optimization to get the results without approximations!!!! --- # Remaining principal components Once the first principal component is calculated, we can calculate the second principal component We find the second principal component `\(Z_2\)` as a linear combination of `\(X_1, ..., X_p\)` that has the maximal variance out of the linear combinations that are uncorrelated with `\(Z_1\)` this is the same as saying that `\(\phi_2\)` should be orthogonal to the direction `\(\phi_1\)` --- # Remaining principal components We can do this to calculate all the principal components since we are working literately through the principal components, we can calculate only as many as we want --- # The proportion of variance explained the proportion of variance explained of the `\(m\)`th principal component is given by $$ \dfrac{\sum_{i=1}^n \left( \sum_{j=1}^p \phi_{jm}x_{ij} \right)^2}{\sum_{j=1}^p\sum_{i=1}^n x_{ij}^2} $$ Don't worry, this is already calculated by the software you use to get PCA --- # Visualizing PCA Once we have the principal components there are a couple of things we can visualize --- .center[ ![:scale 90%](images/lter_penguins.png) ] .footnote[Art by Allison Horst] --- # Plotting PC1 against PC2 <img src="index_files/figure-html/unnamed-chunk-3-1.png" width="700px" style="display: block; margin: auto;" /> --- # Plotting PC1 against PC2 <img src="index_files/figure-html/unnamed-chunk-4-1.png" width="700px" style="display: block; margin: auto;" /> --- # Plotting PC1 against PC3 <img src="index_files/figure-html/unnamed-chunk-5-1.png" width="700px" style="display: block; margin: auto;" /> --- # Plotting PC1 against PC3 <img src="index_files/figure-html/unnamed-chunk-6-1.png" width="700px" style="display: block; margin: auto;" /> --- # Plotting PC2 against PC3 <img src="index_files/figure-html/unnamed-chunk-7-1.png" width="700px" style="display: block; margin: auto;" /> --- # Plotting PC2 against PC3 <img src="index_files/figure-html/unnamed-chunk-8-1.png" width="700px" style="display: block; margin: auto;" /> --- # Loadings for PC1 <img src="index_files/figure-html/unnamed-chunk-9-1.png" width="700px" style="display: block; margin: auto;" /> --- # Loadings for all Principal components <img src="index_files/figure-html/unnamed-chunk-10-1.png" width="700px" style="display: block; margin: auto;" /> --- ### Percent variance explained by each PCA component <img src="index_files/figure-html/unnamed-chunk-11-1.png" width="700px" style="display: block; margin: auto;" /> --- # Alternative intepretation Interpretation as a rotation of the space --- # Scaling of variables You must think about scaling the variables Since we are maximizing some value then the magnitude of the variables will matter If you don't have any prior knowledge of the data, it is advisable to set all the variables to the same scale --- # Uniqueness of Principal Components The principal components you generate should be unique up to a sign-flip of the loadings --- # How is this a dimensionality reduction technique? PCA is not a dimensionality reduction method by itself in the strictest sense You get the reduction by only keeping some of the columns - by number of columns - Threshold by variance explained --- # Extensions Think of this problem as `$$X\approx U V$$` where - `\(X\)` is a `\(n \times p\)` matrix - `\(U\)` is a `\(n \times d\)` matrix - `\(V\)` is a `\(d \times p\)` matrix --- # Extensions What we want to do is find `$$\sum_{i=1}^n \sum_{j=1}^p \text{loss}\left( X_{ij}, (UV)_{ij} \right)$$` subject to some constraints --- # PCA `$$\sum_{i=1}^n \sum_{j=1}^p \left( X_{ij} - (UV)_{ij} \right)^2$$` with no constraints --- # Sparse PCA `$$\sum_{i=1}^n \sum_{j=1}^p \left( X_{ij}- (UV)_{ij} \right)^2$$` Under the constraint that there is at most `\(k\)` columns in `\(U\)` In other words, each principal component can contain at most `\(k\)` loadings --- # K-Means `$$\sum_{i=1}^n \sum_{j=1}^p \left( X_{ij}- (UV)_{ij} \right)^2$$` Under the constraint that there is at most 1 columns in `\(U\)` --- # Non-Negative Matrix Factorization `$$\sum_{i=1}^n \sum_{j=1}^p \left( X_{ij}- (UV)_{ij} \right)^2$$` Under the constraint that all the values of `\(U\)` and `\(V\)` are non-negative --- # A Bluffer's Guide to Dimension Reduction - Leland McInnes More about this: https://www.youtube.com/watch?v=9iol3Lk6kyU&t=6s --- # Final Project Search for data! Ideas: - https://github.com/rfordatascience/tidytuesday - https://www.data-is-plural.com/ - https://www.kaggle.com/