Explain the assumptions we are making when performing Principle Component Analysis (PCA). What happens when these assumptions are violated?
Answer the following questions regarding Principle Component Analysis.
You will in this exercise explore a data set using PCA. The data comes from the #tidytuesday project and is about Student Loan Payments.
Load in the data using the following script.
library(tidymodels)
loans <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-11-26/loans.csv") %>%
select(-agency_name, -added) %>%
drop_na()
Use the prcomp()
function to perform PCA on the loans data set. Set scale. = TRUE
to perform scaling. What results are contained in this object? (hint: use the names()
function)
Calculate the amount of variance explained by each principal component. (hint: look at ?broom:::tidy.prcomp
)
Use the tidy()
function to extract the loadings. Which variable contributed most to the first principle component? Second Component?
Use the augment()
function to get back the transformation and create a scatter plot of any two components of your choice.
In this exercise, you are tasked to predict the weight of an animal in a zoo, based on which words are used to describe it. The animals
data set can be downloaded here.
This data set contains 801 variables. The first variable weight
is the natural log of the mean weight of the animal. The remaining variables are named tf_*
which shows how many times the word *
appears in the description of the animal.
Use {tidymodels} to set up a workflow to train a PC regression. We can do this by specifying a linear regression model, and create a preprocessor recipe with {recipes} that applies PCA transformation on the predictors using step_pca()
. Use the threshold
argument in step_pca()
to only keep the principal components that explain 90% of the variance.
How well does this model perform on the testing data set?
For part (a) through (c) indicate which of the statements are correct. Justify your answers.
Suppose we estimate the regression coefficients in a linear regression model by minimizing
\[ \sum_{i=1}^n \left( y_i - \beta_0 - \sum^p_{j=1}\beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p \beta_j^2 \]
for a particular value of \(\lambda\). For part (a) through (c) indicate which of the statements are correct. Justify your answers.
In this exercise, you are tasked to predict the weight of an animal in a zoo, based on which words are used to describe it. The animals
data set can be downloaded here.
This data set contains 801 variables. The first variable weight
is the natural log of the mean weight of the animal. The remaining variables are named tf_*
which shows how many times the word *
appears in the description of the animal.
Fit a lasso regression model to predict weight
based on all the other variables.
Use the tune package to perform hyperparameter tuning to select the best value of \(\lambda\). Use 10 bootstraps as the resamples
data set.
How well does this model perform on the testing data set?