Assignment 6

Exercise 1 (5 points)

Explain the assumptions we are making when performing Principle Component Analysis (PCA). What happens when these assumptions are violated?

Exercise 2 (5 points)

Answer the following questions regarding Principle Component Analysis.

Exercise 3 (10 points)

You will in this exercise explore a data set using PCA. The data comes from the #tidytuesday project and is about Student Loan Payments.

Load in the data using the following script.

library(tidymodels)
loans <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-11-26/loans.csv") %>%
  select(-agency_name, -added) %>%
  drop_na()
  1. Use the prcomp() function to perform PCA on the loans data set. Set scale. = TRUE to perform scaling. What results are contained in this object? (hint: use the names() function)

  2. Calculate the amount of variance explained by each principal component. (hint: look at ?broom:::tidy.prcomp)

  3. Use the tidy() function to extract the the loadings. Which variable contributed most to the first principle component? Second Component?

  4. Use the augment() function to get back the transformation and create a scatter plot of any two components of your choice.

Exercise 4 (10 points)

In this exercise, you are tasked to predict the weight of an animal in a zoo, based on which words are used to describe it. The animals data set can be downloaded here.

This data set contains 1001 variables. The first variable weight is the natural log of the mean weight of the animal. The remaining variables are named tf_* which shows how many times the word * appears in the description of the animal.

Use {tidymodels} to set up a workflow to train a PC regression. We can do this by specifying a linear regression model, and create a preprocessor recipe with {recipes} that applies PCA transformation on the predictors using step_pca(). Use the threshold argument in step_pca() to only keep the principal components that explain 90% of the variance.

How well does this model perform on the testing data set?