Assignment 4

Exercise 1 (10 points)

Explain the assumptions we are making when performing Principle Component Analysis (PCA). What happens when these assumptions are violated?

Exercise 2 (10 points)

Answer the following questions regarding Principle Component Analysis.

Exercise 3 (10 points)

You will in this exercise explore a data set using PCA. The data comes from the #tidytuesday project and is about Student Loan Payments.

Load in the data using the following script.

library(tidymodels)
loans <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-11-26/loans.csv") %>%
  select(-agency_name, -added) %>%
  drop_na()
  1. Use the prcomp() function to perform PCA on the loans data set. Set scale. = TRUE to perform scaling. What results are contained in this object? (hint: use the names() function)

  2. Calculate the amount of variance explained by each principal component. (hint: look at ?broom:::tidy.prcomp)

  3. Use the tidy() function to extract the loadings. Which variable contributed most to the first principle component? Second Component?

  4. Use the augment() function to get back the transformation and create a scatter plot of any two components of your choice.

Exercise 4 (15 points)

In this exercise, you are tasked to predict the weight of an animal in a zoo, based on which words are used to describe it. The animals data set can be downloaded here.

This data set contains 801 variables. The first variable weight is the natural log of the mean weight of the animal. The remaining variables are named tf_* which shows how many times the word * appears in the description of the animal.

Use {tidymodels} to set up a workflow to train a PC regression. We can do this by specifying a linear regression model, and create a preprocessor recipe with {recipes} that applies PCA transformation on the predictors using step_pca(). Use the threshold argument in step_pca() to only keep the principal components that explain 90% of the variance.

How well does this model perform on the testing data set?

Exercise 5 (10 points)

For part (a) through (c) indicate which of the statements are correct. Justify your answers.

  1. The lasso, relative to least squares, is:
    • More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
    • More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
    • Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
    • Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
  2. Repeat (a) for ridge regression relative to least squares.
  3. Repeat (a) for non-linear methods relative to least squares.

Exercise 6 (10 points)

Suppose we estimate the regression coefficients in a linear regression model by minimizing

\[ \sum_{i=1}^n \left( y_i - \beta_0 - \sum^p_{j=1}\beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p \beta_j^2 \]

for a particular value of \(\lambda\). For part (a) through (c) indicate which of the statements are correct. Justify your answers.

  1. As we increase \(\lambda\) from 0, the training RSS will:
    • Increase initially, and then eventually start decreasing in an inverted U shape.
    • Decrease initially, and then eventually start increasing in a U shape.
    • Steadily increase.
    • Steadily decrease.
    • Remain constant.
  2. Repeat (a) for test RSS.
  3. Repeat (a) for variance.
  4. Repeat (a) for squared bias.
  5. Repeat (a) for the irreducible error.

Exercise 7 (15 points)

In this exercise, you are tasked to predict the weight of an animal in a zoo, based on which words are used to describe it. The animals data set can be downloaded here.

This data set contains 801 variables. The first variable weight is the natural log of the mean weight of the animal. The remaining variables are named tf_* which shows how many times the word * appears in the description of the animal.

Fit a lasso regression model to predict weight based on all the other variables.

Use the tune package to perform hyperparameter tuning to select the best value of \(\lambda\). Use 10 bootstraps as the resamples data set.

How well does this model perform on the testing data set?