Assignment 2

Exercise 1 (15 points)

Suppose we collect data for a group of students in a statistics class with variables \(X_1\) = hours studied, \(X_2\) = undergrad GPA, and \(Y\) = receive an A. We fit a logistic regression and produce estimated coefficient, \(\hat{\beta}_0=-6\), \(\hat{\beta}_1=0.05\), \(\hat{\beta}_2=1\).

  1. Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of \(3.5\) gets an A in the class.
  2. How many hours would that student in part (a) need to study to have a 50% chance of getting an A in the class?

Exercise 2 (10 points)

Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First, we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next, we use 1-nearest neighbors (i.e. \(K = 1\)) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why?

Exercise 3 (15 points)

You will in this exercise examine the differences between LDA and QDA.

  1. If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?
  2. If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?
  3. In general, as the sample size \(n\) increases, do we expect the test prediction accuracy or QDA relative to LDA to improve, decline, or be unchanged? Why?
  4. True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.

Exercise 4 (20 points)

In this exercise, we will explore a data set about cars called auto which you can find here.

The data set contains 1 factor variable and 6 numeric variables. The factor variable mpg has two levels high and low indicating whether the car has a high or low miles per gallon. We will in this exercise investigate if we are able to use a logistic regression classifier to predict if a car has high or low mpg from the other variables.

  1. Read in the data and create a test-train rsplit object of auto using initial_split(). Use default arguments for initial_split().

  2. Create the training and testing data set with training() and testing() respectively.

  3. Fit a logistic regression model using logistic_reg(). Use all the 6 numeric variables as predictors (a formula shorthand is to write mpg ~ . where . means everything. Remember to fit the model only using the training data set.

  4. Inspect the model with summary() and tidy(). Which of the variables are significant?

  5. Predict values for the training data set and save them as training_pred.

  6. Use the following code to calculate the training accuracy

bind_cols(
  training_pred,
  auto_training
) %>%
  accuracy(truth = mpg, estimate = .pred_class)

(auto_training should be renamed to match your training data set if needed.)

  1. Predict values for the testing data set and use the above code to calculate the testing accuracy. Compare.

Execise 5 (20 points)

This exercise should be answered using the Weekly data set, which is part of the LSLR package. If you don’t have it installed already you can install it with

Installing ISLR [1.2] ...
    OK [linked cache]

To load the data set run the following code

library(ISLR)
data("Weekly")

This data is similar in nature to the Smarket data from chapter 4’s lab, it contains 1089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

  1. Produce some numerical and graphical summaries of the data. Does there appear to be any patterns?

  2. Use the whole data set to perform a logistic regression (with logistic_reg()) with Direction as the response and the five lag variables plus Volume as predictors. Use the summary() (remember to do summary(model_fit$fit)) function to print the results. Do any of the predictors appear to be statistically significant? if so, which ones?

  3. Use conf_int() and accuracy() from yardstick package to calculate the confusion matrix and the accuracy (overall fraction of correct predictions). Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

  4. Split the data into a training and testing data set using the following code

    weekly_training <- Weekly %>% filter(Year <= 2008)
     weekly_testing <- Weekly %>% filter(Year > 2008)
     
  5. Now fit the logistic regression model using the training data, with Lag2 as the only predictor. Compute the confusion matrix and accuracy metric using the testing data set.

  6. Repeat (e) using LDA.

  7. Repeat (e) using QDA.

  8. Repeat (e) using KNN with K = 1.

  9. Which of these methods appear to provide the best results on the data?

  10. (Optional) Experiment with different combinations of predictors for each of the methods. Report the variables, methods, and associated confusion matrix that appears to provide the best results on the held-out data. Note that you can also experiment with different values of K in KNN. (This kind of running many many models and testing on the testing data set many times is not good practice. We will look at ways in later weeks on how we can properly explore multiple models.)