Each exercise is worth 16 points.
Exercise 1
For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
- The sample size n is extremely large, and the number of predictors p is small.
- The number of predictors p is extremely large, and the number of observations n is small.
- The relationship between the predictors and response is highly non-linear.
- The variance of the error terms, is extremely high.
Exercise 2
Describe the difference between a parametric and non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a noon-parametric approach)? What are its disadvantages?
Exercise 3
Carefully explain the the difference between the KNN classifier and KNN regression methods. Name a downside when using this model on very large data.
Exercise 4
Suppose we have a data set with five predictors, X1= GPA, X2= extracurricular activities (EA), X3= Gender (1 for Female and 0 for Male), X4= Interaction between GPA and EA, and X5= Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get β0=50, β1=20, β2=0.07, β3=35, β4=0.01, β5=−10.
- Which answer is correct, and why?
- For a fixed value of EA and GPA, males earn more on average than females.
- For a fixed value of EA and GPA, females earn more on average than males.
- For a fixed value of EA and GPA, males earn more on average than females provided that the GPA is high enough.
- For a fixed value of EA and GPA, females earn more on average than males provided that the GPA is high enough.
- Predict the salary of a female with EA of 110 and a GPA of 4.0.
- True or false: Since the coefficient for the GPA/EA interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.
Exercise 5
This question should be answered using the biomass
data set.
- Fit a multiple regression model to predict
HHV
using carbon
, hydrogen
and oxygen
.
- Provide an interpretation of each coefficient in the model.
- Write out the model in equation form.
- For which the predictors can you reject the null hypothesis H0:βj=0?
- On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
- How well do the models in (a) and (e) fit the data? How big was the effect of removing the predictor?