Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction.
What are the advantages and disadvantages of a very flexible approach compared to a less flexible approach for regression or classification? If you were you draw the decision boundary for a very flexible classification model how would it look? Under what circumstances might a more flexible approach be preferred to a less flexible approach?
Explain the differences between K-nearest neighbor and linear regression for a general regression task. Under what circumstances would a K-nearest neighbor approach perform better than a linear model. The performance here is measured using an appropriate performance metric calculated on the training data set.
Explain how the scaling of predictor variables will or won’t be affecting the model fit for K-nearest neighbors, logistic regression, LDA, and QDA.
Suppose you are given a data set and told to perform a clustering analysis to determine how many clusters are present. Explain how you would go about doing that.
Suppose that we use some statistical learning method to make a prediction for the response Y for a particular value of the predictor X. Carefully describe how we might estimate the standard deviation of our prediction.
In this exercise, you will try to fit a classification model. You are given a data set with a response and 10 numeric predictors. You are to fit 2 knn models one with (K = 1) and one with (K = 2), 1 LDA, and one QDA. The data have already been split for you and can be downloaded here vowel_train and vowel_test. Use K-fold cross-validation with K = 10 with accuracy as the performance metric to select 1 of the 4 models. Fit this one model on the training data set, predict on the testing data set, and calculate the testing accuracy and construct a confusion matrix. Comment on your results.