What is a test set, and why would you want to use it? What considerations should you take when deciding the test-train ratio/
Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First, we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next, we use 1-nearest neighbors (i.e. \(K = 1\)) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why?
In this exercise, we will explore a data set about cars called auto
which you can find here.
The data set contains 1 factor variable and 6 numeric variables. The factor variable mpg
has two levels high
and low
indicating whether the car has a high or low miles per gallon. We will in this exercise investigate if we are able to use a logistic regression classifier to predict if a car has high or low mpg from the other variables.
Read in the data and create a test-train rsplit
object of auto
using initial_split()
. Use default arguments for initial_split()
.
Create the training and testing data set with training()
and testing()
respectively.
Fit a logistic regression model using logistic_reg()
. Use all the 6 numeric variables as predictors (a formula shorthand is to write mpg ~ .
where .
means everything. Remember to fit the model only using the training data set.
Inspect the model with summary()
and tidy()
. Which of the variables are significant?
Predict values for the training data set and save them as training_pred
.
Use the following code to calculate the training accuracy
bind_cols(
training_pred,
auto_training
) %>%
accuracy(truth = mpg, estimate = .pred_class)
(auto_training
should be renamed to match your training data set if needed.)