Quantities are defined here very broadly to be data or measurements
Some machine learning methods have a statistical underpinning
This allows us to quantify the uncertainty
Examples of "non-statistical machine learning methods" are
There is not a hard and fast distinction. Machine learning is about getting answers. Statistics is a great way to find answers.
The main goals are
For response Y and p different predictors X_1, X_2, ..., X_p
Then the relationship between them can be written as
Y = f(X) + \epsilon
with \epsilon being a random error term, independent from X and has mean 0.
For response Y and p different predictors X_1, X_2, ..., X_p
Then the relationship between them can be written as
Y = f(X) + \epsilon
with \epsilon being a random error term, independent from X and has mean 0.
This formulation is VERY general.
There is no assumption that f provides any information.
For response Y and p different predictors X_1, X_2, ..., X_p
Then the relationship between them can be written as
Y = f(X) + \epsilon
with \epsilon being a random error term, independent from X and has mean 0.
This formulation is VERY general.
There is no assumption that f provides any information.
Our goal is to find f
For response Y and p different predictors X_1, X_2, ..., X_p
Then the relationship between them can be written as
Y = f(X) + \epsilon
with \epsilon being a random error term, independent from X and has mean 0.
This formulation is VERY general.
There is no assumption that f provides any information.
Our goal is to find f
If f is different than the null-model or monkey model.
Main thesis:
If we can find f then we can predict the value of Y for different values of X
This holds a major assumption that the scenario in which we estimate f stays the same.
Models trained on data from a recession may not apply to data in a depression
Models trained on low-income houses might not work on high-income houses
\hat{Y} = \hat{f}(X)
\hat{Y} = \hat{f}(X)
\hat{Y} = \hat{f}(X)
\hat{Y} = \hat{f}(X)
\hat{Y} = \hat{f}(X)
The error is how much f is different from \hat{f}
We split this into reducible and irreducible
We will generally not be able to completely predict anything from a limited number of features
Any error left when a perfect statistical model is trained is the irreducible error
Sub-optimal estimates of \hat{f} introduce error which could have been reduced.
(this hinges on a more philosophical basis. Is the world fully deterministic?)
If we could, we technically have a mathematical formula.
amount of sales tax on an item
is generally not statistical you might need a complicated model, but you should be able to eliminate all the error
Examples with error:
bolt factory. Estimate the weight of the bolt.
Machines are calibrated, but things like, temperature, air quality, material quality, particles will still play (a small) factor.
Understanding how Y is related to X
We want to understand the exact form
"What effect will changing price affect the rating of a product?"
This is inference. We are primarily interested in the effect, not the outcome
As we will see later, there is a trade-off between models that work well for prediction and easily explainable models
hard to explain models can be good predictors but bad for inference
Certain fields will hold different weight on explainability/interpretability
Most of what we will are working on is going to be supervised.
The learning we are doing is based on a specific parameter Y we are working around
Most of what we will are working on is going to be supervised.
The learning we are doing is based on a specific parameter Y we are working around
unsupervised learning on the other hand doesn't have an explicit goal or answer sheet
Most of what we will are working on is going to be supervised.
The learning we are doing is based on a specific parameter Y we are working around
unsupervised learning on the other hand doesn't have an explicit goal or answer sheet
Most of what we will are working on is going to be supervised.
The learning we are doing is based on a specific parameter Y we are working around
unsupervised learning on the other hand doesn't have an explicit goal or answer sheet
"here is all our customer data, do they form groups?"
the book covers mean squared error
There are many ways to access how well a model performs.
many of these will be related to how far away the prediction is from the observation
We have seen this before so we are just freshening up
Start with simple
Y = \beta_0 + \beta_1 X + \epsilon
Where X is a single predictor variable
We have seen this before so we are just freshening up
Start with simple
Y = \beta_0 + \beta_1 X + \epsilon
Where X is a single predictor variable
Notice this is
f(X) = \beta_0 + \beta_1 X
We need to find the values for the betas that makes it the line as close to the data as possible
Consider the data on the right
It appears to have a possible linear trend
Consider the data on the right
It appears to have a possible linear trend
If we draw a simple horizontal line for y = 16
This would be \beta_0 = 16, \beta_1 = 0
If we take the square of all the vertical lines and sum them we get
## # A tibble: 1 x 1## rss## <dbl>## 1 1998.
Consider the data on the right
It appears to have a possible linear trend
If we minimize the RSS then we would get \beta_0 = 5.31537, \beta_1 = 0.04897
With a resulting RSS of
## # A tibble: 1 x 1## rss## <dbl>## 1 216.
Consider the data on the right
Overlaying the true relationship in orange
since we are only receiving a sample of the underlying distribution, we are not able to completely determine the right slope and intercept
We are minimizing the residual sum of squares (RSS)
\begin{align} \hat{\beta}_1 &= \dfrac{\sum^n_{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sum^n_{i=1}(x_i - \bar{x})^2} \\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \end{align}
Where \bar{x} and \bar{y} are sample means of x and y
Since this model is built on certain assumptions, we can calculate standard error estimates for each parameter estimate.
These standard errors can be used to determine if the estimates are significantly different from 0
An inverse relationship between the size of effect and number of observations
How much does the model fit the data?
We want to know how well the model is performing
Again a measure of how far away the predictions are away from the actual model
Remember how residuals squared sum (RSS) depended on the number of observations?
Remember how residuals squared sum (RSS) depended on the number of observations?
RSE = \sqrt{\dfrac{1}{n-2} RSS} = \sqrt{\dfrac{1}{n-2} \sum\limits^n_{i=1}(y_i - \hat{y}_i)^2}
residual standard error takes care of this by normalization
Interpretation:
RSE is the average amount that the response will deviate from the true regression line
RSE measures the lack of fit. Smaller values are better
R^2 = 1 - \dfrac{RSS}{TSS}
where TSS = \sum(y_i - \bar{y})^2 is the total sum of squares
Interpretation:
R^2 is the proportion of variance explained
takes values between 0 and 1, higher being better
This is a simple extension,
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon
All the previous questions apply here with slightly different answers
When p=1 we have the question
Is there an association between Y and X
but when p>1 then question becomes
Is at least one of the predictors X_1, X_2, ..., X_p useful in predicting the response
and
Which of the X's have an association with Y
Is at least one of the predictors X_1, X_2, ..., X_p useful in predicting the response
F = \dfrac{(TSS-RSS)/p}{RSS/(n-p-1)}
If F-statistic is close to 1 then we suspect there is no relationship between the response and predictors
We will come back to this later
The linear model works well in a lot of cases.
But there are assumptions
If the assumptions are not met then the model will not be sound
If the assumptions are not met then the model will not be sound
If the error terms are correlated, we may have an unwarranted sense of confidence in our model
You should be careful to throw out data that does not fit well into your model
domain, linear models perform badly if some of the observations are FAR away from the other points
Function | Package | Code |
---|---|---|
lda |
MASS |
predict(obj) |
glm |
stats |
predict(obj, type = "response") |
gbm |
gbm |
predict(obj, type = "response", n.trees) |
mda |
mda |
predict(obj, type = "posterior") |
rpart |
rpart |
predict(obj, type = "prob") |
Weka |
RWeka |
predict(obj, type = "probability") |
logitboost |
LogitBoost |
predict(obj, type = "raw", nIter) |
blatantly stolen from Max Kuhn
The goals of parsnip
is...
model_lm <- lm(mpg ~ disp + drat + qsec, data = mtcars)
library(parsnip)model_lm <- linear_reg() %>% set_mode("regression") %>% set_engine("lm")model_lm
## Linear Regression Model Specification (regression)## ## Computational engine: lm
library(parsnip)model_lm <- linear_reg() %>% set_mode("regression") %>% set_engine("lm")model_lm
## Linear Regression Model Specification (regression)## ## Computational engine: lm
fit_lm <- model_lm %>% fit(mpg ~ disp + drat + qsec, data = mtcars)fit_lm
## parsnip model object## ## Fit time: 1ms ## ## Call:## stats::lm(formula = mpg ~ disp + drat + qsec, data = data)## ## Coefficients:## (Intercept) disp drat qsec ## 11.52439 -0.03136 2.39184 0.40340
predict(fit_lm, mtcars)
## # A tibble: 32 x 1## .pred## <dbl>## 1 22.5## 2 22.7## 3 24.9## 4 18.6## 5 14.6## 6 19.2## 7 14.3## 8 23.8## 9 25.7## 10 23.0## # … with 22 more rows
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |