07_Feature_Selection_Algorithms_Part1
pdf
keyboard_arrow_up
School
University of Illinois, Urbana Champaign *
*We aren’t endorsed by this school
Course
207
Subject
Industrial Engineering
Date
Apr 3, 2024
Type
Pages
25
Uploaded by chickensquak
1 Unit 7: Feature Selection Algorithms – Part 1 Unit Overview How can we efficiently look for a linear regression model that we can infer might provide the best predictions for new datasets? Case Study: Predicting Chicago Airbnb Prices for New Datasets In this unit we will continue to pursue our main research goal for the Chicago Airbnb listings by fitting more linear regression models. Primary Goal: By incorporating interaction terms, we might
build a linear regression model that has even better
performance for new datasets (approximated by how well it does with the test dataset). Unit Topics 1.
Recapping research goal progress and machine learning concepts 2.
Overfitting vs. underfitting a model 3.
Pros and cons of overfitting vs. underfitting a model 4.
Goal: Find a “parsimonious Model 5.
Adjusted R^2: How to measure model parsimoniousness for linear regression models 6.
Interpreting adjusted R^2 suggestions 7.
Feature selection Backwards elimination algorithm 8.
algorithms overview 9.
Forward selection algorithm
Following Along in the e-Book Check out Modules 9-00
to
9-05 in the E-Book if you’d like the “written”/”book” version of this unit lecture material: https://exploration.stat.illinois.edu/
2 1.
R
ECAPPING R
ESEARCH G
OAL P
ROGRESS AND M
ACHINE L
EARNING C
ONCEPTS
Adding an explanatory variable to a model •
Training Dataset: If we were to add beds to this linear regression model, the training dataset R^2 is guaranteed
to ___________________ ie. __________________.
o
In many cases it will _______________ ie. __________________.
•
Test Dataset: However, if we were to add beds to this linear regression model, we saw in previous units that the test dataset R^2 is _____________ ie. _________________.
o
In this scenario, we say that we overfit
the model by including beds.
Main Research Goal: Design a predictive model that will effectively predict the price of a Chicago Airbnb listing for newly listed
Airbnbs
using the dataset that we were given.
3 2.
O
VERFITTING VS
.
U
NDERFITTING A M
ODEL
Before exploring more linear regression candidate models for our Airbnb dataset, let's examine the artificial dataset below to learn more about: 1.
what it means
to overfit
as well as underfit
a model to a given training dataset and 2.
how overfitting vs. underfitting can happen.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4 Candidate Model 1: Simple Linear
Regression Model We fit the simple linear regression model plotted below with the training dataset. 𝑦𝑦
�
=
−
0.06 + 1.003
𝑥𝑥
Candidate Model 2: Nonlinear
Regression Model
We also fit the nonlinear regression model plotted below with the training dataset.
5 Question:
Which model had the better training dataset fit? Question:
Which model had the better test dataset fit? What we just saw here is an example of what we call overfitting a model
to a given training dataset. That is, our nonlinear model was trained too well
on the training dataset, such that the test dataset (or any other dataset that is assumed to have been randomly drawn from the same population) has dramatically worse performance. The excessive complexity
of our nonlinear model ended up fitting “every nook and cranny” (ie. random fluctuations that existed in the training dataset) rather than meaningful trends that we would expect to roughly hold for any dataset drawn from the same population.
6 OVERFITTING: Some ways a linear regression model can be too “complex” and overfit to the training dataset If an explanatory variable is _________________ in an existing model (usually with other existing explanatory variables) and it ________________ bring enough “predictive power” to the model (ie. it _____________ increase the fit of the model enough) on its own
, then the model may be overfit
to the training dataset. 1.
Including Irrelevant (ie. Unassociated) Explanatory Variables
Question:
Suppose we were to fit two models with this dataset. Candidate Model 1: ℎ𝑒𝑒𝑒𝑒𝑒𝑒ℎ𝑡𝑡
�
=
𝛽𝛽
̂
0
+
𝛽𝛽
̂
1
𝑟𝑟𝑟𝑟𝑒𝑒ℎ𝑡𝑡
Candidate Model 2: ℎ𝑒𝑒𝑒𝑒𝑒𝑒ℎ𝑡𝑡
�
=
𝛽𝛽
̂
0
+
𝛽𝛽
̂
1
𝑟𝑟𝑟𝑟𝑒𝑒ℎ𝑡𝑡
+
𝛽𝛽
̂
2
𝑟𝑟𝑖𝑖𝑒𝑒𝑖𝑖𝑟𝑟𝑒𝑒𝑖𝑖𝑚𝑚
𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣
+
𝛽𝛽
̂
3
𝑟𝑟𝑟𝑟𝑒𝑒ℎ𝑡𝑡 ∗
𝑟𝑟𝑖𝑖𝑒𝑒𝑖𝑖𝑟𝑟𝑒𝑒𝑖𝑖𝑚𝑚
𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣
•
Which model do you think would have the best fit for this given dataset?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
7 •
Which model do you think would have the best fit for ANOTHER dataset also randomly drawn from the same population?
2.
Including Collinear Explanatory Variables
Linear Regression Model Training R^2 Test R^2 Predicts height: •
Right foot length •
Left foot length 0.489 0.532 Predicts height: •
Right foot length 0.457 0.547 Predicts height: •
Left foot length
0.476 0.553
8 UNDERFITTING: Some ways a linear regression model can be too “simple” and underfit to the training dataset If an explanatory variable is _________________ in an existing model (usually with other existing explanatory variables) and it ________________ bring enough “predictive power” to the model (ie. it ______________ increase the fit of the model enough) on its own
, then the model may be underfit
to the training dataset. What is vague about this definition above?
9 3.
P
ROS AND C
ONS OF O
VERFITTING VS
.
U
NDERFITTING A M
ODEL
Overfitting a Model:
Including explanatory variables that don’t bring enough predictive power. Pros: •
The model, for the __________________ dataset, will have __________________ predictive power. Cons: •
The model, for the __________________ dataset(s), may not have __________________ predictive power. •
The model may ________________ explanatory variables that have no _______________________________ with the response variable. Underfitting a Model:
Too few explanatory variables that bring enough predictive power Cons: •
The model, for the ______________________________ dataset(s), may not have __________________ predictive power. •
The model may ________________ explanatory variables that have _______________________________ with the response variable.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
10 4.
G
OAL
:
F
IND A “P
ARSIMONIOUS
”
M
ODEL
Definition Thus, because we do not want to underfit or overfit a model, our goal is to find the parsimonious model
which is a balance of the two. Specifically, a parsimonious model will find the ideal balance of: •
a __________________ number of explanatory variables to avoid _________________________ and •
a __________________ predictive power to avoid _________________________.
11 5.
A
DJUSTED R^2:
H
OW TO MEASURE MODEL PARSIMONIOUSNESS FOR LINEAR REGRESSION MODELS
What measure can we use to assess the parsimoniousness of a linear regression model? We can use the adjusted R^2
of a linear regression model to measure the parsimoniousness of that linear regression model. The equation for this metric is below where 𝑝𝑝
is the number of slopes in the linear regression model. 𝑹𝑹
𝒂𝒂𝒂𝒂𝒂𝒂
𝟐𝟐
=
𝟏𝟏 − �
𝑺𝑺𝑺𝑺𝑺𝑺
𝑺𝑺𝑺𝑺𝑺𝑺
� ⋅ �
𝒏𝒏 − 𝟏𝟏
𝒏𝒏 − 𝒑𝒑 − 𝟏𝟏
�
Balancing Act when Adding an Explanatory Variable Increasing the number of slopes
in the model will cause •
Effect 1: this term �
𝑣𝑣−1
𝑣𝑣−𝑝𝑝−1
�
to ____________. •
Effect 2: this term �
𝑆𝑆𝑆𝑆𝑆𝑆
𝑆𝑆𝑆𝑆𝑆𝑆
�
to either ______________ or stay the same. Which Effect on 𝑹𝑹
𝒂𝒂𝒂𝒂𝒂𝒂
𝟐𝟐
“Wins” when ADDING an Explanatory Variable? •
Adding an explanatory variable
(ie. slope(s)) to the model that brings a large enough decrease in the training SSE of the model (ie. a large enough improvement of fit) will cause 𝑅𝑅
𝑣𝑣𝑎𝑎𝑎𝑎
2
to _____________. This suggests the following about this explanatory variable. o
____________ overfit
the model to the training dataset o
____________ is fitting general trends
that apply to the training, test, and new datasets. o
By adding it we expect ____________ fit
for the test and new datasets. •
Adding an explanatory variable
(ie. slope(s)) to the model that brings a NOT large enough decrease in the training SSE of the model (ie. a NOT large enough improvement of fit) will cause 𝑅𝑅
𝑣𝑣𝑎𝑎𝑎𝑎
2
to _____________. This suggests the following about this explanatory variable. o
____________ overfit
the model to the training dataset o
____________ is fitting general trends
that apply to the training, test, and new datasets. o
By adding it we expect ____________ fit
for the test and new datasets. Note how similar this looks to 𝑅𝑅
2
= 1
− �
𝑆𝑆𝑆𝑆𝑆𝑆
𝑆𝑆𝑆𝑆𝑆𝑆
�
Remember that a linear regression model is trying to minimize SSE (ie. get a better fit) of the training dataset
. Giving your model more slopes gives it more of an opportunity
to minimize training dataset SSE even further.
12 𝑹𝑹
𝒂𝒂𝒂𝒂𝒂𝒂
𝟐𝟐
=
𝟏𝟏 − �
𝑺𝑺𝑺𝑺𝑺𝑺
𝑺𝑺𝑺𝑺𝑺𝑺
� ⋅ �
𝒏𝒏 − 𝟏𝟏
𝒏𝒏 − 𝒑𝒑 − 𝟏𝟏
�
Balancing Act when Deleting an Explanatory Variable Decreasing the number of slopes
in the model will cause •
Effect 1: this term �
𝑣𝑣−1
𝑣𝑣−𝑝𝑝−1
�
to ____________. •
Effect 2: this term �
𝑆𝑆𝑆𝑆𝑆𝑆
𝑆𝑆𝑆𝑆𝑆𝑆
�
to either ______________ or stay the same. Which Effect on 𝑹𝑹
𝒂𝒂𝒂𝒂𝒂𝒂
𝟐𝟐
“Wins” when DELETING an Explanatory Variable? •
Deleting an explanatory variable
(ie. slope(s)) from the model that brought a large enough increase in the training SSE of the model (ie. a large enough decline of fit) will cause 𝑅𝑅
𝑣𝑣𝑎𝑎𝑎𝑎
2
to _____________. This suggests the following about this explanatory variable. o
This explanatory variable ____________ fitting general trends
that apply to the training, test, and new datasets. o
By deleting it we ____________ underfitting
the model o
By deleting it we expect ____________ fit
to the test and new datasets. •
Deleting an explanatory variable
(ie. slope(s)) from the model that DID NOT bring a large enough increase in the training SSE of the model (ie. NOT a large enough decline of fit) will cause 𝑅𝑅
𝑣𝑣𝑎𝑎𝑎𝑎
2
to _____________. This suggests the following about this explanatory variable. o
This explanatory variable ____________ fitting general trends
that apply to the training, test, and new datasets, but instead fitting random fluctuations in the training dataset. o
By deleting it we ____________ underfitting
the model. o
By deleting it we expect ____________ fit
to the test and new datasets. Interpretation: The ________________ the 𝑅𝑅
𝑣𝑣𝑎𝑎𝑎𝑎
2
, the more parsimonious the model is. Interpretation WARNING! Note that the 𝑅𝑅
𝑣𝑣𝑎𝑎𝑎𝑎
2
is often not put into words like the R^2. It just represents a measure of linear regression model parsimoniousness.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
13 6.
I
NTERPRETING A
DJUSTED R^2
S
UGGESTIONS
Recall that the neighborhood explanatory variable did not have a strong association with price. Could we be overfitting by including it in the current best model below
? Let’s see what the adjusted R^2 suggests.
14
15 Adjusted R^2 Interpretations 1.
Which model is considered to be the more parsimonious model
according to the adjusted R^2.? 2.
Thus, does the adjusted R^2 suggest
that neighborhood brings “enough” predictive power
to the model (in the presence of the other variables in the old model)?
3.
Thus, does the adjusted R^2 suggest that we would be overfitting
the model to the training dataset by including the neighborhood variable? 4.
Thus, which model does the adjusted R^2 suggest will have better predictions for new datasets
? Corroborating Adjusted R^2 Interpretations with Another Technique Let’s consult another method that can evaluate similar questions by calculating the test R^2 of both models. Does the test R^2 agree with the interpretations the adjusted R^2 made?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
16 NOTE: The suggestions/interpretations made by the test R^2 and the adjusted R^2 of models may not always agree! But if they do, it can help build more confidence in your claims. Benefit of using the Adjusted R^2 over the Test R^2 to Make these Interpretations
17 7.
F
EATURE S
ELECTION A
LGORITHMS O
VERVIEW
Total Number of Possible Linear Regression Models (without interaction terms): If we have P
available explanatory variables to model in a linear
regression model, how many possible linear
regression models (without interaction terms) can we make and evaluate whether they are the “most parsimonious”? Feature Selection Algorithms Goals Rather than calculate the 𝑅𝑅
𝑣𝑣𝑎𝑎𝑎𝑎
2
of every single possible ________ linear regression models, we can use what we call feature selection algorithms (
ie. explanatory variable selection algorithms) instead to help us find a linear regression model that: o
is considered “parsimonious enough” (for instance, has a “high enough” adjusted R^2) o
in a relatively low amount of time. Feature Selection Algorithms Overview 1.
Backwards Elimination Algorithm with 𝑅𝑅
𝑣𝑣𝑎𝑎𝑎𝑎
2
2.
Forward Selection algorithms with 𝑅𝑅
𝑣𝑣𝑎𝑎𝑎𝑎
2
3.
Regularized Linear Regression Methods
18 8.
B
ACKWARDS E
LIMINATION A
LGORITHM
We can use what we call a backwards elimination algorithm
to help us select a good combination of explanatory variables to include in a model
that achieve some goal we are trying to reach
. For instance, below is how we can use a backwards elimination algorithm that finds a model with _____________________, however you can use this same structure select a model that optimizes some other metric as well. Backwards Elimination Algorithm Goal
: Find the linear regression
model with the highest
adjusted R^2
. Steps: 1.
Fit a “current model” and find the adjusted R^2 of this model. In the beginning, your “current model” should include all possible explanatory variables you are considering. 2.
For each explanatory variable in your “current model” do the following: a.
Fit a “test model”.
Your “test model” should be every explanatory variable in the “current model” except for the explanatory variable you are considering.
b.
Find the adjusted R^2 of this “test model”. 3.
If NONE of the “test models” from step (2) had a adjusted R^2 that was HIGHER than the adjusted R^2 for the “current model”, then STOP THE ALGORITHM, and return the “current model” as your “final model”. 4.
Otherwise, choose the “test model” from step (2) that had the HIGHEST adjusted R^2 and set your new “current model” to be this “test model”. Then go back to step (2).
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
19 Question: Should we stop the algorithm or continue on to the next iteration?
20 9.
F
ORWARD S
ELECTION A
LGORITHM
We can use what we call a forward selection algorithm
to help us select a good combination of explanatory variables to include in a model
that achieve some goal we are trying to reach
. For instance, below is how we can use a forward selection algorithm that finds a model with _____________________, however you can use this same structure select a model that optimizes some other metric as well. Forward Selection Algorithm Goal:
Find the linear regression
model with the highest
adjusted R^2
.
Steps: 1.
Fit a “current model” and find the adjusted R^2 of this model. In the beginning, your “current model” should include none of the possible explanatory variables you are considering. 2.
For each explanatory variable in your “current model” do the following: a.
Fit a “test model”.
Your “test model” should be every explanatory variable in the “current model” in addition to the explanatory variable you are considering.
b.
Find the adjusted R^2 of this “test model”. 3.
If NONE of the “test models” from step (2) had a adjusted R^2 that was HIGHER than the adjusted R^2 for the “current model”, then STOP THE ALGORITHM, and return the “current model” as your “final model”. 4.
Otherwise, choose the “test model” from step (2) that had the HIGHEST adjusted R^2 and set your new “current model” to be this “test model”. Then go back to step (2).
21 Question: What should we do at the end of this iteration?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
22 Question: What should we do at the end of this iteration?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
23 Question: What should we do at the end of this iteration?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
24 Question: What should we do at the end of this iteration? Question: What should we do at the end of this iteration?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
25 Final Model Conclusions from Backwards Elimination and Forward Selection Algorithms Caveat
:
Forward selection algorithms
and backwards elimination algorithms
that try to find a model with a _______________ _____________________, are not guaranteed to find
the model with the ________________ ________________ out of all possible models!
Additional Insights
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help