07_Feature_Selection_Algorithms_Part1

pdf

School

University of Illinois, Urbana Champaign *

*We aren’t endorsed by this school

Course

207

Subject

Industrial Engineering

Date

Apr 3, 2024

Type

pdf

Pages

25

Uploaded by chickensquak

Report
1 Unit 7: Feature Selection Algorithms – Part 1 Unit Overview How can we efficiently look for a linear regression model that we can infer might provide the best predictions for new datasets? Case Study: Predicting Chicago Airbnb Prices for New Datasets In this unit we will continue to pursue our main research goal for the Chicago Airbnb listings by fitting more linear regression models. Primary Goal: By incorporating interaction terms, we might build a linear regression model that has even better performance for new datasets (approximated by how well it does with the test dataset). Unit Topics 1. Recapping research goal progress and machine learning concepts 2. Overfitting vs. underfitting a model 3. Pros and cons of overfitting vs. underfitting a model 4. Goal: Find a “parsimonious Model 5. Adjusted R^2: How to measure model parsimoniousness for linear regression models 6. Interpreting adjusted R^2 suggestions 7. Feature selection Backwards elimination algorithm 8. algorithms overview 9. Forward selection algorithm Following Along in the e-Book Check out Modules 9-00 to 9-05 in the E-Book if you’d like the “written”/”book” version of this unit lecture material: https://exploration.stat.illinois.edu/
2 1. R ECAPPING R ESEARCH G OAL P ROGRESS AND M ACHINE L EARNING C ONCEPTS Adding an explanatory variable to a model Training Dataset: If we were to add beds to this linear regression model, the training dataset R^2 is guaranteed to ___________________ ie. __________________. o In many cases it will _______________ ie. __________________. Test Dataset: However, if we were to add beds to this linear regression model, we saw in previous units that the test dataset R^2 is _____________ ie. _________________. o In this scenario, we say that we overfit the model by including beds. Main Research Goal: Design a predictive model that will effectively predict the price of a Chicago Airbnb listing for newly listed Airbnbs using the dataset that we were given.
3 2. O VERFITTING VS . U NDERFITTING A M ODEL Before exploring more linear regression candidate models for our Airbnb dataset, let's examine the artificial dataset below to learn more about: 1. what it means to overfit as well as underfit a model to a given training dataset and 2. how overfitting vs. underfitting can happen.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 Candidate Model 1: Simple Linear Regression Model We fit the simple linear regression model plotted below with the training dataset. 𝑦𝑦 = 0.06 + 1.003 𝑥𝑥 Candidate Model 2: Nonlinear Regression Model We also fit the nonlinear regression model plotted below with the training dataset.
5 Question: Which model had the better training dataset fit? Question: Which model had the better test dataset fit? What we just saw here is an example of what we call overfitting a model to a given training dataset. That is, our nonlinear model was trained too well on the training dataset, such that the test dataset (or any other dataset that is assumed to have been randomly drawn from the same population) has dramatically worse performance. The excessive complexity of our nonlinear model ended up fitting “every nook and cranny” (ie. random fluctuations that existed in the training dataset) rather than meaningful trends that we would expect to roughly hold for any dataset drawn from the same population.
6 OVERFITTING: Some ways a linear regression model can be too “complex” and overfit to the training dataset If an explanatory variable is _________________ in an existing model (usually with other existing explanatory variables) and it ________________ bring enough “predictive power” to the model (ie. it _____________ increase the fit of the model enough) on its own , then the model may be overfit to the training dataset. 1. Including Irrelevant (ie. Unassociated) Explanatory Variables Question: Suppose we were to fit two models with this dataset. Candidate Model 1: ℎ𝑒𝑒𝑒𝑒𝑒𝑒ℎ𝑡𝑡 = 𝛽𝛽 ̂ 0 + 𝛽𝛽 ̂ 1 𝑟𝑟𝑟𝑟𝑒𝑒ℎ𝑡𝑡 Candidate Model 2: ℎ𝑒𝑒𝑒𝑒𝑒𝑒ℎ𝑡𝑡 = 𝛽𝛽 ̂ 0 + 𝛽𝛽 ̂ 1 𝑟𝑟𝑟𝑟𝑒𝑒ℎ𝑡𝑡 + 𝛽𝛽 ̂ 2 𝑟𝑟𝑖𝑖𝑒𝑒𝑖𝑖𝑟𝑟𝑒𝑒𝑖𝑖𝑚𝑚 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 + 𝛽𝛽 ̂ 3 𝑟𝑟𝑟𝑟𝑒𝑒ℎ𝑡𝑡 ∗ 𝑟𝑟𝑖𝑖𝑒𝑒𝑖𝑖𝑟𝑟𝑒𝑒𝑖𝑖𝑚𝑚 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 Which model do you think would have the best fit for this given dataset?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 Which model do you think would have the best fit for ANOTHER dataset also randomly drawn from the same population? 2. Including Collinear Explanatory Variables Linear Regression Model Training R^2 Test R^2 Predicts height: Right foot length Left foot length 0.489 0.532 Predicts height: Right foot length 0.457 0.547 Predicts height: Left foot length 0.476 0.553
8 UNDERFITTING: Some ways a linear regression model can be too “simple” and underfit to the training dataset If an explanatory variable is _________________ in an existing model (usually with other existing explanatory variables) and it ________________ bring enough “predictive power” to the model (ie. it ______________ increase the fit of the model enough) on its own , then the model may be underfit to the training dataset. What is vague about this definition above?
9 3. P ROS AND C ONS OF O VERFITTING VS . U NDERFITTING A M ODEL Overfitting a Model: Including explanatory variables that don’t bring enough predictive power. Pros: The model, for the __________________ dataset, will have __________________ predictive power. Cons: The model, for the __________________ dataset(s), may not have __________________ predictive power. The model may ________________ explanatory variables that have no _______________________________ with the response variable. Underfitting a Model: Too few explanatory variables that bring enough predictive power Cons: The model, for the ______________________________ dataset(s), may not have __________________ predictive power. The model may ________________ explanatory variables that have _______________________________ with the response variable.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10 4. G OAL : F IND A “P ARSIMONIOUS M ODEL Definition Thus, because we do not want to underfit or overfit a model, our goal is to find the parsimonious model which is a balance of the two. Specifically, a parsimonious model will find the ideal balance of: a __________________ number of explanatory variables to avoid _________________________ and a __________________ predictive power to avoid _________________________.
11 5. A DJUSTED R^2: H OW TO MEASURE MODEL PARSIMONIOUSNESS FOR LINEAR REGRESSION MODELS What measure can we use to assess the parsimoniousness of a linear regression model? We can use the adjusted R^2 of a linear regression model to measure the parsimoniousness of that linear regression model. The equation for this metric is below where 𝑝𝑝 is the number of slopes in the linear regression model. 𝑹𝑹 𝒂𝒂𝒂𝒂𝒂𝒂 𝟐𝟐 = 𝟏𝟏 − � 𝑺𝑺𝑺𝑺𝑺𝑺 𝑺𝑺𝑺𝑺𝑺𝑺 � ⋅ � 𝒏𝒏 − 𝟏𝟏 𝒏𝒏 − 𝒑𝒑 − 𝟏𝟏 Balancing Act when Adding an Explanatory Variable Increasing the number of slopes in the model will cause Effect 1: this term 𝑣𝑣−1 𝑣𝑣−𝑝𝑝−1 to ____________. Effect 2: this term 𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆 to either ______________ or stay the same. Which Effect on 𝑹𝑹 𝒂𝒂𝒂𝒂𝒂𝒂 𝟐𝟐 “Wins” when ADDING an Explanatory Variable? Adding an explanatory variable (ie. slope(s)) to the model that brings a large enough decrease in the training SSE of the model (ie. a large enough improvement of fit) will cause 𝑅𝑅 𝑣𝑣𝑎𝑎𝑎𝑎 2 to _____________. This suggests the following about this explanatory variable. o ____________ overfit the model to the training dataset o ____________ is fitting general trends that apply to the training, test, and new datasets. o By adding it we expect ____________ fit for the test and new datasets. Adding an explanatory variable (ie. slope(s)) to the model that brings a NOT large enough decrease in the training SSE of the model (ie. a NOT large enough improvement of fit) will cause 𝑅𝑅 𝑣𝑣𝑎𝑎𝑎𝑎 2 to _____________. This suggests the following about this explanatory variable. o ____________ overfit the model to the training dataset o ____________ is fitting general trends that apply to the training, test, and new datasets. o By adding it we expect ____________ fit for the test and new datasets. Note how similar this looks to 𝑅𝑅 2 = 1 − � 𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆 Remember that a linear regression model is trying to minimize SSE (ie. get a better fit) of the training dataset . Giving your model more slopes gives it more of an opportunity to minimize training dataset SSE even further.
12 𝑹𝑹 𝒂𝒂𝒂𝒂𝒂𝒂 𝟐𝟐 = 𝟏𝟏 − � 𝑺𝑺𝑺𝑺𝑺𝑺 𝑺𝑺𝑺𝑺𝑺𝑺 � ⋅ � 𝒏𝒏 − 𝟏𝟏 𝒏𝒏 − 𝒑𝒑 − 𝟏𝟏 Balancing Act when Deleting an Explanatory Variable Decreasing the number of slopes in the model will cause Effect 1: this term 𝑣𝑣−1 𝑣𝑣−𝑝𝑝−1 to ____________. Effect 2: this term 𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆 to either ______________ or stay the same. Which Effect on 𝑹𝑹 𝒂𝒂𝒂𝒂𝒂𝒂 𝟐𝟐 “Wins” when DELETING an Explanatory Variable? Deleting an explanatory variable (ie. slope(s)) from the model that brought a large enough increase in the training SSE of the model (ie. a large enough decline of fit) will cause 𝑅𝑅 𝑣𝑣𝑎𝑎𝑎𝑎 2 to _____________. This suggests the following about this explanatory variable. o This explanatory variable ____________ fitting general trends that apply to the training, test, and new datasets. o By deleting it we ____________ underfitting the model o By deleting it we expect ____________ fit to the test and new datasets. Deleting an explanatory variable (ie. slope(s)) from the model that DID NOT bring a large enough increase in the training SSE of the model (ie. NOT a large enough decline of fit) will cause 𝑅𝑅 𝑣𝑣𝑎𝑎𝑎𝑎 2 to _____________. This suggests the following about this explanatory variable. o This explanatory variable ____________ fitting general trends that apply to the training, test, and new datasets, but instead fitting random fluctuations in the training dataset. o By deleting it we ____________ underfitting the model. o By deleting it we expect ____________ fit to the test and new datasets. Interpretation: The ________________ the 𝑅𝑅 𝑣𝑣𝑎𝑎𝑎𝑎 2 , the more parsimonious the model is. Interpretation WARNING! Note that the 𝑅𝑅 𝑣𝑣𝑎𝑎𝑎𝑎 2 is often not put into words like the R^2. It just represents a measure of linear regression model parsimoniousness.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
13 6. I NTERPRETING A DJUSTED R^2 S UGGESTIONS Recall that the neighborhood explanatory variable did not have a strong association with price. Could we be overfitting by including it in the current best model below ? Let’s see what the adjusted R^2 suggests.
14
15 Adjusted R^2 Interpretations 1. Which model is considered to be the more parsimonious model according to the adjusted R^2.? 2. Thus, does the adjusted R^2 suggest that neighborhood brings “enough” predictive power to the model (in the presence of the other variables in the old model)? 3. Thus, does the adjusted R^2 suggest that we would be overfitting the model to the training dataset by including the neighborhood variable? 4. Thus, which model does the adjusted R^2 suggest will have better predictions for new datasets ? Corroborating Adjusted R^2 Interpretations with Another Technique Let’s consult another method that can evaluate similar questions by calculating the test R^2 of both models. Does the test R^2 agree with the interpretations the adjusted R^2 made?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
16 NOTE: The suggestions/interpretations made by the test R^2 and the adjusted R^2 of models may not always agree! But if they do, it can help build more confidence in your claims. Benefit of using the Adjusted R^2 over the Test R^2 to Make these Interpretations
17 7. F EATURE S ELECTION A LGORITHMS O VERVIEW Total Number of Possible Linear Regression Models (without interaction terms): If we have P available explanatory variables to model in a linear regression model, how many possible linear regression models (without interaction terms) can we make and evaluate whether they are the “most parsimonious”? Feature Selection Algorithms Goals Rather than calculate the 𝑅𝑅 𝑣𝑣𝑎𝑎𝑎𝑎 2 of every single possible ________ linear regression models, we can use what we call feature selection algorithms ( ie. explanatory variable selection algorithms) instead to help us find a linear regression model that: o is considered “parsimonious enough” (for instance, has a “high enough” adjusted R^2) o in a relatively low amount of time. Feature Selection Algorithms Overview 1. Backwards Elimination Algorithm with 𝑅𝑅 𝑣𝑣𝑎𝑎𝑎𝑎 2 2. Forward Selection algorithms with 𝑅𝑅 𝑣𝑣𝑎𝑎𝑎𝑎 2 3. Regularized Linear Regression Methods
18 8. B ACKWARDS E LIMINATION A LGORITHM We can use what we call a backwards elimination algorithm to help us select a good combination of explanatory variables to include in a model that achieve some goal we are trying to reach . For instance, below is how we can use a backwards elimination algorithm that finds a model with _____________________, however you can use this same structure select a model that optimizes some other metric as well. Backwards Elimination Algorithm Goal : Find the linear regression model with the highest adjusted R^2 . Steps: 1. Fit a “current model” and find the adjusted R^2 of this model. In the beginning, your “current model” should include all possible explanatory variables you are considering. 2. For each explanatory variable in your “current model” do the following: a. Fit a “test model”. Your “test model” should be every explanatory variable in the “current model” except for the explanatory variable you are considering. b. Find the adjusted R^2 of this “test model”. 3. If NONE of the “test models” from step (2) had a adjusted R^2 that was HIGHER than the adjusted R^2 for the “current model”, then STOP THE ALGORITHM, and return the “current model” as your “final model”. 4. Otherwise, choose the “test model” from step (2) that had the HIGHEST adjusted R^2 and set your new “current model” to be this “test model”. Then go back to step (2).
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
19 Question: Should we stop the algorithm or continue on to the next iteration?
20 9. F ORWARD S ELECTION A LGORITHM We can use what we call a forward selection algorithm to help us select a good combination of explanatory variables to include in a model that achieve some goal we are trying to reach . For instance, below is how we can use a forward selection algorithm that finds a model with _____________________, however you can use this same structure select a model that optimizes some other metric as well. Forward Selection Algorithm Goal: Find the linear regression model with the highest adjusted R^2 . Steps: 1. Fit a “current model” and find the adjusted R^2 of this model. In the beginning, your “current model” should include none of the possible explanatory variables you are considering. 2. For each explanatory variable in your “current model” do the following: a. Fit a “test model”. Your “test model” should be every explanatory variable in the “current model” in addition to the explanatory variable you are considering. b. Find the adjusted R^2 of this “test model”. 3. If NONE of the “test models” from step (2) had a adjusted R^2 that was HIGHER than the adjusted R^2 for the “current model”, then STOP THE ALGORITHM, and return the “current model” as your “final model”. 4. Otherwise, choose the “test model” from step (2) that had the HIGHEST adjusted R^2 and set your new “current model” to be this “test model”. Then go back to step (2).
21 Question: What should we do at the end of this iteration?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
22 Question: What should we do at the end of this iteration?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
23 Question: What should we do at the end of this iteration?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
24 Question: What should we do at the end of this iteration? Question: What should we do at the end of this iteration?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
25 Final Model Conclusions from Backwards Elimination and Forward Selection Algorithms Caveat : Forward selection algorithms and backwards elimination algorithms that try to find a model with a _______________ _____________________, are not guaranteed to find the model with the ________________ ________________ out of all possible models! Additional Insights
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help