14-model-selection-validation-4up

pdf

School

Georgia Institute Of Technology *

*We aren’t endorsed by this school

Course

6254

Subject

Industrial Engineering

Date

Oct 30, 2023

Type

pdf

Pages

Uploaded by AmbassadorClover12145

Project organization Project proposals due March 14 (~1.5 weeks) I would like to make sure everyone has a team, so I want to add a new deadline… By TODAY please go to the link posted on Piazza (https://goo.gl/p5nTxb) and add your team’s details to the spreadsheet: • team members • tentative project title • campus(es) where team members are located • number of team members • whether you are potentially open to adding more members Exam Details – Wed 3/7/18 • Coverage: – HW #1-3 – Also lectures through the lecture on the VC bound (from Feb 19). – The midterm will not cover lecture material after Feb 19. The following are not on the exam: § Regression, Tikhonov Regularization, Bias and Variance of Regression Function Sets, LASSO, etc. • A single sheet of notes (front and back) allowed • 75 minute time limit (3:00 PM - 4:15 PM) • No calculators allowed • Sample questions are posted Given a set , find a function that minimizes More complex Less complex We must carefully limit “complexity” to avoid overfitting better chance of approximating the ideal classifier/function Approximation-generalization tradeoff better chance of generalizing to new data (out of sample) Approximation-generalization tradeoff “Complexity” of hypothesis set Error Out-of-sample error In-sample error generalization error

Approximation-generalization tradeoff “Complexity” of hypothesis set Error bias variance Out-of-sample error In-sample error Learning curve – A simple model Number of data points ( ) Expected Error bias Out-of-sample error In-sample error Learning curve – A complex model Number of data points ( ) Expected Error bias Out-of-sample error In-sample error Bias-variance decomposition What is it good for? Practically, impossible to compute bias/variance exactly… Can estimate empirically – split data into training and test sets – split training data into many different subsets and estimate a classifier/regressor on each – compute bias/variance using the results and test set In reality, just like with the VC bound, more useful as a conceptual tool than as a practical technique

Developing a good learning model The bias-variance decomposition gives us a useful way to think about how to develop improved learning models Reduce variance (without significantly increasing the bias) – limiting model complexity (e.g. polynomial order in regression) – regularization – can be counterintuitive (e.g Stein’s paradox) – typically can be done through general techniques Reduce bias (without significantly increasing the variance) – exploit prior information to steer the model in the correct direction – typically application specific Example Least-squares is an unbiased estimator, but can have high variance Tikhonov regularization deliberately introduces bias into the estimator (shrinking it towards the origin) The slight increase in bias can buy us a huge decrease in the variance, especially when some variables are highly correlated The trick is figuring out just how much bias to introduce… Model selection In statistical learning, a model is a mathematical representation of a function such as a – classifier – regression function – density – … In many cases, we have one (or more) “free parameters” that are not automatically determined by the learning algorithm Often, the value chosen for these free parameters has a significant impact on the algorithm’s output The problem of selecting values for these free parameters is called model selection Examples Method Parameter • polynomial regression polynomial degree • ridge regression/LASSO regularization parameter • robust regression loss function parameter regularization parameter • SVMs margin violation cost • kernel methods kernel choice/parameters • regularized LR regularization parameter • -nearest neighbors number of neighbors

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Model selection dilemma We need to select appropriate values for the free parameters All we have is the training data We must use the training data to select the parameters However, these free parameters usually control the balance between underfitting and overfitting They were left “free” precisely because we don’t want to let the training data influence their selection, as this almost always leads to overfitting – e.g., if we let the training data determine the degree in polynomial regression, we will just end up choosing the maximum and doing interpolation Big picture For much of this class, we have focused on trying to understand learning via decompositions of the form Validation takes another approach: After we have selected , why not just try (a little harder) to estimate directly? VC dimension regularization Validation Suppose that in addition to our training data, we also have a validation set Use the validation set to form an estimate Examples • Classification: • Regression: Accuracy of validation What can we say about the accuracy of ? In the case of classification, , which is just a Bernoulli random variable Hoeffding: More generally, we always have

Accuracy of validation In either case, this shows us that Thus, we can get as accurate an estimate of using a validation set as long as is large enough Remember, is ultimately something we learned from training data Where is this validation set coming from? Validation vs training We are given a data set Validation error is : Small Large validation (holdout) set training set bad estimate accurate estimate, but of what? Learning curve Number of data points ( ) Expected Error Large lets us say: We are very confident that we have selected a terrible Out-of-sample error In-sample error Can we have our cake and eat it too? After we’ve used our validation set to estimate the error, re-train on the whole data set Small Large Rule of thumb: Set training set ( ) validation set ( ) bad estimate of , but good estimate of , but

Validation vs testing We call this “validation”, but how is it any different than simply “testing”? Typically, is used to make learning choices If an estimate of affects learning, i.e., it impacts which we choose, then it is no longer a test set It becomes a validation set What’s the difference? – a test set is unbiased – a validation set will have an (overly) optimistic bias (remember the coin tossing experiments?) Example Suppose we have two hypotheses and that Next, suppose that our error estimates for , denoted by and , are distributed according to Pick that minimizes It is easy to argue that Why? 75 % of the time, optimistic bias Using validation for model selection Suppose we have models training set ( ) validation set ( ) pick the best The bias We select the model using the validation set is a biased estimate of (and ) Expected Error Validation set size ( )

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Quantifying the bias We’ve seen this before… For models , we use a data set of size to pick the model that does best out of Back to Hoeffding! Or, if the correspond to a few continuous parameters, we can use the VC approach to argue Data contamination We have now discussed three different kinds of estimates of the risk : These three estimates have different degrees of “contamination” that manifests itself as a (deceptively) optimistic bias • Training set: totally contaminated • Testing set: totally clean (requires strict discipline) • Validation set: slightly contaminated We will return in a bit to the issue of data “contamination” Validation dilemma Back to our core dilemma in validation We would like to argue that All we need to do is set so that it is simultaneously small and large Can we do this? Yes! small large Leave one out We need to be small, so let’s set ! Select a hypothesis using the data set Validation error We set to be too small, so this is a terrible estimate Repeat this for all possible choices of and average! This is called the leave-one-out cross validation error

Example Fitting a line to 3 data points Leave more out Leave-one-out: Train times on points each -fold cross validation: Train times on points each Example: Iterate over all 5 choices of validation set and average Common choices are (Note: On this slide, is the number of folds and is the size of the validation set) validate train Remarks • For -fold cross validation, the estimate depends on the particular choice of partition • It is common to form several estimates based on different random partitions and then average them • When using -fold cross validation for classification, you should ensure that each of the sets contain training data from each class in the same proportion as in the full data set – “stratified cross validation” • Scikit-learn can do all of this for you for any of the built in learning methods The bootstrap What else can you do when your training set is really small? You really need as much training data as possible to get reasonable results Fix For , let be a subset of size obtained by sampling with replacement from the full data set Example:

The bootstrap error estimate Define Set The bootstrap error estimate is then given by model learned based on the data Bootstrap in practice • Typically, must be large (say, ) for the estimate to be accurate • Can be rather computationally demanding • tends to be pessimistic , so it is common to combine the training and bootstrap error estimates • A common choice is the “0.632 bootstrap estimate” • The “balanced” bootstrap chooses such that each input-output pair appears exactly times • Can be used to estimate confidence intervals of basically anything Data snooping This is by far the most common trap that people fall into in practice Leads to serious overfitting… Can be very subtle… Many ways to slip up… If a data set has affected any step in the learning process, its ability to assess the outcome has been compromised Example Suppose we plan to use an SVM with a quadratic kernel on our data set What is the VC dimension of the hypothesis set in this case?

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Reuse of the data set If you try one model after another on the same data set , you will eventually “succeed” If you torture the data long enough, it will confess You need to think about the VC dimension/complexity of the total learning model – May include models you only considered in your mind ! – May include models tried by others ! Remedies – Avoid data snooping (strict discipline) – Test on new data that no one has seen before – Account for data snooping Puzzle: Time-series forecasting Suppose we wish to predict whether the price of a stock is going to go up or down tomorrow • Take history over a long period of time • Normalize the time series to zero mean, unit variance • Form all possible input-output pairs with – input = previous 20 days of stock prices – output = price movement on the 21 st day • Randomly split data into training and testing data • Train on training data only, test on testing data only Based on the test data, it looks like we can consistently predict the price movement direction with accuracy ~52% Are we going to be rich?