MIE1626 Midterm Practice Problems with Soln

pdf

School

University of Toronto *

*We aren’t endorsed by this school

Course

1626

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

6

Uploaded by MagistrateThunderPolarBear15

Report
Practice Problems MIE1626 Page 1 of 6 QUESTION 1 [2 marks]: In the expression Sales β‰ˆ f(TV, Radio, Newspaper), "Sales" is the: A) Response B) Training Data C) Independent Variable D) Feature QUESTION 2 [2 marks]: In a predictive modeling project using regression, you fit a linear model to your data set. Which of the following is most likely true if you fit a quadratic model to the data set? A) Using the Quadratic Model will decrease your Irreducible Error. B) Using the Quadratic Model will decrease the Bias of your model. C) Using the Quadratic Model will decrease the Variance of your model D) Using the Quadratic Model will decrease your Reducible Error QUESTION 3 [2 marks]: One way of carrying out the bootstrap is to average equally over all possible bootstrap samples from the original data set (where two bootstrap data sets are different if they have the same observations but in different order). Unlike the usual implementation of the bootstrap, this method has the advantage of not introducing extra noise due to resampling randomly. To carry out this implementation on a data set with n data points, how many bootstrap data sets would we need to average over? A) 2 𝑛 B) 𝑛 2 C) 𝑛 𝑛 D) 𝑛! QUESTION 4 [2 marks]: Which of the following statements is more accurate about classification methods: Logistic Regression, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and naive Bayes? A) Logistic regression is not a suitable method when the classes are well-separated. B) LDA is useful when n is large and for problems with more than 2 classes. C) Assuming Gaussian distributions in each class, QDA is less flexible than naive Bayes. D) Naive bayes is most useful when the number of features and samples are roughly the same.
Practice Problems MIE1626 Page 2 of 6 Answers for Q1-4 1 2 3 4 A B C A QUESTION 5 [18 marks]: For predicting p, the probability of credit default, you have used a logistic regression model with variables X 1 = credit score and X 2 = credit card balance. Using historical data with class labels, you have fitted the model Logit(p) β‰ˆ 𝛽 Μ‚ 0 + 𝛽 Μ‚ 1 ? 1 + 𝛽 Μ‚ 2 ? 2 and obtained the estimated coefficients 𝛽 0 Μ‚ = βˆ’50 , 𝛽 1 Μ‚ = βˆ’1 , and 𝛽 2 Μ‚ = 0.2 . Part (a). [1 mark]: Explain in plain English what the estimated intercept means and provide a numerical example for its role in the model. Answer: the intercept determines the prediction for a sample where X 1 and X 2 are both 0. Accordingly p= e^(-50)/(1+e^(-50))= 1.9e-22 which is a very small number indicating the very small estimated probability of default for a person with a credit score of 0 and a balance of 0. So, in plain English, our model predicts that a person with a credit score of 0 and a balance of 0 is very unlikely to have a credit default. Part (b). [2 marks]: Explain what 𝛽 2 Μ‚ means and provide a numerical example for how it impacts logit(p). Answer: 𝛽 2 Μ‚ is the estimated coefficient for credit card balance in the logistic regression model. A one unit increase in balance results in an increase of size 𝛽 2 Μ‚ = 0.2 in the logit(p) if the other variable (credit score) remains unchanged. Logit(p) is the logarithm of the odds p/(1-p) where p is the probability of default. Logit(p)=Ln (p/(1-p)) = 𝛽 Μ‚ 0 + 𝛽 Μ‚ 1 ? 1 + 𝛽 Μ‚ 2 ? 2 Part (c). [1 mark]: Provide a numerical example for how 𝛽 2 Μ‚ impacts the odds of default. Odds = p/(1-p) = 𝑒 (𝛽 Μ‚ 0 +𝛽 Μ‚ 1 π‘₯ 1 +𝛽 Μ‚ 2 π‘₯ 2 )
Practice Problems MIE1626 Page 3 of 6 Following the example in the previous part, a one unit increase in balance (while the other variable remains unchanged) results in the odds p/(1-p) increasing by a factor of exp( 𝛽 2 Μ‚ )=e^0.2 = 1.22 which is a 22% increase in the odds of credit default. Note that odds of credit default is p/(1-p) which is different from the probability of credit default (represented as p). Part (d). [1 mark]: Estimate the probability of credit default for Bob who has a credit score of 70 and a credit card balance of 610. p= e^(-50-1*70+0.2*610)/(1+e^(-50-1*70+0.2*610))=0.88 Part (e). [1 mark]: For having a credit default risk of 50% what should Bob’s credit card balance be? e^(-50-1*70+0.2*x)/(1+e^(-50-1*70+0.2*x))=0.5 -> 2*e^(-50-1*70+0.2*x) = 1+e^(-50-1*70+0.2*x) -> e^(-50-1*70+0.2*x)=1 -> -50-1*70+0.2*x=0 -> x=600 Part (f). [4 marks]: To use the logistic model as a classifier for detecting defaulters, we use the classification threshold of 𝑝̂ = 0.5. Calculate the confusion matrix for the following test dataset and specify the values for TP, TN, FP, and FN. X 1 = credit score X 2 = credit card balance Y (class labels) 70 610 1 70 700 1 70 800 1 70 500 0 70 400 1 60 600 0 50 600 0
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Practice Problems MIE1626 Page 4 of 6 40 600 1 80 600 0 90 600 0 Answer X 1 X 2 Y (actual class) 𝑝̂ ? Μ‚ (Predicted class) Confusion result 70 610 1 0.88 1 TP 70 700 1 >0.88 1 TP 70 800 1 >0.88 1 TP 70 500 0 <0.5 0 TN 70 400 1 <0.5 0 FN 60 600 0 >0.5 1 FP 50 600 0 >0.5 1 FP 40 600 1 >0.5 1 TP 80 600 0 <0.5 0 TN 90 600 0 <0.5 0 TN FN=1 FP=2 TN=3 TP=4 Confusion matrix Actual class 1 0 Predicted class 1 TP=4 FP=2 0 FN=1 TN=3 Part (g). [3 marks]: Calculate precision, recall, and accuracy for the test dataset. Precision=TP/(TP+FP) = 4/6 = 0.667 = 66.7% Recall=TP/(TP+FN) = 4/5 = 0.8 = 80% Accuracy= (TP+TN) / all = (4+3)/10 = 0.7 = 70% Part (h). [2 mark]: Draw an ROC plot and indicate the point that shows the performance of the classifier on the test dataset based on the classification threshold of 𝑝̂ = 0.5.
Practice Problems MIE1626 Page 5 of 6 False positive rate = Fall-out = FP / (FP+TN) = 2/5 = 0.4 = 40% Part (i). [3 mark]: Your colleague says s/he has developed an alternative classifier producing the following test results using the same features and that it has a better F1-score. Actual class 1 0 Predicted class 1 4 1 0 2 3 Based on the results presented to you, do you recommend using the alternative classifier instead of the logistic regression model (Yes/No)? Justify your answer. Answer: No, we do not recommend the alternative classifier. If we look at the columns in the presented confusion matrix, we observe that six observations have actual class 1, and four observations have actual class 0. Regardless of the presumed improvement in F1-score, the confusion matrix presented by the colleague is inconsistent with the test dataset which has 5 observations for each class. So, the presented evaluation of the alternative model is incorrect (or based on an incomparable test dataset), and therefore it cannot be recommended.
Practice Problems MIE1626 Page 6 of 6 To be more specific, we have the following results for the alternative classifier which are not possible on the same test dataset: FN=2 FP=1 TN=3 TP=4 If we miss this issue and calculate performance measures, we see that in comparison with the logistic regression model, the values for precision and recall are swapped, and accuracy has remained the same. Precision=TP/(TP+FP) = 4/5 = 0.8 = 80% Recall=TP/(TP+FN) = 4/6 = 0.667 = 66.7% Accuracy= (TP+TN) / all = (4+3)/10 = 0.7 = 70% Therefore F1-score (the harmonic mean of precision and recall) does not change anyways. So, the alternative classifier cannot be recommended even if we assume the test results to be comparable and miss the error in the presented confusion matrix.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help