worksheet_regression2

.pdf

School

University of British Columbia *

*We aren’t endorsed by this school

Course

DSCI100

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

12

Uploaded by CountKuduMaster478

Report
Worksheet 9 - Regression Continued Lecture and Tutorial Learning Goals: By the end of the week, you will be able to: Recognize situations where a simple regression analysis would be appropriate for making predictions. Explain the -nearest neighbour ( -nn) regression algorithm and describe how it differs from k-nn classification. Interpret the output of a -nn regression. In a dataset with two variables, perform -nearest neighbour regression in R using tidymodels to predict the values for a test dataset. Using R, execute cross-validation in R to choose the number of neighbours. Using R, evaluate -nn regression prediction accuracy using a test data set and an appropriate metric ( e.g. , root means square prediction error). In a dataset with > 2 variables, perform -nn regression in R using tidymodels to predict the values for a test dataset. In the context of -nn regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE). Describe advantages and disadvantages of the -nearest neighbour regression approach. Perform ordinary least squares regression in R using tidymodels to predict the values for a test dataset. Compare and contrast predictions obtained from -nearest neighbour regression to those obtained using simple ordinary least squares regression from the same dataset. This worksheet covers parts of the Regression II chapter of the online textbook. You should read this chapter before attempting the worksheet. ### Run this cell before continuing. library ( tidyverse ) library ( repr ) library ( tidymodels ) library ( cowplot ) options ( repr.matrix.max.rows = 6 ) source ( "tests.R" ) source ( 'cleanup.R' ) Warm-up Questions Here are some warm-up questions on the topic of multiple regression to get you thinking before we jump into data analysis. The course readings should help you answer these. In [ ]:
Question 1.0 Multiple Choice: {points: 1} In multivariate k-nn regression with one outcome/target variable and two predictor variables, the predictions take the form of what shape? A. a flat plane B. a wiggly/flexible plane C. A straight line D. a wiggly/flexible line E. a 4D hyperplane F. a 4D wiggly/flexible hyperplane Save the letter of the answer you think is correct to a variable named answer1.0 . Make sure you put quotations around the letter and pay attention to case. ### BEGIN SOLUTION answer1.0 <- "B" ### END SOLUTION test_1.0 () Question 1.1 Multiple Choice: {points: 1} In simple linear regression with one outcome/target variable and one predictor variable, the predictions take the form of what shape? A. a flat plane B. a wiggly/flexible plane C. A straight line D. a wiggly/flexible line E. a 4D hyperplane F. a 4D wiggly/flexible hyperplane Save the letter of the answer you think is correct to a variable named answer1.1 . Make sure you put quotations around the letter and pay attention to case. ### BEGIN SOLUTION answer1.1 <- "C" ### END SOLUTION test_1.1 () In [ ]: In [ ]: In [ ]: In [ ]:
Question 1.2 Multiple Choice: {points: 1} In multiple linear regression with one outcome/target variable and two predictor variables, the predictions take the form of what shape? A. a flat plane B. a wiggly/flexible plane C. A straight line D. a wiggly/flexible line E. a 4D hyperplane F. a 4D wiggly/flexible hyperplane Save the letter of the answer you think is correct to a variable named answer1.2 . Make sure you put quotations around the letter and pay attention to case. ### BEGIN SOLUTION answer1.2 <- "A" ### END SOLUTION test_1.2 () Understanding Simple Linear Regression Consider this small and simple dataset: simple_data <- tibble ( X = c ( 1 , 2 , 3 , 6 , 7 , 7 ), Y = c ( 1 , 1 , 3 , 5 , 7 , 6 )) options ( repr.plot.width = 5 , repr.plot.height = 5 ) base <- ggplot ( simple_data , aes ( x = X , y = Y )) + geom_point ( size = 2 ) + scale_x_continuous ( limits = c ( 0 , 7.5 ), breaks = seq ( 0 , 8 ), minor_breaks = se scale_y_continuous ( limits = c ( 0 , 7.5 ), breaks = seq ( 0 , 8 ), minor_breaks = se theme ( text = element_text ( size = 20 )) base Now consider these three potential lines we could fit for the same dataset: options ( repr.plot.height = 3.5 , repr.plot.width = 10 ) line_a <- base + ggtitle ( "Line A" ) + geom_abline ( intercept = -0.897 , slope = 0.9834 , color = "blue" ) + theme ( text = element_text ( size = 20 )) line_b <- base + ggtitle ( "Line B" ) + geom_abline ( intercept = 0.1022 , slope = 0.9804 , color = "purple" ) + theme ( text = element_text ( size = 20 )) line_c <- base + ggtitle ( "Line C" ) + geom_abline ( intercept = -0.2347 , slope = 0.9164 , color = "green" ) + In [ ]: In [ ]: In [ ]: In [ ]:
theme ( text = element_text ( size = 20 )) plot_grid ( line_a , line_b , line_c , ncol = 3 ) Question 2.0 {points: 1} Use the graph below titled "Line A" to roughly calculate the average squared vertical distance between the points and the blue line. Read values of the graph to a precision of 0.25 (e.g. 1, 1.25, 1.5, 1.75, 2). Save your answer to a variable named answer2.0 . We reprint the plot for you in a larger size to make it easier to estimate the locations on the graph. #run this code options ( repr.plot.width = 9 , repr.plot.height = 9 ) line_a ### BEGIN SOLUTION answer2.0 <- (( 0 - 1 ) ^ 2 + ( 1 - 1 ) ^ 2 + ( 2 - 3 ) ^ 2 + ( 5 - 5 ) ^ 2 + ( 6 - 6 ) ^ 2 + ( 6 - 7 ### END SOLUTION answer2.0 test_2.0 () Question 2.1 {points: 1} Use the graph titled "Line B" to roughly calculate the average squared vertical distance between the points and the purple line. Read values of the graph to a precision of 0.25 (e.g. 1, 1.25, 1.5, 1.75, 2). Save your answer to a variable named answer2.1 . We reprint the plot for you in a larger size to make it easier to estimate the locations on the graph. options ( repr.plot.width = 9 , repr.plot.height = 9 ) line_b ### BEGIN SOLUTION answer2.1 <- (( 1 - 1 ) ^ 2 + ( 2 - 1 ) ^ 2 + ( 3 - 3 ) ^ 2 + ( 6 - 5 ) ^ 2 + ( 7 - 7 ) ^ 2 + ( 6 - 7 ### END SOLUTION answer2.1 test_2.1 () Question 2.2 {points: 1} Use the graph titled "Line C" to roughly calculate the average squared vertical distance between the points and the green line. Read values of the graph to a In [ ]: In [ ]: In [ ]: In [ ]: In [ ]: In [ ]:
precision of 0.25 (e.g. 1, 1.25, 1.5, 1.75, 2). Save your answer to a variable named answer2.2 . We reprint the plot for you in a larger size to make it easier to estimate the locations on the graph. options ( repr.plot.width = 9 , repr.plot.height = 9 ) line_c ### BEGIN SOLUTION answer2.2 <- (( 0.75 - 1 ) ^ 2 + ( 1.5 - 1 ) ^ 2 + ( 2.5 - 3 ) ^ 2 + ( 5.25 - 5 ) ^ 2 + ( 6.25 - ### END SOLUTION answer2.2 test_2.2 () Question 2.3 {points: 1} Based on your calculations above, which line would linear regression by ordinary least squares choose given our small and simple dataset? Line A, B or C? Assign the letter that corresponds the line to a variable named answer2.3 . Make sure you put quotations around the letter and pay attention to case. ### BEGIN SOLUTION answer2.3 <- "C" ### END SOLUTION test_2.3 () Marathon Training Revisited with Linear Regression! Source: https://media.giphy.com/media/BDagLpxFIm3SM/giphy.gif Remember our question from last week: what features predict whether athletes will perform better than others? Specifically, we are interested in marathon In [ ]: In [ ]: In [ ]: In [ ]: In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help