ProbelmSet9

pdf

School

University of Toronto *

*We aren’t endorsed by this school

Course

130

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

4

Uploaded by LieutenantFlagSquid18

Report
STA130H1S – Fall 2022 Problem Set 9 Amogh Shashidhar(1008817666) and STA130 Professors Instructions Complete the exercises in this .Rmd file and submit your .Rmd and .pdf output through Quercus on Thursday, November 24th by 5:00 p.m. ET. library (tidyverse) library (rpart) library (partykit) library (knitr) Part 1: Binary Classification Decision Trees Question 1: Gallup World Poll Using data from the Gallup World Poll (and the World Happiness Report), we are interested in predicting which factors influence life expectancy around the world. These data are in the file happinessdata_2017.csv . happiness2017 <- read_csv ( "happiness2017.csv" ) (a) Begin by creating a new variable called life_exp_category which takes the value “Good” for countries with a life expectancy higher than 65 years, and “Poor” otherwise. # code you answer here life_exp_category <- happiness2017 %>% select (country, life_exp) %>% mutate ( case_when ( Value = life_exp > 6 life_exp_category ## # A tibble: 1,420 x 3 ## country life_exp case_when(Value = life_exp > 65 ~ "Good", life_exp <= ~1 ## <chr> <dbl> <chr> ## 1 Afghanistan 47.6 Poor ## 2 Afghanistan 47.9 Poor ## 3 Afghanistan 48.2 Poor ## 4 Afghanistan 48.5 Poor ## 5 Afghanistan 48.7 Poor ## 6 Afghanistan 49.0 Poor ## 7 Afghanistan 49.3 Poor ## 8 Afghanistan 49.6 Poor ## 9 Afghanistan 49.9 Poor ## 10 Albania 67.2 Good ## # ... with 1,410 more rows, and abbreviated variable name ## # 1: case_when(Value = life_exp > 65 ~ "Good", life_exp <= 65 ~ "Poor") 1
(b) Divide the data into training (80%) and testing (20%) datasets. Build a classification tree using the training data to predict which countries have Good vs Poor life expectancy, using only the social_support variable as a predictor. set.seed ( 666 ) # Use the last 3 digits of your student ID number for the random seed. # code you answer here n <- dim (life_exp_category)[ 1 ] n_train <- as.integer (n * 0.8 ) n_test <- n - n_train training_indices <- sample ( 1 : n, size = n_train, replace = FALSE ) life_exp_category <- life_exp_category %>% rowid_to_column () train <- life_exp_category %>% filter (rowid %in% training_indices) test <- life_exp_category %>% filter ( ! ( rowid %in% training_indices)) (c) Use the same training dataset created in (b) to build a second classification tree to predict which countries have good vs poor life expectancy, using logGDP , social_support , freedom , and generosity as potential predictors. # code you answer here tree <- rpart (life_exp_category $ case_when(Value = life_exp > 65 ~ "Good", life_exp <= 65 ~ "Poor") ~ h tree %>% as.party () %>% plot ( type= "extended" , tp_args = list ( id = FALSE )) happiness2017$logGDP 1 9.521 < 9.521 happiness2017$logGDP 2 10.143 < 10.143 n = 359 Poor Good 0 0.2 0.4 0.6 0.8 1 happiness2017$generosity 4 ≥ - 0.177 < - 0.177 n = 188 Poor Good 0 0.2 0.4 0.6 0.8 1 n = 83 Poor Good 0 0.2 0.4 0.6 0.8 1 n = 781 Poor Good 0 0.2 0.4 0.6 0.8 1 2
(d) Use the testing dataset you created in (b) to calculate the confusion matrix for the trees you built in (b) and (c). Report the sensitivity (true positive rate), specificity (true negative rate) and accuracy for each of the trees. Here you will treat “Good” life expectancy as the positive response and prediction. # code you answer here for the tree created in part (b) tree_train_pred <- predict (tree, type = "class" ) train_confusion_matrix <- table ( y-hat = tree_train_pred ) train_confusion_matrix ## y-hat ## Good Poor ## 547 864 # code you answer here for the tree created in part (c) train_confusion_matrix / sum (train_confusion_matrix) ## y-hat ## Good Poor ## 0.3876683 0.6123317 (e) Fill in the following table using the tree you constructed in part (c). Does the fact that some of the values are missing (NA) prevent you from making predictions for the life expectancy category for these observations? logGDP social_support freedom generosity Predicted life expectancy category Obs 1 9.68 0.76 NA -0.35 547 Obs 2 9.36 NA 0.82 -0.22 864 Obs 3 10.4 0.88 0.77 0.11 0.3876683 Obs 4 9.94 0.85 0.63 0.01 0.6123317 Hint: make a tibble() of this data and then use it with the predict() function. Question 2: Confusion Matrices and Metrics (Accuracy, etc.) Two classification trees were built to predict which individuals have a disease using different sets of potential predictors. We use each of these trees to predict disease status for 100 new individuals. Below are confusion matrices corresponding to these two classification trees. Tree A Disease No disease Predict disease 36 22 Predict no disease 2 40 Tree B Disease No disease Predict disease 24 6 Predict no disease 14 56 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
(a) Calculate the accuracy, false-positive rate, and false negative rate for each classification tree. Here, a “positive” result means we predict an individual has the disease and a “negative” result means we predict they do not. the accuracy of Tree A is 0.06745, false-positive rate is 0.6000 and false negative rate is 0.7318. the accuracy of Tree B is 0.6967, false-positive rate is 0.6509 and false negative rate is 0.6785. (b) Suppose the disease is very serious if untreated. Explain which classifier you would prefer to use. if the disease is very serious and went untreated, i would use the Naive Bayes classifier algorithm as it gives the best type of results as desired compared to other algorithms like classification algorithms like Logistic Regression, Tree-Based Algorithms, Support Vector Machines. Question 3: Geometric Interpretation of Prediction Data was collected on 30 cancer patients to investigate the effectiveness (Yes/No) of a treatment. Two quantitative variables, x1 and x2 (but taking values between 0 and 1), are thought to be important predictors of effectiveness. Suppose that the rectangles labeled as nodes in the scatter plot below represent nodes of a classification tree. Node 1 Node 2 Node 3 Node 4 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 Effectiveness Yes No (a) The diagram above is the geometric interpretation of a classification tree to predict drug effectiveness based on two predictors, x1 and x2. What is the predicted class of each node? Node Proportion of “Yes” values in each node Prediction (assume we declare “effective” if more than 50% of the values are “Yes”) 1 5 N ot Effective 2 3 E ffective 3 1 N ot Effective 4 2 N ot Effective 4