ProbelmSet9
pdf
keyboard_arrow_up
School
University of Toronto *
*We aren’t endorsed by this school
Course
130
Subject
Statistics
Date
Feb 20, 2024
Type
Pages
4
Uploaded by LieutenantFlagSquid18
STA130H1S – Fall 2022
Problem Set 9
Amogh Shashidhar(1008817666) and STA130 Professors
Instructions
Complete the exercises in this .Rmd file and submit your .Rmd and .pdf output through
Quercus
on Thursday,
November 24th by 5:00 p.m. ET.
library
(tidyverse)
library
(rpart)
library
(partykit)
library
(knitr)
Part 1: Binary Classification Decision Trees
Question 1: Gallup World Poll
Using data from the Gallup World Poll (and the World Happiness Report), we are interested in predicting
which factors influence life expectancy around the world. These data are in the file
happinessdata_2017.csv
.
happiness2017 <-
read_csv
(
"happiness2017.csv"
)
(a) Begin by creating a new variable called
life_exp_category
which takes the value “Good”
for countries with a life expectancy higher than 65 years, and “Poor” otherwise.
# code you answer here
life_exp_category <- happiness2017
%>%
select
(country, life_exp)
%>%
mutate
(
case_when
(
Value =
life_exp
>
6
life_exp_category
## # A tibble: 1,420 x 3
##
country
life_exp case_when(Value = life_exp > 65 ~ "Good", life_exp <= ~1
##
<chr>
<dbl> <chr>
##
1 Afghanistan
47.6 Poor
##
2 Afghanistan
47.9 Poor
##
3 Afghanistan
48.2 Poor
##
4 Afghanistan
48.5 Poor
##
5 Afghanistan
48.7 Poor
##
6 Afghanistan
49.0 Poor
##
7 Afghanistan
49.3 Poor
##
8 Afghanistan
49.6 Poor
##
9 Afghanistan
49.9 Poor
## 10 Albania
67.2 Good
## # ... with 1,410 more rows, and abbreviated variable name
## #
1:
case_when(Value = life_exp > 65 ~ "Good", life_exp <= 65 ~ "Poor")
1
(b) Divide the data into training (80%) and testing (20%) datasets. Build a classification tree
using the training data to predict which countries have
Good
vs
Poor
life expectancy, using only
the
social_support
variable as a predictor.
set.seed
(
666
)
# Use the last 3 digits of your student ID number for the random seed.
# code you answer here
n <-
dim
(life_exp_category)[
1
]
n_train <-
as.integer
(n
*
0.8
)
n_test <- n
-
n_train
training_indices <-
sample
(
1
:
n,
size =
n_train,
replace =
FALSE
)
life_exp_category <- life_exp_category
%>%
rowid_to_column
()
train <- life_exp_category
%>%
filter
(rowid
%in%
training_indices)
test <- life_exp_category
%>%
filter
(
!
( rowid
%in%
training_indices))
(c) Use the same training dataset created in (b) to build a second classification tree to predict
which countries have good vs poor life expectancy, using
logGDP
,
social_support
,
freedom
, and
generosity
as potential predictors.
# code you answer here
tree <-
rpart
(life_exp_category
$
case_when(Value = life_exp > 65 ~ "Good", life_exp <= 65 ~ "Poor")
~
h
tree
%>%
as.party
()
%>%
plot
(
type=
"extended"
,
tp_args =
list
(
id =
FALSE
))
happiness2017$logGDP
1
≥
9.521
<
9.521
happiness2017$logGDP
2
≥
10.143
<
10.143
n = 359
Poor
Good
0
0.2
0.4
0.6
0.8
1
happiness2017$generosity
4
≥ -
0.177
< -
0.177
n = 188
Poor
Good
0
0.2
0.4
0.6
0.8
1
n = 83
Poor
Good
0
0.2
0.4
0.6
0.8
1
n = 781
Poor
Good
0
0.2
0.4
0.6
0.8
1
2
(d) Use the testing dataset you created in (b) to calculate the confusion matrix for the trees
you built in (b) and (c). Report the sensitivity (true positive rate), specificity (true negative
rate) and accuracy for each of the trees.
Here you will treat “Good” life expectancy as the
positive response and prediction.
# code you answer here for the tree created in part (b)
tree_train_pred <-
predict
(tree,
type =
"class"
)
train_confusion_matrix <-
table
(
y-hat
= tree_train_pred )
train_confusion_matrix
## y-hat
## Good Poor
##
547
864
# code you answer here for the tree created in part (c)
train_confusion_matrix
/
sum
(train_confusion_matrix)
## y-hat
##
Good
Poor
## 0.3876683 0.6123317
(e) Fill in the following table using the tree you constructed in part (c). Does the fact that some
of the values are missing (NA) prevent you from making predictions for the life expectancy
category for these observations?
logGDP
social_support
freedom
generosity
Predicted life expectancy category
Obs 1
9.68
0.76
NA
-0.35
547
Obs 2
9.36
NA
0.82
-0.22
864
Obs 3
10.4
0.88
0.77
0.11
0.3876683
Obs 4
9.94
0.85
0.63
0.01
0.6123317
Hint: make a
tibble()
of this data and then use it with the
predict()
function.
Question 2: Confusion Matrices and Metrics (Accuracy, etc.)
Two classification trees were built to predict which individuals have a disease using different sets of potential
predictors. We use each of these trees to predict disease status for 100 new individuals. Below are confusion
matrices corresponding to these two classification trees.
Tree A
Disease
No disease
Predict disease
36
22
Predict no disease
2
40
Tree B
Disease
No disease
Predict disease
24
6
Predict no disease
14
56
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
(a) Calculate the accuracy, false-positive rate, and false negative rate for each classification
tree. Here, a “positive” result means we predict an individual has the disease and a “negative”
result means we predict they do not.
the accuracy of Tree A is 0.06745, false-positive rate is 0.6000 and false negative rate is 0.7318. the accuracy
of Tree B is 0.6967, false-positive rate is 0.6509 and false negative rate is 0.6785.
(b) Suppose the disease is very serious if untreated. Explain which classifier you would prefer
to use.
if the disease is very serious and went untreated, i would use the Naive Bayes classifier algorithm as it gives
the best type of results as desired compared to other algorithms like classification algorithms like Logistic
Regression, Tree-Based Algorithms, Support Vector Machines.
Question 3: Geometric Interpretation of Prediction
Data was collected on 30 cancer patients to investigate the effectiveness (Yes/No) of a treatment.
Two
quantitative variables, x1 and x2 (but taking values between 0 and 1), are thought to be important predictors
of effectiveness. Suppose that the rectangles labeled as nodes in the scatter plot below represent nodes of a
classification tree.
Node 1
Node 2
Node 3
Node 4
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
x1
x2
Effectiveness
Yes
No
(a) The diagram above is the geometric interpretation of a classification tree to predict drug
effectiveness based on two predictors, x1 and x2. What is the predicted class of each node?
Node
Proportion of “Yes” values in each node
Prediction (assume we declare “effective”
if more than 50% of the values are “Yes”)
1
5 N
ot Effective
2
3 E
ffective
3
1 N
ot Effective
4
2 N
ot Effective
4
Related Documents
Related Questions
Use a graphing utility to evaluate nPr
50P4
arrow_forward
Thank you. Can you please provide the Boxplot graphs for both sets of data and compare the graphs as well?
arrow_forward
Answer the question below in the picture
arrow_forward
Use the given minimum and maximum data entries, and the number of classes, to find the class width, the lower class limits, and the upper class limits.
minimum = 12, maximum = 80, 7 classes
The class width is 10
(Type a whole number.)
Use the minimum as the first lower class limit, and then find the remaining lower class limits.
The lower class limits are 12,22,32,42,52,62,72.
(Type a whole number. Use a comma to separate answers as needed.)
The upper class limits are
(Type a whole number. Use a comma to separate answers as needed.)
arrow_forward
Use the given minimum and maximum data entries, and the number of classes, to find the class width, the lower class limits, and the upper class limits.
minimum=9, maximum=53, 7 classes. What is the class width?
arrow_forward
Use the given minimum and maximum data entries, and the number of classes, to find the class width, the lower class limits, and the upper class limits.
minimum=7,
maximum=74,
7
classes
arrow_forward
Now monitor the process. An additional ten days of data have been collected, see table labeled “1st 10 Days of Monitoring Reservation Processing Time” in the Data File.
Develop Xbar and R charts for the 1st 10 days of monitoring. Plot the data for the 1st 10 days on the Xbar and R charts.
Is the process in control? If the control chart indicates an out-of-control process, note which days, the pattern, and whether it is the Xbar or R chart.
Based on the X-bar and R Charts that you developed for the 1st 10 days of data, is the process in control?
Group of answer choices
No. The X-bar and R Charts are both out of control.
No. The X-bar Chart is in control, but the R Chart is out of control.
No. The R Chart is in control, but the X-bar Chart is out of control.
Yes. The X-bar and R Charts are both in control.
arrow_forward
Please use two boxplot graph to show the two different teacher's class data
arrow_forward
In IBM SPSS, what does clicking on this icon do?
arrow_forward
A class of eighth graders undertook an ambitious project to compile the most dominant color of
uniforms for all 337 middle school sports teams in their state this year. The circle graph to the right
shows the results. Approximately how many teams are represented in each sector?
example
Dominant uniform color
Get more help -
black 25%
white 20%
sector labelled white, approximately
There are approximately teams represented by the sector labelled black, approximately teams represented by the sector labelled navy blue, approximately teams represented by the
teams represented by the sector labelled gray, approximately teams represented by the sector labelled maroon, and approximately teams
represented by the sector labelled other.
(Simplify your answers. Round to the nearest whole number.)
navy blue 22%
gray 16%
maroon 12%
other 5%
Clear all
Check ansv
arrow_forward
Texas experienced a severe drought and a long heat wave in 2011. Access the Climate Graph link below.
Here is the Link !
https://nca2014.globalchange.gov/report/sectors/energy-water-and-land#graphic-16636
Question # 1A- 1D
Part A: Disregarding 2011, the year with the least total rainfall June-August was what amount , and the rainfall for that year was about how much .
Part B: Disregarding 2011, the year with the greatest total rainfall June-August was what amount, and the rainfall for that year was about how much .
Part C: Disregarding 2011, the year with the lowest average temperature June-August was what amount , and the average temperature for that year was about how much .
Part D: Disregarding 2011, the year with the highest average temperature June-August was what amount ,and the average temperature for that year was about how much
arrow_forward
How will I draw R shiny app using this dataset? Thanks.
There are two files:
server.R and ui.R
arrow_forward
Use the maximum and minimum data entries and the number of classes to find the class width, the lower class limits, and the upper class limits.
min = 1, max = 30, 6 classes
arrow_forward
Use the given minimum and maximum data entries, and the number of classes, to find the class width, the lower class limits, and the upper class limits.
minimum=8,
maximum=67,
7
classes
The class width is
nothing.
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt

Elementary Geometry for College Students
Geometry
ISBN:9781285195698
Author:Daniel C. Alexander, Geralyn M. Koeberlein
Publisher:Cengage Learning

Mathematics For Machine Technology
Advanced Math
ISBN:9781337798310
Author:Peterson, John.
Publisher:Cengage Learning,
Related Questions
- Use the given minimum and maximum data entries, and the number of classes, to find the class width, the lower class limits, and the upper class limits. minimum = 12, maximum = 80, 7 classes The class width is 10 (Type a whole number.) Use the minimum as the first lower class limit, and then find the remaining lower class limits. The lower class limits are 12,22,32,42,52,62,72. (Type a whole number. Use a comma to separate answers as needed.) The upper class limits are (Type a whole number. Use a comma to separate answers as needed.)arrow_forwardUse the given minimum and maximum data entries, and the number of classes, to find the class width, the lower class limits, and the upper class limits. minimum=9, maximum=53, 7 classes. What is the class width?arrow_forwardUse the given minimum and maximum data entries, and the number of classes, to find the class width, the lower class limits, and the upper class limits. minimum=7, maximum=74, 7 classesarrow_forward
- Now monitor the process. An additional ten days of data have been collected, see table labeled “1st 10 Days of Monitoring Reservation Processing Time” in the Data File. Develop Xbar and R charts for the 1st 10 days of monitoring. Plot the data for the 1st 10 days on the Xbar and R charts. Is the process in control? If the control chart indicates an out-of-control process, note which days, the pattern, and whether it is the Xbar or R chart. Based on the X-bar and R Charts that you developed for the 1st 10 days of data, is the process in control? Group of answer choices No. The X-bar and R Charts are both out of control. No. The X-bar Chart is in control, but the R Chart is out of control. No. The R Chart is in control, but the X-bar Chart is out of control. Yes. The X-bar and R Charts are both in control.arrow_forwardPlease use two boxplot graph to show the two different teacher's class dataarrow_forwardIn IBM SPSS, what does clicking on this icon do?arrow_forward
- A class of eighth graders undertook an ambitious project to compile the most dominant color of uniforms for all 337 middle school sports teams in their state this year. The circle graph to the right shows the results. Approximately how many teams are represented in each sector? example Dominant uniform color Get more help - black 25% white 20% sector labelled white, approximately There are approximately teams represented by the sector labelled black, approximately teams represented by the sector labelled navy blue, approximately teams represented by the teams represented by the sector labelled gray, approximately teams represented by the sector labelled maroon, and approximately teams represented by the sector labelled other. (Simplify your answers. Round to the nearest whole number.) navy blue 22% gray 16% maroon 12% other 5% Clear all Check ansvarrow_forwardTexas experienced a severe drought and a long heat wave in 2011. Access the Climate Graph link below. Here is the Link ! https://nca2014.globalchange.gov/report/sectors/energy-water-and-land#graphic-16636 Question # 1A- 1D Part A: Disregarding 2011, the year with the least total rainfall June-August was what amount , and the rainfall for that year was about how much . Part B: Disregarding 2011, the year with the greatest total rainfall June-August was what amount, and the rainfall for that year was about how much . Part C: Disregarding 2011, the year with the lowest average temperature June-August was what amount , and the average temperature for that year was about how much . Part D: Disregarding 2011, the year with the highest average temperature June-August was what amount ,and the average temperature for that year was about how mucharrow_forwardHow will I draw R shiny app using this dataset? Thanks. There are two files: server.R and ui.Rarrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Big Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin HarcourtElementary Geometry for College StudentsGeometryISBN:9781285195698Author:Daniel C. Alexander, Geralyn M. KoeberleinPublisher:Cengage LearningMathematics For Machine TechnologyAdvanced MathISBN:9781337798310Author:Peterson, John.Publisher:Cengage Learning,

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt

Elementary Geometry for College Students
Geometry
ISBN:9781285195698
Author:Daniel C. Alexander, Geralyn M. Koeberlein
Publisher:Cengage Learning

Mathematics For Machine Technology
Advanced Math
ISBN:9781337798310
Author:Peterson, John.
Publisher:Cengage Learning,