STAT3032_HW9_Section001_S2023_Solution

.docx

School

University of Minnesota-Twin Cities *

*We aren’t endorsed by this school

Course

3032

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

10

Uploaded by JudgeOxide10008

STAT 3032 Regression and Correlated Data Homework 9 Please show your work on each problem for full credit. A correct answer, unsupported by the necessary explanation , R code or output will receive very little if any credit. Your work needs to be organized in a reasonably neat and coherent way, and submitted as a pdf file on Canvas. Please do not share this handout outside the class. Problem 1 On April 15, 1912, during her maiden voyage, the ship Titanic sank after colliding with an iceberg, killing many passengers and crew. Here, we will use a subset of the data to analyze the survival rates for different groups of people. Please download the dataset TitanicPartial_v2.csv from Canvas and work through the following questions. Variables used in this analysis: Survival: survival status. 1 = survived, 0 = did not survive Pclass: passenger class, 1= First, 2 = Second, 3 = Third Age: age in years. SibSp: number of siblings/spouses aboard (a)_Explore the data. How many passengers are included in the dataset? How many of them survived and how many of them did not survive? Please explain how you obtain the answers. > dat = read.csv('TitanicPartial_v2.csv') > table(dat$Survival) 0 1 424 290 424 passengers did not survive; 290 passengers survived; the total number of passengers is 714 (=290 + 424). Alternatively , you can use the following code to obtain the answers. > nrow(dat) [1] 714 > sum(dat$Survival) [1] 290 > 714 - 290
STAT 3032 Regression and Correlated Data [1] 424 424 passengers did not survive; 290 passengers survived; the total number of passengers is 714 (=290 + 424). (b)_See below for the scatterplots of the survival status of the passengers in the different classes. Each dot represents a passenger. Based on the scatterplots only , which passenger class has the lowest odds for survival for those who are in the age group 40-50 ? Please explain your answer. Hint: look at the relative number of survival and non-survival in each group. For a particular group of passengers, Theestimated odds of survival = The proportion of survivors The proportion of the deceased In the red box (age group 40-50), we can see that the proportion of survivors is lower in the third class than any other. So the third class has the lowest odds of survival. (c)_Fit the following regression model using a suitable generalized linear model and provide the model summary. Explain your choice of GLM
STAT 3032 Regression and Correlated Data mod1: Survival ~ 1 + as.factor(Pclass)+ Age We will use a logistic regression model. The reason for this choice is that the response is binary (survive or not survive) and we are interested in probability/odds of survival. > mod1 = glm(Survival ~ 1 + as.factor(Pclass)+ Age, data=dat, family=binomial) > summary(mod1) Call: glm(formula = Survival ~ 1 + as.factor(Pclass) + Age, family = binomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max -2.1524 -0.8466 -0.6083 1.0031 2.3929 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.296012 0.317629 7.229 4.88e-13 *** as.factor(Pclass)2 -1.137533 0.237578 -4.788 1.68e-06 *** as.factor(Pclass)3 -2.469561 0.240182 -10.282 < 2e-16 *** Age -0.041755 0.006736 -6.198 5.70e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 964.52 on 713 degrees of freedom Residual deviance: 827.16 on 710 degrees of freedom AIC: 835.16 Number of Fisher Scoring iterations: 4 (d)_What happens if we don’t apply the as.factor( ) function to Pclass ? Try fitting mod1 without as.factor( ) . You can call this new model mod2 . How do the summary outputs of mod1 and mod2 differ? Hint: how many slope(s) is/are associated with Pclass when you don’t use as.factor( ) ? > mod2 = glm(Survival ~ 1 + Pclass + Age, data=dat, family=binomial) > summary(mod2)
STAT 3032 Regression and Correlated Data Call: glm(formula = Survival ~ 1 + Pclass + Age, family = binomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max -2.1712 -0.8550 -0.6136 1.0127 2.3883 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.585448 0.406761 8.815 < 2e-16 *** Pclass -1.243853 0.119060 -10.447 < 2e-16 *** Age -0.042006 0.006725 -6.246 4.2e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 964.52 on 713 degrees of freedom Residual deviance: 827.43 on 711 degrees of freedom AIC: 833.43 Number of Fisher Scoring iterations: 4 Only 1 slope is associated with Pclass when I don’t use as.factor( ) . Whereas in Part (c), Pclass has 2 slopes for its dummy variables. Without the as.factor( ) , R is treating Pclass as a quantitative variable. (e)_Interpret the slope of Age in mod1 in context. Solution 1: (Note solutions 2 and 3 are preferred but not required) Controlling for the passenger class, when the age increases by 1 year, the log of the odds of survival decreases by 0.0418. > exp(-0.041755)-1 [1] -0.04089527 Solution 2: Controlling for the passenger class, when the age increases by 1 year, the odds of survival decreases by 4.1%. Solution 3:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help