exam1at430sol

pdf

School

University of Illinois, Urbana Champaign *

*We aren’t endorsed by this school

Course

425

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

10

Report

Uploaded by ConstableValor10277

STAT 425 Exam 1 @ 4:30 pm October 4, 2023, 4:30pm Name: SOLUTIONS Netid: _________________________ This is an 80 minute handwritten exam. There are 5 problems, each worth 10 points. Do not start working until your proctor tells you to start. Your head must be visible to your proctor on Zoom along with your screen and the work in front of you. You may use a calculator but not the computer or R or the internet for your work. To work on the exam you can either: 1. Print out the exam and do your handwritten work on the exam itself; or 2. View the exam on your screen and do work on separate sheets of paper. Clearly label your work as to which problem number (1,2,3..) and part (a,b,c..) you are solving; or 3. Do work on a blank file or pdf of the exam on a tablet, and upload your work file from the tablet. In this case the proctor must be able to see your tablet. Scanning and uploading your exam: After you finish, scan/photograph each page and upload into Moodle in the same way you upload assignment files. You are allowed two one-sided 8.5 by 11 inch sheets of notes for yourself. Scan and upload after the exam. 1
Problem 1. (3 parts) Data will be collected in the form ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) , where x i is the i th value of a fixed, nonrandom explanatory variable, and y i is the corresponding random response. Consider the model y i = β 0 + β 1 x i + e i , i = 1 , . . . , n for unknown parameters β 0 and β 1 and random errors e 1 , . . . , e n that are independent with mean zero and variances equal to an unknown constant σ 2 . The ordinary least squares estimators for β 0 and β 1 are given by ˆ β 0 = ¯ y ¯ x ˆ β 1 and ˆ β 1 = QQQQQQQ n i =1 ( x i ¯ x )( y i ¯ y ) QQQQQQQ n i =1 ( x i ¯ x ) 2 , where ¯ x = 1 n n YYYYYYY i =1 x i and ¯ y = 1 n n YYYYYYY i =1 y i . (a) (3 pts) If x 1 = 3 . 5 and x 2 = 5 . 0 , find E ( y 2 ) E ( y 1 ) in terms of the model parameters. E ( y 2 ) E ( y 1 ) = ( β 0 + 5 β 1 ) ( β 0 + 3 . 5 β 1 ) = 1 . 5 β 1 (b) (3 pts) Find an explicit expression for E ( ¯ y ) in terms of the model parameters and predictor variables. E y ) = E 1 n n YYYYYYY i =1 y i = 1 n n YYYYYYY i =1 E ( y i ) = 1 n n YYYYYYY i =1 ( β 0 + β 1 x i ) = β 0 + β 1 1 n n YYYYYYY i =1 x i = β 0 + β 1 ¯ x (c) (4 pts) After the data are collected we find n = 20 , QQQQQQQ 20 i =1 ( x i ¯ x ) 2 = 50 , QQQQQQQ 20 i =1 ( y i ¯ y ) 2 = 33 , and QQQQQQQ 20 i =1 ( y i ˆ y i ) 2 = 9 . 6 , where ˆ y 1 , . . . , ˆ y 20 are the fitted values for LS regression of y on x . Based on these results, show how to calculate the standard error for ˆ β 1 , plugging in all the relevant numbers. You do not have to complete the calculation. se ( ˆ β 1 ) = ˆ σ rrrrrrr QQQQQQQ 20 i =1 ( x i ¯ x ) 2 = wwwwwww vvvvvvv vvvvvvv uuuuuuu QQQQQQQ 20 i =1 ( y i ˆ y i ) 2 / (20 2) QQQQQQQ 20 i =1 ( x i ¯ x ) 2 = ttttttt 9 . 6 / 18 50 2
Problem 2. (4 parts) Data on fuel consumption were collected for each of the 50 states and Washington D.C. for a total sample size of n = 51 . The variables are gasoline Tax (cents/gallon), Fuel consumption per 1000 pop. over 16, Dlic (Licensed Drivers per 1000 population over 16), and logMiles (log 10 miles of highway in the state). The following linear model was fit to the data. Fuel = β 0 + β 1 Tax + β 2 Dlic + β 3 logMiles + error The results of fitting a linear model of this form are summarized below. ## ## Call: ## lm(formula = Fuel ~ Tax + Dlic + logMiles, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -171.13 -48.91 5.34 41.90 193.25 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -166.926 168.544 -0.99 0.32705 ## Tax -3.999 2.171 -1.84 0.07175 ## Dlic 0.536 0.135 3.96 0.00025 ## logMiles 79.445 21.972 3.62 0.00073 ## ## Residual standard error: 69.4 on 47 degrees of freedom ## Multiple R-squared: 0.427, Adjusted R-squared: 0.391 ## F-statistic: 11.7 on 3 and 47 DF, p-value: 7.66e-06 (a) (2 pts) Based on the results, what is the proportion of total variance explained by the model? Multiple R-squared = 0.427 (b) (2 pts) Based on the fitted model, estimate the expected Fuel consumption per 1000 population for a state with the following profile: ## Tax Dlic logMiles ## 20 782.8 5.1 Set up the calculation with all relevant numbers. You do not need to complete the calculation. -166.926 + (20)(-3.999) + (782.8)(0.536) + (5.1)(79.445) 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
(c) (3 pts) Consider the F-Statistic test results given on the last line of the summary. State the null hypothesis H 0 and alternative hypothesis H A for this test. Express the hypotheses in terms of the unknown parameters β 0 , β 1 , β 2 , β 3 , σ 2 . H 0 : β 1 = β 2 = β 3 = 0 H A : at least one of β 1 , β 2 , β 3 ̸ = 0 Equivalently, H A : β 1 ̸ = 0 or β 2 ̸ = 0 or β 3 ̸ = 0 . (d) (3 pts) Based on the model summary and mathematical notation above, provide the t value and p-value for testing the null hypothesis H 0 : β 1 = 0 against the alternative H a : β 1 ̸ = 0 . Also give the degrees of freedom for this test. This is the coefficient t test for Tax . From the model summary we have t value = 1 . 84 , p-value = 0 . 07175 The degrees of freedom = 47 = degrees of freedom for residual standard error. 4
Problem 3. (3 parts) Consider a model of the form y = X β + e , where X is an n × p full rank matrix (its columns are linearly independent), y and e are n × 1 , and β is p × 1 . Assume X is a fixed (non-random) matrix, E ( e ) = 0 , and cov ( e ) = σ 2 I . The least square estimator of β is ˆ β = ( X T X ) 1 X T y . The projection matrix or “hat” matrix is H = X ( X T X ) 1 X T . (a) (3 pts) Show that X ˆ β = Hy . X ˆ β = X ( X T X ) 1 X T y = Hy (b) (4 pts) If ˆ y is the vector of least square fitted values, show that Cov y ) = σ 2 H . Method 1: Cov y ) = Cov ( X ˆ β ) = X Cov ( ˆ β ) X T = X ( σ 2 ( X T X 1 )) X T = σ 2 X ( X T X ) 1 X T = σ 2 H Method 2: Cov y ) = Cov ( Hy ) = H Cov ( y ) H T = H ( σ 2 I ) H ( H is symmetric ) = σ 2 HH = σ 2 H (c) (3 pts) Explain why var ( ˆ y i ) = σ 2 h i for i = 1 , 2 , . . . , n , where h i is the ( i, i ) diagonal element of H . The diagonal elements of Cov ( ˆ y ) = σ 2 H are the variances of ˆ y 1 , . . . , ˆ y n . These diagonal elements are σ 2 h 1 , . . . , σ 2 h n . 5
Problem 4. (4 parts) Data were collected on variables x1, x2, x3, x4, x5, and y. Two models were compared using the anova function. Here are the results: ## Analysis of Variance Table ## ## Model 1: y ~ x1 + x5 ## Model 2: y ~ x1 + x2 + x3 + x4 + x5 ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 47 755.48 ## 2 44 641.26 3 114.21 2.6122 0.06315 (a) (2 pts) Find the value for residual sum of squares, y ˆ y 2 = QQQQQQQ n i =1 ( y i ˆ y i ) 2 , for Model 2, and also find the residual degrees of freedom for this model. RSS 2 = 641 . 26 with 44 degrees of freedom. (b) (2 pts) Suppose the errors in Model 2 have expectations equal to zero, are uncorrelated, and have constant variance σ 2 . Calculate an unbiased estimate of ˆ σ 2 . ˆ σ 2 = RSS 2 df 2 = 641 . 26 44 = 14 . 57 (c) (3 pts) State the null and alternative hypotheses being tested by the F statistic in the Analysis of Variance table given above. Several ways to state the hypotheses: H 0 : The model including only x1 and x5 is adequate. H A : At least one of the variables x2, x3, x4 must also be included in the model. H 0 : β x 2 = β x 3 = β x 4 = 0 . H A : β x 2 ̸ = 0 or β x 3 ̸ = 0 or β x 4 ̸ = 0 . H 0 : Model y x1 + x5 H A : Model y x1 + x2 + x3 + x4 + x5 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
(d) (3 pts) (i) Give the numerical values for the degrees of freedom of the F-test in the table. (ii) Is the null hypothesis rejected at level α = 0 . 05 ? How do you know? (i) Degrees of freedom (numerator, denominator) = (3, 44). (ii) H 0 is not rejected at level 0.05 because p = 0 . 06315 . 7
Problem 5. (4 parts) Data on fuel consumption were collected for each of the 50 states and Washington D.C. for a total sample size of n = 51 . The variables are gasoline Tax (cents/gallon), Fuel consumption per 1000 pop. over 16, Dlic (Licensed Drivers per 1000 population over 16), and logMiles (log 10 miles of highway in the state). The following linear model was fit to the data. Fuel = β 0 + β 1 Tax + β 2 Dlic + β 3 logMiles + error Below is a plot of standardized residuals versus diagonals of the “hat” matrix after fitting the above model by ordinary least squares. The points are labeled with the two letter state abbreviations. 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 -3 -2 -1 0 1 2 3 Leverage (h_i) Standardized Residual AL AK AZ AR CA CO CT DE DC FL GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY (a) (3 pts) We know that in general QQQQQQQ n i =1 h i = p , where p is the number of columns of the design matrix X including the constant column for the intercept. Based on the information given, (i) find the average (sample mean) of the leverages in the data, and (ii) identify the states that have leverage more than twice the average value (identify by their labels). (i) Average leverage = 4/51 = 0.0784 (ii) Twice the average = 8/51 = 0.157. According to the graph, the states above this level are VT, GA, HI, RI, AK, and DC. 8
(b) (3 pts) Based on the above plot, state which of the following would have the largest Cook’s Distance and explain why: AK, RI, WY. AK has the largest Cook’s Distance because Cook’s Distance increases as a function of absolute standardized residual and leverage. Specifically: AK has larger Cook’s Distance than RI because it has higher leverage and absolute standardized residual. AK also has higher Cook’s Distance than WY because, while the AK absolute stan- dardized residual is very similar that of WY, the AK leverage is more than twice that of WY. (c) (2 pts) The plot below shows quantiles of the standardized residuals plotted against quantiles of the standard normal distribution. What potential problem with the linear model assumptions is this plot meant to diagnose? Does the plot below suggest any problem with the model? Explain briefly. -2 -1 0 1 2 -3 -2 -1 0 1 2 3 Normal Q-Q Plot Theoretical Quantiles Sample Quantiles This type of plot is for checking whether the error distribution is non-normal. The plot suggests the distribution might have heavy tails or outliers, with several values at each end deviating from the straight line trend of the inner quantiles. 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
(d) (2 pts) Consider Box-Cox models of the form g λ ( Fuel ) = β 0 + β 1 Tax + β 2 Dlic + β 3 logMiles + error where g λ ( y ) = 999 ===== ;;;; y λ 1 λ , if λ ̸ = 0 log e ( y ) , if λ = 0 . The plot below shows log-likelihood versus the value of λ , where for each λ we obtain a set of coefficient estimates for the regression parameters specific to that value of λ by the method of maximum likelihood. For each λ the result is the same as if we performed ordinary least squares regression of g λ ( Fuel ) on the predictor variables. -1 0 1 2 3 4 2 4 6 8 10 12 λ log-Likelihood 95% Consider the null hypothesis that λ = 1 . Based on the results given, should we reject or fail to reject this hypothesis at the at level α = 0 . 05 ? Why or why not? The 95% confidence interval for λ in the graph includes the value λ = 1 so we fail to reject the null hypothesis. 10