This report will cover the dataset diabetes details from Efron et al.(2003). Throughout this report, we will explore the potential relationship between the ten predictor variables: age, sex, BMI, average blood pressure, and six blood serum measurements (tc, ldl, hdl, tch, ltg, glu) on a quantitative measure of disease progression one year after the baseline. There are 442 diabetes patients, or records in the dataset. In this report, we will explore different methods and cross validation techniques to understand these sample datasets leveraging linear regression, ridge regression, lasso method and metrics like our MSE, the mean prediction error and standard errors of the test dataset. First, we will…show more content… With the positive coefficients, we will see an increase in one unit of each variable separately compared with the advancement in diabetes. With a 0.05 parameter, the linear regression model selects 5 predictor variables with significance, age, tc, ldl, tch, and glu. To validate the assumption, we can plot the residuals versus the fitted values to see if there are any indications of signs of random distributions. For the residual plot, we see there are no indications or violations of random distribution and can calculate the MSE of the model, which is 3111.265. Next, we will leverage the best subset method to select the predictor variables that are truly impactful to the model.
Best Subset Model
With our best subset method, we can leverage our lowest BIC metric to select the best model. We can plot out best subset method and pinpoint the number of variables to select. The plot below showcases that the lowest point or value of BIC, contains 6 variables. We leverage the BIC metric because it places a penalty on models with more or many variables. Meaning, the mode variable a model has, the bigger the penalty. We can review the coefficients, standard errors, t-value, and the p-values for the best subset method with the significant six variables. Our MSE metric for the 6-variable best subset model is 3090.483, a slight decrease from our linear regression model.