MULTIPLE LINEAR REGRESSION PROJECT BALAKRISHNAN.docx

MULTIPLE LINEAR REGRESSION PROJECT IE 5318 – 004 Spring 2020 Project Group Members Balakrishnan Ramasubramanian – 1001705923 Padmanaban Baskaran – 1001767716 Nishanth Sudhakar Sonica – 1001743407 Instructor Dr. Chen Kan 1

1. PROJECT PROPOSAL Calories intake for a human being is an essential factor for maintaining good health and well- being. The daily recommended caloric intake for average American can range from 1600-3000 calories per day. It is very important the we focus on our daily calories intake in order to maintain a well-balanced diet. The subject matter of dataset explores the number of calories consumed by a person while eating junk foods. The number of calories present in each junk meal is determined by various factors such as saturated fat, protein, sugar and carbs (units in grams) present in various junk meal. This project would help determine factors that add up the calories the relationship between our response and predictor variables to serve more nutritious food to consumers. Software used for processing data: SAS Statistical Software and Minitab Variable Description Variables Description Y (Response) Calories X1 (Predictor) Saturated Fat (grams) X2 (Predictor) Protein (grams) X3 (Predictor) Sugars (grams) X4 (Predictor) Carbohydrates (grams) Table 1.1 Corresponding Response and Predictor variables taken Data Collection Process From the data source, there exists a strong relationship between the response and predictor variables taken. It is clearly seen that varying the amounts of the predictor variables provides corresponding changes in the response variable. The source of data collected contains 126 observations portraying the number of calories consumed per junk meal and factors contributing the calories rise. The source of data: https: //www.statcrunch.com/app/index.php?dataid=2515365 Multiple Linear Regression (MLR) In multiple linear regression, we consider 4 predictor variables denoted as X1, X2, X3 and X4. In our project, X1 denotes saturated fat, X2 denotes protein, X3 denotes sugars and X4 as carbohydrates. Calories denote the response variable for the model. Scatter Plot Matrix The scatter plot is used to establish the relationship between the predictor and response variable.  In the scatter plot for Calories VS Saturated fat, scatter plot tends to form an upward trend with no curvature.  In the scatter plot for Calories VS Protein, scatter plot tends to form a linear upward trend with a little outliner along the upward trend.  In the scatter plot for Calories VS Protein, scatter plot tends to form a linear upward trend with a little outliner along the upward trend.  In the scatter plot for Calories VS Sugar, trend is steep upward trend with a few scatter points.  In the scatter plot for Calories VS Carbs, scatter plot tends to form an upward trend and appears to be in funnel shape. 2

Fig 1.1 – Scatter Plot Matrix Discussion on Paired Correlation Fig 1.2 – Correlation Matrix for variables Correlations between Response Variable and Predictor Variables The correlation between Calories (Y) Vs Saturated fat (X 1 ) = 0.90185 The correlation between Calories (Y) Vs Protein (X 2 ) = 0.81908 The correlation between Calories (Y) Vs Sugars (X 3 ) = 0.24806 The correlation between Calories (Y) Vs Carbohydrates (X 4 ) = 0.53133 Correlations between Predictor Variables The correlation between Saturated fat (X 1 ) Vs Protein (X 2 ) = 0.70397 The correlation between Saturated fat (X 1 ) Vs Sugars (X 3 ) = 0.30512 The correlation between Saturated fat (X 1 ) Vs Carbohydrates (X 4 ) = 0.42458 The correlation between Protein (X 2 ) Vs Sugars (X 3 ) = -0.11842 The correlation between Protein (X 2 ) Vs Carbohydrates (X 4 ) = 0.12760 The correlation between Sugars (X 3 ) Vs Carbohydrates (X 4 ) = 0.81346 3

I. PRELIMINARY MULTIPLE LINEAR REGRESSION MODEL ANALYSIS Fig 2.1 – ANOVA and Parameter estimation for preliminary model General fitted regression model is given by, ? i = ? 0 + ? 1 X i1 + ? 2 X i2 + ? 3 X i3 + ? 4 X i4 + ? i Where, i = Total number of observations (i.e. 1,2….126) ? i = Random Error Y i = Represents Calories which is our response variable. X i1 , X i2 , X i3 and X i4 = Represents predictors Saturated fat, Protein, Sugars and Carbohydrates. ? 0 = Calorie Intercept ? 1 , ? 2 , ? 3 , and ? 4 = Parameters estimate of predictors Saturated fat, Protein, Sugars and Carbohydrates. Fitted regression equation is given by ^ Y i = b 0 + b 1 X i1 + b 2 X i2 + b 3 X i3 + b 4 X i4 From the above Fig 2.1 fitted regression equation based on parameter estimator values is given by ^ Calories = 5.21859 + 16.63468 Saturated Fat + 5.88101 Protein - 2.92718 Sugars + 5.63309 Carbohydrates Model Assumptions For data analysis, fulfilment of specific assumptions is essential. Some of the following assumptions for the Multiple Linear Regression Model are: Linear Model is Appropriate, The Residuals have Constant Variance, The Residuals are Normally Distributed, Residuals are Uncorrelated and No Outliers. Linear Model is Appropriate 4

Fig 2.2 – Residuals versus predictor variables plot preliminary model In order to state that our linear model is appropriate, outcomes were studied by examining the residuals vs predictor variables plot shown above. All the predictors are plotted against residuals independently. From Fig 2.2 , there is no curvature shape in residuals vs predictor variables plot. Hence, linear model form is appropriate. Residuals have Constant Variance Fig 2.3 - Residuals versus predictor value plot Since there is no funnel shape displayed in the residuals versus predictor value plot displayed above, we can conclude that there is constant variance in the model. This can also be tested using modified levene test. Modified Levene Test 5

Fig 2.4 – Modified levene test from Minitab F-Test Hypothesis H 0 : σ1 =σ2 i.e. variance of two groups is equal H 1 : σ1 ≠ σ2 i.e. variance of two groups is unequal Decision Rule : Reject H 0 if p-value is less than the level of significance (p < α) Test Statistics: Form the table above, P-value = 0.0976 Assume level of significance α = 0.05 and n = 126. Hence (p > α) Hence, we fail to reject H 0 . We can conclude that variance of the model is constant. Two-Sided T-Test H 0 : Means of 2 groups are equal. (constant variance) H 1 : Means of 2 groups are not equal. (Non-constant Variance) Decision Rule : Reject H 0 if p < α; Reject H 0 . Test Statistics: We concluded equal variance from the first result, considering p value of Satterthwaite (unequal variance), p value = 0.3778 Hence, we conclude that (p > α). As we Fail to Reject H 0 , so the variance of the model is constant. The Residuals are Normally Distributed Normal distribution is observed in the error term. This assumption is satisfied using Normal Probability Plot. Graph is plotted between quantile and residuals. From the Normal Probability Plot displayed below, it is clearly seen that the Fig 2.4 follows a practically straight line. Hence Normality is okay. A normality test is conducted to confirm the results. Normality Test H 0 : Normality is satisfied H 1 : Normality is violated 6 0.0976 9.40 0.000

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help