MULTIPLE LINEAR REGRESSION PROJECT BALAKRISHNAN
.docx
keyboard_arrow_up
School
University of Texas *
*We aren’t endorsed by this school
Course
5318
Subject
Industrial Engineering
Date
Dec 6, 2023
Type
docx
Pages
22
Uploaded by DukeFly3874
MULTIPLE LINEAR REGRESSION PROJECT
IE 5318 – 004
Spring 2020
Project Group Members
Balakrishnan Ramasubramanian – 1001705923
Padmanaban Baskaran – 1001767716
Nishanth Sudhakar Sonica – 1001743407
Instructor
Dr. Chen Kan
1
1. PROJECT PROPOSAL
Calories intake for a human being is an essential factor for maintaining good health and well- being.
The daily recommended caloric intake for average American can range from 1600-3000 calories per
day. It is very important the we focus on our daily calories intake in order to maintain a well-balanced
diet. The subject matter of dataset explores the number of calories consumed by a person while eating
junk foods. The number of calories present in each junk meal is determined by various factors such as
saturated fat, protein, sugar and carbs (units in grams) present in various junk meal. This project
would help determine factors that add up the calories the relationship between our response and
predictor variables to serve more nutritious food to consumers.
Software used for processing data:
SAS Statistical Software and Minitab
Variable Description
Variables
Description
Y (Response)
Calories
X1 (Predictor)
Saturated Fat (grams)
X2 (Predictor)
Protein (grams)
X3 (Predictor)
Sugars (grams)
X4 (Predictor)
Carbohydrates (grams)
Table 1.1 Corresponding Response and Predictor variables taken
Data Collection Process
From the data source, there exists a strong relationship between the response and predictor variables
taken. It is clearly seen that varying the amounts of the predictor variables provides corresponding
changes in the response variable. The source of data collected contains 126 observations portraying
the number of calories consumed per junk meal and factors contributing the calories rise.
The source of data: https:
//www.statcrunch.com/app/index.php?dataid=2515365
Multiple Linear Regression
(MLR)
In multiple linear regression, we consider 4 predictor variables denoted as X1, X2, X3 and X4. In our
project, X1 denotes saturated fat, X2 denotes protein, X3 denotes sugars and X4 as carbohydrates.
Calories denote the response variable for the model.
Scatter Plot Matrix
The scatter plot is used to establish the relationship between the predictor and response variable.
In the scatter plot for Calories VS Saturated fat, scatter plot tends to form an upward trend
with no curvature.
In the scatter plot for Calories VS Protein, scatter plot tends to form a linear upward trend
with a little outliner along the upward trend.
In the scatter plot for Calories VS Protein, scatter plot tends to form a linear upward trend
with a little outliner along the upward trend.
In the scatter plot for Calories VS Sugar, trend is steep upward trend with a few scatter
points.
In the scatter plot for Calories VS Carbs, scatter plot tends to form an upward trend and
appears to be in funnel shape.
2
Fig 1.1 – Scatter Plot Matrix
Discussion on Paired Correlation
Fig 1.2 – Correlation Matrix for variables
Correlations between Response Variable and Predictor Variables
The correlation between Calories (Y) Vs Saturated fat (X
1
) = 0.90185
The correlation between Calories (Y) Vs Protein (X
2
) = 0.81908
The correlation between Calories (Y) Vs Sugars (X
3
) = 0.24806
The correlation between Calories (Y) Vs Carbohydrates (X
4
) = 0.53133
Correlations between Predictor Variables
The correlation between Saturated fat (X
1
) Vs Protein (X
2
) = 0.70397
The correlation between Saturated fat (X
1
) Vs Sugars (X
3
) = 0.30512
The correlation between Saturated fat (X
1
) Vs Carbohydrates (X
4
) = 0.42458
The correlation between Protein (X
2
) Vs Sugars (X
3
) = -0.11842
The correlation between Protein (X
2
) Vs Carbohydrates (X
4
) = 0.12760
The correlation between Sugars (X
3
) Vs Carbohydrates (X
4
) = 0.81346
3
I.
PRELIMINARY MULTIPLE LINEAR REGRESSION MODEL
ANALYSIS
Fig 2.1 – ANOVA and Parameter estimation for preliminary model
General fitted regression model is given by,
?
i
=
?
0
+
?
1
X
i1
+
?
2
X
i2
+
?
3
X
i3
+
?
4
X
i4
+
?
i
Where,
i
= Total number of observations (i.e. 1,2….126)
?
i
= Random Error
Y
i
= Represents Calories which is our response variable.
X
i1
, X
i2
, X
i3
and X
i4
= Represents predictors Saturated fat, Protein, Sugars and Carbohydrates.
?
0
= Calorie Intercept
?
1
,
?
2
,
?
3
, and
?
4
= Parameters estimate of predictors Saturated fat, Protein, Sugars and Carbohydrates.
Fitted regression equation is given by
^
Y
i
= b
0
+ b
1
X
i1
+ b
2
X
i2
+ b
3
X
i3
+ b
4
X
i4
From the above
Fig 2.1
fitted regression equation based on parameter estimator values is given by
^
Calories
= 5.21859 + 16.63468 Saturated Fat + 5.88101 Protein - 2.92718 Sugars + 5.63309 Carbohydrates
Model Assumptions
For data analysis, fulfilment of specific assumptions is essential. Some of the following assumptions
for the Multiple Linear Regression Model are: Linear Model is Appropriate, The Residuals have
Constant Variance, The Residuals are Normally Distributed, Residuals are Uncorrelated and No
Outliers.
Linear Model is Appropriate
4
Fig 2.2 – Residuals versus predictor variables plot preliminary model
In order to state that our linear model is appropriate, outcomes were studied by examining the
residuals vs predictor variables plot shown above. All the predictors are plotted against residuals
independently. From
Fig 2.2
, there is no curvature shape in residuals vs predictor variables plot.
Hence, linear model form is appropriate.
Residuals have Constant Variance
Fig 2.3 - Residuals versus predictor value plot
Since there is no funnel shape displayed in the residuals versus predictor value plot displayed above,
we can conclude that there is constant variance in the model. This can also be tested using modified
levene test.
Modified Levene Test
5
Fig 2.4 – Modified levene test from Minitab
F-Test Hypothesis
H
0
: σ1 =σ2 i.e. variance of two groups is equal
H
1
: σ1 ≠ σ2 i.e. variance of two groups is unequal
Decision Rule
: Reject H
0
if p-value is less than the level of significance (p < α)
Test Statistics:
Form the table above, P-value = 0.0976
Assume level of significance α = 0.05 and n = 126. Hence (p > α)
Hence, we fail to reject H
0
. We can conclude that variance of the model is constant.
Two-Sided T-Test
H
0
: Means of 2 groups are equal. (constant variance)
H
1
: Means of 2 groups are not equal. (Non-constant Variance)
Decision Rule
: Reject H
0
if p < α; Reject H
0
.
Test Statistics:
We concluded equal variance from the first result, considering p value of
Satterthwaite (unequal variance), p value = 0.3778
Hence, we conclude that (p > α). As we Fail to Reject H
0
, so the variance of the model is constant.
The Residuals are Normally Distributed
Normal distribution is observed in the error term. This assumption is satisfied using Normal
Probability Plot. Graph is plotted between quantile and residuals. From the Normal Probability Plot
displayed below, it is clearly seen that the
Fig 2.4
follows a practically straight line. Hence Normality
is okay. A normality test is conducted to confirm the results.
Normality Test
H
0
: Normality is satisfied
H
1
: Normality is violated
6
0.0976
9.40
0.000
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help