Quiz 2 Notes (2)

.docx

School

Boston University *

*We aren’t endorsed by this school

Course

BA222

Subject

Statistics

Date

Jan 9, 2024

Type

docx

Pages

Uploaded by CountCrown15475

Introduction to Regression Analysis Linear Equation Y = β 0 + β 1 x - Y = Dependent Variable - X = Independent Variable - β 0 = Intercept - β 1 = Slope Interpretation β0(intercept) is the average value of y when x = 0 β1 (Slope) is the average change in y when x increases by one unit Linear Regressions → to acknowledge the fact that the relation between x and y is statistical and not exact Error = all other factors related to y - New terms is called the error term - When we add it to a linear equation we call it a linear regression model Estimation - using the Pizza Sales csv Going to estimate the values of β0 and β1 using the Ordinary Least Squares (OLS) methodology - The parameters estimated using OLS minimise the mean square error (MSE) Step 1: - Load package and data

- import statsmodels.formula.api as smf Step 2: - Specify the regression model - Use smf.ols() - Code: - model = smf.ols('pie_sales ~ price', data = pz) - The variable on the left of ~ - Dependent variable (y) - The variable on the right of ~ - Independent variable (x) Step 3: - Estimate the Beta coefficients - Use .fit() - Code: - regResults = model.fit() Step 4: - Getting the beta coefficients - Can see the estimated beta coefficients using the .params attribute on the results of .fit() - Code: - betas = regResults.params betas To combine Steps 2 and 3: - Code: - regResults = smf.ols('pie_sales ~ price', data = pz).fit() Note: - Still use the _ to combine terms w two words in the data set Fitted Values: The fitted Values (predicted values) ŷ are the values of y that are expected given some estimated beta coefficients and a given x* value: Ŷ = β 0* + β 1* x*

Getting the fitted values on Python (for the pizza data) pizza['sales_pred'] = regResults.params pizza[['pie_sales', 'sales_pred']] Getting fitted values using custom x values: - First, need to specify the values that you want in a DataFrame using the independent variable names: - Code: - x = [5, 6, 7, 8, 9, 10, 11, 12, 13, 15] customData = pd.DataFrame(data = {'price':x}) customData['pred_sales'] = regResults.predict(customData) customData To Find the price level that maximises revenues: - Revenues are sales times price - Code: - customData['revenue'] = customData['price'] * customData['pred_sales'] To get Visualizing Results: - Code: - plt.scatter(x = customData.price, y = customData.revenue) plt.show() Evaluating Regression Models 1. Do the values of the beta coefficients make sense? 2. How well the models fits the data? 3. Are the estimated coefficients statistically different than zero Goodness of Fit - how well the model fits the data → how close the residual value is to the actual value Visualising the Results - Import seaborn package - Import seaborn as sb

- Use the .regplot() function from the .sb package and specify the x and y variables in the model: - Code: - sb.regplot(y = pz.pie_sales, x = pz.price) plt.show() When visualising a linear fit you want to focus on: - Sign - Dispersion around the regression line - Linearity - Outliers Methods to measure the Goodness of Fit: 1. R-Squared 2. R-Squared → The goodness of fit of a univariate regression model can be represented using the R-Squared coefficient. - Bounded by 0 and 1 - Must be somewhere in between the two - Will be a decimal - Values closer to zero indicate a bad fit - Values close to one indicate a good fit - Assumes a linear relation - Similar to correlation Code: residuals = regResults.resid SSE = (residuals ** 2).sum() sqMeanDeviation = (pz.pie_sales - pz.pie_sales.mean()) ** 2 SST = sqMeanDeviation.sum() rSquared = 1 - SSE/SST rSquared Interpretation of the R-Square - R2 = 0.87

- Means about 87% of the variation in the y variable can be explained with the regression model - It is close to 1 so it is considered a good fit - Between two or more variables: - Choose the variable which has the highest r- squared (closest to 1) - This is the one that is closest to the desired value (1) so it will be the best/ goodest fit - Getting the R-Squared on Python - Code: - regResults.rsquared Sampling Variation Sampling: → a sample is a subset of the population Statistical Inference → using a sample to learn about a population - A sample is representative if each population has the same probability of being selected in the sample - A non-representative sample will be biassed towards certain values - This will distort the summary statistics calculated from it.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version