SA4_Econ140_Solution
.pdf
keyboard_arrow_up
School
University of California, Berkeley *
*We aren’t endorsed by this school
Course
140
Subject
Economics
Date
Feb 20, 2024
Type
Pages
9
Uploaded by LieutenantWorld12843
Section Assignment - Econ 140 Spring 2023
WEEK 4: OLS Regression, Functional Forms, Bias in Regression
Exercises
Question 1: Omitted variable bias
Consider the following long regression:
y
i
=
β
0
+
β
1
x
i
+
β
2
z
i
+
u
i
The following short regression:
y
i
=
α
0
+
α
1
x
i
+
ϵ
i
And the auxiliary regression:
z
i
=
γ
0
+
γ
1
x
i
+
e
i
Prove that
α
1
=
β
1
+
β
2
γ
1
and interpret.
Solution:
Let’s back out for a second and think about why we could be interested in estimating
β
1
. (To understand the
what comes next it might help to imagine you are working with observational data on wages and education.)
Many times we are trying to identify causal relations. We think of causality using ceteris paribus: what would
happen to
y
if we altered
x
but we kept all the other things constant? When we include the other variables
as controls in the regression we can interpret the coefficient on
x
as the relation between
y
and
x
keeping the
controls fixed, what we wanted to get.
But if we don’t include them in the regression we can’t make that
interpretation in general. What would the coefficient capture in that case?
Substitute the expression of
z
i
coming from the auxiliary regression into the long:
y
i
=
β
0
+
β
1
x
i
+
β
2
(
γ
0
+
γ
1
x
i
+
e
i
) +
u
i
=
β
0
+
β
1
x
i
+
β
2
γ
0
+
β
2
γ
1
x
i
+
β
2
e
i
+
u
i
= (
β
0
+
β
2
γ
0
) + (
β
1
+
β
2
γ
1
)
x
i
+ (
β
2
e
i
+
u
i
)
So the coefficient on
x
i
when we consider the short regression (
α
1
) is
β
1
+
β
2
γ
1
. We call
β
2
γ
1
omitted variable
bias
(OVB).
α
1
captures the “effect” (to be precise: this is not really a causal effect unless we make additional
assumptions. Rather, it is the statistical association) of
x
on
y
but also the “effect” of
z
on
y
, scaled by the
association between
z
and
x
. This comes from the fact that when we observe higher values of
x
the values of
z
might be moving, too (e.g., education usually correlates positively with parental wealth, so if we grab someone
with more education it is likely than their parents were also wealthier), so if we do not control for it we will
not be able to interpret
β
1
in a causal way (e.g., do the higher wages for the person with more education come
from the additional education of from the fact that they had richer parents?).
A way of thinking of this is that when we are trying to identify causality we are thinking about the coefficient
of
x
from a regression that includes all the relevant predetermined variables (the “longest” regression if I may).
When would it be non-problematic to omit variables – when is there no OVB? There are two cases:
•
β
2
=
0: the omitted variable was not really relevant to begin with.
•
γ
1
=
0:
x
is uncorrelated with the omitted variable.
(There’s also a third case: when the omitted variable is not predetermined and may itself be an outcome of
x
.
These are usually called bad controls, mediators or mechanisms, but more on this later. Maybe.)
This is why when trying to identify causality we always have to think if there’s anything relevant in the error
term (and remember, anything not explicitly included in the regression is in the error term) that’s correlated
with one of the included variables.
This also shows why RCTs are great, if I assign
x
randomly then it is
expected to be uncorrelated with all the predetermined variables!
1
Question 2: OLS and Measurement error
Recall that the OLS estimator in a bivariate regression of
Y
on
X
is equal to
Cov
(
x
i
,
y
i
)
Var
(
x
i
)
. Also note that Cov
(
a
+
bX
,
Y
) =
Cov
(
a
,
Y
) +
Cov
(
bX
,
Y
) =
b
Cov
(
X
,
Y
)
. Recall that Var
(
X
+
Y
) =
Var
(
X
) +
Var
(
Y
) +
2Cov
(
X
,
Y
)
.
You want to estimate the effect of life satisfaction
L
∗
on life expectancy
Y
, and you believe that the two variables
are related as follows:
Y
i
=
β
0
+
β
1
L
∗
i
+
e
i
(1)
You manage to find out the life expectancy of a sample of 1,000 individuals. Unfortunately, you cannot observe
their life satisfaction
L
∗
, and so you run a survey, ask them how satisfied they are with their life, and record
their answer as
L
. As it turns out, people are present-biased: When asked about their life satisfaction, they
are influenced by random events that happened on that day – maybe they just learned something cool in their
econometrics class (making them report higher life satisfaction), or their favorite sports team just lost (making
them report lower life satisfaction).
Therefore, you think that the reported life satisfaction
L
is equal to:
L
i
=
L
∗
i
+
v
i
(2)
, where
v
i
is a random error term that is fully independent of
L
∗
i
and
Y
i
.
You think of running the following regression specification to estimate your model:
Y
i
=
α
0
+
α
1
L
i
+
u
i
(3)
a) Can you think of other reasons why a variable may be mismeasured in the data?
Solution:
There a many potential reasons for measurement error. We generally classify them to random
measurement error and non-random (or systematic) measurement error
Examples for random measurement error are: Physical constraints (e.g., a thermometer will never be
100% accurate), rounding (people do not report their precise salaries, but a round number), random noise
(for some census data, the census bureau has started adding random numbers to preserve anonymity),
random errors (when I ask people about their SATs, some will just get it wrong, but on average people
will report the correct number)
Non-random measurement error is also very common and has bigger issues.
Examples for this are:
People systematically misreporting (for example, rich people are less likely to truthfully report their
wealth, autocratic countries systematically over-report their growth estimates), measurement difficulties
(GDP in poorer countries is less precisely estimated than in richer countries), and many more.
b) Will you (on average) get the effect you want –
β
1
– if you run this regression?
Hints:
Use the covariance-
over-variance formula for the OLS estimator. Plug in what you know about
L
i
and
L
∗
i
from equation (2).
Your final expression should be related to the OLS estimator for equation (1). You can use the fact that
Covariance is a linear operator and Var
(
A
+
B
) =
Var
(
A
) +
Var
(
B
) +
2 Cov
(
A
,
B
)
.
No, you will not (on average) get the effect you want,
β
1
, if you run this regression. The OLS estimator
for
α
1
will be equal to:
Solution:
ˆ
α
1
=
Cov
(
y
i
,
L
i
)
Var
(
L
i
)
=
Cov
(
y
i
,
L
∗
i
+
v
i
)
Var
(
L
∗
i
+
v
i
)
Use linearity of Covariance
=
Cov
(
y
i
,
L
∗
i
)
+
Cov
(
y
i
,
v
i
)
Var
(
L
∗
i
)
+
Var
(
v
i
) +
2 Cov
(
L
∗
i
,
v
i
)
=
Cov
(
y
i
,
L
i
∗
)
Var
(
L
∗
i
)
+
Var
(
v
i
)
|
{z
}
≥
0
Hence,
|
ˆ
α
1
| ≤
ˆ
β
1
=
Cov
(
y
i
,
L
i
∗
)
Var
(
L
∗
i
)
c) What does this tell you about the effect of measurement error on your regression?
2
Solution:
We see that measurement error leads to a systematic problem in the regression. Whenever
we have measurement error (Var
(
v
i
)
>
0), the estimated coefficient from the regression is closer to zero
than the true coefficient. We call this
attenuation bias
. When we have a regression with this type of
measurement error (random measurement error in the independent variable), we know that the true
coefficient will be at least as large as the one we estimated.
d)
Creative question
: Can you think of ways to reduce measurement error in this example?
Solution:
One could ask people a different question, for example to disregard the last week. It is also
possible to ask many questions related to life satisfaction and create an average over those questions to
get a more precise estimate for their actual life satisfaction.
3
Question 3: Logs
Figure 1: Hint: You can use this table as a cheatsheet (Source: Wooldridge (2016)
a) You are interested in estimating the relationship between campaign spending and election results. You
collect data on and you run a regression of
voteA
, the share (from 0 to 100) of total votes that candidate
A receives, on
shareA
, share of the total campaign spending (from 0 to 100) corresponding to candidate
A. The estimated equation is:
\
voteA
=
26.81
+
0.464
·
shareA
Interpret the coefficient on
shareA
. Be accurate about the difference between "percents" and "percentage
points".
Solution:
First, to clarify a bit, think of candidate A as being the candidate for party A, and your dataset has data on
many regressions in which party A participated (e.g., your data could be the percentage of votes received
by the democrat candidate in all presidential elections).
The estimated coefficient is telling us that in the data when the share of candidate’s A spending increases
by 1 percentage point, candidate A receives on average 0.46 percentage points more of the total vote.
b) You want to know how much wages change with higher education and run a regression of
log(wage)
,
the natural log of monthly wages in US$, on
educ
, the years of education, on a sample of workers in the
US. The estimated equation is:
\
log
(
wage
) =
0.584
+
0.083
·
educ
Interpret the coefficient on
educ
.
Solution:
The coefficient on
educ
has a percentage interpretation when it is multiplied by 100.
The
predicted wage increases by 8.3% for every additional year of education.
c) A consulting firm hired you to study how firms’ CEO’s wages are associated with the sales of the com-
pany. You collect a dataset of different firms in Argentina and run a regression of
log(salary)
, the
natural log of the CEO’s salary, on
log(sales)
, the natural log of the sales of the firm. You are asked
to discuss the relationship between sales and CEO salaries in front of your boss.
\
log
(
salary
) =
4.822
+
0.257 log
(
sales
)
Solution:
Now we have an elasticity. The coefficient tells us that when sales go up by 1% the average wages of the
CEOs increase by 0.257% (this is not necessarily a causal relation, it’s just capturing the relation seen in
the data).
d) Unlike you, your econometrics professor is obsessed with the effect of class sizes on math test scores. She
asks you to run a regression of
math10
(percentage from 0 to 100 of total points attained in math exam
in class 10) on
log(enroll)
, the natural logarithm of the class size. There are also two control variables
included. You get the following result.
\
math 10
=
−
207.66
+
21.16 log
(
totcomp
) +
3.98 log
(
staff
)
−
1.29 log
(
enroll
)
Interpret the relationship between enrollment and the math scores.
4
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
4.
arrow_forward
Suppose you have run four regression models: A, B, C, and D. You are going to make a decision on which one to use just based on the adjusted r² value. Here are the adjusted r² values for each model: A: 0.71 B: 0.57 C: 0.65 D: 0.76 Which regression model would you choose based on the adjusted r²? OD since it has the highest adjusted r² value B since it has the lowest adjusted r² OC since it has an adjusted r² between the adjusted r² of regressions B and D. Either B or C since they have the lowest adjusted r²
arrow_forward
The dependent variable in the regression in our cost driver analysis is which of the following?
Company sales
Total overhead cost for the entire period of time
Total overhead cost per month
arrow_forward
Imagine you are an economist working for the Government of Econville. You are tasked with developing a model to predict the GDP of the country based on various factors such as interest rates, inflation, unemployment rate, and population growth. You collect quarterly data for the past 20 years and start building your model. After running your initial regression, you notice some peculiar patterns in the residuals: (1) residuals do not have identical variances across different levels of the independent variables; (2) two or more independent variables in a regression model are highly correlated with each other; (3) the correlation of a variable with its own past values. You suspect that your model might be suffering from 3 potential issues in the regression analysis that can affect reliability and validity. what are the implications of Heteroscedasticity if this potential issue in your model?
arrow_forward
Imagine you are an economist working for the Government of Econville. You are tasked with developing a model to predict the GDP of the country based on various factors such as interest rates, inflation, unemployment rate, and population growth. You collect quarterly data for the past 20 years and start building your model. After running your initial regression, you notice some peculiar patterns in the residuals: (1) residuals do not have identical variances across different levels of the independent variables; (2) two or more independent variables in a regression model are highly correlated with each other; (3) the correlation of a variable with its own past values. You suspect that your model might be suffering from 3 potential issues in the regression analysis that can affect reliability and validity. Based on Addressing Heteroscedasticity list one test you would employ to test this potential issue?
arrow_forward
Imagine you are an economist working for the Government of Econville. You are tasked with developing a model to predict the GDP of the country based on various factors such as interest rates, inflation, unemployment rate, and population growth. You collect quarterly data for the past 20 years and start building your model. After running your initial regression, you notice some peculiar patterns in the residuals: (1) residuals do not have identical variances across different levels of the independent variables; (2) two or more independent variables in a regression model are highly correlated with each other; (3) the correlation of a variable with its own past values. You suspect that your model might be suffering from 3 potential issues in the regression analysis that can affect reliability and validity. What name would you give to this potential issue that pertains to two or more independent variables in a regression model are highly correlated with each other
arrow_forward
We estimate a simple regression explaining monthly salary (salary) in terms of IQ score
(IQ), using data from a random sample of 935 individuals. We obtain the following
estimated regression line:
salary = 117 + 8.30 × IQ
What is the correct interpretation of the estimated slope coefficient?
Individuals with IQ scores of 100 have, on average, monthly wages of $830.
An additional one point increase in IQ score is associated, on average, with an increase in monthly
salary of $8.30.
An additional one point increase in IQ score is associated, on average, with a decrease in monthly
salary of $8.30.
Each additional one point increase in IQ score will cause an increase in an individual's monthly
salary by $8.30.
arrow_forward
Please help me with this question. Answer there is incorrect
arrow_forward
9
arrow_forward
Subject: Economics
Let ei be the ith residual in the ordinary least squares regression of y on X in the classical regression modeland let εi be the corresponding true disturbance. Prove that plim(ei - εi) = 0
arrow_forward
N5
arrow_forward
Discuss and explain each of the assumptions of the simple linear regression model.
arrow_forward
Consider the following ANOVA table for a multiple regression model relating housing prices (in thousands of dollars) to the number of bedrooms in the house and the
size of the lot on which the house was built (in square feet). There were 90 total observations.
Estimated Price = 24,838.74 + 2339.88 (Bedrooms) + 0.2685 (Lot Size)
ANOVA
df
SS
MS
F
Regression 2 382,993.4921
Residual 87 627,839.9910
Significance F
191,496.7461 26.5358 1.0066E-09
7216.5516
Total
89 1,010,833.4831
Compute the adjusted coefficient of determination for this regression model. Round your answer to four decimal places.
arrow_forward
4. The following regression is fitted using variables identified that could be related to tuition
charges ($) of a university.
TUITION = a+ B ACCEPT + y MSAT + 1 VSAT
Where ACCEPT = the percentage of applicants that was accepted by the university, MSAT =
Median Math SAT score for the freshman class and VSAT = Median English SAT score for the
freshman class.
The data was processed using MNITAB and the following is an extract of the output obtained:
Predictor Coef
StDev
Constant
-26780
6115
ACCEPT
116.00
37.17
MSAT
-4.21
14.12
VSAT
70.85
15.77
т
P
-4.38
0.000
0.003
-0.30
4.49
0.767
**
S = 2685
R-Sq 69.6%
R-Sq (adj)
= 67.7%
Analysis of Variance
Source
DF
SS
MS
Regression
3
Residual Error
49
Total
52
808139371
353193051
1161332421
269379790
7208021
F
37.37
Р
0.000
a) Write out the regression equation.
b) State the dependent and independent variable(s)
c) Fill in the blanks identified by ** and ****.
d) Is
significant, at the 10% level of significance?
[1]
[2]
[6]
[4]
e) State one…
arrow_forward
Consider the following estimated regression model relating annual salary to years of education and work experience.
Estimated Salary=11,722.40+3182.56(Education)+1202.44(Experience)Estimated Salary=11,722.40+3182.56(Education)+1202.44(Experience)
Suppose an employee with 66 years of education has been with the company for 33 years (note that education years are the number of years after 8th8th grade). According to this model, what is his estimated annual salary?
arrow_forward
You are the owner of a restaurant located in a beach resort in Hawaii and want to use regression analysis to estimate the demand for your
fresh seafood dinners. You have collected data on the daily quantity of seafood dinners sold over the last summer season. In order to
correctly specify your regression equation, which of the following variables should be considered?
Select one:
A. the prices charged for souvenirs in local stores
B. the prices charged for scuba diving excursions at the resort
C. the wages paid to your chef and servers
D. the daily number of vacationers at the resort
arrow_forward
Answer part sections iv, v and vi
arrow_forward
Explain the concept of model selection criteria such as Akaike Information Criterion (AIC) and
Bayesian Information Criterion (BIC) in the context of linear regression. How can these
criteria be used to compare and select among competing regression models?
arrow_forward
The diagram shows what happened to the consumption of lamb in the UK over the period 1974– 2015. How can we explain this dramatic fall in consumption? One way of exploring this issue is to make use of a regression model, which should help us to see which variables are relevant and how they are likely to affect
140
130
120
110
100
90
80
70
60
50
40
30
20
1974 1978 1982 1986 1990 1994 1998 2002 2006 2010 2014
Note: Data from 2015 based on end of financial year.
Source: Based on data in Family Food datasets
UK consumption of lamb: 1974–2017
The following is an initial model fitted to annual data for the years 1974–2010.
QL = 144.0 – 0.137PL – 0.034PB + 0.214PP – 0.00513Y + e (1)
where:
QL is the quantity of lamb sold in grams per person per week; PL is the ‘real’ price of lamb (in pence per kg, 2000 prices); PB is the ‘real’ price of beef (in pence per…
arrow_forward
5.2
42
The following regression shows the impact of the number of rooms (ROOM) on the
monthly rent of an apartment (in thousand Rands) (RENT) from a sample of 42 randomly
chosen apartments in Parow:
Dependent Variable: RENT
Variable
C
ROOM
r-square: 0.7180
Parameter
-0.7705
2.1331
SE of regression: 2.3397
Std. Error
0.8233
0.2114
sample size = 42
What will be values of the parameters, variance and standard errors of the
parameters, estimated variance of error terms, coefficient of determination and
correlation coefficient, if RENT is measured in Rands and there is no change to
the unit of measurement of the ROOM variable?
arrow_forward
Scenario 2: Unknown Scenario: Imagine you are an economist working for the Government of Econville. You are tasked with developing a model to predict the GDP of the country based on various factors such as interest rates, inflation, unemployment rate, and population growth. You collect quarterly data for the past 20 years and start building your model. After running your initial regression, you notice some peculiar patterns in the residuals: (1) residuals do not have identical variances across different levels of the independent variables; (2) two or more independent variables in a regression model are highly correlated with each other; (3) the correlation of a variable with its own past values. You suspect that your model might be suffering from 3 potential issues in the regression analysis that can affect reliability and validity. Please answer the following questions based on scenario 2.(I)What name would you give to this potential issue that pertains to residuals that do not have…
arrow_forward
What are the consequences in the regression results if multicollinearity is present in the regression model?
arrow_forward
1. Can you estimate a regression model for Y and X?
2. What are the assumptions of the model in 1?
3. Do you think that this model is accurate?
4. What are the related hypotheses of 37
5. Discuss and interpret your results.
Y
4
4
-1
-2
11
-2
-4
0.
3.
7.
12
-3
-6
7.
14
arrow_forward
As the number of relevant independent variables in a regression increases, the R-squared of a regression
Select one:
a. exhibits greater heteroskedasticity
b. increases
c. decreases
d. stays constant
arrow_forward
Consider the following estimated regression model relating annual salary to years of education and work experience. Estimated Salary=11,681.31+3418.97(Education)+1194.78(Experience) Suppose two employees at the company have been working there for five years. One has a bachelor's degree (8 years of education) and one has a master's degree (10 years of education). How much more money would we expect the employee with a master's degree to make?
arrow_forward
The overall significance of an estimated multiple regression model is tested by using _____.
Select one:
a. F-test
b. t-test
c. χ^2-test
d. None of the above
arrow_forward
8
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Managerial Economics: Applications, Strategies an...
Economics
ISBN:9781305506381
Author:James R. McGuigan, R. Charles Moyer, Frederick H.deB. Harris
Publisher:Cengage Learning
Related Questions
- 4.arrow_forwardSuppose you have run four regression models: A, B, C, and D. You are going to make a decision on which one to use just based on the adjusted r² value. Here are the adjusted r² values for each model: A: 0.71 B: 0.57 C: 0.65 D: 0.76 Which regression model would you choose based on the adjusted r²? OD since it has the highest adjusted r² value B since it has the lowest adjusted r² OC since it has an adjusted r² between the adjusted r² of regressions B and D. Either B or C since they have the lowest adjusted r²arrow_forwardThe dependent variable in the regression in our cost driver analysis is which of the following? Company sales Total overhead cost for the entire period of time Total overhead cost per montharrow_forward
- Imagine you are an economist working for the Government of Econville. You are tasked with developing a model to predict the GDP of the country based on various factors such as interest rates, inflation, unemployment rate, and population growth. You collect quarterly data for the past 20 years and start building your model. After running your initial regression, you notice some peculiar patterns in the residuals: (1) residuals do not have identical variances across different levels of the independent variables; (2) two or more independent variables in a regression model are highly correlated with each other; (3) the correlation of a variable with its own past values. You suspect that your model might be suffering from 3 potential issues in the regression analysis that can affect reliability and validity. what are the implications of Heteroscedasticity if this potential issue in your model?arrow_forwardImagine you are an economist working for the Government of Econville. You are tasked with developing a model to predict the GDP of the country based on various factors such as interest rates, inflation, unemployment rate, and population growth. You collect quarterly data for the past 20 years and start building your model. After running your initial regression, you notice some peculiar patterns in the residuals: (1) residuals do not have identical variances across different levels of the independent variables; (2) two or more independent variables in a regression model are highly correlated with each other; (3) the correlation of a variable with its own past values. You suspect that your model might be suffering from 3 potential issues in the regression analysis that can affect reliability and validity. Based on Addressing Heteroscedasticity list one test you would employ to test this potential issue?arrow_forwardImagine you are an economist working for the Government of Econville. You are tasked with developing a model to predict the GDP of the country based on various factors such as interest rates, inflation, unemployment rate, and population growth. You collect quarterly data for the past 20 years and start building your model. After running your initial regression, you notice some peculiar patterns in the residuals: (1) residuals do not have identical variances across different levels of the independent variables; (2) two or more independent variables in a regression model are highly correlated with each other; (3) the correlation of a variable with its own past values. You suspect that your model might be suffering from 3 potential issues in the regression analysis that can affect reliability and validity. What name would you give to this potential issue that pertains to two or more independent variables in a regression model are highly correlated with each otherarrow_forward
- We estimate a simple regression explaining monthly salary (salary) in terms of IQ score (IQ), using data from a random sample of 935 individuals. We obtain the following estimated regression line: salary = 117 + 8.30 × IQ What is the correct interpretation of the estimated slope coefficient? Individuals with IQ scores of 100 have, on average, monthly wages of $830. An additional one point increase in IQ score is associated, on average, with an increase in monthly salary of $8.30. An additional one point increase in IQ score is associated, on average, with a decrease in monthly salary of $8.30. Each additional one point increase in IQ score will cause an increase in an individual's monthly salary by $8.30.arrow_forwardPlease help me with this question. Answer there is incorrectarrow_forward9arrow_forward
- Subject: Economics Let ei be the ith residual in the ordinary least squares regression of y on X in the classical regression modeland let εi be the corresponding true disturbance. Prove that plim(ei - εi) = 0arrow_forwardN5arrow_forwardDiscuss and explain each of the assumptions of the simple linear regression model.arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Managerial Economics: Applications, Strategies an...EconomicsISBN:9781305506381Author:James R. McGuigan, R. Charles Moyer, Frederick H.deB. HarrisPublisher:Cengage Learning
Managerial Economics: Applications, Strategies an...
Economics
ISBN:9781305506381
Author:James R. McGuigan, R. Charles Moyer, Frederick H.deB. Harris
Publisher:Cengage Learning