SA4_Econ140_Solution

.pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

140

Subject

Economics

Date

Feb 20, 2024

Type

pdf

Pages

9

Uploaded by LieutenantWorld12843

Report
Section Assignment - Econ 140 Spring 2023 WEEK 4: OLS Regression, Functional Forms, Bias in Regression Exercises Question 1: Omitted variable bias Consider the following long regression: y i = β 0 + β 1 x i + β 2 z i + u i The following short regression: y i = α 0 + α 1 x i + ϵ i And the auxiliary regression: z i = γ 0 + γ 1 x i + e i Prove that α 1 = β 1 + β 2 γ 1 and interpret. Solution: Let’s back out for a second and think about why we could be interested in estimating β 1 . (To understand the what comes next it might help to imagine you are working with observational data on wages and education.) Many times we are trying to identify causal relations. We think of causality using ceteris paribus: what would happen to y if we altered x but we kept all the other things constant? When we include the other variables as controls in the regression we can interpret the coefficient on x as the relation between y and x keeping the controls fixed, what we wanted to get. But if we don’t include them in the regression we can’t make that interpretation in general. What would the coefficient capture in that case? Substitute the expression of z i coming from the auxiliary regression into the long: y i = β 0 + β 1 x i + β 2 ( γ 0 + γ 1 x i + e i ) + u i = β 0 + β 1 x i + β 2 γ 0 + β 2 γ 1 x i + β 2 e i + u i = ( β 0 + β 2 γ 0 ) + ( β 1 + β 2 γ 1 ) x i + ( β 2 e i + u i ) So the coefficient on x i when we consider the short regression ( α 1 ) is β 1 + β 2 γ 1 . We call β 2 γ 1 omitted variable bias (OVB). α 1 captures the “effect” (to be precise: this is not really a causal effect unless we make additional assumptions. Rather, it is the statistical association) of x on y but also the “effect” of z on y , scaled by the association between z and x . This comes from the fact that when we observe higher values of x the values of z might be moving, too (e.g., education usually correlates positively with parental wealth, so if we grab someone with more education it is likely than their parents were also wealthier), so if we do not control for it we will not be able to interpret β 1 in a causal way (e.g., do the higher wages for the person with more education come from the additional education of from the fact that they had richer parents?). A way of thinking of this is that when we are trying to identify causality we are thinking about the coefficient of x from a regression that includes all the relevant predetermined variables (the “longest” regression if I may). When would it be non-problematic to omit variables – when is there no OVB? There are two cases: β 2 = 0: the omitted variable was not really relevant to begin with. γ 1 = 0: x is uncorrelated with the omitted variable. (There’s also a third case: when the omitted variable is not predetermined and may itself be an outcome of x . These are usually called bad controls, mediators or mechanisms, but more on this later. Maybe.) This is why when trying to identify causality we always have to think if there’s anything relevant in the error term (and remember, anything not explicitly included in the regression is in the error term) that’s correlated with one of the included variables. This also shows why RCTs are great, if I assign x randomly then it is expected to be uncorrelated with all the predetermined variables! 1
Question 2: OLS and Measurement error Recall that the OLS estimator in a bivariate regression of Y on X is equal to Cov ( x i , y i ) Var ( x i ) . Also note that Cov ( a + bX , Y ) = Cov ( a , Y ) + Cov ( bX , Y ) = b Cov ( X , Y ) . Recall that Var ( X + Y ) = Var ( X ) + Var ( Y ) + 2Cov ( X , Y ) . You want to estimate the effect of life satisfaction L on life expectancy Y , and you believe that the two variables are related as follows: Y i = β 0 + β 1 L i + e i (1) You manage to find out the life expectancy of a sample of 1,000 individuals. Unfortunately, you cannot observe their life satisfaction L , and so you run a survey, ask them how satisfied they are with their life, and record their answer as L . As it turns out, people are present-biased: When asked about their life satisfaction, they are influenced by random events that happened on that day – maybe they just learned something cool in their econometrics class (making them report higher life satisfaction), or their favorite sports team just lost (making them report lower life satisfaction). Therefore, you think that the reported life satisfaction L is equal to: L i = L i + v i (2) , where v i is a random error term that is fully independent of L i and Y i . You think of running the following regression specification to estimate your model: Y i = α 0 + α 1 L i + u i (3) a) Can you think of other reasons why a variable may be mismeasured in the data? Solution: There a many potential reasons for measurement error. We generally classify them to random measurement error and non-random (or systematic) measurement error Examples for random measurement error are: Physical constraints (e.g., a thermometer will never be 100% accurate), rounding (people do not report their precise salaries, but a round number), random noise (for some census data, the census bureau has started adding random numbers to preserve anonymity), random errors (when I ask people about their SATs, some will just get it wrong, but on average people will report the correct number) Non-random measurement error is also very common and has bigger issues. Examples for this are: People systematically misreporting (for example, rich people are less likely to truthfully report their wealth, autocratic countries systematically over-report their growth estimates), measurement difficulties (GDP in poorer countries is less precisely estimated than in richer countries), and many more. b) Will you (on average) get the effect you want – β 1 – if you run this regression? Hints: Use the covariance- over-variance formula for the OLS estimator. Plug in what you know about L i and L i from equation (2). Your final expression should be related to the OLS estimator for equation (1). You can use the fact that Covariance is a linear operator and Var ( A + B ) = Var ( A ) + Var ( B ) + 2 Cov ( A , B ) . No, you will not (on average) get the effect you want, β 1 , if you run this regression. The OLS estimator for α 1 will be equal to: Solution: ˆ α 1 = Cov ( y i , L i ) Var ( L i ) = Cov ( y i , L i + v i ) Var ( L i + v i ) Use linearity of Covariance = Cov ( y i , L i ) + Cov ( y i , v i ) Var ( L i ) + Var ( v i ) + 2 Cov ( L i , v i ) = Cov ( y i , L i ) Var ( L i ) + Var ( v i ) | {z } 0 Hence, | ˆ α 1 | ≤ ˆ β 1 = Cov ( y i , L i ) Var ( L i ) c) What does this tell you about the effect of measurement error on your regression? 2
Solution: We see that measurement error leads to a systematic problem in the regression. Whenever we have measurement error (Var ( v i ) > 0), the estimated coefficient from the regression is closer to zero than the true coefficient. We call this attenuation bias . When we have a regression with this type of measurement error (random measurement error in the independent variable), we know that the true coefficient will be at least as large as the one we estimated. d) Creative question : Can you think of ways to reduce measurement error in this example? Solution: One could ask people a different question, for example to disregard the last week. It is also possible to ask many questions related to life satisfaction and create an average over those questions to get a more precise estimate for their actual life satisfaction. 3
Question 3: Logs Figure 1: Hint: You can use this table as a cheatsheet (Source: Wooldridge (2016) a) You are interested in estimating the relationship between campaign spending and election results. You collect data on and you run a regression of voteA , the share (from 0 to 100) of total votes that candidate A receives, on shareA , share of the total campaign spending (from 0 to 100) corresponding to candidate A. The estimated equation is: \ voteA = 26.81 + 0.464 · shareA Interpret the coefficient on shareA . Be accurate about the difference between "percents" and "percentage points". Solution: First, to clarify a bit, think of candidate A as being the candidate for party A, and your dataset has data on many regressions in which party A participated (e.g., your data could be the percentage of votes received by the democrat candidate in all presidential elections). The estimated coefficient is telling us that in the data when the share of candidate’s A spending increases by 1 percentage point, candidate A receives on average 0.46 percentage points more of the total vote. b) You want to know how much wages change with higher education and run a regression of log(wage) , the natural log of monthly wages in US$, on educ , the years of education, on a sample of workers in the US. The estimated equation is: \ log ( wage ) = 0.584 + 0.083 · educ Interpret the coefficient on educ . Solution: The coefficient on educ has a percentage interpretation when it is multiplied by 100. The predicted wage increases by 8.3% for every additional year of education. c) A consulting firm hired you to study how firms’ CEO’s wages are associated with the sales of the com- pany. You collect a dataset of different firms in Argentina and run a regression of log(salary) , the natural log of the CEO’s salary, on log(sales) , the natural log of the sales of the firm. You are asked to discuss the relationship between sales and CEO salaries in front of your boss. \ log ( salary ) = 4.822 + 0.257 log ( sales ) Solution: Now we have an elasticity. The coefficient tells us that when sales go up by 1% the average wages of the CEOs increase by 0.257% (this is not necessarily a causal relation, it’s just capturing the relation seen in the data). d) Unlike you, your econometrics professor is obsessed with the effect of class sizes on math test scores. She asks you to run a regression of math10 (percentage from 0 to 100 of total points attained in math exam in class 10) on log(enroll) , the natural logarithm of the class size. There are also two control variables included. You get the following result. \ math 10 = 207.66 + 21.16 log ( totcomp ) + 3.98 log ( staff ) 1.29 log ( enroll ) Interpret the relationship between enrollment and the math scores. 4
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help