Assignment1Sol

pdf

School

New York University *

*We aren’t endorsed by this school

Course

0266

Subject

Economics

Date

Apr 3, 2024

Type

pdf

Pages

9

Uploaded by ChancellorLapwing4188

Report
Assignment 1 Solution: ECON-UA 266 - Intro to Econometrics Sahar Parsa, Xiaotong Wu, Anne Schick, and Odhrain McCarthy Spring 2024 IMPORTANT DISCLAIMER: The homework is NOT graded. The points are only to give you information about the weight assigned to each questions. The first assignment solution will be released on Friday 2nd of February 2024 5PM – you have until then to try to solve the homework on your own (or with the help of the TA during recitation). It covers the material related to the first week of class. 1) You are encouraged to discuss the problems with others, but 2) you must write up your own results. This question will cover statistical inference, the relationship between a population parameter and the sample average. This question will cover statistical inference, the relationship between a population parameter and the sample average. Suppose you flip a coin and you are interested in the probability of getting a head. Assuming the coin is not rigged, the random variable X defined as 1 if it is a head and 0 if it is a tail is described by a Bernoulli distribution which takes the value 1 (head) with the probability p = 0 . 5 . Question [100 points] This question will cover statistical inference, the relationship between a population parameter and the sample average. Suppose you flip a coin and you are interested in the probability of getting a head. Assuming the coin is not rigged, the random variable X defined as 1 if it is a head and 0 if it is a tail is described by a Bernoulli distribution which takes the value 1 (head) with the probability p = 0 . 5 . 1. [5 points] What is the probability of getting a tail? What is the expected value (= population mean) of the random variable X ? What about the population variance? Write down the formula first and then calculate the value. Solution : The probability of getting a tail is equal to: 1 p = 1 0 . 5 = 0 . 5 [ X ] = µ X = Pr ( X = 1) × 1 + Pr ( X = 0) × 0 = 0 . 5 × 1 = 0 . 5 = p V ar ( X ) = σ 2 X = (1 µ X ) 2 Pr ( X = 1) + (0 µ X ) 2 Pr ( X = 0) = (1 0 . 5) 2 0 . 5 + (0 0 . 5) 2 0 . 5 = 0 . 5 2 = 0 . 25 2. [5 points] Generate a sample of N = 10 observations by randomly drawing 10 times from the coin experiment, { X 1 , X 2 , . . . , X N } , in R. 1
Solution : We’re going to use the function rbinom in R, which randomly draws values from a Binomial distribution in R. A binomial distribution with one trial is a Bernouilli distribution. The function rbinom has three arguments rbinom ( a, b, c ) , where a is the sample size, b is the number of trials which is set to 1, and c is the probability of getting a value equal to 1. library (tidyverse) ## -- Attaching packages --------------------------------------- tidyverse 1.3.2 -- ## v ggplot2 3.4.0 v purrr 1.0.1 ## v tibble 3.1.8 v dplyr 1.0.10 ## v tidyr 1.3.0 v stringr 1.5.0 ## v readr 2.1.3 v forcats 0.5.2 ## -- Conflicts ------------------------------------------ tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() set.seed ( 123 ) sample <- rbinom ( 10 , 1 , 0.5 ) %>% as.data.frame () colnames (sample) <- c ( ' S1 ' ) sample ## S1 ## 1 0 ## 2 1 ## 3 0 ## 4 1 ## 5 1 ## 6 0 ## 7 1 ## 8 1 ## 9 1 ## 10 0 In order to replicate the values drawn in your sample, you need to define a seed at the beginning of your code, using “set.seed(123)” command. In our sample, we obtain 6 “heads” or 1. We obtain 4 “tails”, or 0. We are calling on the package tidyverse. tidyverse is a package that allows us to manipulate and vizualize dataset easily in R. We will define ggplot2 in the next question. We named the column S1 for Sample 1. 3. [5 points] Plot the histogram of the sample you generated in 2 using the package ggplot. Solution : ggplot (sample %>% mutate ( S1 = as.factor (S1)), aes ( x = S1)) + geom_bar ( aes ( y = (..count..) / sum (..count..))) + labs ( title = "Histogram Sample 1" , x = "Outcome" , y = "Frequency" ) ## Warning: The dot-dot notation (‘..count..‘) was deprecated in ggplot2 3.4.0. ## i Please use ‘after_stat(count)‘ instead. ## This warning is displayed once every 8 hours. ## Call ‘lifecycle::last_lifecycle_warnings()‘ to see where this warning was ## generated. 2
0.0 0.2 0.4 0.6 0 1 Outcome Frequency Histogram Sample 1 To create our figure, we use the function ggplot from the package ggplot2 inside the tidyverse package we called in the previous question. According to ggplot2 page, “ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.” We refer you https://ggplot2. tidyverse.org/ for more information. Note the use of as.factor, which transform the integers in factor variables. This prevents the values in the sample to be bunched in the bar chart and helps with the visualization. It transforms the values into factor format. We are also plotting the frequency instead of the count by using “aes(y = (..count..)/sum(..count..))” inside aes inside geom_bar. 4. [5 points] Estimate the sample average and the sample variance of X . Write down the formula first and then give the estimate. How does the sample average differ from the population average? Solution : As { X 1 , X 2 , . . . , X 10 } is our sample, we can construct the sample average the sample average formula: ¯ X = QQQQQQQ N i =1 X i N which is estimated to be equal to ¯ x = QQQQQQQ 10 i =1 x i 10 = 0+1+0+1+1+0+1+1+1+0 10 = 6 10 = 0 . 6 Notice the use of the upper case letter to denote a random variable, and the lower case letter to denote the value the random variable is taking. Similarly, we can construct the sample variance as: S 2 X = QQQQQQQ 10 i =1 ( X i ¯ X ) 2 10 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
For our sample, the sample variable is equal to s 2 x = (0 0 . 6) 2 + (1 0 . 6) 2 + (0 0 . 6) 2 + (1 0 . 6) 2 + (1 0 . 6) 2 + (0 0 . 6) 2 + (1 0 . 6) 2 + (1 0 . 6) 2 + (1 0 . 6) 2 + (0 0 . 6) 2 10 = 0 . 36 + 0 . 16 + 0 . 36 + 0 . 16 + 0 . 16 + 0 . 36 + 0 . 16 + 0 . 16 + 0 . 16 + 0 . 36 10 = 2 . 4 10 = 0 . 24 As expected, the sample average and variance are similar to, but they differ from the the population average and the population variance of the Bernouilli random variable given in question 1. 5. [10 points] Generate another sample of size N = 10 and repeat 3 and 4. Do you observe a difference in the histogram, the sample average, and the sample variance in this new sample with the sample generated in 2? Explain. Solution : set.seed ( 234 ) sample $ S2 <- rbinom ( 10 , 1 , 0.5 ) sample ## S1 S2 ## 1 0 1 ## 2 1 1 ## 3 0 0 ## 4 1 1 ## 5 1 0 ## 6 0 1 ## 7 1 1 ## 8 1 1 ## 9 1 1 ## 10 0 0 ggplot (sample %>% mutate ( S2 = as.factor (S2)), aes ( x = S2)) + geom_bar ( aes ( y = (..count..) / sum (..count..))) + labs ( title = "Histogram Sample 2" , x = "Outcome" , y = "Frequency" ) 4
0.0 0.2 0.4 0.6 0 1 Outcome Frequency Histogram Sample 2 Notice that to obtain a different sample, we need to change the value inside set.seed . In this sample, we obtain 7 heads and 3 tails. The sample average is: ¯ x = QQQQQQQ 10 i =1 x i 10 = 1 + 1 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 10 = 7 10 = 0 . 7 and the sample variance is: s 2 x = QQQQQQQ 10 i =1 ( x i ¯ x ) 2 10 = (1 0 . 7) 2 + (1 0 . 7) 2 + (0 0 . 7) 2 + (1 0 . 7) 2 + (0 0 . 7) 2 + (1 0 . 7) 2 + (1 0 . 7) 2 + (1 0 . 7) 2 + (1 0 . 7) 2 + (0 0 . 7) 2 10 = 0 . 09 + 0 . 09 + 0 . 49 + 0 . 09 + 0 . 49 + 0 . 09 + 0 . 09 + 0 . 09 + 0 . 09 + 0 . 49 10 = 2 . 1 10 0 . 21 From the class notes, we know that different samples will generate different values for our statistics. It is worth pausing to discuss the formula we are using for the sample variance. You might have seen two formula for the sample variance. In the first formula, one divides the numerator by N and in the other one by N 1 . These two formulas are the same when the sample size is very large. They will differ when the sample size is small. The main difference between the two formula is that the sample variance measured by dividing by N 1 is an unbiased estimator of the population variance, while the sample variance measured by dividing by N will be a biased estimator. Why do we need to divide by N 1 ? Intuitively, when we are estimating the population variance σ 2 X = [ X [ X ]] 2 , one needs to estimate two unknown objects:the 5
unknown population mean [ X ] and the unknown population variance. As such, we lose one degree of freedom as we use our sample of size N to estimate the unknown population mean. In this question, we are just asking for the value for the sample average and sample variance (as opposed to an unbiased estimator for the population variance). We will accept both formulas for the sample variance. In our answer, we used the formula dividing by N . 6. [10 points] Generate 100 samples of size N = 10 and for each sample calculate the sample average. Then plot histogram/distribution of the sample averages. Solution : set.seed ( 101 ) sample <- replicate ( 100 , rbinom ( 10 , 1 , 0.5 )) %>% as.data.frame () sample.numcols <- sample[, sapply (sample, is.numeric)] # Use only the numeric columns -- all the columns sampleaverage <- apply (sample.numcols, 2 , function (x) mean (x, na.rm = TRUE )) %>% as.data.frame () colnames (sampleaverage) <- c ( ' average ' ) ggplot (sampleaverage, aes ( x = average)) + geom_histogram ( aes ( y = ..ncount..)) + labs ( title = "Histogram Sample Average" , x = "Sample Average" , y = "Frequency" ) ## ‘stat_bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘. 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Sample Average Frequency Histogram Sample Average 7. [10 points] What is the standard deviation of the sample averages? What about the average value of the sample averages? 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Solution : Standard Deviation: sd (sampleaverage $ average) ## [1] 0.1671478 Sampling Average: mean (sampleaverage $ average) ## [1] 0.479 8. [10 points] Write down the formulas for the standard deviation as well as the mean of the sample average? Solution : The population standard deviation of the sample average is: σ ¯ X = rrrrrrr V ar ( ¯ X ) = sssssss σ 2 X N = qqqqqqq 0 . 25 / 10 The mean of the sample average is: [ ¯ X ] = µ X = 0 . 5 9. [15 points] Repeat 6, 7 and 8 but with a sample of size N = 100 . Solution : set.seed ( 111 ) sample <- replicate ( 100 , rbinom ( 100 , 1 , 0.5 )) %>% as.data.frame () sample.numcols <- sample[, sapply (sample, is.numeric)] # Use only the numeric columns -- all the columns sampleaverage <- apply (sample.numcols, 2 , function (x) mean (x, na.rm = TRUE )) %>% as.data.frame () colnames (sampleaverage) <- c ( ' average ' ) ggplot (sampleaverage, aes ( x = average)) + geom_histogram ( aes ( y = ..ncount..)) + labs ( title = "Histogram Sample Average" , x = "Sample Average" , y = "Frequency" ) ## ‘stat_bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘. 7
0.00 0.25 0.50 0.75 1.00 0.40 0.45 0.50 0.55 Sample Average Frequency Histogram Sample Average Solution : Standard Deviation: sd (sampleaverage $ average) ## [1] 0.0444454 Sampling Average: mean (sampleaverage $ average) ## [1] 0.5006 The population standard deviation of the sample average is: V ar ( ¯ X ) = sssssss σ 2 X N = qqqqqqq 0 . 25 / 100 The mean of the sample average is: [ ¯ X ] = µ X = 0 . 5 10. [10 points] How does the distribution (look at the histogram), the standard deviation, and the mean value of the sample averages differ when we increase the sample size from N = 10 to N = 100 ? How do the values compare to the values when you are using formulas derived in 8? Explain your results and the lesson learned (Hint: The sampling distribution changes with the sample size.) 8
Solution : The sample average of the sample average remains similar in both experiments. But the sample standard deviation changed: It is smaller when the sample size is larger ( N = 100 ) – there is fewer dispersion around the mean value. One can also see this pattern using the formula for the population standard deviation of the sample average. 11. [15 points] Suppose that instead of using the sample average as the statistic to estimate the population mean of our random variable, we use another statistic. The other statistic is equal to 0.6, independently of the sample drawn. What is the sampling distribution of this second estimator? Does this second statistic have sampling uncertainty measured by the standard deviation of the estimator? Does this new statistic do better than the sample average in terms of the sampling uncertainty? Now, turning to the expected value (population mean) of the statistic, what is the deviation of the expected value to the true population parameter? Compare the sampling expected value of the sample average and this new statistic. Is the new statistic “better” than the sample average at estimating the true population parameter? Solution : (1) The statistic always takes the value of ˆ µ X = 0 . 6 independently of the sample, i.e., it takes the value 0 . 6 with probability 1. (2) The sampling uncertainty (as captured by the standard deviation of the estimator) is zero. There is no variance. This is because the estimate is 0.6 independently of the sample. (3) The expected value of the statistic is also 0 . 6 , i.e., µ X ] = 0 . 6 (4) Given the true population mean parameter ( µ X ) is 0.5, the statistic will always be greater than the true population parameter by 0.1. This new statistic is a biased estimator of the population mean µ X . (5) The sampling uncertainty of the sample average is greater than the sampling uncertainty of this new statistic. (6) but, the sample average is unbiased. The population mean of the sample average is the µ X = 0 . 5 . This is shown empirically with our simulation, where we saw that the sample average had a sample average approximately of 0 . 5 and the formula was [ ¯ X ] = µ X = 0 . 5 independently of N . On the other hand, the new statistic is biased. The sample average ( ¯ x 0 . 5 ) is much closer to the true population parameter ( µ X = 0 . 5 ) than this new statistic ( ˆ µ X = 0 . 6 ) (7) In this exercise, we saw that different estimators will have different sampling properties. An important question in statistic is what property should we care for when choosing an estimator. Should the statistic have zero sampling uncertainty? Should it be unbiased? There are many more properties we could look at, but these two are the most important ones. In general, we care about the population mean of the sampling distribution. In particular, a good statistic is one that gets in right on average, i.e., is unbiased. Within the class of unbiased estimator, then we should prefer less sampling uncertainty. Given this criteria, the sample average is performing better at estimating the population mean of X . 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help