lab08-Revised-Class

.pdf

School

University of North Georgia, Dahlonega *

*We aren’t endorsed by this school

Course

MATH-240

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

Uploaded by SuperHumanKoala4250

lab08-Revised-Class March 24, 2024 1 Lab 8: Normal Distribution and Variance of Sample Means Welcome to Lab 8! In today’s lab, we will learn about the variance of sample means as well as the normal distribution . [1]: # Run this cell, but please don't change it. # These lines import the Numpy and Datascience modules. import numpy as np from datascience import * # These lines do some fancy plotting magic. import matplotlib % matplotlib inline import matplotlib.pyplot as plt plt . style . use( 'fivethirtyeight' ) import scipy.stats as stats import warnings warnings . simplefilter( 'ignore' , FutureWarning ) 2 1. Normal Distributions When we visualize the distribution of a sample, we are often interested in the mean and the standard deviation of the sample (for the rest of this lab, we will abbreviate “standard deviation” as “SD”). These two summary statistics can give us a bird’s eye view of the distribution - by letting us know where the distribution sits on the number line and how spread out it is, respectively. We want to check if the data is linearly related, so we should look at the data. Question 1.1. The next cell loads the table births from lecture, which is a large random sample of US births and includes information about mother-child pairs. Plot the distribution of mother’s ages from the table. Don’t change the last line, which will plot the mean of the sample on the distribution itself. [2]: births = Table . read_table( 'baby.csv' ) # Add your plot code here 1

births . hist( "Maternal Age" ) # Do not change these lines plt . ylim( -.002 , .07 ) plt . scatter(np . mean(births . column( "Maternal Age" )), 0 , color = 'red' , s =50 , zorder =4 ); From the plot above, we can see that the mean is the center of gravity or balance point of the distribution. If you cut the distribution out of cardboard, and then placed your finger at the mean, the distribution would perfectly balance on your finger. Since the distribution above is right skewed (which means it has a long right tail), we know that the mean of the distribution is larger than the median, which is the “halfway” point of the data. Conversely, if the distribution had been left skewed, we know the mean would be smaller than the median. Question 1.2. Run the following cell to compare the mean (red) and median (green) of the distribution of mothers ages. [3]: # Do not change or delete any of these lines of code births . hist( "Maternal Age" ) plt . ylim( -.002 , .07 ) plt . scatter(np . mean(births . column( "Maternal Age" )), 0 , color = 'red' , s =50 , ␣ , → zorder =4 ); 2

plt . scatter(np . median(births . column( "Maternal Age" )), 0 , color = 'green' , s =50 , ␣ , → zorder =5 ); We are also interested in the standard deviation of mother’s ages. The SD gives us a sense of how variable mothers’ ages are around the average mothers’ age. If the SD is large, then the mothers’ heights should spread over a large range from the mean. If the SD is small, then the mothers’ heights should be tightly clustered around the average mother height. The SD of an array is defined as the root mean square of deviations (differences) from average . Fun fact! (Greek letter sigma) is used to represent the SD and (Greek letter mu) is used for the mean. Question 1.3. Run the cell below to see the width of one SD (blue) from the sample mean (red) plotted on the histogram of maternal ages. [4]: # calculate the mean and standard devuiation of ages age_mean = np . mean(births . column( "Maternal Age" )) age_sd = np . std(births . column( "Maternal Age" )) # Do not change or delete any of the following lines of code births . hist( "Maternal Age" ) plt . ylim( -.002 , .07 ) 3

plt . scatter(age_mean, 0 , color = 'red' , s =50 , zorder = 3 ); plt . scatter(age_mean + age_sd, 0 , marker = '^' , color = 'blue' , s =50 , zorder = 4 ); plt . scatter(age_mean - age_sd, 0 , marker = '^' , color = 'blue' , s =50 , zorder = 5 ); In this histogram, the standard deviation is not easy to identify just by looking at the graph. However, the distributions of some variables allow us to easily spot the standard deviation on the plot. For example, if a sample follows a normal distribution , the standard deviation is easily spotted at the point of inflection (the point where the curve begins to change the direction of its curvature) of the distribution. Question 1.4. Fill in the following code to examine the distribution of maternal heights, which is roughly normally distributed. We’ll plot the standard deviation on the histogram, as before - notice where one standard deviation (blue) away from the mean (red) falls on the plot. [5]: # calculate the mean and standard devuiation of heights height_mean = np . mean(births . column( "Maternal Height" )) height_sd = np . std(births . column( "Maternal Height" )) # Do not change or delete any of the following lines of code births . hist( "Maternal Height" , bins = np . arange( 55 , 75 , 1 )) plt . ylim( -0.003 , 0.16 ) plt . scatter((height_mean), 0 , color = 'red' , s =50 , zorder = 3 ); 4

plt . scatter(height_mean + height_sd, 0 , marker = '^' , color = 'blue' , s =50 , zorder = ␣ , → 3 ); plt . scatter(height_mean - height_sd, 0 , marker = '^' , color = 'blue' , s =50 , zorder = ␣ , → 3 ); We don’t always know how a variable will be distributed, and making assumptions about whether or not a variable will follow a normal distribution is dangerous. However, the Central Limit Theorem defines one distribution that always follows a normal distribution. The distribution of the sums and means of many large random samples drawn with replacement from a single distribution (regardless of the distribution’s original shape) will be normally distributed. Remember that the Central Limit Theorem refers to the distribution of a statistic calculated from a distribution, not the distribution of the original sample or population. If this is confusing, ask a TA! The next section will explore distributions of sample means, and you will see how the standard deviation of these distributions depends on sample sizes. 3 2. Variability of the Sample Mean By the Central Limit Theorem , the probability distribution of the mean of a large random sample is roughly normal. The bell curve is centered at the population mean. Some of the sample means are higher and some are lower, but the deviations from the population mean are roughly symmetric on either side, as we have seen repeatedly. Formally, probability theory shows that the sample mean is an unbiased estimate of the population mean. In our simulations, we also noticed that the means of larger samples tend to be more tightly 5

clustered around the population mean than means of smaller samples. In this section, we will quantify the variability of the sample mean and develop a relation between the variability and the sample size. Let’s take a look at the salaries of employees of the City of San Francisco in 2014. The mean salary reported by the city government was about $75,463.92. Note: If you get stuck on any part of this lab, please refer to chapter 14 of the textbook . Read in the table 1. Read in the file ‘Georgia_Salaries_2021.csv’ and name the table ‘georgia_salaries’ 2. Display all of the column names in ‘Georgia_Salaries_2021.csv’ 3. Create an array named salaries that contains all of the values in the column ‘SALARY’ 4. Find the mean of salaries. [6]: # insert your code here georgia_salaries = Table . read_table( 'Georgia_Salaries_2021.csv' ) georgia_salaries georgia_salaries . labels salaries = georgia_salaries . select( "SALARY" ) np . mean(salaries) [6]: SALARY 40883.1 [7]: salary_mean = np . mean(salaries . column( "SALARY" )) print ( 'Mean salary of State of Georgia employees in 2022: ' , ␣ , → round (salary_mean, 2 )) # Do not change or delete any of the following lines of code georgia_salaries . hist( 'SALARY' , bins = np . arange( 0 , 300000+10000*2 , 10000 )) # georgia_salaries.hist('SALARY',bins=np.arange(0, 300000+10000*2, 10000)) Mean salary of State of Georgia employees in 2022: 40883.06 /opt/conda/lib/python3.8/site-packages/datascience/tables.py:5206: UserWarning: FixedFormatter should only be used together with FixedLocator axis.set_xticklabels(ticks, rotation='vertical') 6

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version