HW7-solutions

.html

School

University of Texas *

*We aren’t endorsed by this school

Course

230

Subject

Statistics

Date

Apr 3, 2024

Type

html

Pages

Uploaded by sjobs3121

Homework 7 for ISEN 355 (System Simulation) ¶ (C) 2023 David Eckman Due Date: Upload to Canvas by 9:00pm (CDT) on Friday, March 24. In [ ]: # Import some useful Python packages. import numpy as np import matplotlib.pyplot as plt import scipy.stats import pandas as pd Problem 1. (50 points) ¶ The file canvas_total_times.csv contains the total time (in hours) each student has spent on the ISEN 355 Canvas page. The instructor, TA, and graders were excluded from the data. In [ ]: # Import the dataset. mydata = pd.read_csv('canvas_total_times.csv', header=None) # Convert the data to a list. times = mydata[mydata.columns[0]].values.tolist() (a) (2 points) Do you think it is reasonable to assume that the data are independent and identically distributed? Why or why not? If we think of the distribution as reflecting the Canvas habits of the "typical" ISEN 355 student, then it is reasonable to assume that the data are identically distributed. It would be reasonable to assume the data are independent if we believe that each student visits the Canvas webpage on their own devices on their own time. There are, however, a number of factors that could contribute to the Canvas times being dependent. For example, all students must log on to Canvas to take the timed quizzes during the labs. In addition, some students complete the homework assignments in pairs, so if they are working off of only one computer, then Canvas is only open on one of their accounts. Likewise for the group project assignments. Given that total times are in the tens of hours, and students may not have Canvas open most of the time when working on the homework and the project, these dependences may be minimal. (b) (3 points) Plot a histogram of the times using 10 bins. Label the axes of your plot. Comment on the shape of the data. In [ ]: plt.hist(times, bins=10) plt.xlabel("Time Spent on Canvas (hours)") plt.ylabel("Frequency") plt.title("Histogram of Time Spent on Canvas") plt.show()

The histogram shows that most students are on Canvas for less than 30 hours, while only a few are on Canvas for more than 50 hours. The shape of the histogram has a right tail and does not look symmetric. (c) (4 points) How many outliers do you see in the data? What is a possible explanation for these high values? In the rest of the problem, we will fit various distributions to the time data and assess their fit. What do you think we should do about the outliers? Explain your reasoning. From the histogram, it appears that 3 students spent more than 50 hours on Canvas, thus there are 3 outliers. It is possible that these 3 students simply left Canvas open on one of their browser tabs. If this in indeed the case, then we might consider the observations to be erroneous, in which case it would be advisable to remove the outliers. If instead, we were to discover that these students actually intentionally view the Canvas page for 50+ hours, then we should leave the outliers in the data. In [ ]: # Although not necessary, we could use the following piece of code to get the number of outliers: outliers = [value for value in times if value > 50] num_outliers = len(outliers) print(num_outliers) 3 For the rest of the assignment, use ALL of the data in canvas_total_times.csv . (d) (3 points) Use scipy.stats.rv_continuous.fit() to compute maximum likelihood estimators (MLEs) for the mean ( loc ) and standard deviation ( scale ) of a normal distribution fitted to the data. See the accompanying documentation to see how the function is called as well as the documentation about scipy.stats.norm . The term rv_continous is not part of the function call; instead, you would specifically use scipy.stats.norm.fit() . Report the MLEs. (Note: This .fit method is different from

the one we used in the previous homework to fit a Poisson distribution. You should not need to update SciPy to use this function.) (Note: For functions like .fit() that return multiple outputs, you will want to define multiple variables on the left-hand side of the equal sign. For example a, b = myfunction(c) would be a way to separate two variables ( a and b ) returned by one call of the function myfunction() .) In [ ]: loc_mle, scale_mle = scipy.stats.norm.fit(times) print(f"The MLEs of the fitted normal distribution are a mean of {round(loc_mle, 2)} and a standard deviation of {round(scale_mle, 2)}.") The MLEs of the fitted normal distribution are a mean of 17.17 and a standard deviation of 10.7. (e) (2 points) If you were to generate a single time from the fitted normal distribution, what is the probability it would be negative? Based on the mean and standard deviation of the fitted normal distribution, should you be concerned about using this input distribution? Explain your reasoning. In [ ]: prob_negative = scipy.stats.norm.cdf(x=0, loc=loc_mle, scale=scale_mle) print(f"The probability of generating a negative value is {round(prob_negative, 3)}.") The probability of generating a negative value is 0.054. We should be concerned, because the chance of generating a negative value is about 1/20, which is very high. If the simulation model needed to generate many Canvas time, it would very likely generate a negative value at some point, which is an impossible about of time to be on Canvas. (f) (3 points) Reproduce your histogram from part (b) with density=True . Superimpose the pdf of the fitted normal distribution over the interval [0, 70]. Use x = np.linspace(0, 70, 71) . Comment on the visual fit of the normal distribution. In [ ]: plt.hist(times, bins=10, density=True) x = np.linspace(0, 70, 71) plt.plot(x, scipy.stats.norm.pdf(x, loc_mle, scale_mle)) plt.xlabel("Time Spent on Canvas (hours)") plt.ylabel("Frequency / PDF") plt.title("Histogram / Normal PDF") plt.show()

The normal distribution does not fit the data well. Even if the outliers were excluded, the fit for the rest of the data looks poor. I particular, the normal distribution puts too much probability density on times <= 5 hours (including negative values!), and not enough on the range 5-15 hours, were most of the data is concentrated. The misalignment between the mode (peak) of the normal distribution and the tallest bars in the histogram is possibly explained by the outliers shifting the normal distribution to the right. (g) (4 points) Use scipy.stats.rv_continuous.fit() to compute maximum likelihood estimators (MLEs) for the shape parameter ( a ), location parameter ( loc ) and scale parameter ( scale ) of a gamma distribution fitted to the data. See the documentation about scipy.stats.gamma . Report the MLEs. In [ ]: a_mle, loc_mle, scale_mle = scipy.stats.gamma.fit(times) print(f"The MLEs of the fitted gamma distribution are a = {round(a_mle, 3)}, loc = {round(loc_mle, 3)}, and scale = {round(scale_mle, 3)}.") The MLEs of the fitted gamma distribution are a = 1.567, loc = 4.484, and scale = 8.096. (h) (4 points) You are about to conduct a Chi-Square goodness-of-fit test for the fitted gamma distribution using 10 equiprobable bins. In preparation, you will need find the breakpoints of the bins, count the observed number of data points in each bin, and compute the expected number of data points in each bin. 1. To get the endpoints of the bins, use scipy.stats.gamma.ppf() to calculate quantiles of the fitted gamma distribution at q = np.linspace(0, 1, 11) . 2. To get the observed number in each bin, use np.histogram() (documentation found here ). (Note: The function np.histogram() returns multiple outputs.) 3. To get the expected number in each bin, note that you have 10 equiprobable bins. The expected numbers need to sum up to the total number of observations

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version