STATS 10 Assignment 2 (1)
pdf
School
University of California, Los Angeles *
*We aren’t endorsed by this school
Course
10
Subject
Statistics
Date
Apr 3, 2024
Type
Pages
12
Uploaded by CommodoreCrow17890
STATS 10 Assignment 2
Lalonye Calhoun 006059433
Discussion 3A/B
Exercise 1
Work with lead and copper data obtained from the residents of Flint, Michigan from January-
February, 2017. Data are reported in PPB (parts per billion, or μg/L) from each residential
testing kit. Remember that “Pb” denotes lead, and “Cu” denotes copper. You can learn more
about the Flint water crisis at https://en.wikipedia.org/wiki/Flint_water_crisis.
a. Download the data from the course site and read it into R. Or use online data link:
read.csv(“https://ucla.box.com/shared/static/e9xuft4h3p8fdi4ydoj2hhujee0vmopb.csv”)
When you read in the data, name your object “flint”.
b. The EPA states a water source is especially dangerous if the lead level is 15 PPB or
greater. What proportion of the locations tested were found to have dangerous lead
levels?
-
.04436229%
c. Report the mean copper level for only test sites in the North region
.
-
44.6424
d. Report the mean copper level for only test sites with dangerous lead levels (at least 15
PPB)
.
-
141.9631
e. Report the mean lead and copper levels.
-
54.581 copper levels
-
3.383
f.
Create a box plot with a good title for the lead levels.
-
g. Based on what you see in part (f), does the mean seem to be a good measure of center
for the data? Report a more useful statistic for this data
-
No, The median would be better because the data is skewed
.
Exercise 2
The data here represent life expectancies (Life) and per capita income (Income) in 1974 dollars
for 101 countries in the early 1970’s. The source of these data is: Leinhardt and Wasserman
(1979), New York Times (September, 28, 1975, p. E-3). They also appear on Regression
Analysis by Ashish Sen and Muni Srivastava. You can access these data in R using:
life <-read.table("https://ucla.box.com/shared/static/rqk4lc030pabv30wknx2ft9jy848ub9n.txt",
header = TRUE)
a. Construct a scatterplot of Life against Income. Note: Income should be on the
horizontal axis. How does income appear to affect life expectancy?
-
The higher your income the more likely you are to live past 70, the less money you have
the more likely you are to die around 50.
b. Construct the boxplot and histogram of Income. Are there any outliers?
-
Boxplot: There were some outliers around 3000 to 5000
-
Histogram: I don't see any outliers
c. Split the data set into two parts: One for which the Income is strictly below $1000, and
one for which the Income is at least $1000. Come up with your own names for these two
objects.
-
lowerthan1000 = life[life$Income < 1000,]
-
-
Above1000 = life[life$Income > 1000,]
d. Use the data for which the Income is below $1000. Plot Life against Income and
compute the correlation coefficient. Hint: use the function cor()
-
0.752886
Exercise 3
The Maas river data contain the concentration of lead and zinc in ppm at 155 locations at
the banks of the Maas river in the Netherlands. You can read the data in R as follows:
maas <-
read.table("https://ucla.box.com/shared/static/tv3cxooyp6y8fh6gb0qj2cxihj8klg1h.txt",
header = TRUE)
a. Compute the summary statistics for lead and zinc using the summary() function.
-
Lead:
Min. 1st Qu.
Median
Mean 3rd Qu. Max.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
-
37.0
72.5
123.0
153.4
207.0
654.0
-
Zinc:
Min.
1st Qu.
Median
Mean
3rd Qu.
Max.
-
113.0
198.0
326.0
469.7
674.5
1839.0
-
b
. Plot two histograms: one of lead and one of log(lead).
Lead:
Log:
c
. Plot log(lead) against log(zinc). What do you observe?
-
The correlation coefficient is positive and the graph is linear
-
d. The level of risk for surface soil based on lead concentration in ppm is given on the
table below:
The following commands give different colors and sizes on a scatterplot
For two variables: x, y
mycolors <- c("green", "orange", "red") #can be changed to other colors
mylevels <- cut(y, c(0, 100, 1000, 10000)) #the levels, can be changed to other values
mysize <- 19 #the point size, can be changed to other values
plot(x, y, col=colors[as.numeric(mylevels)], pch= mysize)
Use similar techniques to give different colors and sizes to the lead concentration at
these 155
Locations.
-
Exercise 4
The data for this exercise represent approximately the centers (given by longitude and latitude)
of each one of the City of Los Angeles neighborhoods. See also the Los Angeles Times project
on the City of Los Angeles neighborhoods at: http://projects.latimes.com/mapping-
la/neighborhoods/. You can access these data at:
LA <- read.table("https://ucla.box.com/shared/static/d189x2gn5xfmcic0dmnhj2cw94jwvqpa.txt",
header=TRUE)
a. Plot the data point locations. Use good formatting for the axes and title. Then add the
outline of LA County by typing:
map("county", "california", add = TRUE)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
-
b. Do you see any relationship between income and school performance? Hint: Plot the
variable Schools against the variable Income and describe what you see. Ignore the data
points on the plot for which Schools = 0. Use what you learned about subsetting with
logical statements to first create the objects you need for the scatter plot. Then, create
the scatter plot. Alternate
methods may only receive half credit.
-
The scatterplot is linear meaning the more income increases the more school
performance increases. plots/variables seem to be a little associated.
Exercise 5
In this exercise, you will work with a dataset containing information about customers of a
retail store.
The dataset includes the following variables:
a. Customer ID: unique identifier for each customer
b. Age: age of the customer in years
c. Gender: gender of the customer (M for male, F for female)
d. Income: annual income of the customer in dollars
e. Education: education level of the customer (high school, some college, college degree,
graduate degree)
f. Marital status: marital status of the customer (single, married, divorced, widowed)
g. Purchase amount: the total amount the customer spent at the store in the past year
Load the data into R:
customer_data <-
read.csv("https://ucla.box.com/shared/static/y2y8rcie7mjw2h5t92x9dfcp133tc90h.csv")
a. Are there any missing values in the dataset? If so, how many are there and which
variables have missing values?
-
There are 22 Na’s in total
-
10 in NA’s age
-
5 in income
-
6 in purchase Amount
b. What is the data type of each variable? Are there any variables that should be
converted to a different data type?
Customer Id: shows numerical but is categorical
Age: numerical
Gender: categorical
Income: numerical
Education: categorical
Marital status: Categorical
Purchase amount: numerical
c. Do any numerical variables have outliers or extreme values? If so, how would you
handle them? Provide your analysis in R for identifying outliers (e.g., visualization,
numerical summary statistics). This is an open-ended question, so please feel free to use
any appropriate methods to identify and deal with any outliers or extreme values in the
dataset.
-
There are no outliers, I checked by graphing each numerical variable into box plots.
Part II
You may choose to type or write your answers electronically or scan your handwritten
solutions. Please ensure that you show all steps and explanations to receive full credit,
unless otherwise instructed.
Exercise 1
A study was done random sample of 900 college students. The researcher wants to find
out if gender would affect people’s body image. The two-way table below summarizes the
two variables.
a. In general, are students happy with their body weight? (Hint: Students that are happy
with their body weight responded "about right.")
-
600 out of 900 students are happy with body weight
b. If the researcher wants to compare the differences in body image between females and
males. What graph would best visualize the data for this purpose? Explain. (No need to draw
the actually plot)
-
A bar chart because we can see the comparisons side by side and their data
c. Are female students more likely to feel they are about right than male students?
Explain with numerical evidence.
-
Yes because .20% more females feel about right than males
d. For students who do not feel ‘about right’ with their body image, are there any
differences between the two gender groups? (Hint: are they more likely to feel there are
overweight or underweight? Do female students and male students feel the same way?)
-
More females feel overweight while more males feel underweight. There is more of a
deviation or difference in females for feeling underweight or overweight, there are 130
females that feel overweight, and only 30 feel underweight. While males have about a
5-person difference having 68 claiming they feel overweight and 72 feeling underweight.
So while I can’t truly see what males are more likely to feel other than about right,
females outside of that category are more likely to feel overweight.
Exercise 2
For each of the scatterplots shown, provide a written description that includes the
direction, form, and strength of the relationship, along with any outliers that do not fit the
general trend. In addition, explain what these characteristics mean in the context of the
data.
a. Data on 50 states taken from the U.S. Census shows how the median family income is
related to the population (25 years or older) with a college degree or higher.
-
The form is linear and the direction is positive meaning the higher your familial income
the more likely you are to get a degree. The association seems to be moderately strong
with some clustering around 50k of income stuck in between 15-20% of having a BA.
There is one obvious outlier at 60k of income surpassing everyone with 30% of a BA
which disrupts the trend of having a higher income for a higher percentage of a degree.
b. Consider the relationship between the average amount of fuel used (in liters) to drive a
fixed distance in a car (100 km), and the speed at which the car is driven (in km per hour).
-
The form is exponential or curving and does become linear until around 50km/h. The
direction is positive meaning the more speed increases the more fuel is used. There is
one outlier which I am assuming is the start of the car moving with 20 liters of fuel at
around km/h. The association has a strong positive correlation seeing as the data
increases in a positive linearly formed direction
c.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Exercise 3
A researcher collected data on the median starting salaries and the median mid-career
salaries for graduates at a selection of colleges. (Source: The Wall Street Journal, Salary
increase by salary type,
https://www.wsj.com/public/resources/documents/info-Salaries_for_Colleges_by_Type-
sort.html). The data points and the fitted least squares regression line are displayed in
the graph below.
a. What is the explanatory variable and response variable?
-
The explanatory variable is the median starting salary and the response variable is the
median mid-career salary
b. And why do you think the median salary is used instead of the mean?
-
Because the distribution is skewed and the median is best fitted for identifying skewed
data
c. Can the median mid-career salary be estimated given a median starting salary of 60 (in
thousands of dollars)? Please explain why or why not, and show your calculation and
explanation if possible.
-
d. Can the median mid-career salary be estimated given a median starting salary of 100
(in thousands of dollars)? Please explain why or why not, and show your calculation and
explanation if possible.
-
It can be estimated using the same equation as above but my only concern is it will not
be portrayed in the graph.
Exercise 4
Assume that the relationship between the calories in a five-ounce serving and the %
alcohol content for a sample of wines is linear. Use the % alcohol as the explanatory
variable, and fit a least squares regression line.
a. Calculate slope and intercept of the regression line.
-
The intercept is -77.55 and slope is 19.92
b. Report the equation of the regression line and interpret it in the context of the problem.
-
Yhat = 19.92x-77.55, tells me our direction is negative
c. Find and interpret the value of the coefficient of determination.
-
cor(df$calories, df$alcohol)
-
0.9439221
-
d. Suppose a new point was added to your data: a wine that is 20% alcohol that contains
0 calories. How will that affect the value of r and the slope of the regression line? (No
calculation needed)
Data table (Source:healthalicious.com)
Calories % alcohol
122 10.6
119 10.1
121 10.1
123 8.8
129 11.1
236 15.2
Table of summary statistics
Calories % alcohol
Mean 141.67 11.03
Std. Dev. 46.34 2.32
r 0.95
-
It will make the correlation weaker than it already is
Exercise 5
A doctor who believes strongly that antidepressants work better than "talk therapy" tests
depressed patients by treating half of them with antidepressants and the other half with talk
therapy. The doctor recruited 100 patients for the study. After six months’ treatment, the patients
will be evaluated on a scale of 1 to 5, with 5 indicating the greatest improvement. The doctor is
designing the study plan.
a. The doctor wants to put the most severe patients in the antidepressants group because he is
concerned about those patients’ conditions. Will this affect his ability to compare the
effectiveness of the antidepressants and the “talk therapy”? Explain.
-
Yes it will affect his study because now there are biases. The more severe patients may
notice a bigger change because of how sever their case is.
b. The doctor asks you whether it is acceptable for him to know which treatment each patient
receives. Explain why this practice may affect his ability to compare the two groups.
-
It may be okay but it may cause him to apply his confirmation biase considering he
already isn't a fan of talk therapy and this would tarnish the study even more.
-
c. What improvements to the plan would you recommend?
-
Id suggest a double blind study so the doctors personal opinions and biases can be
canceled out. I'd also suggest random sampling regardless of the patients severity
(consensually, if they feel they can try)
My work for part two:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Related Questions
Sonya collected data on the shoe size worn by each member of her class. She used this data to calculate the average shoe size in the class. What type of data are these?
1.primary aggregate data
2.secondary aggregate data
3.primary microdata
4.secondary microdata
arrow_forward
write a paragraph that discusses:
Why is this graphic might be needed for that specific topic
Why the graphic is misleading or is a good representation of the data
How the graphic could be improved
arrow_forward
The Iris Flower Data Set at
https://en.wikipedia.org/wiki/Iris flower data set e consists of
measurements taken from Iris flowers: the petal length and width and the
sepal length and width were measured in centimeters (cm) and recorded for
a lot of flowers.
We are interested in being able to use the petal width to predict the petal
length.
Use the Excel output provided below to answer the Iris Flower questions
that follow.
SUMMARY OUTPUT
Data from:
https://en.wikipedia.org/wiki/Iris flower data set
Using Petal Width to predict Petal Length.
Regression Statistics
Multiple R
0.962746025
R Square
0.926879908
Adjusted R Square
0.926385853
Standard Error
0.47895943
Observations
150
ANOVA
df
MS
Significance F
Regression
1
430.373884
430.373884
1876.067473 5.90274E-86
Residual
148
33.951516
0.229402135
Total
149
464.3254
Coefficients Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Lower 95.0% Upper 95.0%
Intercept
1.079463302 0.073168311
14.75315321
7.2245E-31
0.934873757 1.224052847…
arrow_forward
an attempt to develop a model of wine quality as judged by wine experts, data on alcohol content and wine quality was collected from variants of a particular wine. From a sample of 12wines, a model was created using the percentages of alcohol to predict wine quality. For those data, SR=18,671 and SST=27,382.Use this information to complete parts (a) through (c) below. Please complete part 3(B) ONLY.
Question content area bottom
Part 1
a. Determine the coefficient of determination,
r2,
and interpret its meaning.
r2=0.682
(Round to three decimal places as needed.)
Part 2
Interpret the meaning of r2.
It means that
68.2
of the variation in
wine quality
can be explained by the variation in
alcohol content.
(Round to one decimal place as needed.)
Part 3
b. Determine the standard error of the estimate.
SYX=
(Round to four decimal places as needed.)
arrow_forward
Please help by providing detailed workings.
Answers to a minimum of 6 d.p. Thanks
arrow_forward
The National Ignition Facility uses 6mm diameter pellets of deuterium (hydrogen with an extra neutron) cooled to 11 Kelvin as targets for it’s laser in an attempt to achieve nuclear fusion. Because of the massive energies involved, the targets must be constructed to a very high degree of accuracy. You sample 22 targets and their diameters are collected and stored in the data table.
round.diametermm..5.
1
5.99869
2
6.00053
3
5.99925
4
6.0009
5
5.9996
6
6.00188
7
6.00187
8
6.00091
9
5.99903
10
5.99889
11
6.00037
12
5.99907
13
6.00012
14
6.00114
15
5.99867
16
5.99962
17
5.99989
18
6.00147
19
6.0017
20
6.00153
21
6.0007
22
6.0018
a) State a null an an alternate hypothesis to test the claim that the target manufacturing process isperforming to specifications.(b) Conduct a hypothesis test to with an α = 0.01 to test your hypothesis(c) What sample size would be needed to reliably, with a probability of 80% or better, detect a deviation of…
arrow_forward
I need the right answers to this problem.
arrow_forward
A blog noted that "there has been increasing anecdotal evidence that vitamin C may still be useful as an anticancer medicine if used in high concentrations and given directly into the vein (intravenously)." Use this information to answer the questions below.
Question content area bottom
Part 1
Explain what it means that there is "increasing anecdotal evidence" that Vitamin C may be a useful anticancer medicine.
A.
There is no evidence that shows Vitamin C may be a useful anticancer medicine.
B.
There is scientific evidence that Vitamin C is a non-useful anticancer medicine.
C.
There is an increase in rigorous or scientific analysis that shows Vitamin C may be a useful anticancer medicine.
D.
There is an increase in observations or personal experiences that shows Vitamin C may be a useful anticancer medicine.
Part 2
How does anecdotal evidence contrast with scientific evidence?
A.
They are stories about individual cases.…
arrow_forward
Part B please
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Related Questions
- Sonya collected data on the shoe size worn by each member of her class. She used this data to calculate the average shoe size in the class. What type of data are these? 1.primary aggregate data 2.secondary aggregate data 3.primary microdata 4.secondary microdataarrow_forwardwrite a paragraph that discusses: Why is this graphic might be needed for that specific topic Why the graphic is misleading or is a good representation of the data How the graphic could be improvedarrow_forwardThe Iris Flower Data Set at https://en.wikipedia.org/wiki/Iris flower data set e consists of measurements taken from Iris flowers: the petal length and width and the sepal length and width were measured in centimeters (cm) and recorded for a lot of flowers. We are interested in being able to use the petal width to predict the petal length. Use the Excel output provided below to answer the Iris Flower questions that follow. SUMMARY OUTPUT Data from: https://en.wikipedia.org/wiki/Iris flower data set Using Petal Width to predict Petal Length. Regression Statistics Multiple R 0.962746025 R Square 0.926879908 Adjusted R Square 0.926385853 Standard Error 0.47895943 Observations 150 ANOVA df MS Significance F Regression 1 430.373884 430.373884 1876.067473 5.90274E-86 Residual 148 33.951516 0.229402135 Total 149 464.3254 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept 1.079463302 0.073168311 14.75315321 7.2245E-31 0.934873757 1.224052847…arrow_forward
- an attempt to develop a model of wine quality as judged by wine experts, data on alcohol content and wine quality was collected from variants of a particular wine. From a sample of 12wines, a model was created using the percentages of alcohol to predict wine quality. For those data, SR=18,671 and SST=27,382.Use this information to complete parts (a) through (c) below. Please complete part 3(B) ONLY. Question content area bottom Part 1 a. Determine the coefficient of determination, r2, and interpret its meaning. r2=0.682 (Round to three decimal places as needed.) Part 2 Interpret the meaning of r2. It means that 68.2 of the variation in wine quality can be explained by the variation in alcohol content. (Round to one decimal place as needed.) Part 3 b. Determine the standard error of the estimate. SYX= (Round to four decimal places as needed.)arrow_forwardPlease help by providing detailed workings. Answers to a minimum of 6 d.p. Thanksarrow_forwardThe National Ignition Facility uses 6mm diameter pellets of deuterium (hydrogen with an extra neutron) cooled to 11 Kelvin as targets for it’s laser in an attempt to achieve nuclear fusion. Because of the massive energies involved, the targets must be constructed to a very high degree of accuracy. You sample 22 targets and their diameters are collected and stored in the data table. round.diametermm..5. 1 5.99869 2 6.00053 3 5.99925 4 6.0009 5 5.9996 6 6.00188 7 6.00187 8 6.00091 9 5.99903 10 5.99889 11 6.00037 12 5.99907 13 6.00012 14 6.00114 15 5.99867 16 5.99962 17 5.99989 18 6.00147 19 6.0017 20 6.00153 21 6.0007 22 6.0018 a) State a null an an alternate hypothesis to test the claim that the target manufacturing process isperforming to specifications.(b) Conduct a hypothesis test to with an α = 0.01 to test your hypothesis(c) What sample size would be needed to reliably, with a probability of 80% or better, detect a deviation of…arrow_forward
- I need the right answers to this problem.arrow_forwardA blog noted that "there has been increasing anecdotal evidence that vitamin C may still be useful as an anticancer medicine if used in high concentrations and given directly into the vein (intravenously)." Use this information to answer the questions below. Question content area bottom Part 1 Explain what it means that there is "increasing anecdotal evidence" that Vitamin C may be a useful anticancer medicine. A. There is no evidence that shows Vitamin C may be a useful anticancer medicine. B. There is scientific evidence that Vitamin C is a non-useful anticancer medicine. C. There is an increase in rigorous or scientific analysis that shows Vitamin C may be a useful anticancer medicine. D. There is an increase in observations or personal experiences that shows Vitamin C may be a useful anticancer medicine. Part 2 How does anecdotal evidence contrast with scientific evidence? A. They are stories about individual cases.…arrow_forwardPart B pleasearrow_forward
arrow_back_ios
arrow_forward_ios
Recommended textbooks for you
- Big Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin Harcourt

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt