STATS 10 Assignment 2 (1)

pdf

School

University of California, Los Angeles *

*We aren’t endorsed by this school

Course

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

Uploaded by CommodoreCrow17890

STATS 10 Assignment 2 Lalonye Calhoun 006059433 Discussion 3A/B Exercise 1 Work with lead and copper data obtained from the residents of Flint, Michigan from January- February, 2017. Data are reported in PPB (parts per billion, or μg/L) from each residential testing kit. Remember that “Pb” denotes lead, and “Cu” denotes copper. You can learn more about the Flint water crisis at https://en.wikipedia.org/wiki/Flint_water_crisis. a. Download the data from the course site and read it into R. Or use online data link: read.csv(“https://ucla.box.com/shared/static/e9xuft4h3p8fdi4ydoj2hhujee0vmopb.csv”) When you read in the data, name your object “flint”. b. The EPA states a water source is especially dangerous if the lead level is 15 PPB or greater. What proportion of the locations tested were found to have dangerous lead levels? - .04436229% c. Report the mean copper level for only test sites in the North region . - 44.6424 d. Report the mean copper level for only test sites with dangerous lead levels (at least 15 PPB) . - 141.9631 e. Report the mean lead and copper levels. - 54.581 copper levels - 3.383 f. Create a box plot with a good title for the lead levels. - g. Based on what you see in part (f), does the mean seem to be a good measure of center for the data? Report a more useful statistic for this data - No, The median would be better because the data is skewed . Exercise 2 The data here represent life expectancies (Life) and per capita income (Income) in 1974 dollars for 101 countries in the early 1970’s. The source of these data is: Leinhardt and Wasserman (1979), New York Times (September, 28, 1975, p. E-3). They also appear on Regression Analysis by Ashish Sen and Muni Srivastava. You can access these data in R using: life <-read.table("https://ucla.box.com/shared/static/rqk4lc030pabv30wknx2ft9jy848ub9n.txt", header = TRUE) a. Construct a scatterplot of Life against Income. Note: Income should be on the horizontal axis. How does income appear to affect life expectancy?

- The higher your income the more likely you are to live past 70, the less money you have the more likely you are to die around 50. b. Construct the boxplot and histogram of Income. Are there any outliers? - Boxplot: There were some outliers around 3000 to 5000

- Histogram: I don't see any outliers c. Split the data set into two parts: One for which the Income is strictly below $1000, and one for which the Income is at least $1000. Come up with your own names for these two objects. - lowerthan1000 = life[life$Income < 1000,] - - Above1000 = life[life$Income > 1000,] d. Use the data for which the Income is below $1000. Plot Life against Income and compute the correlation coefficient. Hint: use the function cor() - 0.752886 Exercise 3 The Maas river data contain the concentration of lead and zinc in ppm at 155 locations at the banks of the Maas river in the Netherlands. You can read the data in R as follows: maas <- read.table("https://ucla.box.com/shared/static/tv3cxooyp6y8fh6gb0qj2cxihj8klg1h.txt", header = TRUE) a. Compute the summary statistics for lead and zinc using the summary() function. - Lead: Min. 1st Qu. Median Mean 3rd Qu. Max.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

- 37.0 72.5 123.0 153.4 207.0 654.0 - Zinc: Min. 1st Qu. Median Mean 3rd Qu. Max. - 113.0 198.0 326.0 469.7 674.5 1839.0 - b . Plot two histograms: one of lead and one of log(lead). Lead:

Log: c . Plot log(lead) against log(zinc). What do you observe? - The correlation coefficient is positive and the graph is linear - d. The level of risk for surface soil based on lead concentration in ppm is given on the table below: The following commands give different colors and sizes on a scatterplot For two variables: x, y

mycolors <- c("green", "orange", "red") #can be changed to other colors mylevels <- cut(y, c(0, 100, 1000, 10000)) #the levels, can be changed to other values mysize <- 19 #the point size, can be changed to other values plot(x, y, col=colors[as.numeric(mylevels)], pch= mysize) Use similar techniques to give different colors and sizes to the lead concentration at these 155 Locations. - Exercise 4 The data for this exercise represent approximately the centers (given by longitude and latitude) of each one of the City of Los Angeles neighborhoods. See also the Los Angeles Times project on the City of Los Angeles neighborhoods at: http://projects.latimes.com/mapping- la/neighborhoods/. You can access these data at: LA <- read.table("https://ucla.box.com/shared/static/d189x2gn5xfmcic0dmnhj2cw94jwvqpa.txt", header=TRUE) a. Plot the data point locations. Use good formatting for the axes and title. Then add the outline of LA County by typing: map("county", "california", add = TRUE)

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

- b. Do you see any relationship between income and school performance? Hint: Plot the variable Schools against the variable Income and describe what you see. Ignore the data points on the plot for which Schools = 0. Use what you learned about subsetting with logical statements to first create the objects you need for the scatter plot. Then, create the scatter plot. Alternate methods may only receive half credit.

- The scatterplot is linear meaning the more income increases the more school performance increases. plots/variables seem to be a little associated. Exercise 5 In this exercise, you will work with a dataset containing information about customers of a retail store. The dataset includes the following variables: a. Customer ID: unique identifier for each customer b. Age: age of the customer in years c. Gender: gender of the customer (M for male, F for female) d. Income: annual income of the customer in dollars e. Education: education level of the customer (high school, some college, college degree, graduate degree) f. Marital status: marital status of the customer (single, married, divorced, widowed) g. Purchase amount: the total amount the customer spent at the store in the past year Load the data into R: customer_data <- read.csv("https://ucla.box.com/shared/static/y2y8rcie7mjw2h5t92x9dfcp133tc90h.csv") a. Are there any missing values in the dataset? If so, how many are there and which variables have missing values? - There are 22 Na’s in total - 10 in NA’s age - 5 in income - 6 in purchase Amount b. What is the data type of each variable? Are there any variables that should be converted to a different data type? Customer Id: shows numerical but is categorical Age: numerical Gender: categorical Income: numerical Education: categorical Marital status: Categorical Purchase amount: numerical c. Do any numerical variables have outliers or extreme values? If so, how would you handle them? Provide your analysis in R for identifying outliers (e.g., visualization, numerical summary statistics). This is an open-ended question, so please feel free to use any appropriate methods to identify and deal with any outliers or extreme values in the dataset. - There are no outliers, I checked by graphing each numerical variable into box plots. Part II You may choose to type or write your answers electronically or scan your handwritten solutions. Please ensure that you show all steps and explanations to receive full credit, unless otherwise instructed.

Exercise 1 A study was done random sample of 900 college students. The researcher wants to find out if gender would affect people’s body image. The two-way table below summarizes the two variables. a. In general, are students happy with their body weight? (Hint: Students that are happy with their body weight responded "about right.") - 600 out of 900 students are happy with body weight b. If the researcher wants to compare the differences in body image between females and males. What graph would best visualize the data for this purpose? Explain. (No need to draw the actually plot) - A bar chart because we can see the comparisons side by side and their data c. Are female students more likely to feel they are about right than male students? Explain with numerical evidence. - Yes because .20% more females feel about right than males d. For students who do not feel ‘about right’ with their body image, are there any differences between the two gender groups? (Hint: are they more likely to feel there are overweight or underweight? Do female students and male students feel the same way?) - More females feel overweight while more males feel underweight. There is more of a deviation or difference in females for feeling underweight or overweight, there are 130 females that feel overweight, and only 30 feel underweight. While males have about a 5-person difference having 68 claiming they feel overweight and 72 feeling underweight. So while I can’t truly see what males are more likely to feel other than about right, females outside of that category are more likely to feel overweight. Exercise 2 For each of the scatterplots shown, provide a written description that includes the direction, form, and strength of the relationship, along with any outliers that do not fit the general trend. In addition, explain what these characteristics mean in the context of the data. a. Data on 50 states taken from the U.S. Census shows how the median family income is related to the population (25 years or older) with a college degree or higher. - The form is linear and the direction is positive meaning the higher your familial income the more likely you are to get a degree. The association seems to be moderately strong with some clustering around 50k of income stuck in between 15-20% of having a BA. There is one obvious outlier at 60k of income surpassing everyone with 30% of a BA which disrupts the trend of having a higher income for a higher percentage of a degree. b. Consider the relationship between the average amount of fuel used (in liters) to drive a fixed distance in a car (100 km), and the speed at which the car is driven (in km per hour). - The form is exponential or curving and does become linear until around 50km/h. The direction is positive meaning the more speed increases the more fuel is used. There is one outlier which I am assuming is the start of the car moving with 20 liters of fuel at around km/h. The association has a strong positive correlation seeing as the data increases in a positive linearly formed direction c.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Exercise 3 A researcher collected data on the median starting salaries and the median mid-career salaries for graduates at a selection of colleges. (Source: The Wall Street Journal, Salary increase by salary type, https://www.wsj.com/public/resources/documents/info-Salaries_for_Colleges_by_Type- sort.html). The data points and the fitted least squares regression line are displayed in the graph below. a. What is the explanatory variable and response variable? - The explanatory variable is the median starting salary and the response variable is the median mid-career salary b. And why do you think the median salary is used instead of the mean? - Because the distribution is skewed and the median is best fitted for identifying skewed data c. Can the median mid-career salary be estimated given a median starting salary of 60 (in thousands of dollars)? Please explain why or why not, and show your calculation and explanation if possible. - d. Can the median mid-career salary be estimated given a median starting salary of 100 (in thousands of dollars)? Please explain why or why not, and show your calculation and explanation if possible. - It can be estimated using the same equation as above but my only concern is it will not be portrayed in the graph. Exercise 4 Assume that the relationship between the calories in a five-ounce serving and the % alcohol content for a sample of wines is linear. Use the % alcohol as the explanatory variable, and fit a least squares regression line. a. Calculate slope and intercept of the regression line. - The intercept is -77.55 and slope is 19.92 b. Report the equation of the regression line and interpret it in the context of the problem. - Yhat = 19.92x-77.55, tells me our direction is negative c. Find and interpret the value of the coefficient of determination. - cor(df$calories, df$alcohol) - 0.9439221 - d. Suppose a new point was added to your data: a wine that is 20% alcohol that contains 0 calories. How will that affect the value of r and the slope of the regression line? (No calculation needed) Data table (Source:healthalicious.com) Calories % alcohol 122 10.6 119 10.1 121 10.1

123 8.8 129 11.1 236 15.2 Table of summary statistics Calories % alcohol Mean 141.67 11.03 Std. Dev. 46.34 2.32 r 0.95 - It will make the correlation weaker than it already is Exercise 5 A doctor who believes strongly that antidepressants work better than "talk therapy" tests depressed patients by treating half of them with antidepressants and the other half with talk therapy. The doctor recruited 100 patients for the study. After six months’ treatment, the patients will be evaluated on a scale of 1 to 5, with 5 indicating the greatest improvement. The doctor is designing the study plan. a. The doctor wants to put the most severe patients in the antidepressants group because he is concerned about those patients’ conditions. Will this affect his ability to compare the effectiveness of the antidepressants and the “talk therapy”? Explain. - Yes it will affect his study because now there are biases. The more severe patients may notice a bigger change because of how sever their case is. b. The doctor asks you whether it is acceptable for him to know which treatment each patient receives. Explain why this practice may affect his ability to compare the two groups. - It may be okay but it may cause him to apply his confirmation biase considering he already isn't a fan of talk therapy and this would tarnish the study even more. - c. What improvements to the plan would you recommend? - Id suggest a double blind study so the doctors personal opinions and biases can be canceled out. I'd also suggest random sampling regardless of the patients severity (consensually, if they feel they can try)

My work for part two:

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version