Lab-2-Assignment

pdf

School

University of California, Los Angeles *

*We aren’t endorsed by this school

Course

10

Subject

Arts Humanities

Date

Dec 6, 2023

Type

pdf

Pages

16

Uploaded by MegaClover10792

Report
Lab 2 Assignment 2023-10-17 Intro Content 1 >= 1 ## [1] TRUE 2 == 3 ## [1] FALSE 10 == 2 * 5 ## [1] TRUE "word1" != "word2" ## [1] TRUE c( 1 , 2 , 10 , 50 , - 4 , 1 / 2 ) <= 10 ## [1] TRUE TRUE TRUE FALSE TRUE TRUE NCbirths = read.csv( "births2023-2.csv" ) Exercises 1a) Download the data ‘recent-grads.csv’ from Bruinlearn and read it into R. When you read in the data, name your object “grads”. How many variables and observations does the data have? Hint: Try dim(grads) to find the answer. There are 167 observations and 14 variables. grads = read.csv( "recent-grads.csv" ) dim(grads) ## [1] 167 14 1b) The Bureau of Labor Statistics, U.S. Department of Labor reports the unemployment rate for the college graduates was 2.7 percent in September 2023. What proportion of the majors had lower unemployment rates than 2.7%? 1
mean(grads$Unemployment_rate < 0.027 ) ## [1] 0.07784431 1c) Create a bar chart for ‘Major_category’ variable in the data. What are the three majors with highest frequencies? Engineering, Humanities & Liberal Arts, and Education are the majors with the highest frequencies. library(mosaic) ## Registered S3 method overwritten by ’mosaic’: ## method from ## fortify.SpatialPolygonsDataFrame ggplot2 ## ## The ’mosaic’ package masks several functions from core packages in order to add ## additional features. The original behavior of these functions should not be affected by this. ## ## Attaching package: ’mosaic’ ## The following objects are masked from ’package:dplyr’: ## ## count, do, tally ## The following object is masked from ’package:Matrix’: ## ## mean ## The following object is masked from ’package:ggplot2’: ## ## stat ## The following objects are masked from ’package:stats’: ## ## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test, ## quantile, sd, t.test, var ## The following objects are masked from ’package:base’: ## ## max, mean, min, prod, range, sample, sum barchart(grads$Major_category) 2
Freq Agriculture & Natural Resources Arts Biology & Life Science Business Communications & Journalism Computers & Mathematics Education Engineering Health Humanities & Liberal Arts Industrial Arts & Consumer Services Interdisciplinary Law & Public Policy Physical Sciences Psychology & Social Work Social Science 0 5 10 15 20 25 1d) Report the mean and standard deviation of the ‘Median’ earnings of the majors in ‘Humanities & Liberal Arts’ major category. HLA_medians = grads$Median[grads$Major_category == "Humanities & Liberal Arts" ] mean(HLA_medians) ## [1] 31913.33 sd(HLA_medians) ## [1] 3393.032 1e) Report the mean and standard deviation of the ‘Median’ earnings of all majors that are NOT in ‘Hu- manities & Liberal Arts’ major category. How are they different from the results in d)? Compared to the results in d), the mean of the median earnings of all non-Humanities & Liberal Arts majors is greater by 7701.8. The standard deviation of the median earnings of all non-Humanities & Liberal Arts majors is also larger than that of d) by 5727.411. This shows that, on average, non-Humanities & Liberal Arts majors earn more than Humanities & Liberal Arts majors. Not_HLA_medians = grads$Median[grads$Major_category != "Humanities & Liberal Arts" ] mean(Not_HLA_medians) ## [1] 39615.13 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
sd(Not_HLA_medians) ## [1] 9120.443 1f) Create a box plot for the ‘Median’ earning of all observations in the data with a good title. boxplot(grads$Median) title( "Boxplot of Median Salaries of Different Majors" ) 30000 50000 Boxplot of Median Salaries of Different Majors 1g) Based on what you see in part (f), describe the shape of the distribution. Does the mean seem to be a good measure of center for the data? Report a more useful statistic for this data. This distribution seems to be unimodal and skewed slightly to the right. The mean does not seem to be a good measure of center for this data set because mean is only an effective measure of center for symmetric/normal distributions. This distribution is skewed, so median would be a better measure of center. Section 2 life_expectancy = read.csv( "life_expectancy.csv" ) 2a) Construct a scatterplot of Life expectancy against GDP per capita. Note: Life expectancy should be on the vertical axis. How does GDP per capita appear to be associated with life expectancy? Although not a linear relationship, we can see that longer life expectancy is associated with higher GDP. There is no high life expectancy with low GDP shown in the scatterplot. 4
plot(life_expectancy$GDP.per.capita, life_expectancy$Life.expectancy) 0 50000 100000 150000 20 40 60 80 life_expectancy$GDP.per.capita life_expectancy$Life.expectancy 2b) Construct the boxplot and histogram of ‘GDP per capita.’ Describe the distribution based on shape, center and variability. Are there any outliers found in the boxplot? The shape of the distribution seems to be unimodal and skewed right. The median would be the ideal measure of center, and based on the boxplot, the median seems to be a value slightly larger than zero. There is not a lot of variability in this distribution. Based on the boxplot, the IQR would be a fairly small number and most of the data is between 0 and 50000. Although they do not look like extreme outliers in the histogram, the boxplot shows some values that are outliers. boxplot(life_expectancy$GDP.per.capita) 5
0 50000 100000 150000 histogram(life_expectancy$GDP.per.capita) 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
life_expectancy$GDP.per.capita Density 0e+00 1e-05 2e-05 3e-05 4e-05 5e-05 0 50000 100000 150000 2c) Report the center (typical value) of ‘GDP per capita’ variables. Use the appropriate measures to find the center (typical value). median(life_expectancy$GDP.per.capita) ## [1] 4873 2d) Make a subset of the data for the year of 2018, and name the data as ‘life2018.’ Suppose that a county generally should have a GDP per capita greater than $16,000 to be considered a ‘developed’ nation. List the names of the countries (Entity) that are considered as developed nations in 2018. life2018 = life_expectancy[life_expectancy$Year == 2018 , ] life2018$Entity[life2018$GDP.per.capita > 16000 ] ## [1] "Argentina" "Australia" "Austria" ## [4] "Azerbaijan" "Bahrain" "Belarus" ## [7] "Belgium" "Bulgaria" "Canada" ## [10] "Chile" "Croatia" "Cyprus" ## [13] "Czechia" "Denmark" "Equatorial Guinea" ## [16] "Estonia" "Finland" "France" ## [19] "Gabon" "Germany" "Greece" ## [22] "Hong Kong" "Hungary" "Iceland" ## [25] "Iran" "Ireland" "Israel" ## [28] "Italy" "Japan" "Kazakhstan" ## [31] "Kuwait" "Latvia" "Lithuania" 7
## [34] "Luxembourg" "Malaysia" "Malta" ## [37] "Mauritius" "Mexico" "Montenegro" ## [40] "Netherlands" "New Zealand" "Norway" ## [43] "Oman" "Panama" "Poland" ## [46] "Portugal" "Puerto Rico" "Qatar" ## [49] "Romania" "Russia" "Saudi Arabia" ## [52] "Seychelles" "Singapore" "Slovakia" ## [55] "Slovenia" "South Korea" "Spain" ## [58] "Sweden" "Switzerland" "Taiwan" ## [61] "Thailand" "Trinidad and Tobago" "Turkey" ## [64] "Turkmenistan" "United Arab Emirates" "United Kingdom" ## [67] "United States" "Uruguay" 2e) Plot Life expectancy against GDP per capita of the developed nations in 2018. Also, compute the correlation coefficient. Describe the association of the two variables. Hint: use the function cor() developed_LE_2018 = life2018$Life.expectancy[life2018$GDP.per.capita > 16000 ] developed_GDP_2018 = life2018$GDP.per.capita[life2018$GDP.per.capita > 16000 ] plot( x = developed_GDP_2018, y = developed_LE_2018) 20000 40000 60000 80000 100000 120000 140000 65 70 75 80 85 developed_GDP_2018 developed_LE_2018 cor(developed_LE_2018, developed_GDP_2018) ## [1] 0.4223323 8
The two variables have a weak, positive, and seemingly non-linear association. Exercise 3 maas <- read.table( "http://www.stat.ucla.edu/~nchristo/statistics12/soil.txt" , header = TRUE) 3a) Compute the summary statistics for lead and zinc using the summary() function. summary(maas$lead) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 37.0 72.5 123.0 153.4 207.0 654.0 summary(maas$zinc) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 113.0 198.0 326.0 469.7 674.5 1839.0 3b) Plot two histograms: one of lead and one of zinc. Describe the shapes of the two distributions. histogram(maas$lead) maas$lead Density 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0 200 400 600 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
histogram(maas$zinc) maas$zinc Density 0.0000 0.0005 0.0010 0.0015 0.0020 0 500 1000 1500 The two distributions are unimodal and skewed right. 3c) Plot two histograms: one of log(lead) and one of log(zinc). How are they different from the results in (b)? histogram(log(maas$lead)) 10
log(maas$lead) Density 0.0 0.1 0.2 0.3 0.4 0.5 4 5 6 histogram(log(maas$zinc)) 11
log(maas$zinc) Density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 5 6 7 The log(lead) distribution is unimodal and fairly symmetrical. This is different from the results in b) because there is not much skew in this distrubtion. The log(zinc) distribution is different from the results in b) because it is bimodal; we can see it has two peaks, rather than one. 3d) Plot log(lead) against log(zinc) and compute the correlation coefficient. Describe the association of the two variables. plot(log(maas$lead), log(maas$zinc)) 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
3.5 4.0 4.5 5.0 5.5 6.0 6.5 5.0 5.5 6.0 6.5 7.0 7.5 log(maas$lead) log(maas$zinc) These two variables seem to have a strong, positive linear correlation. 3e) Use techniques similar to last lab to give different colors and sizes to the lead concentration at these 155 locations. You do not need to use the maps package create a map of the area. Just plot the points without a map. lead_colors <- c( "lavender" , "slategray2" , "plum2" ) lead_levels <- cut(maas$lead, c( 0 , 120 , 400 , 1000 )) plot(maas$x, maas$y, cex = maas$lead/ 170 , col= lead_colors[as.numeric(lead_levels)], pch = 5 ) 13
178500 179000 179500 180000 180500 181000 181500 330000 332000 maas$x maas$y Exercise 4 LA <- read.table( "http://www.stat.ucla.edu/~nchristo/statistics12/la_data.txt" , header = TRUE) 4a) Plot the data point locations. Use good formatting for the axes and title. Then add the outline of LA County by typing: map(“county”, “california”, add = TRUE) find.package( "maps" ) ## [1] "/home/pdpark@g.ucla.edu/R/x86_64-pc-linux-gnu-library/4.1/maps" plot(LA$Longitude, LA$Latitude, xlab = "Longitude" , ylab = "latitude" , main = "Neighborhoods of LA" ) library(maps) map( "county" , "california" , add = TRUE) 14
-118.6 -118.5 -118.4 -118.3 -118.2 33.8 34.0 34.2 Neighborhoods of LA Longitude latitude 4b) Do you see any relationship between income and school performance? Hint: Plot the variable Schools against the variable Income and describe what you see. Ignore the data points on the plot for which Schools = 0. Use what you learned about subsetting with logical statements to first create the objects you need for the scatter plot. Then, create the scatter plot. clean_LA = LA[LA$Schools != 0 , ] plot(clean_LA$Schools, clean_LA$Income) 15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
600 700 800 900 50000 150000 clean_LA$Schools clean_LA$Income The data set shows that there is a weak, positive relationship between income and school performance. There are very few outliers in this data set. 16