Lab-2-Assignment
pdf
keyboard_arrow_up
School
University of California, Los Angeles *
*We aren’t endorsed by this school
Course
10
Subject
Arts Humanities
Date
Dec 6, 2023
Type
Pages
16
Uploaded by MegaClover10792
Lab 2 Assignment
2023-10-17
Intro Content
1
>=
1
## [1] TRUE
2
==
3
## [1] FALSE
10
==
2
*
5
## [1] TRUE
"word1"
!=
"word2"
## [1] TRUE
c(
1
,
2
,
10
,
50
, -
4
,
1
/
2
) <=
10
## [1]
TRUE
TRUE
TRUE FALSE
TRUE
TRUE
NCbirths
=
read.csv(
"births2023-2.csv"
)
Exercises
1a) Download the data ‘recent-grads.csv’ from Bruinlearn and read it into R. When you read in the data,
name your object “grads”. How many variables and observations does the data have? Hint: Try dim(grads)
to find the answer.
There are 167 observations and 14 variables.
grads
=
read.csv(
"recent-grads.csv"
)
dim(grads)
## [1] 167
14
1b) The Bureau of Labor Statistics, U.S. Department of Labor reports the unemployment rate for the college
graduates was 2.7 percent in September 2023. What proportion of the majors had lower unemployment rates
than 2.7%?
1
mean(grads$Unemployment_rate <
0.027
)
## [1] 0.07784431
1c) Create a bar chart for ‘Major_category’ variable in the data. What are the three majors with highest
frequencies?
Engineering, Humanities & Liberal Arts, and Education are the majors with the highest frequencies.
library(mosaic)
## Registered S3 method overwritten by ’mosaic’:
##
method
from
##
fortify.SpatialPolygonsDataFrame ggplot2
##
## The ’mosaic’ package masks several functions from core packages in order to add
## additional features.
The original behavior of these functions should not be affected by this.
##
## Attaching package: ’mosaic’
## The following objects are masked from ’package:dplyr’:
##
##
count, do, tally
## The following object is masked from ’package:Matrix’:
##
##
mean
## The following object is masked from ’package:ggplot2’:
##
##
stat
## The following objects are masked from ’package:stats’:
##
##
binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##
quantile, sd, t.test, var
## The following objects are masked from ’package:base’:
##
##
max, mean, min, prod, range, sample, sum
barchart(grads$Major_category)
2
Freq
Agriculture & Natural Resources
Arts
Biology & Life Science
Business
Communications & Journalism
Computers & Mathematics
Education
Engineering
Health
Humanities & Liberal Arts
Industrial Arts & Consumer Services
Interdisciplinary
Law & Public Policy
Physical Sciences
Psychology & Social Work
Social Science
0
5
10
15
20
25
1d) Report the mean and standard deviation of the ‘Median’ earnings of the majors in ‘Humanities & Liberal
Arts’ major category.
HLA_medians
=
grads$Median[grads$Major_category ==
"Humanities & Liberal Arts"
]
mean(HLA_medians)
## [1] 31913.33
sd(HLA_medians)
## [1] 3393.032
1e) Report the mean and standard deviation of the ‘Median’ earnings of all majors that are NOT in ‘Hu-
manities & Liberal Arts’ major category. How are they different from the results in d)?
Compared to the results in d), the mean of the median earnings of all non-Humanities & Liberal Arts majors
is greater by 7701.8. The standard deviation of the median earnings of all non-Humanities & Liberal Arts
majors is also larger than that of d) by 5727.411. This shows that, on average, non-Humanities & Liberal
Arts majors earn more than Humanities & Liberal Arts majors.
Not_HLA_medians
=
grads$Median[grads$Major_category !=
"Humanities & Liberal Arts"
]
mean(Not_HLA_medians)
## [1] 39615.13
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
sd(Not_HLA_medians)
## [1] 9120.443
1f) Create a box plot for the ‘Median’ earning of all observations in the data with a good title.
boxplot(grads$Median)
title(
"Boxplot of Median Salaries of Different Majors"
)
30000
50000
Boxplot of Median Salaries of Different Majors
1g) Based on what you see in part (f), describe the shape of the distribution. Does the mean seem to be a
good measure of center for the data? Report a more useful statistic for this data.
This distribution seems to be unimodal and skewed slightly to the right. The mean does not seem to be a good
measure of center for this data set because mean is only an effective measure of center for symmetric/normal
distributions. This distribution is skewed, so median would be a better measure of center.
Section 2
life_expectancy
=
read.csv(
"life_expectancy.csv"
)
2a) Construct a scatterplot of Life expectancy against GDP per capita. Note: Life expectancy should be on
the vertical axis. How does GDP per capita appear to be associated with life expectancy?
Although not a linear relationship, we can see that longer life expectancy is associated with higher GDP.
There is no high life expectancy with low GDP shown in the scatterplot.
4
plot(life_expectancy$GDP.per.capita, life_expectancy$Life.expectancy)
0
50000
100000
150000
20
40
60
80
life_expectancy$GDP.per.capita
life_expectancy$Life.expectancy
2b) Construct the boxplot and histogram of ‘GDP per capita.’ Describe the distribution based on shape,
center and variability. Are there any outliers found in the boxplot?
The shape of the distribution seems to be unimodal and skewed right.
The median would be the ideal
measure of center, and based on the boxplot, the median seems to be a value slightly larger than zero.
There is not a lot of variability in this distribution. Based on the boxplot, the IQR would be a fairly small
number and most of the data is between 0 and 50000. Although they do not look like extreme outliers in
the histogram, the boxplot shows some values that are outliers.
boxplot(life_expectancy$GDP.per.capita)
5
0
50000
100000
150000
histogram(life_expectancy$GDP.per.capita)
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
life_expectancy$GDP.per.capita
Density
0e+00
1e-05
2e-05
3e-05
4e-05
5e-05
0
50000
100000
150000
2c) Report the center (typical value) of ‘GDP per capita’ variables. Use the appropriate measures to find
the center (typical value).
median(life_expectancy$GDP.per.capita)
## [1] 4873
2d) Make a subset of the data for the year of 2018, and name the data as ‘life2018.’ Suppose that a county
generally should have a GDP per capita greater than $16,000 to be considered a ‘developed’ nation. List
the names of the countries (Entity) that are considered as developed nations in 2018.
life2018
=
life_expectancy[life_expectancy$Year ==
2018
, ]
life2018$Entity[life2018$GDP.per.capita >
16000
]
##
[1] "Argentina"
"Australia"
"Austria"
##
[4] "Azerbaijan"
"Bahrain"
"Belarus"
##
[7] "Belgium"
"Bulgaria"
"Canada"
## [10] "Chile"
"Croatia"
"Cyprus"
## [13] "Czechia"
"Denmark"
"Equatorial Guinea"
## [16] "Estonia"
"Finland"
"France"
## [19] "Gabon"
"Germany"
"Greece"
## [22] "Hong Kong"
"Hungary"
"Iceland"
## [25] "Iran"
"Ireland"
"Israel"
## [28] "Italy"
"Japan"
"Kazakhstan"
## [31] "Kuwait"
"Latvia"
"Lithuania"
7
## [34] "Luxembourg"
"Malaysia"
"Malta"
## [37] "Mauritius"
"Mexico"
"Montenegro"
## [40] "Netherlands"
"New Zealand"
"Norway"
## [43] "Oman"
"Panama"
"Poland"
## [46] "Portugal"
"Puerto Rico"
"Qatar"
## [49] "Romania"
"Russia"
"Saudi Arabia"
## [52] "Seychelles"
"Singapore"
"Slovakia"
## [55] "Slovenia"
"South Korea"
"Spain"
## [58] "Sweden"
"Switzerland"
"Taiwan"
## [61] "Thailand"
"Trinidad and Tobago"
"Turkey"
## [64] "Turkmenistan"
"United Arab Emirates" "United Kingdom"
## [67] "United States"
"Uruguay"
2e) Plot Life expectancy against GDP per capita of the developed nations in 2018.
Also, compute the
correlation coefficient. Describe the association of the two variables. Hint: use the function cor()
developed_LE_2018
=
life2018$Life.expectancy[life2018$GDP.per.capita >
16000
]
developed_GDP_2018
=
life2018$GDP.per.capita[life2018$GDP.per.capita >
16000
]
plot(
x =
developed_GDP_2018,
y =
developed_LE_2018)
20000
40000
60000
80000
100000
120000
140000
65
70
75
80
85
developed_GDP_2018
developed_LE_2018
cor(developed_LE_2018, developed_GDP_2018)
## [1] 0.4223323
8
The two variables have a weak, positive, and seemingly non-linear association.
Exercise 3
maas
<-
read.table(
"http://www.stat.ucla.edu/~nchristo/statistics12/soil.txt"
,
header =
TRUE)
3a) Compute the summary statistics for lead and zinc using the summary() function.
summary(maas$lead)
##
Min. 1st Qu.
Median
Mean 3rd Qu.
Max.
##
37.0
72.5
123.0
153.4
207.0
654.0
summary(maas$zinc)
##
Min. 1st Qu.
Median
Mean 3rd Qu.
Max.
##
113.0
198.0
326.0
469.7
674.5
1839.0
3b) Plot two histograms: one of lead and one of zinc. Describe the shapes of the two distributions.
histogram(maas$lead)
maas$lead
Density
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0
200
400
600
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
histogram(maas$zinc)
maas$zinc
Density
0.0000
0.0005
0.0010
0.0015
0.0020
0
500
1000
1500
The two distributions are unimodal and skewed right.
3c) Plot two histograms: one of log(lead) and one of log(zinc). How are they different from the results in
(b)?
histogram(log(maas$lead))
10
log(maas$lead)
Density
0.0
0.1
0.2
0.3
0.4
0.5
4
5
6
histogram(log(maas$zinc))
11
log(maas$zinc)
Density
0.0
0.1
0.2
0.3
0.4
0.5
0.6
5
6
7
The log(lead) distribution is unimodal and fairly symmetrical.
This is different from the results in b)
because there is not much skew in this distrubtion. The log(zinc) distribution is different from the results
in b) because it is bimodal; we can see it has two peaks, rather than one.
3d) Plot log(lead) against log(zinc) and compute the correlation coefficient. Describe the association of the
two variables.
plot(log(maas$lead), log(maas$zinc))
12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
3.5
4.0
4.5
5.0
5.5
6.0
6.5
5.0
5.5
6.0
6.5
7.0
7.5
log(maas$lead)
log(maas$zinc)
These two variables seem to have a strong, positive linear correlation.
3e) Use techniques similar to last lab to give different colors and sizes to the lead concentration at these 155
locations. You do not need to use the maps package create a map of the area. Just plot the points without
a map.
lead_colors
<-
c(
"lavender"
,
"slategray2"
,
"plum2"
)
lead_levels
<-
cut(maas$lead, c(
0
,
120
,
400
,
1000
))
plot(maas$x, maas$y,
cex =
maas$lead/
170
,
col=
lead_colors[as.numeric(lead_levels)],
pch =
5
)
13
178500
179000
179500
180000
180500
181000
181500
330000
332000
maas$x
maas$y
Exercise 4
LA
<-
read.table(
"http://www.stat.ucla.edu/~nchristo/statistics12/la_data.txt"
,
header =
TRUE)
4a) Plot the data point locations. Use good formatting for the axes and title. Then add the outline of LA
County by typing: map(“county”, “california”, add = TRUE)
find.package(
"maps"
)
## [1] "/home/pdpark@g.ucla.edu/R/x86_64-pc-linux-gnu-library/4.1/maps"
plot(LA$Longitude, LA$Latitude,
xlab =
"Longitude"
,
ylab =
"latitude"
,
main =
"Neighborhoods of LA"
)
library(maps)
map(
"county"
,
"california"
,
add =
TRUE)
14
-118.6
-118.5
-118.4
-118.3
-118.2
33.8
34.0
34.2
Neighborhoods of LA
Longitude
latitude
4b) Do you see any relationship between income and school performance? Hint: Plot the variable Schools
against the variable Income and describe what you see. Ignore the data points on the plot for which Schools
= 0. Use what you learned about subsetting with logical statements to first create the objects you need for
the scatter plot. Then, create the scatter plot.
clean_LA
=
LA[LA$Schools !=
0
, ]
plot(clean_LA$Schools, clean_LA$Income)
15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
600
700
800
900
50000
150000
clean_LA$Schools
clean_LA$Income
The data set shows that there is a weak, positive relationship between income and school performance. There
are very few outliers in this data set.
16