worksheet_wrangling
.pdf
keyboard_arrow_up
School
University of British Columbia *
*We aren’t endorsed by this school
Course
DSCI100
Subject
Statistics
Date
Feb 20, 2024
Type
Pages
22
Uploaded by CountKuduMaster478
Worksheet 3: Cleaning and Wrangling
Data
Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:
distinguish vectors and data frames in R, and how they relate to each other
define the term "tidy data"
discuss the advantages and disadvantages of storing data in a tidy data
format
recall and use the following tidyverse functions and operators for their
intended data wrangling tasks:
select
filter
|>
map
mutate
summarize
group_by
pivot_longer
separate
%in%
This worksheet covers parts of the Wrangling chapter of the online textbook. You
should read this chapter before attempting the worksheet.
### Run this cell before continuing. library
(
tidyverse
)
library
(
repr
)
source
(
"tests.R"
)
source
(
"cleanup.R"
)
options
(
repr.matrix.max.rows =
6
)
Question 0.0
Multiple Choice:
{points: 1}
Which statement below is incorrect about vectors and data frames in R?
A. the columns of data frames are vectors
B. data frames can have columns of different types (e.g., a column of numeric
data, and a column of character data)
C. vectors can have elements of different types (e.g., element one can be numeric,
and element 2 can be a character)
D. data frames are a special kind of list
In [ ]:
Assign your answer to an object called answer0.0
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer0.0 <-
"C"
### END SOLUTION
test_0.0
()
Question 0.1
Multiple Choice:
{points: 1}
Which of the following does not
characterize a tidy dataset?
A. each row is a single observation
B. each value should not be in a single cell
C. each column is a single variable
D. each value is a single cell
Assign your answer to an object called answer0.1
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer0.1 <-
"B"
### END SOLUTION
test_0.1
()
Question 0.2
Multiple Choice:
{points: 1}
For which scenario would using one of the group_by()
+ summarize()
be
appropriate?
A. To apply the same function to every row.
B. To apply the same function to every column.
C. To apply the same function to groups of rows.
D. To apply the same function to groups of columns.
Assign your answer to an object called answer0.2
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).
In [ ]:
In [ ]:
In [ ]:
In [ ]:
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer0.2 <-
"C"
### END SOLUTION
test_0.2
()
Question 0.3
Multiple Choice:
{points: 1}
For which scenario would using one of the purrr
map_*
functions be
appropriate?
A. To apply the same function to groups of rows.
B. To apply the same function to every column.
C. To apply the same function to groups of columns.
D. All of the above.
*Assign your answer to an object called answer0.3
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).**
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer0.3 <-
"B"
### END SOLUTION
test_0.3
()
1. Assessing avocado prices to inform
restaurant menu planning
It is a well known that millennials LOVE avocado toast (joking...well mostly ),
and so many restaurants will offer menu items that centre around this delicious
food! Like many food items, avocado prices fluctuate. So a restaurant who wants
to maximize profits on avocado-containing dishes might ask if there are times
when the price of avocados are less expensive to purchase? If such times exist,
this is when the restaurant should put avocado-containing dishes on the menu to
maximize their profits for those dishes.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Source: https://www.averiecooks.com/egg-hole-avocado-toast/
To answer this question we will analyze a data set of avocado sales from multiple
US markets. This data was downloaded from the Hass Avocado Board website in
May of 2018 & compiled into a single CSV. Each row in the data set contains
weekly sales data for a region. The data set spans the year 2015-2018.
Some relevant columns in the dataset:
Date
- The date in year-month-day format
average_price
- The average price of a single avocado
type
- conventional or organic
yr
- The year
region
- The city or region of the observation
small_hass_volume
in pounds (lbs)
large_hass_volume
in pounds (lbs)
extra_l_hass_volume
in pounds (lbs)
wk
- integer number for the calendar week in the year (e.g., first week of
January is 1, and last week of December is 52).
To answer our question of whether there are times in the year when avocados are
typically less expensive (and thus we can make more profitable menu items with
them at a restaurant) we will want to create a scatter plot of average_price
(y-
axis) versus Date
(x-axis).
Question 1.1
Multiple Choice:
{points: 1}
Which of the following is not included in the csv
file?
A. Average price of a single avocado.
B. The farming practice (production with/without the use of chemicals).
C. Average price of a bag of avocados.
D. All options are included in the data set.
*Assign your answer to an object called answer1.1
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).**
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer1.1 <-
"C"
### END SOLUTION
test_1.1
()
Question 1.2
Multiple Choice:
{points: 1}
The rows in the data frame represent:
A. daily avocado sales data for a region
B. weekly avocado sales data for a region
C. bi-weekly avocado sales data for a region
D. yearly avocado sales data for a region
Assign your answer to an object called answer1.2
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer1.2 <-
"B"
### END SOLUTION
test_1.2
()
Question 1.3
{points: 1}
The first step to plotting total volume against average price is to read the file
avocado_prices.csv
using the shortest relative path. The data file was given to
you along with this worksheet, but you will have to look to see where it is in the
worksheet_03
directory to correctly load it. When you do this, you should also
preview the file to help you choose an appropriate read_*
function to read the
data.
Assign your answer to an object called avocado
.
#... <- ...("...")
### BEGIN SOLUTION avocado <-
read_csv
(
"data/avocado_prices.csv"
)
### END SOLUTION avocado
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
test_1.3
()
Question 1.4
Multiple Choice:
{points: 1}
Why are the 2nd to 5th columns col_double
instead of col_integer
?
A. They aren't "real" numbers.
B. They contain decimals.
C. They are numbers created using text/letters.
D. They are col_integer
...
Assign your answer to an object called answer1.4
. Make sure your answer is an
uppercase letter and is surrounded by quotation marks (e.g. "F"
).
# Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks.
# Replace the fail() with your answer. ### BEGIN SOLUTION
answer1.4 <-
"B"
### END SOLUTION
test_1.4
()
Before we get started doing our analysis, let's learn about the pipe operator, |>
,
as it can be very helpful when doing data analysis in R!
Pipe Operators: |>
Pipe operators allow you to chain together different functions - it takes the
output of one statement and makes it the input of the next statement. Having a
chain of processing functions is known as a pipeline
.
If we wanted to subset the avocado data to obtain just the average prices for
organic avocados, we would need to first filter the type
column using the
function: filter()
for the rows where the type is organic. Then we would need
to use the select()
function to get just the average price column.
Below we illustrate how to do this using the pipe operator, |>
, instead of
creating an intermediate object as we have in past worksheets:
Note: the indentation on the second line of the pipeline is not
required, but added for readability.
# run this cell
filter
(
avocado
, type ==
"organic"
) |>
select
(
average_price
)
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
Loblolly is a built-in R dataset that that records the growth of Loblolly pine trees. We are interested in the height column of Loblolly.
The values in the height column can be converted into a vector of name x by the assignment:
x <- Loblolly$height
We can see the values in x by typing x and Return or Enter. Remember x <- Loblolly$height
After this assignment, we can see the first six values in x by typing head(x) and Return or Enter
a. Give the R code to assign the values in in the height column of Loblolly to the vector named x. (Easy-see above)
b. Calculate the interquartile range of x using the R, IQR() function.
c. Calculate the sum of the squares of the values in x.
d. Calculate the sample mean of x.
e. Calculate the sample variance of x.
f. Calculate the sample standard deviation of x.
f. Using R, calculate a
trimmed mean of x. [
h.Enter your R script in the box below.
arrow_forward
I need help on all parts please
arrow_forward
What type of data is shown in these charts and what information do you or don’t you get from these visual representations?
arrow_forward
Please explain each step to find the production schedule for technology matrix and demand vector
arrow_forward
Enter the data in SPSS or EXCEL
Merge the two data sets
Once you have entered all the data, create a table to display the information
Use any graphical method to present the data
Comment on the findings
Country
May
June
July
August
Sept
October
Uganda
947
57
63
117
37
90
Kenya
411
42
55
100
85
49
Tanzania
521
72
02
116
52
46
Rwanda
753
55
15
89
4
14
Burundi
964
52
1
122
70
33
Ethiopia
327
6
26
135
17
24
Congo
967
6
7
113
90
53
Somalia
670
21
5
24
28
75
Eritrea
379
28
3
16
39
85
Country
May
June
July
August
Sept.
Oct.
Sudan
39
5
38
11
82
86
Chad
44
10
93
11
101
187
Mali
25
30
67
21
132
82
Benin
47
37
44
10
44
21
Togo
49
31
116
11
31
18
Niger
23
40
37
13
88
18
Gambia
39
49
90
21
130
139…
arrow_forward
A Course stream
DE IXL-Properties of parallelogram x
a ixl.com/math/geometry/properties-of-parallelograms
saugus.k12.ma.us bookmarks
math Login
Scholastic Student.
A Classroom
Sheppard Software..
Recommendations
Skill plans
A Math
LELanguage arts
Geometry >
* N.6 Properties of parallelograms LLK
Find DE and EF in rectangle DEFG.
D
E 3p-27
9р-77
p+19
F
DE =
EF =
Submit
arrow_forward
MANUAL SOLUTIONS
DONT USE EXCEL
arrow_forward
do a scatterplot visualizing the relationship between these two variables (Notes: Be sure to add chart title and axis titles.)
Geography
Population
Government Expenditure ($)
City of Kenai
7,551
14,099,000
City of Wasilla
8,972
13,629,000
City of Foley
16,741
27,753,000
City of Hoover
84,100
93,696,000
City of Prattville
35,107
26,955,000
City of Hot Springs
36,711
10,477,000
City of Avondale
80,631
45,937,000
City of Coolidge
12,073
9,521,000
City of Douglas
16,764
13,096,000
City of Sedona
10,209
14,908,000
City of Tempe
174,708
186,195,000
City of Baldwin Park
76,511
26,460,000
City of Bell Gardens
42,805
25,869,000
City of Camarillo
66,630
32,828,000
City of Campbell
40,788
42,043,000
City of Dixon
19,144
14,700,000
City of El Cerrito
24,646
31,313,000
City of Fort Bragg
7,260
8,475,000
City of Half Moon Bay
12,281
8,781,000
City of Hesperia
92,664
26,757,000
City of Lathrop
20,331
19,195,000
City of Newark
44,677
41,230,366
City of…
arrow_forward
The entirety of the data set will be in the two pictures
arrow_forward
File
Home
Insert
Draw
Page Layout
Despite the growth in digital entertainment, the nation's 400 amusement parks have
managed to hold on to visitors. A manager collects data on the number of visitors (in
millions) to amusement parks in the United States. A portion of the data is shown in the
accompanying table.
10 v
...
v X v fx
B2
Year
Visitors
A
B
D
2000
312
1
Year
Visitors
2001
315
2
2000
312
2007
345
3
2001
315
4
2002
364
SOURCE: International Association of Amusement Parks and Attractions.
5
2003
370
6.
2004
332
7
2005
314
picture Click here for the Excel Data File
8
2006
355
9
2007
345
b-1. Estimate a linear trend model and an exponential trend model for the sample.
(Round your answers to 2 decimal places.)
10
11
12
Variable
Linear Trend
Exponential Trend
13
Intercept
14
15
Standard Error
16
17
18
arrow_forward
Please show and explain all work in an easy-to-read format!!!!! Please and thank you!!!!
arrow_forward
Please use the given info to answer the subquestion Part B
arrow_forward
The range of the data set
arrow_forward
Data set in the next image
arrow_forward
Plz help asap 40 need crit value as well
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage
Related Questions
- Loblolly is a built-in R dataset that that records the growth of Loblolly pine trees. We are interested in the height column of Loblolly. The values in the height column can be converted into a vector of name x by the assignment: x <- Loblolly$height We can see the values in x by typing x and Return or Enter. Remember x <- Loblolly$height After this assignment, we can see the first six values in x by typing head(x) and Return or Enter a. Give the R code to assign the values in in the height column of Loblolly to the vector named x. (Easy-see above) b. Calculate the interquartile range of x using the R, IQR() function. c. Calculate the sum of the squares of the values in x. d. Calculate the sample mean of x. e. Calculate the sample variance of x. f. Calculate the sample standard deviation of x. f. Using R, calculate a trimmed mean of x. [ h.Enter your R script in the box below.arrow_forwardI need help on all parts pleasearrow_forwardWhat type of data is shown in these charts and what information do you or don’t you get from these visual representations?arrow_forward
- Please explain each step to find the production schedule for technology matrix and demand vectorarrow_forwardEnter the data in SPSS or EXCEL Merge the two data sets Once you have entered all the data, create a table to display the information Use any graphical method to present the data Comment on the findings Country May June July August Sept October Uganda 947 57 63 117 37 90 Kenya 411 42 55 100 85 49 Tanzania 521 72 02 116 52 46 Rwanda 753 55 15 89 4 14 Burundi 964 52 1 122 70 33 Ethiopia 327 6 26 135 17 24 Congo 967 6 7 113 90 53 Somalia 670 21 5 24 28 75 Eritrea 379 28 3 16 39 85 Country May June July August Sept. Oct. Sudan 39 5 38 11 82 86 Chad 44 10 93 11 101 187 Mali 25 30 67 21 132 82 Benin 47 37 44 10 44 21 Togo 49 31 116 11 31 18 Niger 23 40 37 13 88 18 Gambia 39 49 90 21 130 139…arrow_forwardA Course stream DE IXL-Properties of parallelogram x a ixl.com/math/geometry/properties-of-parallelograms saugus.k12.ma.us bookmarks math Login Scholastic Student. A Classroom Sheppard Software.. Recommendations Skill plans A Math LELanguage arts Geometry > * N.6 Properties of parallelograms LLK Find DE and EF in rectangle DEFG. D E 3p-27 9р-77 p+19 F DE = EF = Submitarrow_forward
- MANUAL SOLUTIONS DONT USE EXCELarrow_forwarddo a scatterplot visualizing the relationship between these two variables (Notes: Be sure to add chart title and axis titles.) Geography Population Government Expenditure ($) City of Kenai 7,551 14,099,000 City of Wasilla 8,972 13,629,000 City of Foley 16,741 27,753,000 City of Hoover 84,100 93,696,000 City of Prattville 35,107 26,955,000 City of Hot Springs 36,711 10,477,000 City of Avondale 80,631 45,937,000 City of Coolidge 12,073 9,521,000 City of Douglas 16,764 13,096,000 City of Sedona 10,209 14,908,000 City of Tempe 174,708 186,195,000 City of Baldwin Park 76,511 26,460,000 City of Bell Gardens 42,805 25,869,000 City of Camarillo 66,630 32,828,000 City of Campbell 40,788 42,043,000 City of Dixon 19,144 14,700,000 City of El Cerrito 24,646 31,313,000 City of Fort Bragg 7,260 8,475,000 City of Half Moon Bay 12,281 8,781,000 City of Hesperia 92,664 26,757,000 City of Lathrop 20,331 19,195,000 City of Newark 44,677 41,230,366 City of…arrow_forwardThe entirety of the data set will be in the two picturesarrow_forward
- File Home Insert Draw Page Layout Despite the growth in digital entertainment, the nation's 400 amusement parks have managed to hold on to visitors. A manager collects data on the number of visitors (in millions) to amusement parks in the United States. A portion of the data is shown in the accompanying table. 10 v ... v X v fx B2 Year Visitors A B D 2000 312 1 Year Visitors 2001 315 2 2000 312 2007 345 3 2001 315 4 2002 364 SOURCE: International Association of Amusement Parks and Attractions. 5 2003 370 6. 2004 332 7 2005 314 picture Click here for the Excel Data File 8 2006 355 9 2007 345 b-1. Estimate a linear trend model and an exponential trend model for the sample. (Round your answers to 2 decimal places.) 10 11 12 Variable Linear Trend Exponential Trend 13 Intercept 14 15 Standard Error 16 17 18arrow_forwardPlease show and explain all work in an easy-to-read format!!!!! Please and thank you!!!!arrow_forwardPlease use the given info to answer the subquestion Part Barrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Algebra & Trigonometry with Analytic GeometryAlgebraISBN:9781133382119Author:SwokowskiPublisher:Cengage
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage