worksheet_wrangling

.pdf

School

University of British Columbia *

*We aren’t endorsed by this school

Course

DSCI100

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by CountKuduMaster478

Worksheet 3: Cleaning and Wrangling Data Lecture and Tutorial Learning Goals: After completing this week's lecture and tutorial work, you will be able to: distinguish vectors and data frames in R, and how they relate to each other define the term "tidy data" discuss the advantages and disadvantages of storing data in a tidy data format recall and use the following tidyverse functions and operators for their intended data wrangling tasks: select filter |> map mutate summarize group_by pivot_longer separate %in% This worksheet covers parts of the Wrangling chapter of the online textbook. You should read this chapter before attempting the worksheet. ### Run this cell before continuing. library ( tidyverse ) library ( repr ) source ( "tests.R" ) source ( "cleanup.R" ) options ( repr.matrix.max.rows = 6 ) Question 0.0 Multiple Choice: {points: 1} Which statement below is incorrect about vectors and data frames in R? A. the columns of data frames are vectors B. data frames can have columns of different types (e.g., a column of numeric data, and a column of character data) C. vectors can have elements of different types (e.g., element one can be numeric, and element 2 can be a character) D. data frames are a special kind of list In [ ]:

Assign your answer to an object called answer0.0 . Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F" ). # Replace the fail() with your answer. ### BEGIN SOLUTION answer0.0 <- "C" ### END SOLUTION test_0.0 () Question 0.1 Multiple Choice: {points: 1} Which of the following does not characterize a tidy dataset? A. each row is a single observation B. each value should not be in a single cell C. each column is a single variable D. each value is a single cell Assign your answer to an object called answer0.1 . Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F" ). # Replace the fail() with your answer. ### BEGIN SOLUTION answer0.1 <- "B" ### END SOLUTION test_0.1 () Question 0.2 Multiple Choice: {points: 1} For which scenario would using one of the group_by() + summarize() be appropriate? A. To apply the same function to every row. B. To apply the same function to every column. C. To apply the same function to groups of rows. D. To apply the same function to groups of columns. Assign your answer to an object called answer0.2 . Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F" ). In [ ]: In [ ]: In [ ]: In [ ]:

# Replace the fail() with your answer. ### BEGIN SOLUTION answer0.2 <- "C" ### END SOLUTION test_0.2 () Question 0.3 Multiple Choice: {points: 1} For which scenario would using one of the purrr map_* functions be appropriate? A. To apply the same function to groups of rows. B. To apply the same function to every column. C. To apply the same function to groups of columns. D. All of the above. *Assign your answer to an object called answer0.3 . Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F" ).** # Replace the fail() with your answer. ### BEGIN SOLUTION answer0.3 <- "B" ### END SOLUTION test_0.3 () 1. Assessing avocado prices to inform restaurant menu planning It is a well known that millennials LOVE avocado toast (joking...well mostly ), and so many restaurants will offer menu items that centre around this delicious food! Like many food items, avocado prices fluctuate. So a restaurant who wants to maximize profits on avocado-containing dishes might ask if there are times when the price of avocados are less expensive to purchase? If such times exist, this is when the restaurant should put avocado-containing dishes on the menu to maximize their profits for those dishes. In [ ]: In [ ]: In [ ]: In [ ]:

Source: https://www.averiecooks.com/egg-hole-avocado-toast/ To answer this question we will analyze a data set of avocado sales from multiple US markets. This data was downloaded from the Hass Avocado Board website in May of 2018 & compiled into a single CSV. Each row in the data set contains weekly sales data for a region. The data set spans the year 2015-2018. Some relevant columns in the dataset: Date - The date in year-month-day format average_price - The average price of a single avocado type - conventional or organic yr - The year region - The city or region of the observation small_hass_volume in pounds (lbs) large_hass_volume in pounds (lbs) extra_l_hass_volume in pounds (lbs) wk - integer number for the calendar week in the year (e.g., first week of January is 1, and last week of December is 52). To answer our question of whether there are times in the year when avocados are typically less expensive (and thus we can make more profitable menu items with them at a restaurant) we will want to create a scatter plot of average_price (y- axis) versus Date (x-axis). Question 1.1 Multiple Choice: {points: 1} Which of the following is not included in the csv file? A. Average price of a single avocado. B. The farming practice (production with/without the use of chemicals). C. Average price of a bag of avocados. D. All options are included in the data set.

*Assign your answer to an object called answer1.1 . Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F" ).** # Replace the fail() with your answer. ### BEGIN SOLUTION answer1.1 <- "C" ### END SOLUTION test_1.1 () Question 1.2 Multiple Choice: {points: 1} The rows in the data frame represent: A. daily avocado sales data for a region B. weekly avocado sales data for a region C. bi-weekly avocado sales data for a region D. yearly avocado sales data for a region Assign your answer to an object called answer1.2 . Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F" ). # Replace the fail() with your answer. ### BEGIN SOLUTION answer1.2 <- "B" ### END SOLUTION test_1.2 () Question 1.3 {points: 1} The first step to plotting total volume against average price is to read the file avocado_prices.csv using the shortest relative path. The data file was given to you along with this worksheet, but you will have to look to see where it is in the worksheet_03 directory to correctly load it. When you do this, you should also preview the file to help you choose an appropriate read_* function to read the data. Assign your answer to an object called avocado . #... <- ...("...") ### BEGIN SOLUTION avocado <- read_csv ( "data/avocado_prices.csv" ) ### END SOLUTION avocado In [ ]: In [ ]: In [ ]: In [ ]: In [ ]:

test_1.3 () Question 1.4 Multiple Choice: {points: 1} Why are the 2nd to 5th columns col_double instead of col_integer ? A. They aren't "real" numbers. B. They contain decimals. C. They are numbers created using text/letters. D. They are col_integer ... Assign your answer to an object called answer1.4 . Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F" ). # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. ### BEGIN SOLUTION answer1.4 <- "B" ### END SOLUTION test_1.4 () Before we get started doing our analysis, let's learn about the pipe operator, |> , as it can be very helpful when doing data analysis in R! Pipe Operators: |> Pipe operators allow you to chain together different functions - it takes the output of one statement and makes it the input of the next statement. Having a chain of processing functions is known as a pipeline . If we wanted to subset the avocado data to obtain just the average prices for organic avocados, we would need to first filter the type column using the function: filter() for the rows where the type is organic. Then we would need to use the select() function to get just the average price column. Below we illustrate how to do this using the pipe operator, |> , instead of creating an intermediate object as we have in past worksheets: Note: the indentation on the second line of the pipeline is not required, but added for readability. # run this cell filter ( avocado , type == "organic" ) |> select ( average_price ) In [ ]: In [ ]: In [ ]: In [ ]:

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version