Homework 1

.pdf

School

Columbia University *

*We aren’t endorsed by this school

Course

5800

Subject

Electrical Engineering

Date

Apr 3, 2024

Type

pdf

Pages

Uploaded by ProfessorDogMaster1048

--- title: 'Homework 1: workflow and graphics practice' author: - name: date: '`r Sys.Date()`' output: distill::distill_article --- ```{r setup, include=FALSE} knitr::opts_chunk$set( eval = TRUE, echo = TRUE, message = FALSE, error = FALSE, warning = FALSE ) ``` In our discussion of the Citi Bike case study, we started considering the effect of the pandemic on ridership and rebalancing, and how we might find some insight by looking at data related to other transportation systems in the city. In this homework, we will continue *exploratory data analysis* for this case study as *prerequisites* for communicating with particular audiences for particular purposes. <aside> **Opportunity alert**: In this assignment, I provide most of the code and you fill in the blank (using functions introduced in our class slides and demonstration code) to get it working. But understand what I've coded to help you code on your own in future work: this is a tutorial as much as an assignment. </aside> # Preliminary setup If you have not already, install the `tidyverse` and `distill` R packages. ```{r} ``` Create a directory on your computer for your homework. Place this file in that directory. In RStudio, create a **project** in that same directory (in the RStudio menu: file, new project...). Now, when you import the data you will only need to specify the subdirectory as part of your name. This preparatory step helps your work be **reproducible**.

For this assignment, import data on New York City ridership that I included with this markdown file: MTA_recent_ridership_data_20220127.csv. Create a subdirectory called `data` in your project directory and place the `csv` file you downloaded into it. Rename the file `MTA_recent_ridership_data.csv`. Inside the code chunk below, load the `tidyverse` library package (which includes `dplyr` and `ggplot2` functions): ```{r} # enter code to load the tidyverse libraries library("tidyverse") ``` # Question 1: importing and summarising Import the data into a data frame named `d` and show a summary (hint, in your console, after you load the tidyverse library, you can type ? before `read_csv` or `glimpse` to learn more about functions for this purpose): Use the two functions below to import and summarize your data variables: ```{r} # enter code to import and summarize your data frame variables here. d <- read.csv("data/MTA_recent_ridership_data.csv") glimpse(d) ``` # Question 2: tidying The column or variable names will be difficult to work with as they are currently written. First, we will rename variables so the data frame will be easier to work with in code: ```{r} new_names <- str_c( rep(c('subway', 'bus', 'lirr', 'mta', 'access_ride', 'bridge_tunnel'), each = 2),

rep(c("total", "change"), times = 6), sep = '_' ) colnames(d) <- c('date', new_names) str(d) ``` Also, notice some of the variables are of the wrong type. The variable `date`, for example, is an array of type `char`. Let's change this to a proper `date` type. And all the variables with a percentage are also of a type `char`. Finally, the now renamed variable `mta_total` is of type char. Below, explain why variable `mta_total` is of type `char`: > Write your answer here. The script changed the name of the variable, not the type. The original type of original 'mta_total' is char so variable 'mta_total' maintained its type of char. # Question 3: more tidying Next, we'll clean the variables holding percentages as a type `char`. We'll do this by removing the `%` and recasting the variables, all in one set of piping functions: ```{r} d <- d %>% mutate( date = as_date(date, format = '%m/%d/%Y') ) %>% mutate( mta_total = as.numeric(mta_total) ) %>% mutate( across( where(is.character), ~str_replace_all(.x, pattern = '%', replacement = '')) ) %>% mutate( across( where(is.character), ~as.numeric(.x)) ) str(d) ``` In R, missing data is represented as `NA`. Let's try to visualize whether we have missing data, say, as a so-called heatmap of the data frame. Finish the code I provided below, and answer the question that follows: ```{r} a <- d %>% mutate(observation = row_number()) %>% pivot_longer(

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version