IST 687 HW3

.pdf

School

Syracuse University *

*We aren’t endorsed by this school

Course

687

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

Uploaded by MinisterGoldfinch3708

11/29/23, 8:10 PM HW3.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW3.knit.html 1/10 Intro to Data Science - HW 3 Copyright Jeffrey Stanton, Jeffrey Saltz, and Jasmina Tacheva # Enter your name here: Elyse Peterson Attribution statement: (choose only one and delete the rest) # 1. I did this homework by myself, with help from the book and the professor. Reminders of things to practice from last week: Make a data frame data.frame( ) Row index of max/min which.max( ) which.min( ) Sort value or order rows sort( ) order( ) Descriptive statistics mean( ) sum( ) max( ) Conditional statement if (condition) “true stuff” else “false stuff” This Week: Often, when you get a dataset, it is not in the format you want. You can (and should) use code to refine the dataset to become more useful. As Chapter 6 of Introduction to Data Science mentions, this is called “data munging.” In this homework, you will read in a dataset from the web and work on it (in a data frame) to improve its usefulness. Part 1: Use read_csv( ) to read a CSV file from the web into a data frame: A. Use R code to read directly from a URL on the web. Store the dataset into a new dataframe, called dfComps. The URL is: “https://intro-datascience.s3.us-east-2.amazonaws.com/companies1.csv (https://intro-datascience.s3.us- east-2.amazonaws.com/companies1.csv)” Hint: use read_csv( ), not read.csv( ). This is from the tidyverse package . Check the help to compare them. library (readr) urlToRead <- "https://intro-datascience.s3.us-east-2.amazonaws.com/companies1.csv" dfcomps <- read_csv(url(urlToRead))

11/29/23, 8:10 PM HW3.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW3.knit.html 2/10 ## Rows: 47758 Columns: 18 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (16): permalink, name, homepage_url, category_list, market, funding_tota... ## dbl (2): funding_rounds, founded_year ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. head(dfcomps) ## # A tibble: 6 × 18 ## permalink name homepage_url category_list market funding_total_usd status ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 /organizatio… #way… http://www.… |Entertainme… News 1 750 000 acqui… ## 2 /organizatio… &TV … http://enjo… |Games| Games 4 000 000 opera… ## 3 /organizatio… 'Roc… http://www.… |Publishing|… Publi… 40 000 opera… ## 4 /organizatio… (In)… http://www.… |Electronics… Elect… 1 500 000 opera… ## 5 /organizatio… #NAM… http://plus… |Software| Softw… 1 200 000 opera… ## 6 /organizatio… -R- … <NA> |Entertainme… Games 10 000 opera… ## # ℹ 11 more variables: country_code <chr>, state_code <chr>, region <chr>, ## # city <chr>, funding_rounds <dbl>, founded_at <chr>, founded_month <chr>, ## # founded_quarter <chr>, founded_year <dbl>, first_funding_at <chr>, ## # last_funding_at <chr> Part 2: Create a new data frame that only contains companies with a homepage URL: E. Use subsetting to create a new dataframe that contains only the companies with homepage URLs (store that dataframe in urlComps ). urlcomps <- subset(dfcomps,complete.cases(dfcomps$homepage_url)) head(urlcomps) ## # A tibble: 6 × 18 ## permalink name homepage_url category_list market funding_total_usd status ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 /organizatio… #way… http://www.… |Entertainme… News 1 750 000 acqui… ## 2 /organizatio… &TV … http://enjo… |Games| Games 4 000 000 opera… ## 3 /organizatio… 'Roc… http://www.… |Publishing|… Publi… 40 000 opera… ## 4 /organizatio… (In)… http://www.… |Electronics… Elect… 1 500 000 opera… ## 5 /organizatio… #NAM… http://plus… |Software| Softw… 1 200 000 opera… ## 6 /organizatio… .Clu… http://nic.… |Software| Softw… 7 000 000 <NA> ## # ℹ 11 more variables: country_code <chr>, state_code <chr>, region <chr>, ## # city <chr>, funding_rounds <dbl>, founded_at <chr>, founded_month <chr>, ## # founded_quarter <chr>, founded_year <dbl>, first_funding_at <chr>, ## # last_funding_at <chr>

11/29/23, 8:10 PM HW3.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW3.knit.html 3/10 D. How many companies are missing a homepage URL? library (tidyverse) ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.3 ✔ purrr 1.0.2 ## ✔ forcats 1.0.0 ✔ stringr 1.5.0 ## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to becom e errors count(dfcomps) ## # A tibble: 1 × 1 ## n ## <int> ## 1 47758 count(urlcomps) ## # A tibble: 1 × 1 ## n ## <int> ## 1 44435 count(dfcomps)-count(urlcomps) ## n ## 1 3323 Part 3: Analyze the numeric variables in the dataframe. G. How many numeric variables does the dataframe have? You can figure that out by looking at the output of str(urlComps) . H. What is the average number of funding rounds for the companies in urlComps ? str(urlcomps)

11/29/23, 8:10 PM HW3.knit file:///C:/Users/empet/OneDrive/Desktop/Intro to Data Science/Elyse_Peterson_HW3.knit.html 4/10 ## tibble [44,435 × 18] (S3: tbl_df/tbl/data.frame) ## $ permalink : chr [1:44435] "/organization/waywire" "/organization/tv-communications" "/organization/rock-your-paper" "/organization/in-touch-network" ... ## $ name : chr [1:44435] "#waywire" "&TV Communications" "'Rock' Your Paper" "(In) Touch Network" ... ## $ homepage_url : chr [1:44435] "http://www.waywire.com" "http://enjoyandtv.com" "http:// www.rockyourpaper.org" "http://www.InTouchNetwork.com" ... ## $ category_list : chr [1:44435] "|Entertainment|Politics|Social Media|News|" "|Games|" "| Publishing|Education|" "|Electronics|Guides|Coffee|Restaurants|Music|iPhone|Apps|Mobile|iOS|E-Co mmerce|" ... ## $ market : chr [1:44435] "News" "Games" "Publishing" "Electronics" ... ## $ funding_total_usd: chr [1:44435] "1 750 000" "4 000 000" "40 000" "1 500 000" ... ## $ status : chr [1:44435] "acquired" "operating" "operating" "operating" ... ## $ country_code : chr [1:44435] "USA" "USA" "EST" "GBR" ... ## $ state_code : chr [1:44435] "NY" "CA" NA NA ... ## $ region : chr [1:44435] "New York City" "Los Angeles" "Tallinn" "London" ... ## $ city : chr [1:44435] "New York" "Los Angeles" "Tallinn" "London" ... ## $ funding_rounds : num [1:44435] 1 2 1 1 2 1 1 1 1 1 ... ## $ founded_at : chr [1:44435] "1/6/12" NA "26/10/2012" "1/4/11" ... ## $ founded_month : chr [1:44435] "2012-06" NA "2012-10" "2011-04" ... ## $ founded_quarter : chr [1:44435] "2012-Q2" NA "2012-Q4" "2011-Q2" ... ## $ founded_year : num [1:44435] 2012 NA 2012 2011 2012 ... ## $ first_funding_at : chr [1:44435] "30/06/2012" "4/6/10" "9/8/12" "1/4/11" ... ## $ last_funding_at : chr [1:44435] "30/06/2012" "23/09/2010" "9/8/12" "1/4/11" ... #There are 2 fields (funding rounds and founded year) that are numeric# mean(urlcomps$funding_rounds) ## [1] 1.725194 I. What year was the oldest company in the dataframe founded? Hint: If you get a value of “NA,” most likely there are missing values in this variable which preclude R from properly calculating the min & max values. You can ignore NAs with basic math calculations. For example, instead of running mean(urlComps$founded_year), something like this will work for determining the average (note that this question needs to use a different function than ‘mean’. #mean(urlComps$founded_year, na.rm=TRUE) min(na.omit(dfcomps$founded_year)) ## [1] 1900 Part 4: Use string operations to clean the data. K. The permalink variable in urlComps contains the name of each company but the names are currently preceded by the prefix “/organization/”. We can use str_replace() in tidyverse or gsub() to clean the values of

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version