Project3_Report

.pdf

School

Northeastern University *

*We aren’t endorsed by this school

Course

6000

Subject

Computer Science

Date

Apr 3, 2024

Type

pdf

Pages

Uploaded by ConstableKudu4044

1 Introduction to Analytics Module 3 - Project by Vidhi Naik ALY 6000 Introduction to Analytics October 13, 2023

2 Introduction Project 3 involves data analysis and visualization using R. This report outlines our exploration of a dataset from Goodreads, which we've cleaned and analyzed. It covers aspects like book ratings, publishing trends, and statistical insights. This report presents our methodology, findings, and recommendations based on data analysis. This report is generated using R's 'Rmarkdown' feature, providing an organized presentation of key findings, certain remarks, and comments to better explain the code, conclusions drawn from the analysis, and actionable recommendations for future exploration. Please note that certain output has been excluded from the report due to its substantial size, to prevent excessive pagination. Key Findings • The data preparation stage involved cleaning and refining the dataset, making it suitable for analysis by standardizing column names, handling date formats, and filtering books published between 1990 and 2020 with page counts below 1200. • Extensive statistical analysis revealed important insights. Descriptive statistics helped understand the distribution of book ratings, and population statistics, including mean, population variance, and population standard deviation, provided insights into book rating patterns. • Publisher profiles were created based on the number of books published. Publishers with less than 250 books were excluded, and a Pareto Chart displayed how books were distributed among different publishers. • The project generated key visualizations, such as a histogram illustrating the distribution of book ratings, a box plot displaying page count variations, and a scatter plot that explored relationships between the number of pages and book ratings. • By drawing and comparing three random samples of 100 books each with population statistics, the impact of sample size on parameter estimation was assessed. • The data-driven recommendations presented in the report offer valuable insights for book industry stakeholders, drawing from statistical patterns and trends identified in the dataset. These findings deepen our understanding of the book data's implications.

3 Naik_Project 3 – Rmarkdown Report VidhiNaik 2023-10-13 #VidhiNaik_Project3 #cat("\014") # clears console #rm(list = ls()) # clears global environment try(dev.off(dev.list()["RStudioGD"]), silent = TRUE) # clears plots #try(p_unload(p_loaded(), character.only = TRUE), silent = TRUE) # clears packages #options(scipen = 100) # disables scientific notion for entire R session library (pacman) p_load (tidyverse) library (tidyverse) #Q1. Download the file books.csv from Canvas and read the dataset into R. books_data <- read.csv ( "/Users/vidhinaik/Desktop/MS Project Management Syllabus/Intro to Analytics ALY6000/Assignment 3/books.csv" ) head (books_data) #Q2. The janitor package contains helpful functions that perform basic maintenance of your data frame. Use the clean_name function to standardize the names in your data frame. #install.packages("janitor") library (janitor) ## ## Attaching package: 'janitor' ## The following objects are masked from 'package:stats': ## ## chisq.test, fisher.test books_data <- clean_names (books_data) head (books_data) #Q3. The lubridate package contains helpful functions to convert dates represented as strings to dates represented as dates. Convert the first_publish_date column to a type date using the mdy function. #install.packages("lubridate") library (lubridate) books_data $ first_publish_date <- mdy (books_data $ first_publish_date) ## Warning: 1186 failed to parse.

4 class (books_data $ first_publish_date) books_data #Q4. Reduce your dataset to only include books published between 1990 and 2020 (inclusive). books_subset1 <- books_data[books_data $ first_publish_date >= as.Date ( "1990- 01-01" ) & books_data $ first_publish_date <= as.Date ( "2020-12-31" ), ] books_subset1 $ first_publish_date #Q5. Remove the following columns from the data set: publish_date, edition, characters, price, genres, setting, and isbn. #install.packages("dplyr") library (dplyr) books_subset1 <- books_subset1 %>% select ( - publish_date, - edition, - characters, - price, - genres, - setting, - isbn) books_subset1 #----------------------------Data Analysis------------------------------ #Q1. Use the glimpse function to produce a long view of the dataset. glimpse (books_subset1) #Q2. Use the summary function to produce a breakdown of the statistics of the dataset. summary (books_subset1) #Q3. Create a rating histogram with the following criteria. #– The y-axis is labeled “Number of Books.” #– The x-axis is labeled “Rating.” #– The title of the graph “Histogram of Book Ratings.” #– The graph is filled with the color “red.” #– Set a binwidth of .25. #– Use theme_bw(). library (ggplot2) ggplot ( data = books_subset1, aes ( x = rating)) + geom_histogram ( binwidth = 0.25 , fill = "red" ) + labs ( y = "Number of Books" , x = "Rating" , title = "Histogram of Book Ratings" ) + theme_bw () ## Warning: Removed 22495 rows containing non-finite values (`stat_bin()`).

5 #Q4. Create a boxplot of the number pages per book in the dataset with the following requirements. #– The boxplot is horizontal. #– The x-axis is labeled “Pages.” #– The title is “Box Plot of Page Counts.” #– Fill the boxplot with the color magenta. #– Use the theme theme_economist from the ggthemes package. #install.packages("ggthemes") library (ggthemes) ggplot ( data = books_subset1, aes ( x = pages)) + geom_boxplot ( fill = "magenta" ) + labs ( x = "Pages" , title = "Box Plot of Page Counts" ) + coord_cartesian ( xlim = c ( 0 , 1200 )) + theme_economist () ## Warning: Removed 23294 rows containing non-finite values (`stat_boxplot()`).

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version