Vyas_Project3_Report

.pdf

School

Northeastern University *

*We aren’t endorsed by this school

Course

6000

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

16

Uploaded by UltraWolverinePerson1024

Report
Project 3 - Exploring Visualizations ALY6000 71053 Introduction to Analytics SEC 27 Module 3 Prepared by: Anvita Vyas (NUID:002962386) For: Prof.Herath Gedara, Chinthaka Pathum Dinesh Submission Date: 10 October 2023
Introduction The third project of the course presents an opportunity to visualize data through R programming. The project entails working on the books data. The project involved data cleaning and data analysis through visualization. It culminates in a comprehensive analysis of the dataset, providing us the chance to analyze the data visually and understand different relations between the variables. Overview In this project, we will analyze the dataset about books, which was collected from Goodreads. The dataset includes details such as book titles, authors, ratings, pages, and more. The objective of this assignment is to give a chance to explore functions related to data cleaning, exploratory data analysis, and to create compelling visualizations. Moreover, we would also explore functions that help with basic statistics by computing population and sample statistics. The aim of the assignment is to draw insights from the data by visualization and understanding statistics. Key Findings 1. Data processing a. This process involves data loading and cleaning. The first step is loading the packages. For this assignment, we have used tidyverse by p_load(tidyverse) . Loading that package helped with reading the CSV by read.csv(). To work with date and time we have loaded library(lubridate) . To clean the data we had to download the dplyr library by library(dplyr) and load janitor by p_load(janitor) . To help with data visualization we loaded library(ggplot2), library(plotly), library(ggQC), library(ggthemes). 2. Data Cleaning a. This assignment had a huge focus on data cleaning so that the data was easy to work with. We first worked on cleaning the names by clean_names(). To deal with date and cleaning the date we used mdy() and year() functions. 3. Data Manipulation a. For data manipulation, we used functions like the vector function c() to select the data we require. To manipulate the data we also used select() , filter() , arrange() , group_by() , and mutate() functions. The data manipulation helped with analyzing the data more clearly as well as keeping the data more focused on what is needed. 4. Statistical Analysis 1
a. This assignment involves the use of descriptive statistics like means mean() which was previously introduced. In addition, this assignment introduced a lot of new ways to conduct statistical analysis. Like creating custom R functions by using function() to compute the average, population variance, and population standard deviation of book ratings. Additionally, we got a chance to explore sample statistics by conducting analysis on three random samples of 100 books from the dataset and computing sample statistics for mean, variance, and standard deviation. This helped to analyze the data for comparison with the population. Fig 1. Input to create custom functions Fig 2. Output for creating own functions 5. Data Visualization a. During the assignment, there were lots of functions that were used to help with data visualization. Using the glimpse() function to view the data set in a different configuration. Also, using ggplot() function to plot scatter plots and histograms with specific specifications. These were great tools for visualizing the data and gaining insights to make recommendations. 2
The key findings demonstrate how various operations and functions in R programming can help with data analysis to extract meaningful information from datasets. Key Visualizations Question 3 (Data Analysis) Create a rating histogram with the following criteria: – The y-axis is labeled “Number of Books.” – The x-axis is labeled “Rating.” – The title of the graph is “Histogram of Book Ratings.” – The graph is filled with the color “red.” – Set a binwidth of .25. – Use theme_bw(). Fig 3. Input to create a histogram Fig 4. Output, the histogram 3
Question 4 (Data Analysis) Create a boxplot of the number pages per book in the dataset with the following requirements. – The boxplot is horizontal. – The x-axis is labeled “Pages.” – The title is “Box Plot of Page Counts.” – Fill the boxplot with the color magenta. – Use the theme theme_economist from the ggthemes package. Fig 5. Input to create a box plot 4
Fig 6. Output - the box plot Question 6 (Data Analysis) Using the data frame constructed in the prior problem, create a Pareto Chart with an ogive of cumulative counts formatted with the following additional criteria: – The bars are filled with the color cyan. – The x-axis label is “Publisher.” – The y-axis label is “Number of Books.” – The title is “Pareto and Ogive of Publisher Book Counts (1990 - 2020).” – Use the theme theme_clean(). – Rotate the x-axis labels by 45 degrees (consider the ggeasy package). Fig 7. Input to create a Pareto Chart 5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help