Project3_Report
.pdf
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
6000
Subject
Computer Science
Date
Apr 3, 2024
Type
Pages
14
Uploaded by ConstableKudu4044
1
Introduction to Analytics
Module 3 - Project
by
Vidhi Naik
ALY 6000 Introduction to Analytics
October 13, 2023
2
Introduction Project 3 involves data analysis and visualization using R. This report outlines our exploration of a dataset from Goodreads, which we've cleaned and analyzed. It covers aspects like book ratings, publishing trends, and statistical insights. This report presents our methodology, findings, and recommendations based on data analysis. This report is generated using R's 'Rmarkdown'
feature, providing an organized presentation of key findings, certain remarks, and comments to better explain the code, conclusions drawn from the analysis, and actionable recommendations for future exploration. Please note that certain output has been excluded from the report due to its substantial size, to prevent excessive pagination. Key Findings •
The data preparation stage involved cleaning and refining the dataset, making it suitable for analysis by standardizing column names, handling date formats, and filtering books published between 1990 and 2020 with page counts below 1200. •
Extensive statistical analysis revealed important insights. Descriptive statistics helped understand the distribution of book ratings, and population statistics, including mean, population variance, and population standard deviation, provided insights into book rating patterns. •
Publisher profiles were created based on the number of books published. Publishers with less than 250 books were excluded, and a Pareto Chart displayed how books were distributed among different publishers. •
The project generated key visualizations, such as a histogram illustrating the distribution of book ratings, a box plot displaying page count variations, and a scatter plot that explored relationships between the number of pages and book ratings. •
By drawing and comparing three random samples of 100 books each with population statistics, the impact of sample size on parameter estimation was assessed. •
The data-driven recommendations presented in the report offer valuable insights for book industry stakeholders, drawing from statistical patterns and trends identified in the dataset. These findings deepen our understanding of the book data's implications.
3
Naik_Project 3 – Rmarkdown Report VidhiNaik
2023-10-13
#VidhiNaik_Project3
#cat("\014") # clears console
#rm(list = ls()) # clears global environment try(dev.off(dev.list()["RStudioGD"]), silent = TRUE) # clears plots #try(p_unload(p_loaded(), character.only = TRUE), silent = TRUE) # clears packages
#options(scipen = 100) # disables scientific notion for entire R session
library
(pacman)
p_load
(tidyverse)
library
(tidyverse)
#Q1. Download the file books.csv from Canvas and read the dataset into R.
books_data <- read.csv
(
"/Users/vidhinaik/Desktop/MS Project Management Syllabus/Intro to Analytics ALY6000/Assignment 3/books.csv"
)
head
(books_data)
#Q2. The janitor package contains helpful functions that perform basic maintenance of your data frame. Use the clean_name function to standardize the names in your data frame.
#install.packages("janitor")
library
(janitor)
## ## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## ## chisq.test, fisher.test
books_data <- clean_names
(books_data)
head
(books_data)
#Q3. The lubridate package contains helpful functions to convert dates represented as strings to dates represented as dates. Convert the first_publish_date column to a type date using the mdy function.
#install.packages("lubridate")
library
(lubridate)
books_data
$
first_publish_date <- mdy
(books_data
$
first_publish_date)
## Warning: 1186 failed to parse.
4
class
(books_data
$
first_publish_date)
books_data
#Q4. Reduce your dataset to only include books published between 1990 and 2020 (inclusive).
books_subset1 <- books_data[books_data
$
first_publish_date >=
as.Date
(
"1990-
01-01"
) &
books_data
$
first_publish_date <=
as.Date
(
"2020-12-31"
), ]
books_subset1
$
first_publish_date
#Q5. Remove the following columns from the data set: publish_date, edition, characters, price, genres, setting, and isbn.
#install.packages("dplyr")
library
(dplyr)
books_subset1 <- books_subset1 %>%
select
(
-
publish_date, -
edition, -
characters, -
price, -
genres, -
setting, -
isbn)
books_subset1
#----------------------------Data Analysis------------------------------
#Q1. Use the glimpse function to produce a long view of the dataset.
glimpse
(books_subset1)
#Q2. Use the summary function to produce a breakdown of the statistics of the dataset.
summary
(books_subset1)
#Q3. Create a rating histogram with the following criteria.
#– The y-axis is labeled “Number of Books.”
#– The x-axis is labeled “Rating.”
#– The title of the graph “Histogram of Book Ratings.”
#– The graph is filled with the color “red.”
#– Set a binwidth of .25.
#– Use theme_bw().
library
(ggplot2)
ggplot
(
data = books_subset1, aes
(
x = rating)) +
geom_histogram
(
binwidth = 0.25
, fill = "red"
) +
labs
(
y = "Number of Books"
, x = "Rating"
, title = "Histogram of Book Ratings"
) +
theme_bw
()
## Warning: Removed 22495 rows containing non-finite values (`stat_bin()`).
5
#Q4. Create a boxplot of the number pages per book in the dataset with the following requirements.
#– The boxplot is horizontal.
#– The x-axis is labeled “Pages.”
#– The title is “Box Plot of Page Counts.”
#– Fill the boxplot with the color magenta.
#– Use the theme theme_economist from the ggthemes package.
#install.packages("ggthemes")
library
(ggthemes)
ggplot
(
data = books_subset1, aes
(
x = pages)) +
geom_boxplot
(
fill = "magenta"
) +
labs
(
x = "Pages"
, title = "Box Plot of Page Counts"
) +
coord_cartesian
(
xlim = c
(
0
, 1200
)) +
theme_economist
()
## Warning: Removed 23294 rows containing non-finite values (`stat_boxplot()`).
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
6. A data analyst wants to save stakeholders time and effort when working with a Tableau dashboard. They also want to
direct stakeholders to the most important data. What process can they use to achieve both goals?
Pre-filtering
Pre-sorting
Pre-sizing
Pre-building
7. Fill in the blank: An effective slideshow guides your audience through your main communication points, but it does not
repeat every word you say. A best practice is to keep text to fewer than five lines and
words per slide.
O 25
5
100
50
8. You want to include a visual in your slideshow that will update automatically when its original source file updates. Which
of the following actions will enable you to do so?
Copy and paste the visual into the presentation
O Take a screenshot of the visual and paste it into the presentation
Link the original visual within the presentation
Embed the visual into the presentation
arrow_forward
Next, your task involves analyzing data collected from various sources, interpreting the findings, and presenting them effectively. Based on the concepts and techniques discussed in Module 9:
Describe the process you would follow to analyze both qualitative and quantitative data collected from your research. Include specific methods (e.g., content analysis, thematic analysis for qualitative data, and statistical analysis for quantitative data) and explain why these methods are suitable for your project.
Propose strategies for effectively presenting your findings. Consider different presentation methods (e.g., structured notations, storytelling, graphical representations) and justify which method(s) would be most effective for your audience and study goals.
Note: Your response should demonstrate a comprehensive understanding of the principles and practices of data analysis, interpretation, and presentation in the context of human-computer interaction.
arrow_forward
Alert - don't use any AI platform to generate answer and don't try to copy others'work otherwise I'll reduce rating.
arrow_forward
Create a dashboard for road accident dataset using Visualization tools and explain the steps in detail.
arrow_forward
In this lab activity, you will be introduced to the
concept of enriching data with SPLUNK.
This three-hour course is for knowledge
managers who want to use lookups to enrich
their search environment. Topics will introduce
lookup types and cover how to upload and
define lookups, create automatic lookups, and
use advanced lookup options. Additionally,
students will learn how to verify lookup contents
in search and review lookup best practices.
arrow_forward
DATA MINING TOOLS AT PANDORA RADIO
With more than 80 million registered users, Pandora Radio is a personalized
Internet radio service that helps you find new music based on your past and current
favorite. (This service is also available to mobile devices- as an app for Blackberry or
the iPhone.) The success of Pandora Radio's business model derives from applying
data-mining tools to the Music Genome Project, which is a vast database of songs that
a team of experts has broken down into their various components: melody, rhythm,
vocals, lyrics, and so on. Listeners begin by entering their favorite songs, artists, or
genres, creating customized "stations". Then, Pandora Radio mines its database to find
songs that are similar. Another data-mining tool that Pandora uses is the like/dislike
(thumbs up/thumbs down) option that accompanies each song the site suggests. These
responses are also factored into which songs the Web site decides to play for the user.
Answer the following questions:
1.…
arrow_forward
The Model Selection (Choose technique based on the end goal) is a part of the Model Planning phase of the data analytics lifecycle.
O True
O False
arrow_forward
Purpose
In this lab activity, you will be introduced to the
concept of enriching data with SPLUNK.
This three-hour course is for knowledge
managers who want to use lookups to enrich
their search environment. Topics will introduce
lookup types and cover how to upload and
define lookups, create automatic lookups, and
use advanced lookup options. Additionally,
students will learn how to verify lookup contents
in search and review lookup best practices.
arrow_forward
What role does data modelling play in the analysis process? How will we know what data we'll need to model?
arrow_forward
Data transformation may be as basic as re-presenting existing data in a different manner, or as sophisticated as combining data from many sources. Do you have any thoughts?
arrow_forward
Mm. 142.
arrow_forward
Write a one-page project proposal for big data analytics on the Google Cloud Platform. This project will work as the final test.Be specific on which dataset you are going to work on and what is the purpose of the project. For example, you cannot simply say that you will do analysis on covid data. You should be specific on which aspect of the dataset you will explore and what you want to know about the dataset. Your proposal should be novel. That is, you cannot propose something that has already been done or something available in public domain. Show the source of the data and how you are going to obtain the data.
arrow_forward
What are the five graphical data characteristics that are used in data visualisation in order to highlight and contrast data findings and to tell a story? What are the five graphical data features that are utilised in data visualisation?
arrow_forward
Rufla Bhd was founded in 2011, based in New York and Cambridge, Massachusetts. Rufla Bhd is a financial data platform designed for detailed fundamental research. The organization was created by former analysts and supported by a team of financial experts with serious credentials in Finance, Accounting, and Applied Mathematics. Mr Pranav and Mr Alex are the co-founders of Rufla Bhd where Mr Pranav is the CEO while Mr Alex is the CTO.
Rufla helps analysts looking to go a level deeper become more productive and accurate in their financial analysis. With Rufla, analysts can systematically access all the data both in numbers and text in financial statements, including the details hidden within the footnotes. The main clients of Rufla Bhd are the equity analyst, corporate finance companies, auditors and academicians.
In today’s world, information technology is important in our lives because it helps to deal with every day’s dynamic things. Technology offers various tools to boost…
arrow_forward
Explore the challenges and solutions associated with handling unstructured or semi-structured data in a data warehousing environment. How does this impact data processing and analysis?
arrow_forward
Explore the challenges and best practices associated with data quality and data governance in data warehousing projects.
arrow_forward
What are the typical challenges and limitations associated with data warehousing projects?
arrow_forward
Give an example of how you would use JAD or RAD as opposed to a more traditional method of data collection. When compared to traditional methods, how do team-based methods excel?
arrow_forward
The process of transforming data may vary from something as simple as altering the format or representation of the data to something as involved as combining the data from multiple different sources. What are your thoughts on the situation?
arrow_forward
Computer Science
1. Can you apply the Topic Modeling on Sentiment Analysis (Twitter Data)? show detailed steps with Python Code.
2. Can you apply the Topic Modeling Using Spark? Give a detail for your answer without coding.
arrow_forward
7. What is the difference between Quantitative Data and Qualitative Data?
8. What is the difference between obtrusive and unobtrusive approach?
arrow_forward
Discuss the concept of distributed data warehousing and its advantages for large-scale data analysis. How does it differ from traditional data warehousing?
arrow_forward
Question 3: Describe the data science workflow in your own
words.
+ Code
+ Markdown
Question 4: Research two data science tools of your choice. For
each tool, describe its uses, advantages, disadvantages,
similarities and differences.
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Database System Concepts
Computer Science
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:9780134444321
Author:Tony Gaddis
Publisher:PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:9780132737968
Author:Thomas L. Floyd
Publisher:PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:9780133976892
Author:Paul J. Deitel, Harvey Deitel
Publisher:PEARSON
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781337627900
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education
Related Questions
- 6. A data analyst wants to save stakeholders time and effort when working with a Tableau dashboard. They also want to direct stakeholders to the most important data. What process can they use to achieve both goals? Pre-filtering Pre-sorting Pre-sizing Pre-building 7. Fill in the blank: An effective slideshow guides your audience through your main communication points, but it does not repeat every word you say. A best practice is to keep text to fewer than five lines and words per slide. O 25 5 100 50 8. You want to include a visual in your slideshow that will update automatically when its original source file updates. Which of the following actions will enable you to do so? Copy and paste the visual into the presentation O Take a screenshot of the visual and paste it into the presentation Link the original visual within the presentation Embed the visual into the presentationarrow_forwardNext, your task involves analyzing data collected from various sources, interpreting the findings, and presenting them effectively. Based on the concepts and techniques discussed in Module 9: Describe the process you would follow to analyze both qualitative and quantitative data collected from your research. Include specific methods (e.g., content analysis, thematic analysis for qualitative data, and statistical analysis for quantitative data) and explain why these methods are suitable for your project. Propose strategies for effectively presenting your findings. Consider different presentation methods (e.g., structured notations, storytelling, graphical representations) and justify which method(s) would be most effective for your audience and study goals. Note: Your response should demonstrate a comprehensive understanding of the principles and practices of data analysis, interpretation, and presentation in the context of human-computer interaction.arrow_forwardAlert - don't use any AI platform to generate answer and don't try to copy others'work otherwise I'll reduce rating.arrow_forward
- Create a dashboard for road accident dataset using Visualization tools and explain the steps in detail.arrow_forwardIn this lab activity, you will be introduced to the concept of enriching data with SPLUNK. This three-hour course is for knowledge managers who want to use lookups to enrich their search environment. Topics will introduce lookup types and cover how to upload and define lookups, create automatic lookups, and use advanced lookup options. Additionally, students will learn how to verify lookup contents in search and review lookup best practices.arrow_forwardDATA MINING TOOLS AT PANDORA RADIO With more than 80 million registered users, Pandora Radio is a personalized Internet radio service that helps you find new music based on your past and current favorite. (This service is also available to mobile devices- as an app for Blackberry or the iPhone.) The success of Pandora Radio's business model derives from applying data-mining tools to the Music Genome Project, which is a vast database of songs that a team of experts has broken down into their various components: melody, rhythm, vocals, lyrics, and so on. Listeners begin by entering their favorite songs, artists, or genres, creating customized "stations". Then, Pandora Radio mines its database to find songs that are similar. Another data-mining tool that Pandora uses is the like/dislike (thumbs up/thumbs down) option that accompanies each song the site suggests. These responses are also factored into which songs the Web site decides to play for the user. Answer the following questions: 1.…arrow_forward
- The Model Selection (Choose technique based on the end goal) is a part of the Model Planning phase of the data analytics lifecycle. O True O Falsearrow_forwardPurpose In this lab activity, you will be introduced to the concept of enriching data with SPLUNK. This three-hour course is for knowledge managers who want to use lookups to enrich their search environment. Topics will introduce lookup types and cover how to upload and define lookups, create automatic lookups, and use advanced lookup options. Additionally, students will learn how to verify lookup contents in search and review lookup best practices.arrow_forwardWhat role does data modelling play in the analysis process? How will we know what data we'll need to model?arrow_forward
- Data transformation may be as basic as re-presenting existing data in a different manner, or as sophisticated as combining data from many sources. Do you have any thoughts?arrow_forwardMm. 142.arrow_forwardWrite a one-page project proposal for big data analytics on the Google Cloud Platform. This project will work as the final test.Be specific on which dataset you are going to work on and what is the purpose of the project. For example, you cannot simply say that you will do analysis on covid data. You should be specific on which aspect of the dataset you will explore and what you want to know about the dataset. Your proposal should be novel. That is, you cannot propose something that has already been done or something available in public domain. Show the source of the data and how you are going to obtain the data.arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Database System ConceptsComputer ScienceISBN:9780078022159Author:Abraham Silberschatz Professor, Henry F. Korth, S. SudarshanPublisher:McGraw-Hill EducationStarting Out with Python (4th Edition)Computer ScienceISBN:9780134444321Author:Tony GaddisPublisher:PEARSONDigital Fundamentals (11th Edition)Computer ScienceISBN:9780132737968Author:Thomas L. FloydPublisher:PEARSON
- C How to Program (8th Edition)Computer ScienceISBN:9780133976892Author:Paul J. Deitel, Harvey DeitelPublisher:PEARSONDatabase Systems: Design, Implementation, & Manag...Computer ScienceISBN:9781337627900Author:Carlos Coronel, Steven MorrisPublisher:Cengage LearningProgrammable Logic ControllersComputer ScienceISBN:9780073373843Author:Frank D. PetruzellaPublisher:McGraw-Hill Education
Database System Concepts
Computer Science
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:9780134444321
Author:Tony Gaddis
Publisher:PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:9780132737968
Author:Thomas L. Floyd
Publisher:PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:9780133976892
Author:Paul J. Deitel, Harvey Deitel
Publisher:PEARSON
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781337627900
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education