Vyas_Project3_Report
.pdf
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
6000
Subject
Computer Science
Date
Feb 20, 2024
Type
Pages
16
Uploaded by UltraWolverinePerson1024
Project 3 - Exploring Visualizations
ALY6000 71053 Introduction to Analytics SEC 27
Module 3
Prepared by:
Anvita Vyas (NUID:002962386)
For:
Prof.Herath Gedara, Chinthaka Pathum Dinesh
Submission Date:
10 October 2023
Introduction
The third project of the course presents an opportunity to visualize data through R programming.
The project entails working on the
books
data. The project involved data cleaning and data
analysis through visualization. It culminates in a comprehensive analysis of the dataset,
providing us the chance to analyze the data visually and understand different relations between
the variables.
Overview
In this project, we will analyze the dataset about books, which was collected from Goodreads.
The dataset includes details such as book titles, authors, ratings, pages, and more. The objective
of this assignment is to give a chance to explore functions related to data cleaning, exploratory
data analysis, and to create compelling visualizations. Moreover, we would also explore
functions that help with basic statistics by computing population and sample statistics. The aim
of the assignment is to draw insights from the data by visualization and understanding statistics.
Key Findings
1.
Data processing
a.
This process involves data loading and cleaning. The first step is loading the
packages. For this assignment, we have used tidyverse by
p_load(tidyverse)
.
Loading that package helped with reading the CSV by
read.csv().
To work with
date and time we have loaded
library(lubridate)
. To clean the data we had to
download the dplyr library by
library(dplyr)
and load janitor by
p_load(janitor)
.
To help with data visualization we loaded
library(ggplot2), library(plotly),
library(ggQC), library(ggthemes).
2.
Data Cleaning
a.
This assignment had a huge focus on data cleaning so that the data was easy to
work with. We first worked on cleaning the names by
clean_names().
To deal
with date and cleaning the date we used
mdy()
and
year()
functions.
3.
Data Manipulation
a.
For data manipulation, we used functions like the vector function
c()
to select the
data we require. To manipulate the data we also used
select()
,
filter()
,
arrange()
,
group_by()
, and
mutate()
functions. The data manipulation helped with
analyzing the data more clearly as well as keeping the data more focused on what
is needed.
4.
Statistical Analysis
1
a.
This assignment involves the use of descriptive statistics like means
mean()
which was previously introduced. In addition, this assignment introduced a lot of
new ways to conduct statistical analysis. Like creating custom R functions by
using
function()
to compute the average, population variance, and population
standard deviation of book ratings. Additionally, we got a chance to explore
sample statistics by conducting analysis on three random samples of 100 books
from the dataset and computing sample statistics for mean, variance, and standard
deviation. This helped to analyze the data for comparison with the population.
Fig 1. Input to create custom functions
Fig 2. Output for creating own functions
5.
Data Visualization
a.
During the assignment, there were lots of functions that were used to help with
data visualization. Using the
glimpse()
function to view the data set in a different
configuration. Also, using
ggplot()
function to plot scatter plots and histograms
with specific specifications. These were great tools for visualizing the data and
gaining insights to make recommendations.
2
The key findings demonstrate how various operations and functions in R programming can help
with data analysis to extract meaningful information from datasets.
Key Visualizations
Question 3 (Data Analysis)
Create a rating histogram with the following criteria:
– The y-axis is labeled “Number of Books.”
– The x-axis is labeled “Rating.”
– The title of the graph is “Histogram of Book Ratings.”
– The graph is filled with the color “red.”
– Set a binwidth of .25.
– Use theme_bw().
Fig 3. Input to create a histogram
Fig 4. Output, the histogram
3
Question 4 (Data Analysis)
Create a boxplot of the number pages per book in the dataset with the following requirements.
– The boxplot is horizontal.
– The x-axis is labeled “Pages.”
– The title is “Box Plot of Page Counts.”
– Fill the boxplot with the color magenta.
– Use the theme theme_economist from the ggthemes package.
Fig 5. Input to create a box plot
4
Fig 6. Output - the box plot
Question 6 (Data Analysis)
Using the data frame constructed in the prior problem, create a Pareto Chart with an ogive of
cumulative counts formatted with the following additional criteria:
– The bars are filled with the color cyan.
– The x-axis label is “Publisher.”
– The y-axis label is “Number of Books.”
– The title is “Pareto and Ogive of Publisher Book Counts (1990 - 2020).”
– Use the theme theme_clean().
– Rotate the x-axis labels by 45 degrees (consider the ggeasy package).
Fig 7. Input to create a Pareto Chart
5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
Large volumes of ungrouped data were often organized into intervals of our choice in the past. We are faced with colossal and formidable difficulties, just as in the real world. Do the advantages and disadvantages of working in a group outweigh working alone?
arrow_forward
Computer Science
Analyze the titanic dataset by applying the various relevant techniques taught in class starting from cleaning the data, imputing data, and finally visualizing the data. Provide justification as to why you are using a technique. To receive full points, you should have appropriate cleaning and/or data transformation steps, charts/plots should be meaningful with no overlapping labels. I should see at least 5 different types (bar, line, pie, etc) of charts. You should have a conclusion by highlighting the most important insight you derived from the dataset
arrow_forward
In Business Intelligence and Data Science, what are the advantages and disadvantages of using tools like Tableau for visualization versus using Data Science options such as Python libraries such as Matplotlib/Seaborn or R libraries such as Ggplot2/Plotly?
arrow_forward
Imagine you are part of a research team tasked with creating a new educational mobile application aimed at enhancing online learning experiences. In this context, answer the following four questions.
First, for understanding how users interact with a newly developed mobile application, your goal is to gather data to evaluate the usability and user satisfaction of the application. Using the concepts and methodologies discussed in Module 8 of the Human Computer Interaction:
Design a Data Gathering plan for data gathering, specifying the methods you would use (e.g., interviews, questionnaires, observations) and justify your choices based on the nature of the study.
Discuss how the data gathered from each method can be analyzed to provide insights into the usability and user satisfaction of the mobile application.
Note: Your answer should demonstrate a clear understanding of the different data gathering techniques, their appropriate application, and the importance of ethical…
arrow_forward
Analytical data is the fundamental distinction between BI and adalytic data.
arrow_forward
Distinguish between a tabular and a virtual analysis of data?
arrow_forward
The question is in the image attached
arrow_forward
Data Preparation
Describe what you did with the data prior to
the modelling process (data cleaning). Show
histograms of the one example variable
before and after any pre-processing that
you carried out. If you corrected any mis-
typed entries in the data, report what you
changed. Carry out descriptive analysis and
explain each what they represent, use as
many graph as possible and provide the
descriptions
arrow_forward
Predictive modeling:
Task: Ambulance Demand
Data Generating Process: for New York City
Dates of Coverage {Please identify the specific dates that will be used for this dataset in your modeling effort other than the City Health Department's Emergency Medical Services Division.
}
Frequency of data collection {how often is the data collected? After every incident? Daily? Yearly?}
Agency / Organization collecting the data {who specifically is collecting the data? Please avoid using general references like “government” or “police}
Original Unit of Analysis {What is the original unit of analysis for the data as provided? Calls for service? Census tracts? Cities?}
Transformed Unit of Analysis{i.e. are you modifying the call data to support your model? Hint: if you are doing “demand” model you will be aggregating the data.}
Data Generation Description{here, I want you in your own words to describe how you think the data was generated. Think 2-3 sentences.}
Data…
arrow_forward
Predictive modeling:
Task: Ambulance Demand
Data Generating Process:
Dates of Coverage {Please identify the specific dates that will be used for this dataset in your modeling effort}
Frequency of data collection {how often is the data collected? After every incident? Daily? Yearly?}
Agency / Organization collecting the data {who specifically is collecting the data? Please avoid using general references like “government” or “police}
Original Unit of Analysis {What is the original unit of analysis for the data as provided? Calls for service? Census tracts? Cities?}
Transformed Unit of Analysis{i.e. are you modifying the call data to support your model? Hint: if you are doing “demand” model you will be aggregating the data.}
Data Generation Description{here, I want you in your own words to describe how you think the data was generated. Think 2-3 sentences.}
Data Collector{Who collects the data? A dispatcher? A Census taker?}
Triggering Process{What triggers the data…
arrow_forward
Data about the horses that have been awarded top places in events has already been captured ina spreadsheet (an extract from the spreadsheet is shown below). The data has been normalised tofirst normal form already – underlined column names indicate composite primary key columns.HorseID Horse Name BreedID Breed Name EventID Event Name Place1 Speedy 1 Thoroughbred 15 Show Jumping 11 Speedy 1 Thoroughbred 2 Dressage 52 Candy Floss 2 Appaloosa 15 Show Jumping 23 Getting Warmer 1 Thoroughbred 2 Dressage 44 Mother’s Love 9 Arabian 3 Eventing 3
Normalise the above data to third normal form (3NF). Show all steps as well as the final…
arrow_forward
PLEASE SHOW ALL WORK AND COMMENT ALL CODE
The Objective of this coding problem is the prediction of a proposed metro ectension construction project based on the people'es opinion. There are three alternatives to choose they are as follows:
Eglington-Pickering Line
Airport-Vaughn Line
Airposrt-Hamilton Line
Each record is represented by 16 features.
Task-1:
Metro-Ext.xlsx is the training and test dataset; you will considerr 80% of the data for training and 20% for the test. Build (1) Logistic regression (2) KNN and (3) Naive Bayes model to predict on the test data set and compute the confusion matrix for each model and compare the result.
deliverables = coding files (.py and .ipynb), and a discussions of confusion matrix for both models
metro-EXT.xlsx (Please place chart in EXCEL)
Feasibility and Constructability
Slopes and Gradients
Urban Realm
Geology and Soil Stability
Land Acquisition
Work Opportunities
Economy in Movement of People
Revenue Generation
Access to the Social,…
arrow_forward
Urban data design is a company which designs digital products and concepts that make neighborhoods more sustainable and healthier place to live and do business in. This project focuses on a creation of a new app for Urban Data Design for one of their business problems. The app will make it easier for organizers to find the best suited open spaces according to their requirements for a potential event. Make it easier for everyday citizens to find and participate in interesting events happening in open spaces around them.
The purpose is to provide a platform for people in the community to get information regarding open spaces located in their area.
The people will be able to organize events in these open spaces and list them on the app.
People can get information regarding events which are happening in their areas in open spaces.
Provide the community to participate in communal activities to build strong ties and belonging to its community.
These events can promote physical activities…
arrow_forward
A common practice was to aggregate large volumes of ungrouped data into intervals of our own design. As in the real world, we face vast and difficult obstacles. Are there any advantages or disadvantages to working in a group vs working alone?
arrow_forward
What are the advantages of using a Dimensional Model?
arrow_forward
Do you have any recommendations for the capturing of the data flow?
arrow_forward
The presentation of quantitative information in a graphical form is
Group of answer choices
Data exploration
Data visualization
Data explanation
Summary statistics
arrow_forward
What are the advantages of data model?
arrow_forward
The term "Hybrid Approach" in the context of data warehousing pertains to a methodology that combines elements of both traditional and modern approaches in order to optimize the storage, processing, and analysis of data.
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Fundamentals of Information Systems
Computer Science
ISBN:9781305082168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781305627482
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781285196145
Author:Steven, Steven Morris, Carlos Coronel, Carlos, Coronel, Carlos; Morris, Carlos Coronel and Steven Morris, Carlos Coronel; Steven Morris, Steven Morris; Carlos Coronel
Publisher:Cengage Learning
Related Questions
- Large volumes of ungrouped data were often organized into intervals of our choice in the past. We are faced with colossal and formidable difficulties, just as in the real world. Do the advantages and disadvantages of working in a group outweigh working alone?arrow_forwardComputer Science Analyze the titanic dataset by applying the various relevant techniques taught in class starting from cleaning the data, imputing data, and finally visualizing the data. Provide justification as to why you are using a technique. To receive full points, you should have appropriate cleaning and/or data transformation steps, charts/plots should be meaningful with no overlapping labels. I should see at least 5 different types (bar, line, pie, etc) of charts. You should have a conclusion by highlighting the most important insight you derived from the datasetarrow_forwardIn Business Intelligence and Data Science, what are the advantages and disadvantages of using tools like Tableau for visualization versus using Data Science options such as Python libraries such as Matplotlib/Seaborn or R libraries such as Ggplot2/Plotly?arrow_forward
- Imagine you are part of a research team tasked with creating a new educational mobile application aimed at enhancing online learning experiences. In this context, answer the following four questions. First, for understanding how users interact with a newly developed mobile application, your goal is to gather data to evaluate the usability and user satisfaction of the application. Using the concepts and methodologies discussed in Module 8 of the Human Computer Interaction: Design a Data Gathering plan for data gathering, specifying the methods you would use (e.g., interviews, questionnaires, observations) and justify your choices based on the nature of the study. Discuss how the data gathered from each method can be analyzed to provide insights into the usability and user satisfaction of the mobile application. Note: Your answer should demonstrate a clear understanding of the different data gathering techniques, their appropriate application, and the importance of ethical…arrow_forwardAnalytical data is the fundamental distinction between BI and adalytic data.arrow_forwardDistinguish between a tabular and a virtual analysis of data?arrow_forward
- The question is in the image attachedarrow_forwardData Preparation Describe what you did with the data prior to the modelling process (data cleaning). Show histograms of the one example variable before and after any pre-processing that you carried out. If you corrected any mis- typed entries in the data, report what you changed. Carry out descriptive analysis and explain each what they represent, use as many graph as possible and provide the descriptionsarrow_forwardPredictive modeling: Task: Ambulance Demand Data Generating Process: for New York City Dates of Coverage {Please identify the specific dates that will be used for this dataset in your modeling effort other than the City Health Department's Emergency Medical Services Division. } Frequency of data collection {how often is the data collected? After every incident? Daily? Yearly?} Agency / Organization collecting the data {who specifically is collecting the data? Please avoid using general references like “government” or “police} Original Unit of Analysis {What is the original unit of analysis for the data as provided? Calls for service? Census tracts? Cities?} Transformed Unit of Analysis{i.e. are you modifying the call data to support your model? Hint: if you are doing “demand” model you will be aggregating the data.} Data Generation Description{here, I want you in your own words to describe how you think the data was generated. Think 2-3 sentences.} Data…arrow_forward
- Predictive modeling: Task: Ambulance Demand Data Generating Process: Dates of Coverage {Please identify the specific dates that will be used for this dataset in your modeling effort} Frequency of data collection {how often is the data collected? After every incident? Daily? Yearly?} Agency / Organization collecting the data {who specifically is collecting the data? Please avoid using general references like “government” or “police} Original Unit of Analysis {What is the original unit of analysis for the data as provided? Calls for service? Census tracts? Cities?} Transformed Unit of Analysis{i.e. are you modifying the call data to support your model? Hint: if you are doing “demand” model you will be aggregating the data.} Data Generation Description{here, I want you in your own words to describe how you think the data was generated. Think 2-3 sentences.} Data Collector{Who collects the data? A dispatcher? A Census taker?} Triggering Process{What triggers the data…arrow_forwardData about the horses that have been awarded top places in events has already been captured ina spreadsheet (an extract from the spreadsheet is shown below). The data has been normalised tofirst normal form already – underlined column names indicate composite primary key columns.HorseID Horse Name BreedID Breed Name EventID Event Name Place1 Speedy 1 Thoroughbred 15 Show Jumping 11 Speedy 1 Thoroughbred 2 Dressage 52 Candy Floss 2 Appaloosa 15 Show Jumping 23 Getting Warmer 1 Thoroughbred 2 Dressage 44 Mother’s Love 9 Arabian 3 Eventing 3 Normalise the above data to third normal form (3NF). Show all steps as well as the final…arrow_forwardPLEASE SHOW ALL WORK AND COMMENT ALL CODE The Objective of this coding problem is the prediction of a proposed metro ectension construction project based on the people'es opinion. There are three alternatives to choose they are as follows: Eglington-Pickering Line Airport-Vaughn Line Airposrt-Hamilton Line Each record is represented by 16 features. Task-1: Metro-Ext.xlsx is the training and test dataset; you will considerr 80% of the data for training and 20% for the test. Build (1) Logistic regression (2) KNN and (3) Naive Bayes model to predict on the test data set and compute the confusion matrix for each model and compare the result. deliverables = coding files (.py and .ipynb), and a discussions of confusion matrix for both models metro-EXT.xlsx (Please place chart in EXCEL) Feasibility and Constructability Slopes and Gradients Urban Realm Geology and Soil Stability Land Acquisition Work Opportunities Economy in Movement of People Revenue Generation Access to the Social,…arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Fundamentals of Information SystemsComputer ScienceISBN:9781305082168Author:Ralph Stair, George ReynoldsPublisher:Cengage LearningDatabase Systems: Design, Implementation, & Manag...Computer ScienceISBN:9781305627482Author:Carlos Coronel, Steven MorrisPublisher:Cengage LearningDatabase Systems: Design, Implementation, & Manag...Computer ScienceISBN:9781285196145Author:Steven, Steven Morris, Carlos Coronel, Carlos, Coronel, Carlos; Morris, Carlos Coronel and Steven Morris, Carlos Coronel; Steven Morris, Steven Morris; Carlos CoronelPublisher:Cengage Learning
Fundamentals of Information Systems
Computer Science
ISBN:9781305082168
Author:Ralph Stair, George Reynolds
Publisher:Cengage Learning
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781305627482
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781285196145
Author:Steven, Steven Morris, Carlos Coronel, Carlos, Coronel, Carlos; Morris, Carlos Coronel and Steven Morris, Carlos Coronel; Steven Morris, Steven Morris; Carlos Coronel
Publisher:Cengage Learning