Vyas_Project3_Report

.pdf

School

Northeastern University *

*We aren’t endorsed by this school

Course

6000

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by UltraWolverinePerson1024

Project 3 - Exploring Visualizations ALY6000 71053 Introduction to Analytics SEC 27 Module 3 Prepared by: Anvita Vyas (NUID:002962386) For: Prof.Herath Gedara, Chinthaka Pathum Dinesh Submission Date: 10 October 2023

Introduction The third project of the course presents an opportunity to visualize data through R programming. The project entails working on the books data. The project involved data cleaning and data analysis through visualization. It culminates in a comprehensive analysis of the dataset, providing us the chance to analyze the data visually and understand different relations between the variables. Overview In this project, we will analyze the dataset about books, which was collected from Goodreads. The dataset includes details such as book titles, authors, ratings, pages, and more. The objective of this assignment is to give a chance to explore functions related to data cleaning, exploratory data analysis, and to create compelling visualizations. Moreover, we would also explore functions that help with basic statistics by computing population and sample statistics. The aim of the assignment is to draw insights from the data by visualization and understanding statistics. Key Findings 1. Data processing a. This process involves data loading and cleaning. The first step is loading the packages. For this assignment, we have used tidyverse by p_load(tidyverse) . Loading that package helped with reading the CSV by read.csv(). To work with date and time we have loaded library(lubridate) . To clean the data we had to download the dplyr library by library(dplyr) and load janitor by p_load(janitor) . To help with data visualization we loaded library(ggplot2), library(plotly), library(ggQC), library(ggthemes). 2. Data Cleaning a. This assignment had a huge focus on data cleaning so that the data was easy to work with. We first worked on cleaning the names by clean_names(). To deal with date and cleaning the date we used mdy() and year() functions. 3. Data Manipulation a. For data manipulation, we used functions like the vector function c() to select the data we require. To manipulate the data we also used select() , filter() , arrange() , group_by() , and mutate() functions. The data manipulation helped with analyzing the data more clearly as well as keeping the data more focused on what is needed. 4. Statistical Analysis 1

a. This assignment involves the use of descriptive statistics like means mean() which was previously introduced. In addition, this assignment introduced a lot of new ways to conduct statistical analysis. Like creating custom R functions by using function() to compute the average, population variance, and population standard deviation of book ratings. Additionally, we got a chance to explore sample statistics by conducting analysis on three random samples of 100 books from the dataset and computing sample statistics for mean, variance, and standard deviation. This helped to analyze the data for comparison with the population. Fig 1. Input to create custom functions Fig 2. Output for creating own functions 5. Data Visualization a. During the assignment, there were lots of functions that were used to help with data visualization. Using the glimpse() function to view the data set in a different configuration. Also, using ggplot() function to plot scatter plots and histograms with specific specifications. These were great tools for visualizing the data and gaining insights to make recommendations. 2

The key findings demonstrate how various operations and functions in R programming can help with data analysis to extract meaningful information from datasets. Key Visualizations Question 3 (Data Analysis) Create a rating histogram with the following criteria: – The y-axis is labeled “Number of Books.” – The x-axis is labeled “Rating.” – The title of the graph is “Histogram of Book Ratings.” – The graph is filled with the color “red.” – Set a binwidth of .25. – Use theme_bw(). Fig 3. Input to create a histogram Fig 4. Output, the histogram 3

Question 4 (Data Analysis) Create a boxplot of the number pages per book in the dataset with the following requirements. – The boxplot is horizontal. – The x-axis is labeled “Pages.” – The title is “Box Plot of Page Counts.” – Fill the boxplot with the color magenta. – Use the theme theme_economist from the ggthemes package. Fig 5. Input to create a box plot 4

Fig 6. Output - the box plot Question 6 (Data Analysis) Using the data frame constructed in the prior problem, create a Pareto Chart with an ogive of cumulative counts formatted with the following additional criteria: – The bars are filled with the color cyan. – The x-axis label is “Publisher.” – The y-axis label is “Number of Books.” – The title is “Pareto and Ogive of Publisher Book Counts (1990 - 2020).” – Use the theme theme_clean(). – Rotate the x-axis labels by 45 degrees (consider the ggeasy package). Fig 7. Input to create a Pareto Chart 5

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Related Questions

Large volumes of ungrouped data were often organized into intervals of our choice in the past. We are faced with colossal and formidable difficulties, just as in the real world. Do the advantages and disadvantages of working in a group outweigh working alone?

Computer Science Analyze the titanic dataset by applying the various relevant techniques taught in class starting from cleaning the data, imputing data, and finally visualizing the data. Provide justification as to why you are using a technique. To receive full points, you should have appropriate cleaning and/or data transformation steps, charts/plots should be meaningful with no overlapping labels. I should see at least 5 different types (bar, line, pie, etc) of charts. You should have a conclusion by highlighting the most important insight you derived from the dataset

In Business Intelligence and Data Science, what are the advantages and disadvantages of using tools like Tableau for visualization versus using Data Science options such as Python libraries such as Matplotlib/Seaborn or R libraries such as Ggplot2/Plotly?

Imagine you are part of a research team tasked with creating a new educational mobile application aimed at enhancing online learning experiences. In this context, answer the following four questions. First, for understanding how users interact with a newly developed mobile application, your goal is to gather data to evaluate the usability and user satisfaction of the application. Using the concepts and methodologies discussed in Module 8 of the Human Computer Interaction: Design a Data Gathering plan for data gathering, specifying the methods you would use (e.g., interviews, questionnaires, observations) and justify your choices based on the nature of the study. Discuss how the data gathered from each method can be analyzed to provide insights into the usability and user satisfaction of the mobile application. Note: Your answer should demonstrate a clear understanding of the different data gathering techniques, their appropriate application, and the importance of ethical…

Analytical data is the fundamental distinction between BI and adalytic data.

Distinguish between a tabular and a virtual analysis of data?

The question is in the image attached

Data Preparation Describe what you did with the data prior to the modelling process (data cleaning). Show histograms of the one example variable before and after any pre-processing that you carried out. If you corrected any mis- typed entries in the data, report what you changed. Carry out descriptive analysis and explain each what they represent, use as many graph as possible and provide the descriptions

Predictive modeling: Task: Ambulance Demand Data Generating Process: for New York City Dates of Coverage {Please identify the specific dates that will be used for this dataset in your modeling effort other than the City Health Department's Emergency Medical Services Division. } Frequency of data collection {how often is the data collected? After every incident? Daily? Yearly?} Agency / Organization collecting the data {who specifically is collecting the data? Please avoid using general references like “government” or “police} Original Unit of Analysis {What is the original unit of analysis for the data as provided? Calls for service? Census tracts? Cities?} Transformed Unit of Analysis{i.e. are you modifying the call data to support your model? Hint: if you are doing “demand” model you will be aggregating the data.} Data Generation Description{here, I want you in your own words to describe how you think the data was generated. Think 2-3 sentences.} Data…

Predictive modeling: Task: Ambulance Demand Data Generating Process: Dates of Coverage {Please identify the specific dates that will be used for this dataset in your modeling effort} Frequency of data collection {how often is the data collected? After every incident? Daily? Yearly?} Agency / Organization collecting the data {who specifically is collecting the data? Please avoid using general references like “government” or “police} Original Unit of Analysis {What is the original unit of analysis for the data as provided? Calls for service? Census tracts? Cities?} Transformed Unit of Analysis{i.e. are you modifying the call data to support your model? Hint: if you are doing “demand” model you will be aggregating the data.} Data Generation Description{here, I want you in your own words to describe how you think the data was generated. Think 2-3 sentences.} Data Collector{Who collects the data? A dispatcher? A Census taker?} Triggering Process{What triggers the data…

Data about the horses that have been awarded top places in events has already been captured ina spreadsheet (an extract from the spreadsheet is shown below). The data has been normalised tofirst normal form already – underlined column names indicate composite primary key columns.HorseID Horse Name BreedID Breed Name EventID Event Name Place1 Speedy 1 Thoroughbred 15 Show Jumping 11 Speedy 1 Thoroughbred 2 Dressage 52 Candy Floss 2 Appaloosa 15 Show Jumping 23 Getting Warmer 1 Thoroughbred 2 Dressage 44 Mother’s Love 9 Arabian 3 Eventing 3 Normalise the above data to third normal form (3NF). Show all steps as well as the final…

PLEASE SHOW ALL WORK AND COMMENT ALL CODE The Objective of this coding problem is the prediction of a proposed metro ectension construction project based on the people'es opinion. There are three alternatives to choose they are as follows: Eglington-Pickering Line Airport-Vaughn Line Airposrt-Hamilton Line Each record is represented by 16 features. Task-1: Metro-Ext.xlsx is the training and test dataset; you will considerr 80% of the data for training and 20% for the test. Build (1) Logistic regression (2) KNN and (3) Naive Bayes model to predict on the test data set and compute the confusion matrix for each model and compare the result. deliverables = coding files (.py and .ipynb), and a discussions of confusion matrix for both models metro-EXT.xlsx (Please place chart in EXCEL) Feasibility and Constructability Slopes and Gradients Urban Realm Geology and Soil Stability Land Acquisition Work Opportunities Economy in Movement of People Revenue Generation Access to the Social,…

Urban data design is a company which designs digital products and concepts that make neighborhoods more sustainable and healthier place to live and do business in. This project focuses on a creation of a new app for Urban Data Design for one of their business problems. The app will make it easier for organizers to find the best suited open spaces according to their requirements for a potential event. Make it easier for everyday citizens to find and participate in interesting events happening in open spaces around them. The purpose is to provide a platform for people in the community to get information regarding open spaces located in their area. The people will be able to organize events in these open spaces and list them on the app. People can get information regarding events which are happening in their areas in open spaces. Provide the community to participate in communal activities to build strong ties and belonging to its community. These events can promote physical activities…

A common practice was to aggregate large volumes of ungrouped data into intervals of our own design. As in the real world, we face vast and difficult obstacles. Are there any advantages or disadvantages to working in a group vs working alone?

What are the advantages of using a Dimensional Model?

Do you have any recommendations for the capturing of the data flow?

The presentation of quantitative information in a graphical form is Group of answer choices Data exploration Data visualization Data explanation Summary statistics

What are the advantages of data model?

The term "Hybrid Approach" in the context of data warehousing pertains to a methodology that combines elements of both traditional and modern approaches in order to optimize the storage, processing, and analysis of data.

SEE MORE QUESTIONS

Recommended textbooks for you

Fundamentals of Information Systems

Computer Science

ISBN:9781305082168

Author:Ralph Stair, George Reynolds

Publisher:Cengage Learning

Database Systems: Design, Implementation, & Manag...

Computer Science

ISBN:9781305627482

Author:Carlos Coronel, Steven Morris

Publisher:Cengage Learning

Database Systems: Design, Implementation, & Manag...

Computer Science

ISBN:9781285196145

Author:Steven, Steven Morris, Carlos Coronel, Carlos, Coronel, Carlos; Morris, Carlos Coronel and Steven Morris, Carlos Coronel; Steven Morris, Steven Morris; Carlos Coronel

Publisher:Cengage Learning

SEE MORE TEXTBOOKS