class3_hw_guideline
.pdf
keyboard_arrow_up
School
University of Rochester *
*We aren’t endorsed by this school
Course
240
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
Pages
2
Uploaded by Qing1991
Data pipeline in a pinch
You are the lead data product manager at a hot new AI marketing startup called Sell-o-
gram. The founder has a new idea for a clustering algorithm to identify and target profitable customers based on their previous purchase history. She wants a prototype of the algorithm to be completed in a sprint and has tasked your team to prepare the data for the data scientists. Unfortunately, all your engineers came down with a case of Malaria from a company retreat to a remote island, and you are the last man/woman standing on your team. Your goal is to deliver the data as a CSV file to the data scientists within one week to give them time to work on the algorithm. No other engineers at the company are available to help. The company's technical fate lies in your hands. Problem 1 (Load the dataset)
•
Load the Pandas library and read the dataset (
sales_data_sample.csv
) in Pandas. Problem 2 (Duplicates)
•
Are there any duplicate rows in this dataset? •
If so, could you remove them? Could you verify that duplicated rows are being removed? Problem 3 (NaN Values)
•
For each column, is there any NaN value? •
How do you plan to fill those NaN values? Explain your reasoning and approach. If you decided to leave NaN as it is, please also explain your reasoning. Problem 4 (Outliers)
•
For each column, are there any outliers?
•
If so, prove your point by creating visualizations of your choice. •
Is it reasonable to remove them? Provide your reasoning.
Problem 5 (Erroneous Data) •
After inspecting columns "QUANTITYORDERED," "PRICEEACH," and "SALES," you noticed that for some rows, "QUANTITYORDERED" times "PRICEEACH" is not equal to "SALES." You contacted the data vendors and asked about this issue. They responded,
"During processing of the data, they set a constraint when price is greater than 100; it will be automatically set to 100 to prevent some large numbers". Keeping this information in mind, could you fix this data quality issue?
Problem 6 (Data Types and output result)
•
For each column, determine the appropriate data type and rectify any incorrect ones. •
Export the final data frame to a CSV file.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help