class3_hw_guideline

.pdf

School

University of Rochester *

*We aren’t endorsed by this school

Course

240

Subject

Industrial Engineering

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by Qing1991

Data pipeline in a pinch You are the lead data product manager at a hot new AI marketing startup called Sell-o- gram. The founder has a new idea for a clustering algorithm to identify and target profitable customers based on their previous purchase history. She wants a prototype of the algorithm to be completed in a sprint and has tasked your team to prepare the data for the data scientists. Unfortunately, all your engineers came down with a case of Malaria from a company retreat to a remote island, and you are the last man/woman standing on your team. Your goal is to deliver the data as a CSV file to the data scientists within one week to give them time to work on the algorithm. No other engineers at the company are available to help. The company's technical fate lies in your hands. Problem 1 (Load the dataset) • Load the Pandas library and read the dataset ( sales_data_sample.csv ) in Pandas. Problem 2 (Duplicates) • Are there any duplicate rows in this dataset? • If so, could you remove them? Could you verify that duplicated rows are being removed? Problem 3 (NaN Values) • For each column, is there any NaN value? • How do you plan to fill those NaN values? Explain your reasoning and approach. If you decided to leave NaN as it is, please also explain your reasoning. Problem 4 (Outliers) • For each column, are there any outliers? • If so, prove your point by creating visualizations of your choice. • Is it reasonable to remove them? Provide your reasoning. Problem 5 (Erroneous Data) • After inspecting columns "QUANTITYORDERED," "PRICEEACH," and "SALES," you noticed that for some rows, "QUANTITYORDERED" times "PRICEEACH" is not equal to "SALES." You contacted the data vendors and asked about this issue. They responded,

"During processing of the data, they set a constraint when price is greater than 100; it will be automatically set to 100 to prevent some large numbers". Keeping this information in mind, could you fix this data quality issue? Problem 6 (Data Types and output result) • For each column, determine the appropriate data type and rectify any incorrect ones. • Export the final data frame to a CSV file.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help