Assignment_4_Cleaning_Data_Set_Instructions
.docx
keyboard_arrow_up
School
Indiana University, Bloomington *
*We aren’t endorsed by this school
Course
BUSH352
Subject
Computer Science
Date
Feb 20, 2024
Type
docx
Pages
4
Uploaded by emilyguse09
Assignment 4: Cleaning a Data Set
Task: Identify and Correct Errors in a Data Set
From Canvas, download and save: BtownOrders(Jan2019)_assignment_data.xlsx and
Assignment4_Cleaning_Data_Set_Answersheet.docx
Scenario: You have done such a good job with forecasting and cleaning data from Btown Interactive that
they have asked you to do some further cleaning of their data. The sales group is suspicious that there might be errors in the data for January.
You will be working with a new set of 3,296 orders with 5,182 line items from January 2019. The data is in a file called “
BtownOrders(Jan2019)_assignment_data.xlsx
.” A “line item” is just an order for a specified number of a particular product – there can be multiple line items per order.
You’ll be looking for errors in the data in several places:
1.
Errors in the product names
.
2.
Errors in the promotional codes
.
3.
Errors in the total_product_price
.
You will find, document, and correct the errors in the Excel workbook.
Deliverables:
Complete the Assignment4_Cleaning_Data_Set_Answersheet.docx
associated with this assignment on Canvas. Submit the completed answer sheet with the “cleaned” Excel file on Canvas by the
due date.
Evaluation:
You will be graded based on the number of correct answers. There are 10 questions overall. Be sure to insert rows in the answer sheet where appropriate to answer the
questions completely. 1
Make sure you complete the in-class exercise “Locating Bad Data in Excel” before attempting this assignment! See the W6C2
in-class assignment on Canvas for guidance.
Part 1: Errors in Product Names
Verify that the product names (Column J) are correct using the master list in the Lookups
tab
and correct any errors. You can assume the information in the Lookups tab is always right. So
if there
is a mismatch, the error is in your data set. To do this, you will use the MATCH function (use W6C2 in-class exercise as a guide). Place your MATCH
calculation in column N of the “Btown Orders (Jan)”
worksheet. Title of the column “
ProdMatch
” (in cell N1) and start your MATCH formulas in cell N2.
HINT: Using the Sort and Filter features in Excel can also help you. You’ve used both of those features in the exercises we’ve done so far in this course.
ANOTHER HINT: Remember, there is a list of correct product names in the Lookups worksheet.
Using the Assignment4_Cleaning_Data_Set_Answersheet.docx
provided with this assignment, answer the following questions:
1)
How many line items (rows) had incorrect product names?
2)
List the products names with errors, listing the incorrect name, the corrected name, and how many rows of data had the error. The product names are only off by a couple of letters and should be easily identified in the Lookups list. Correct all invalid product names in the data. (Try sorting by product_name
. You only need to list each incorrect product name once.) Now fix the incorrect product names in the “
Btown Orders (Jan)
” worksheet. Use caution when sorting!!! Refer to item #4 from the Taber reading
.
HINT: Use “Find and Replace” to speed up fixing the errors. Part 2: Errors in Promotional Codes
Verify that the promotional codes (Column E) are correct using the master list in the Lookups tab and correct any errors. Use the MATCH function and place your function in Column O of the
“
Btown Orders (Jan)
” worksheet. Make the title of the column “
PromMatch
” (in cell O1) and start your MATCH formulas in cell O2.
Answer the following questions:
1)
How many line items (rows) had incorrect promotional codes? The value in the promo_code field should match the Lookups data sheet exactly. 2)
List the promotional codes with errors, listing the incorrect codes, the corrected 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
Task 3:
The Driver Relationship team wants to create some workshops and increase communication with the active drivers in InstantRide. Therefore, they requested a new database table to store the driver details of the drivers that have had at least one ride in the system. Create a new table, ACTIVE_DRIVERS, from the DRIVERS and TRAVELS tables which contains the following fields:
DRIVER_ID CHAR(5) (Primary key)
DRIVER_FIRST_NAME VARCHAR(20)
DRIVER_LAST_NAME VARCHAR(20)
DRIVER_DRIVING_LICENSE_ID VARCHAR(10)
DRIVER_DRIVING_LICENSE_CHECKED BOOL
DRIVER_RATING FLOAT
SQL Database Test:
Create a new table to store information on active drivers
Test Query: SELECT * FROM ACTIVE_DRIVERS
Expected Results:
DRIVER_ID
DRIVER_FIRST_NAME
DRIVER_LAST_NAME
DRIVER_DRIVING_LICENSE_ID
DRIVER_DRIVING_LICENSE_CHECKED
DRIVER_RATING
2001
Willie
Butler
1874501
1
4.4
2002
Justin
Howard
1953853
1
4.8
2003
Anthony
Walker
1735487
1
3.5
Task 4:
The Driver Relationship team wants to have quick…
arrow_forward
Task 3:
The Driver Relationship team wants to create some workshops and increase communication with the active drivers in InstantRide. Therefore, they requested a new database table to store the driver details of the drivers that have had at least one ride in the system. Create a new table, ACTIVE_DRIVERS, from the DRIVERS and TRAVELS tables which contains the following fields:
DRIVER_ID CHAR(5) (Primary key)
DRIVER_FIRST_NAME VARCHAR(20)
DRIVER_LAST_NAME VARCHAR(20)
DRIVER_DRIVING_LICENSE_ID VARCHAR(10)
DRIVER_DRIVING_LICENSE_CHECKED BOOL
DRIVER_RATING FLOAT
ANSWER IN MYSQL PLEASE
arrow_forward
Task 9:The StayWell property management team wants to add two additional properties and run some simulation tests relating to market coverage. Add the following properties to the PROPERTY table:
However, the team does not want this data in the database once they complete their operations. You need to create a transaction with the changes rolled back.
arrow_forward
MySql Workbench
CREATE TABLE students ( id INT PRIMARY KEY, first_name VARCHAR(50), last_name VARCHAR(50), age INT, major VARCHAR(50), faculty VARCHAR(50));
CREATE TABLE location ( id INT PRIMARY KEY, name VARCHAR(50), rooms INT);
CREATE TABLE faculty ( id INT PRIMARY KEY, name VARCHAR(50), department_id INT);
1. List last name of all students whose first name is longer than 4 letters in ascending order accordingto the last name. Duplicated rows should be removed from the output.2. Count the total number of rooms in Location.3. Find the number of students in each major.4. Find the number of employees in each department who get no commission or have salary less than5000.5. Find the maximum salary of employees in each department that the employee was hired 15 yearsbefore now. *hint: user TIMESTAMPDIFF(<unit type>,<Date_value 1>,<Date_value 2>), the unitcan be YEAR, MONTH, DAY, HOUR, etc...
arrow_forward
Use the iloc() function to extract the first 20 features of the dataframe har_train. Save this new dataframe to first_twenty. Next, using the seaborn library create a heatmap for the correlation matrix. First you have to create the correlation matrix from the pandas dataframe (save it in a dataframe called corr) and then plot it using seaborn with these customizations: Set the seaborn style to white. Generate a mask using np.triu(np.ones_like()) with the dtype as boolean to only show the lower triangle of the correlation matrix. Save it in a variable called mask. Set up the figure with matplotlib with figsize=(11,9). Use fig, ax = ... Generate a custom diverging colormap for the heatmap with the arguments (220, 10, as_cmap=True). Save it in a variable called cmap. Draw the heatmap with the mask and correct aspect ratio, using the arguments corr, mask=mask, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5}). Finally, use fig.tight_layout() just before…
arrow_forward
Project 3 Details:This project is based on eight tables (AIRPORT, FLIGHT, DEPARTURES, PASSENGER,RESERVATION, EQUIP_TYPE, PILOTS, and TICKET) that contain data about the Belle Airlines. Downloadand use data in Project 3 zip file for this project.Some Background on Belle AirlinesBelle Airlines is a regional carrier that operates primarily in the southwestern United States. At thepresent time, Belle Airlines operates its own reservation information system. To simplify our analysis,we will assume that all reservations on Belle Airlines flights are placed through Belle Airlines employees.Flights are not booked through travel agents and Belle Airlines does not participate in industry-widereservations services. Each flight is assigned a unique flight number and has its own set of flightcharacteristics (ie. flight number, origin, destination, departure time, arrival time, meal code, base fare,mileage between origin and destination, and number of changes in time zone between the origin…
arrow_forward
Normalize the Index metadata connection and explain why you think it would be slower to use.
arrow_forward
The Driver Relationship team wants to create some workshops and increase communication with the active drivers in InstantRide. Therefore, they requested a new database table to store the driver details of the drivers that have had at least one ride in the system. Create a new table, ACTIVE_DRIVERS, from the DRIVERS and TRAVELS tables which contains the following fields:
DRIVER_ID CHAR(5) (Primary key)DRIVER_FIRST_NAME VARCHAR(20)DRIVER_LAST_NAME VARCHAR(20)DRIVER_DRIVING_LICENSE_ID VARCHAR(10)DRIVER_DRIVING_LICENSE_CHECKED BOOLDRIVER_RATING FLOAT
arrow_forward
The Driver Relationship team wants to create some workshops and increase communication with the active drivers in InstantRide. Therefore, they requested a new database table to store the driver details of the drivers that have had at least one ride in the system. Create a new table, ACTIVE_DRIVERS, from the DRIVERS and TRAVELS tables which contains the following fields:
DRIVER_ID CHAR(5) (Primary key)
DRIVER_FIRST_NAME VARCHAR(20)
DRIVER_LAST_NAME VARCHAR(20)
DRIVER_DRIVING_LICENSE_ID VARCHAR(10)
DRIVER_DRIVING_LICENSE_CHECKED BOOL
DRIVER_RATING FLOAT
The code is supposed to look something like this but I'm getting an error
CREATE TABLE ACTIVE_DRIVERS(
DRIVER_ID CHAR(5) NOT NULL,
DRIVER_FIRST_NAME VARCHAR(20) NULL,
DRIVER_LAST_NAME VARCHAR(20) NULL,
DRIVER_DRIVING_LICENSE_ID VARCHAR(10) NULL,
DRIVER_DRIVING_LICENSE_CHECKED BOOL NULL,
DRIVER_RATING FLOAT NULL
) as select D.DRIVER_ID, D.DRIVER_FIRST_NAME, D.DRIVER_LAST_NAME, D.DRIVER_DRIVING_LICENSE_ID,…
arrow_forward
Open the Missing Addresses query in Design view. Add a new column to determine if a customer does not have an address on file. If the customer’s Address is null, it should display Missing. If not, it should display nothing. Name the column AddressPresent. Add criteria of Missing to the column you just created, so only the customers missing an address display. Move the AddressPresent field so it appears between PhoneNumber and Address. Run the query. Ensure only customers with null Address fields display. Save and close the query.
arrow_forward
Help understanding / installing phpmyadmin and if i need wamp server aswell and where to copy and paste this code in order to create a database
arrow_forward
Create a new dataset called "life_exp_2" using subset() with the following variables: life expectancy,
polio vaccination, measles vaccination, obesity 9-15 year old's, basic water, alcohol consumption,
and current healthcare spending. Use View() and str) to check if you obtained all the relevant
variables in the new dataset.
arrow_forward
The Driver Relationship team wants to update the driving license id of an active driver:
Driver ID: 2003, New Driving License ID: 1735488
In addition, the team wants to do the update over the VIEW and also want to see the actual change in the DRIVERS table.
Task
Update a driver's record using the ACTIVE_DRIVERS VIEW.
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Database System Concepts
Computer Science
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:9780134444321
Author:Tony Gaddis
Publisher:PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:9780132737968
Author:Thomas L. Floyd
Publisher:PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:9780133976892
Author:Paul J. Deitel, Harvey Deitel
Publisher:PEARSON
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781337627900
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education
Related Questions
- Task 3: The Driver Relationship team wants to create some workshops and increase communication with the active drivers in InstantRide. Therefore, they requested a new database table to store the driver details of the drivers that have had at least one ride in the system. Create a new table, ACTIVE_DRIVERS, from the DRIVERS and TRAVELS tables which contains the following fields: DRIVER_ID CHAR(5) (Primary key) DRIVER_FIRST_NAME VARCHAR(20) DRIVER_LAST_NAME VARCHAR(20) DRIVER_DRIVING_LICENSE_ID VARCHAR(10) DRIVER_DRIVING_LICENSE_CHECKED BOOL DRIVER_RATING FLOAT SQL Database Test: Create a new table to store information on active drivers Test Query: SELECT * FROM ACTIVE_DRIVERS Expected Results: DRIVER_ID DRIVER_FIRST_NAME DRIVER_LAST_NAME DRIVER_DRIVING_LICENSE_ID DRIVER_DRIVING_LICENSE_CHECKED DRIVER_RATING 2001 Willie Butler 1874501 1 4.4 2002 Justin Howard 1953853 1 4.8 2003 Anthony Walker 1735487 1 3.5 Task 4: The Driver Relationship team wants to have quick…arrow_forwardTask 3: The Driver Relationship team wants to create some workshops and increase communication with the active drivers in InstantRide. Therefore, they requested a new database table to store the driver details of the drivers that have had at least one ride in the system. Create a new table, ACTIVE_DRIVERS, from the DRIVERS and TRAVELS tables which contains the following fields: DRIVER_ID CHAR(5) (Primary key) DRIVER_FIRST_NAME VARCHAR(20) DRIVER_LAST_NAME VARCHAR(20) DRIVER_DRIVING_LICENSE_ID VARCHAR(10) DRIVER_DRIVING_LICENSE_CHECKED BOOL DRIVER_RATING FLOAT ANSWER IN MYSQL PLEASEarrow_forwardTask 9:The StayWell property management team wants to add two additional properties and run some simulation tests relating to market coverage. Add the following properties to the PROPERTY table: However, the team does not want this data in the database once they complete their operations. You need to create a transaction with the changes rolled back.arrow_forward
- MySql Workbench CREATE TABLE students ( id INT PRIMARY KEY, first_name VARCHAR(50), last_name VARCHAR(50), age INT, major VARCHAR(50), faculty VARCHAR(50)); CREATE TABLE location ( id INT PRIMARY KEY, name VARCHAR(50), rooms INT); CREATE TABLE faculty ( id INT PRIMARY KEY, name VARCHAR(50), department_id INT); 1. List last name of all students whose first name is longer than 4 letters in ascending order accordingto the last name. Duplicated rows should be removed from the output.2. Count the total number of rooms in Location.3. Find the number of students in each major.4. Find the number of employees in each department who get no commission or have salary less than5000.5. Find the maximum salary of employees in each department that the employee was hired 15 yearsbefore now. *hint: user TIMESTAMPDIFF(<unit type>,<Date_value 1>,<Date_value 2>), the unitcan be YEAR, MONTH, DAY, HOUR, etc...arrow_forwardUse the iloc() function to extract the first 20 features of the dataframe har_train. Save this new dataframe to first_twenty. Next, using the seaborn library create a heatmap for the correlation matrix. First you have to create the correlation matrix from the pandas dataframe (save it in a dataframe called corr) and then plot it using seaborn with these customizations: Set the seaborn style to white. Generate a mask using np.triu(np.ones_like()) with the dtype as boolean to only show the lower triangle of the correlation matrix. Save it in a variable called mask. Set up the figure with matplotlib with figsize=(11,9). Use fig, ax = ... Generate a custom diverging colormap for the heatmap with the arguments (220, 10, as_cmap=True). Save it in a variable called cmap. Draw the heatmap with the mask and correct aspect ratio, using the arguments corr, mask=mask, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5}). Finally, use fig.tight_layout() just before…arrow_forwardProject 3 Details:This project is based on eight tables (AIRPORT, FLIGHT, DEPARTURES, PASSENGER,RESERVATION, EQUIP_TYPE, PILOTS, and TICKET) that contain data about the Belle Airlines. Downloadand use data in Project 3 zip file for this project.Some Background on Belle AirlinesBelle Airlines is a regional carrier that operates primarily in the southwestern United States. At thepresent time, Belle Airlines operates its own reservation information system. To simplify our analysis,we will assume that all reservations on Belle Airlines flights are placed through Belle Airlines employees.Flights are not booked through travel agents and Belle Airlines does not participate in industry-widereservations services. Each flight is assigned a unique flight number and has its own set of flightcharacteristics (ie. flight number, origin, destination, departure time, arrival time, meal code, base fare,mileage between origin and destination, and number of changes in time zone between the origin…arrow_forward
- Normalize the Index metadata connection and explain why you think it would be slower to use.arrow_forwardThe Driver Relationship team wants to create some workshops and increase communication with the active drivers in InstantRide. Therefore, they requested a new database table to store the driver details of the drivers that have had at least one ride in the system. Create a new table, ACTIVE_DRIVERS, from the DRIVERS and TRAVELS tables which contains the following fields: DRIVER_ID CHAR(5) (Primary key)DRIVER_FIRST_NAME VARCHAR(20)DRIVER_LAST_NAME VARCHAR(20)DRIVER_DRIVING_LICENSE_ID VARCHAR(10)DRIVER_DRIVING_LICENSE_CHECKED BOOLDRIVER_RATING FLOATarrow_forwardThe Driver Relationship team wants to create some workshops and increase communication with the active drivers in InstantRide. Therefore, they requested a new database table to store the driver details of the drivers that have had at least one ride in the system. Create a new table, ACTIVE_DRIVERS, from the DRIVERS and TRAVELS tables which contains the following fields: DRIVER_ID CHAR(5) (Primary key) DRIVER_FIRST_NAME VARCHAR(20) DRIVER_LAST_NAME VARCHAR(20) DRIVER_DRIVING_LICENSE_ID VARCHAR(10) DRIVER_DRIVING_LICENSE_CHECKED BOOL DRIVER_RATING FLOAT The code is supposed to look something like this but I'm getting an error CREATE TABLE ACTIVE_DRIVERS( DRIVER_ID CHAR(5) NOT NULL, DRIVER_FIRST_NAME VARCHAR(20) NULL, DRIVER_LAST_NAME VARCHAR(20) NULL, DRIVER_DRIVING_LICENSE_ID VARCHAR(10) NULL, DRIVER_DRIVING_LICENSE_CHECKED BOOL NULL, DRIVER_RATING FLOAT NULL ) as select D.DRIVER_ID, D.DRIVER_FIRST_NAME, D.DRIVER_LAST_NAME, D.DRIVER_DRIVING_LICENSE_ID,…arrow_forward
- Open the Missing Addresses query in Design view. Add a new column to determine if a customer does not have an address on file. If the customer’s Address is null, it should display Missing. If not, it should display nothing. Name the column AddressPresent. Add criteria of Missing to the column you just created, so only the customers missing an address display. Move the AddressPresent field so it appears between PhoneNumber and Address. Run the query. Ensure only customers with null Address fields display. Save and close the query.arrow_forwardHelp understanding / installing phpmyadmin and if i need wamp server aswell and where to copy and paste this code in order to create a databasearrow_forwardCreate a new dataset called "life_exp_2" using subset() with the following variables: life expectancy, polio vaccination, measles vaccination, obesity 9-15 year old's, basic water, alcohol consumption, and current healthcare spending. Use View() and str) to check if you obtained all the relevant variables in the new dataset.arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Database System ConceptsComputer ScienceISBN:9780078022159Author:Abraham Silberschatz Professor, Henry F. Korth, S. SudarshanPublisher:McGraw-Hill EducationStarting Out with Python (4th Edition)Computer ScienceISBN:9780134444321Author:Tony GaddisPublisher:PEARSONDigital Fundamentals (11th Edition)Computer ScienceISBN:9780132737968Author:Thomas L. FloydPublisher:PEARSON
- C How to Program (8th Edition)Computer ScienceISBN:9780133976892Author:Paul J. Deitel, Harvey DeitelPublisher:PEARSONDatabase Systems: Design, Implementation, & Manag...Computer ScienceISBN:9781337627900Author:Carlos Coronel, Steven MorrisPublisher:Cengage LearningProgrammable Logic ControllersComputer ScienceISBN:9780073373843Author:Frank D. PetruzellaPublisher:McGraw-Hill Education
Database System Concepts
Computer Science
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:9780134444321
Author:Tony Gaddis
Publisher:PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:9780132737968
Author:Thomas L. Floyd
Publisher:PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:9780133976892
Author:Paul J. Deitel, Harvey Deitel
Publisher:PEARSON
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781337627900
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education