Assignment 01

.pdf

School

Humber College *

*We aren’t endorsed by this school

Course

4000

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

3

Uploaded by UltraCloverAlligator20

Report
Page | 1 of 3 Assignment 01 Assignment 01 Assignment 01 Assignment 01 (05 05 05 05%) %) %) %) I. About the Data The Boston Housing dataset contains data collected by the US Census Service concerning housing around Boston Massachusetts. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston). The dataset has 167 cases. The data was originally published by Harrison Jr., David, and Daniel L. Rubinfeld. "Hedonic housing prices and the demand for clean air." Journal of environmental economics and management 5.1 (1978): 81-102. The BostonHousing.xlsx dataset has 11 attributes. The dataset comes with different imperfections (missing and outliers). As described earlier, most algorithms will not process records with these imperfections. II. Requirements A. Make a review of such techniques, data, and examples with references. B. Use the provided data file in the following tasks: 1. Except PTRATIO predictor, perform the necessary “Handling Missing Data” operations to the missing values and highlight them. 2. Find possible "outliers" in the PTRATIO predictor. The possible causes of outliers are: (a) Typing non-numeric value. (b) Shift in decimal place while data entry error. (c) Genuine case of an outlier. Highlight the cells with outlier cases and state the possible cause indicating a, b, or c. C. Use the provided data file in the following tasks: 1. Substitute the missing data with NaN (not a number). 2. Write and provide Python code to implement: Omission Imputation D. Compute the mean, median, min, max, and standard deviation for each of the quantitative variables. E. Plot a histogram for each of the quantitative variables. Based on the histograms and summary statistics, answer the following questions: i. Which variables have the largest variabilities?
Page | 2 of 3 ii. Which variables were seen skewed? iii. Are there any values that seem extreme? F. Plot a side-by-side box plot comparing any two variables. Explain what this plot shows us. G. Compute the correlation table for the quantitative variable. In addition, generate a matrix plot for these variables (Heatmap). i. Which pair of variables are most strongly correlated? ii. How can we reduce the number of variables based on these correlations? iii. How would the correlations change if we normalized the data first? III. Deliverables A report (Max. 10 pages). Feel free to choose the report format. All the Python code used to develop the models (provide all the developed-in .pdf , ipynb files) IV. Instructions: This assignment is to be completed in groups. The due date until the next lecture time, submit what you have before next week's lecture. Late submissions will NOT be marked . (no excuse) The solutions must be submitted via Blackboard through the assignment’s link. Follow the accepted file format word, PDF editable file (no images), and the .ipynb file. Any feedback/issue on the Assignment grades should be a clear email within a week after grading (use Blackboard email please). A zero-tolerance policy regarding plagiarism and cheating is in effect.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help