MBAN Assignment Haider

.pdf

School

York University *

*We aren’t endorsed by this school

Course

MBAN6110

Subject

Computer Science

Date

Apr 3, 2024

Type

pdf

Pages

Uploaded by ChefAtom13064

MBAN 6500X -- Assignment 2 Task This is an individual assignment that requires you to participate in a machine learning competition on Kaggle . Specifically, you will participate in the competition Titanic: Machine Learning from Disaster , where the task it to predict survival based on passenger information. You have to register a Kaggle account and follow the instructions under Overview on the competition site. Beyond the Overview , I recommend you to closely study a couple of notebooks under the Notebooks tab. For example, " Titanic: 81.1% Leader board Score Guaranteed " and " A Data Science Framework: To Achieve 99% Accuracy " provide good examples of exploratory analysis and feature engineering and are thus worth your time. Use them as tutorials. You have to fit and compare three different learning algorithms , including both a linear and a non-linear learner. Submission The assignment is due on February 21 at 7:00 pm. You have to do two things. 1) Select the best among the models and submit the predictions Kaggle's test set to Kaggle. 2) Submit your Kaggle username by putting it in the Canvas comment when you submit the Python file. 3) Submit to Canvas a standard Python file (i.e. PY ) containing the complete code (see Grading ) that trains all three models and for each model, prints the accuracy, F1-score and AUC. Grading The submitted Python file should contain the following standard steps of a data science project: 1. Load data. 2. Exploratory data analysis and pre-processing. Use the insights you gain from the exploratory analysis to guide the pre- processing. Data cleaning. Identification and treatment of missing values and outliers. Feature engineering. Make three plots describing different aspects of the data set. The first should be a histogram showing survival as a function of age (i.e. two histograms, one for survivors and the other for not survivors with age on the x-axis and count on the y-axis), the second, a bar plot of the number of surviors for each passenger class ( Pclass ), and finally, third thould be a matrix showing the pairwise correlations between featrures. Print a basic data description (at a minimum the number of examples, number features, and number of examples in each class). The data description should be printed under the header Data description (see example below). Print (and include in the plots) descriptive statistics (e.g. means, medians, standard deviation). The descriptive statistics should be printed under the header Descriptive statistics (see example below). Only print descriptive statistics for four features. 3. Partition data (not Kaggle's test set) into train, validation and test sets. This test set will be different from Kaggle's test set. 4. Fit models on the training set (this can include a hyper-parameter search) and select the best based on validation set performance. 5. Print the results of all three models on the test set from (4). This should include accuracy, F1-score and AUC. The results should be printed under the header Performance (see example below). 6. Save the predictions of the best model on Kaggle's test set to submission.csv . Example printing The values and feature names, and learner names are examples and should be replaced.

Data description ---------------- number of examples : 123 number of features : 456 number of examples per class : 7 (survived), 8 (didn't survive) Descriptive statistics ---------------------- feature 1 feature 2 feature 3 feature 4 mean 1.1 1.2 1.3 1.4 median 2.1 2.2 2.3 2.4 std 3.1 3.2 3.3 3.4 Performance ----------- accuracy F1-score AUC Linear model x1 y1 z1 Random forest x2 y2 z2 Boosting model x3 y3 z3 While the first part of this submission could be completed by simply copying an existing notebook, the second part cannot. Your code will be marked based on it's originality and the extent that it reflects an understanding of the task. Extensive copying will be considered plagiarism and Turnitin will be used for it's detection. For this assignment, learning and understanding are more important than prediction accuracy. Good luck! Hjalmar

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version