
Method: Python
Dataset: Census Income
Source: https://archive.ics.uci.edu/ml/datasets/Adult
Objective
The objective of this task is to implement from scratch Decision Tree classification method to predict whether the incomes exceed $50K/yr based on census data. Thus, this is a binary classification problem. The training and test sets are pre-defined in the data set (i.e., in "adult.data" and "adult.test").
Requirements
(1) Implement two DT models by choosing any two (2) split criteria from Information Gain, Gain Ratio, Gini Index and Variance. Note that you can use either binary-split or multiple-split.
(2) Use (approximately) 2/3 records in "adult.data" for training, and 1/3 records in "adult.data" for post-pruning.
(3) Report the accuracy of each model.
(4) All DT models must be self-implemented. You CANNOT use any machine learning library in this task.
(5) It is recommended that your implementation includes a "tree induction function", a "classification function" and a "post-pruning function".
(6) You can (but not must) use any suitable pre-processing method. You also can (but not must) use any reasonable early stopping criteria (pre-pruned parameters such as number of splits, minimum data set size, and split threshold) to improve the training speed. If you do so, explain your reasons.
(7) Present clear and accurate explanation of your implementation and results

Trending nowThis is a popular solution!
Step by stepSolved in 3 steps with 2 images

- Q1 The periodic function sin(2x) has multiple roots between x values of -5π and 5π. If xL = -15 and xU = 15, which of the following statements is true using a bracketed method? Select one: a. All roots will be returned b. The middle root will be returned c. The chosen bracket is invalid for bracketed methods d. A single root will be returned e. The algorithm will be stuck in an infinite loop Q2 Consider x and y to represent data points (xi,yi), where i = 1, 2, 3, … n. What is the length of pafter running the following command? p = polyval(x,y) Select one: a. n b. n - 1 c. n + 1 d. Empty variable e. 1 Q3 Consider a system of linear equations in the form of AX = B, where X is the unknown vector. Which of the following can be used to solve for X? Select one: a. X = A\B b. X = B./A c. X = inv(B)*A d. X = inv(A)./B e. X = B\Aarrow_forwardKNN is a technique used to estimate new values based on the similarity of known ones. In this assignment, your company wants you to estimate the selling price of a customer's building The price you calculate will be given to the customer as the company selling price recommendation. You decide to use Data Science techniques such as the K-Nearest Neighbor.(KNN) You will need to: Import the necessary libraries from your program. (You can use the model class sklearn.neighbors.KNeighborsClassifier, part of the package sci-kit-learn 1.1.1 (Links to an external site) or any other. Train/test the model with the data included in the module (cal_housing.tgz). The house you need to estimate the value for has the following properties: longitude: 120.75latitude: 39.34housingMedianAge: 35.5total rooms: 260totalBedrooms:120 population:540households: 12medianIncome:1.8 K BuildingValue: ? What is the recommended price? You need to provide the code, properly commented. You could use…arrow_forwardCourse: Algorithm Project: We will use the defintion of of n-Queens Problem from the chapter Backtracking. In this project you need to describe Problem and Algorithm and Indicate input and output clearly. Analyze and prove the time complexity of your algorithm. Implement the algorithm using backtracking(including writing testing case).illustrate key functions with comments indicating: What it does, what each parameter is used for, how it handles errors etc. Indicate the testing scenarios and testing the results in a clear way. Make sure source is commented appropriately and structured well.arrow_forward
- Question 2. Use composition approach to simulate the following distribution P = 0.3p1 + 0.7p2, P1 follows a geometric distribution with mean 2 and p2 follows a Bernoulli distribution with mean 0.1. Write Python code with n = 106. Use the random seed 0. Compute the simulated mean. where %3Darrow_forward: |: Section 3 - Random Forests Decision Trees on their own are effectiev classifiers. The true power, though, comes from a forest of trees -- multiple decision trees working together as an ensemble learner. We create multiple trees, each using a subset of the available attributes, let them each make a guess at the correct classification, and then take the majority vote as the predicted class. Sounds tricky, right? Once again, sklearn to the rescue. We are going to see if a random forest can improve on our decision tree accuracy for the bank data. To begin, we are going to create a Random Forest using the wheat training and test data from the previous section. First, reload our data. 1 # reload the wheat dataset from UCI 2 3 df = pd. read_csv ("seeds_dataset.txt", sep='\\t', engine='python') 4 6 7 df.columns = ['a', 'p', 'compactness', 'length', 'width', 'coeff', 'length_g', 'type'] print (f'Our data has {df.shape [0]} rows and {df.shape [1]} columns') 8 #Mark 70% of the data for…arrow_forwardData Mining The following is the Training Data which is the result of an average monitoring for 2 weeks of 8 people who are suspected of being infected with the Omicron variant of the Corona virus. Based on the data, make a Decision Tree, using the attribute selection measures "GINI Index". With the class attribute is the column "Infected"arrow_forward
- Database System ConceptsComputer ScienceISBN:9780078022159Author:Abraham Silberschatz Professor, Henry F. Korth, S. SudarshanPublisher:McGraw-Hill EducationStarting Out with Python (4th Edition)Computer ScienceISBN:9780134444321Author:Tony GaddisPublisher:PEARSONDigital Fundamentals (11th Edition)Computer ScienceISBN:9780132737968Author:Thomas L. FloydPublisher:PEARSON
- C How to Program (8th Edition)Computer ScienceISBN:9780133976892Author:Paul J. Deitel, Harvey DeitelPublisher:PEARSONDatabase Systems: Design, Implementation, & Manag...Computer ScienceISBN:9781337627900Author:Carlos Coronel, Steven MorrisPublisher:Cengage LearningProgrammable Logic ControllersComputer ScienceISBN:9780073373843Author:Frank D. PetruzellaPublisher:McGraw-Hill Education





