vertopal

.pdf

School

University of Oregon *

*We aren’t endorsed by this school

Course

102

Subject

Industrial Engineering

Date

Apr 3, 2024

Type

pdf

Pages

10

Uploaded by JudgeFogSnake42

Report
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # These lines load the tests. import otter grader = otter.Notebook() Lab 8: Decision Trees This lab is an introduction to decision trees, random forests, and how to use them to evaluate variable importance. We will use the following dataset represents various human physiological measurements and whether or not an individual was diagnosed with some form of kidney disease. "Classification" will be our variable of interest i.e. our target. Please refer to the "metadata.txt" document for the meanings of each variable. kd = pd.read_csv( "./kidney_disease.csv" ) kd.head() age bp sg al su bgr bu sc sod pot hemo pcv \ 0 48.0 80.0 1.020 1.0 0.0 121.0 36.0 1.2 NaN NaN 15.4 44.0 1 7.0 50.0 1.020 4.0 0.0 NaN 18.0 0.8 NaN NaN 11.3 38.0 2 62.0 80.0 1.010 2.0 3.0 423.0 53.0 1.8 NaN NaN 9.6 31.0 3 48.0 70.0 1.005 4.0 0.0 117.0 56.0 3.8 111.0 2.5 11.2 32.0 4 51.0 80.0 1.010 2.0 0.0 106.0 26.0 1.4 NaN NaN 11.6 35.0 wc rc htn ane classification 0 7800.0 5.2 yes no ckd 1 6000.0 NaN no no ckd 2 7500.0 NaN no yes ckd 3 6700.0 3.9 yes yes ckd 4 7300.0 4.6 no no ckd Let's extract "classification" as our target and drop it from the rest of the data. We'll call the other dataset "features". Note that we first drop NAs from the data, cutting out effective sample size by half. Many models in sklearn don't handle NAs by default, so removing them first is a common initial step. #Remove any rows that have NAs in the data kd.dropna(inplace = True )
# Create our target target = kd[ "classification" ] # Create features as all data except classification features = kd.drop( "classification" , axis = 1 ) features.head() age bp sg al su bgr bu sc sod pot hemo pcv \ 3 48.0 70.0 1.005 4.0 0.0 117.0 56.0 3.8 111.0 2.5 11.2 32.0 5 60.0 90.0 1.015 3.0 0.0 74.0 25.0 1.1 142.0 3.2 12.2 39.0 9 53.0 90.0 1.020 2.0 0.0 70.0 107.0 7.2 114.0 3.7 9.5 29.0 11 63.0 70.0 1.010 3.0 0.0 380.0 60.0 2.7 131.0 4.2 10.8 32.0 12 68.0 70.0 1.015 3.0 1.0 208.0 72.0 2.1 138.0 5.8 9.7 28.0 wc rc htn ane 3 6700.0 3.9 yes yes 5 7800.0 4.4 yes no 9 12100.0 3.7 yes yes 11 4500.0 3.8 yes no 12 12200.0 3.4 yes no Question 1 There are two categorical variables among our features. What are they? Your answers should be the column names of the variables as an array. cat_variables = np.array([ 'htn' , 'ane' ]) grader.check( "q1" ) q1 results: All test cases passed! Question 2 Convert these two variables into their dummy equivalents and then drop the original variables from the features using pd.get_dummies() . There should therefore be 4 new columns in the data populated by 0s or 1s. If your new dummy values come out as Trues and Falses, look at the documentation for astype() to fix this. features[ "htn_yes" ] = pd.get_dummies(features[ 'htn' ], dtype = int ).iloc[:, 1 ] features[ "htn_no" ] = pd.get_dummies(features[ 'htn' ],
dtype = int ).iloc[:, 0 ] features[ "ane_yes" ] = pd.get_dummies(features[ 'ane' ], dtype = int ).iloc[:, 1 ] features[ "ane_no" ] = pd.get_dummies(features[ 'ane' ], dtype = int ).iloc[:, 0 ] features.drop([ "ane" , "htn" ], axis = 1 , inplace = True ) features.head() age bp sg al su bgr bu sc sod pot hemo pcv \ 3 48.0 70.0 1.005 4.0 0.0 117.0 56.0 3.8 111.0 2.5 11.2 32.0 5 60.0 90.0 1.015 3.0 0.0 74.0 25.0 1.1 142.0 3.2 12.2 39.0 9 53.0 90.0 1.020 2.0 0.0 70.0 107.0 7.2 114.0 3.7 9.5 29.0 11 63.0 70.0 1.010 3.0 0.0 380.0 60.0 2.7 131.0 4.2 10.8 32.0 12 68.0 70.0 1.015 3.0 1.0 208.0 72.0 2.1 138.0 5.8 9.7 28.0 wc rc htn_yes htn_no ane_yes ane_no 3 6700.0 3.9 1 0 1 0 5 7800.0 4.4 1 0 0 1 9 12100.0 3.7 1 0 1 0 11 4500.0 3.8 1 0 0 1 12 12200.0 3.4 1 0 0 1 grader.check( "q2" ) q2 results: All test cases passed! Question 3 Split your data into train and test sets. Use a test size of 20%. from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2 , shuffle = True ) grader.check( "q3" ) q3 results: All test cases passed! Question 4 Fit a decision tree to your training data using all predictors and predicting classification. Then predict classes on your test set and calculate the accuracy.
from sklearn.tree import DecisionTreeClassifier clf = DecisionTreeClassifier() clf = clf.fit(X_train, y_train) accuracy = sum (clf.predict(X_test) == y_test) / len (y_test) accuracy 1.0 grader.check( "q4" ) q4 results: All test cases passed! Question 5 Visualize your tree using the code below. How deep of a tree did you need to reach total node purity, and which variables were used? Remember that node purity is determined by the Gini index: G = k = 1 K p mk ( 1 − p mk ) When G = 0, that means there are only observations of a single class in the branch of the tree and we cannot improve that region. Our job is done when all branches are equal to 0 (though beware overfitting as discussed in lecture). from sklearn import tree tree.plot_tree(clf) plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help