vertopal
.pdf
keyboard_arrow_up
School
University of Oregon *
*We aren’t endorsed by this school
Course
102
Subject
Industrial Engineering
Date
Apr 3, 2024
Type
Pages
10
Uploaded by JudgeFogSnake42
import
pandas as
pd
import
numpy as
np
import
matplotlib.pyplot as
plt
import
seaborn as
sns
# These lines load the tests.
import
otter
grader =
otter.Notebook()
Lab 8: Decision Trees
This lab is an introduction to decision trees, random forests, and how to use them to evaluate variable importance.
We will use the following dataset represents various human physiological measurements and whether or not an individual was diagnosed with some form of kidney disease. "Classification" will be our variable of interest i.e.
our target. Please refer to the "metadata.txt" document for the
meanings of each variable.
kd =
pd.read_csv(
"./kidney_disease.csv"
)
kd.head()
age bp sg al su bgr bu sc sod pot hemo pcv \
0 48.0 80.0 1.020 1.0 0.0 121.0 36.0 1.2 NaN NaN 15.4 44.0 1 7.0 50.0 1.020 4.0 0.0 NaN 18.0 0.8 NaN NaN 11.3 38.0 2 62.0 80.0 1.010 2.0 3.0 423.0 53.0 1.8 NaN NaN 9.6 31.0 3 48.0 70.0 1.005 4.0 0.0 117.0 56.0 3.8 111.0 2.5 11.2 32.0 4 51.0 80.0 1.010 2.0 0.0 106.0 26.0 1.4 NaN NaN 11.6 35.0 wc rc htn ane classification 0 7800.0 5.2 yes no ckd 1 6000.0 NaN no no ckd 2 7500.0 NaN no yes ckd 3 6700.0 3.9 yes yes ckd 4 7300.0 4.6 no no ckd Let's extract "classification" as our target and drop it from the rest of the data. We'll call the other dataset "features". Note that we first drop NAs from the data, cutting out effective sample size by half. Many models in sklearn
don't handle NAs by default, so removing them first is a common initial step.
#Remove any rows that have NAs in the data
kd.dropna(inplace =
True
)
# Create our target
target =
kd[
"classification"
]
# Create features as all data except classification
features =
kd.drop(
"classification"
, axis =
1
)
features.head()
age bp sg al su bgr bu sc sod pot hemo pcv \
3 48.0 70.0 1.005 4.0 0.0 117.0 56.0 3.8 111.0 2.5 11.2 32.0 5 60.0 90.0 1.015 3.0 0.0 74.0 25.0 1.1 142.0 3.2 12.2 39.0 9 53.0 90.0 1.020 2.0 0.0 70.0 107.0 7.2 114.0 3.7 9.5 29.0 11 63.0 70.0 1.010 3.0 0.0 380.0 60.0 2.7 131.0 4.2 10.8 32.0 12 68.0 70.0 1.015 3.0 1.0 208.0 72.0 2.1 138.0 5.8 9.7 28.0 wc rc htn ane 3 6700.0 3.9 yes yes 5 7800.0 4.4 yes no 9 12100.0 3.7 yes yes 11 4500.0 3.8 yes no 12 12200.0 3.4 yes no Question 1
There are two categorical variables among our features. What are they? Your answers should be the column names of the variables as an array.
cat_variables =
np.array([
'htn'
, 'ane'
])
grader.check(
"q1"
)
q1 results: All test cases passed!
Question 2
Convert these two variables into their dummy equivalents and then drop the original variables from the features using pd.get_dummies()
. There should therefore be 4 new columns in the data populated by 0s or 1s. If your new dummy values come out as Trues and Falses, look at the documentation for astype()
to fix this.
features[
"htn_yes"
] =
pd.get_dummies(features[
'htn'
], dtype
=
int
).iloc[:, 1
]
features[
"htn_no"
] =
pd.get_dummies(features[
'htn'
],
dtype
=
int
).iloc[:, 0
]
features[
"ane_yes"
] =
pd.get_dummies(features[
'ane'
], dtype
=
int
).iloc[:, 1
]
features[
"ane_no"
] =
pd.get_dummies(features[
'ane'
], dtype
=
int
).iloc[:, 0
]
features.drop([
"ane"
,
"htn"
], axis =
1
, inplace =
True
)
features.head()
age bp sg al su bgr bu sc sod pot hemo pcv \
3 48.0 70.0 1.005 4.0 0.0 117.0 56.0 3.8 111.0 2.5 11.2 32.0 5 60.0 90.0 1.015 3.0 0.0 74.0 25.0 1.1 142.0 3.2 12.2 39.0 9 53.0 90.0 1.020 2.0 0.0 70.0 107.0 7.2 114.0 3.7 9.5 29.0 11 63.0 70.0 1.010 3.0 0.0 380.0 60.0 2.7 131.0 4.2 10.8 32.0 12 68.0 70.0 1.015 3.0 1.0 208.0 72.0 2.1 138.0 5.8 9.7 28.0 wc rc htn_yes htn_no ane_yes ane_no 3 6700.0 3.9 1 0 1 0 5 7800.0 4.4 1 0 0 1 9 12100.0 3.7 1 0 1 0 11 4500.0 3.8 1 0 0 1 12 12200.0 3.4 1 0 0 1 grader.check(
"q2"
)
q2 results: All test cases passed!
Question 3
Split your data into train and test sets. Use a test size of 20%.
from
sklearn.model_selection import
train_test_split
X_train, X_test, y_train, y_test =
train_test_split(features, target, test_size
=
0.2
, shuffle
=
True
)
grader.check(
"q3"
)
q3 results: All test cases passed!
Question 4
Fit a decision tree to your training data using all predictors and predicting classification. Then predict classes on your test set and calculate the accuracy.
from
sklearn.tree import
DecisionTreeClassifier
clf =
DecisionTreeClassifier()
clf =
clf.fit(X_train, y_train)
accuracy =
sum
(clf.predict(X_test) ==
y_test) /
len
(y_test)
accuracy
1.0
grader.check(
"q4"
)
q4 results: All test cases passed!
Question 5
Visualize your tree using the code below. How deep of a tree did you need to reach total node purity, and which variables were used?
Remember that node purity is determined by the Gini index:
G
=
∑
k
=
1
K
p
mk
(
1
− p
mk
)
When G = 0, that means there are only observations of a single class in the branch of the tree and we cannot improve that region. Our job is done when all branches are equal to 0 (though beware overfitting as discussed in lecture).
from
sklearn import
tree tree.plot_tree(clf)
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help