from sklearn.datasets import load_iris
Decision Tree Classification Algorithm
- Traininig
- Find most informative combination of
node of the tree
,feature
, andsplit value
- Do split if
max_depth
is not reached - Iterate over 1-2.
- Find most informative combination of
- Inference (prediction)
- Follow the decision rules.
Decision Tree Example
Let’s consider a simple classification: there are 20 balls of blue and yellow colours. Each ball is located in integer points from 0 to 20 (excluded). We want to guess ball colour y
given its position (integer coordinate x
).
Probabilities
(Sample Means)
Before the first split (aka class probabilities)
\[ P(y=\text{BLUE}) = \frac{9}{20} = 0.45, \quad P(y=\text{YELLOW}) = \frac{11}{20} = 0.55. \]
After the first split (aka conditional on coordinate proba)
\[ P(y=\text{BLUE}|X\leq 12) = \frac{8}{13} \approx 0.62, \quad P(y=\text{BLUE}|X> 12) = \frac{1}{7} \approx 0.14. \]
\[ P(y=\text{YELLOW}|X\leq 12) = \frac{5}{13} \approx 0.38, \quad P(y=\text{YELLOW}|X > 12) = \frac{6}{7} \approx 0.86. \]
Information Criterion
Entropy
\[ H(p) = - \sum_i^K p_i\log(p_i) \]
Before the first split
\[H = - 0.45 \log 0.45 - 0.55 \log 0.55 \approx -0.69 \]
After the first split
\[H_{\text{left}} = - 0.62 \log 0.62 - 0.38 \log 0.38 \approx -0.66\]
\[H_{\text{right}} = - 0.14 \log 0.14 - 0.86 \log 0.86 \approx -0.40\]
\[H_{\text{total}} = - \frac{13}{20} 0.66 - \frac{7}{20} 0.40 \approx -0.86\]
Information Gain
\[ IG = H(\text{parent}) - \sum_{child} H(\text{child}) \]
\[IG = -0.69 - (-0.86) = 0.13\]
Toy Example
First, load dataset as usual.
= load_iris()
iris = iris.data
feats = iris.target labels
Let’s import an algorithm and train its parameters.
from sklearn.tree import DecisionTreeClassifier
= DecisionTreeClassifier()
clf clf.fit(feats, labels)
Let’s take a look at strucuture of decision tree.
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.tree import plot_tree
= plt.subplots(figsize=(14, 5), dpi=150)
fig, ax =ax, filled=True, proportion=False)
plot_tree(clf, ax plt.show()
= plt.subplots(nrows=2, ncols=3, figsize=(12, 6), dpi=150, layout='constrained')
fig, axs
= [[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]
pairs for pairidx, (ax, pair) in enumerate(zip(axs.flatten(), pairs)):
# Train model.
= iris.data[:, pair]
X = iris.target
y = DecisionTreeClassifier()
clf
clf.fit(X, y)
# Plot the decision boundary
DecisionBoundaryDisplay.from_estimator(
clf,
X,=plt.cm.RdYlBu,
cmap='predict',
response_method=ax,
ax=iris.feature_names[pair[0]],
xlabel=iris.feature_names[pair[1]],
ylabel
)
# Plot the training points
for i, color in enumerate('ryb'):
= np.where(y == i)
idx
ax.scatter(0],
X[idx, 1],
X[idx, =color,
c=iris.target_names[i],
label=plt.cm.RdYlBu,
cmap="black",
edgecolor=15,
s
)
='lower right', borderpad=0, handletextpad=0)
plt.legend(loc plt.show()
Forest Cover Type
Read in the data as pandas.DataFrame
. Download data as CSV files from the UCI dataset collection then unzip it. There is a corresponding Kaggle competition.
!wget -cq --show-progress https://archive.ics.uci.edu/static/public/31/covertype.zip
!unzip -f covertype.zip -d covertype
import pandas as pd
import numpy as np
= [
colnames 'Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology',
'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
'Horizontal_Distance_To_Fire_Points'
]+= [f'Wilderness_Area{i}' for i in range(4)]
colnames += [f'Soil_Type{i}' for i in range(40)]
colnames += ['Cover_Type'] colnames
= pd.read_csv('covertype/covtype.data.gz', compression='gzip', names=colnames)
df df.head()
df.shape
= df.iloc[:15120]
df_train = df.iloc[15120:] df_test
assert len(df_train) == 15120
assert len(df_test) == 565892
Use the top rows as a train test.
= df_train df
Split Data
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
= train_test_split(df.drop('Cover_Type', axis=1),
X_train, X_test, y_train, y_test =.80, random_state=42) df.Cover_Type, train_size
= {
params_grid 'criterion': ['gini', 'entropy'],
'max_depth': np.arange(3, 30),
'min_samples_split': np.arange(10, 30, 5)
}
= DecisionTreeClassifier()
clf = KFold(n_splits=5, shuffle=True, random_state=322) cv
Adjust number of jobs (number of trees trained in parallel) to your environment.
= -1 N_JOBS
= GridSearchCV(clf, param_grid=params_grid, cv=cv, n_jobs=N_JOBS, verbose=1)
gs gs.fit(X_train, y_train)
gs.best_estimator_
gs.best_params_
gs.best_score_
Model Evaluation
from sklearn.metrics import accuracy_score
= gs.predict(X_test) y_pred
accuracy_score(y_test, y_pred)
Let’s plot ROC.
Public Test
Fit model with best parameters to the whole available dataste (train + val parts).
'Cover_Type', axis=1), df.Cover_Type) gs.best_estimator_.fit(df.drop(
Now load public test test,
= ... test
then make predictions,
= gs.predict(test) y_pred_leaderboard
and write predictions as a CSV file to local filesystem.
= pd.DataFrame(data=y_pred_leaderboard,
predictions =test.index,
index=['Cover_Type'])
columns'decision_tree.csv') predictions.to_csv(
!head -n 10 decision_tree.csv
Finally, we can submit predictions to kaggle competition but first let’s encode hyper-parameters of best model to JSON.
from json import dumps
= dumps(gs.best_params_)
comment comment
!echo '${comment}'
If you logged in to kaggle you can submit predictions (competition).
!kaggle competitions submit \
-c forest-cover-type-prediction \
-f decision_tree.csv \
-m '${comment}'
References
- All parameters of a
DecisionTreeClassifier
explained.