from sklearn.datasets import load_irisDecision Tree Classification Algorithm
- Traininig
- Find most informative combination of
node of the tree,feature, andsplit value - Do split if
max_depthis not reached - Iterate over 1-2.
- Find most informative combination of
- Inference (prediction)
- Follow the decision rules.
Decision Tree Example
Let’s consider a simple classification: there are 20 balls of blue and yellow colours. Each ball is located in integer points from 0 to 20 (excluded). We want to guess ball colour y given its position (integer coordinate x).

Probabilities
(Sample Means)
Before the first split (aka class probabilities)
\[ P(y=\text{BLUE}) = \frac{9}{20} = 0.45, \quad P(y=\text{YELLOW}) = \frac{11}{20} = 0.55. \]
After the first split (aka conditional on coordinate proba)
\[ P(y=\text{BLUE}|X\leq 12) = \frac{8}{13} \approx 0.62, \quad P(y=\text{BLUE}|X> 12) = \frac{1}{7} \approx 0.14. \]
\[ P(y=\text{YELLOW}|X\leq 12) = \frac{5}{13} \approx 0.38, \quad P(y=\text{YELLOW}|X > 12) = \frac{6}{7} \approx 0.86. \]
Information Criterion
Entropy
\[ H(p) = - \sum_i^K p_i\log(p_i) \]
Before the first split
\[H = - 0.45 \log 0.45 - 0.55 \log 0.55 \approx -0.69 \]
After the first split
\[H_{\text{left}} = - 0.62 \log 0.62 - 0.38 \log 0.38 \approx -0.66\]
\[H_{\text{right}} = - 0.14 \log 0.14 - 0.86 \log 0.86 \approx -0.40\]
\[H_{\text{total}} = - \frac{13}{20} 0.66 - \frac{7}{20} 0.40 \approx -0.86\]
Information Gain
\[ IG = H(\text{parent}) - \sum_{child} H(\text{child}) \]
\[IG = -0.69 - (-0.86) = 0.13\]
Toy Example
First, load dataset as usual.
iris = load_iris()
feats = iris.data
labels = iris.targetLet’s import an algorithm and train its parameters.
from sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier()
clf.fit(feats, labels)Let’s take a look at strucuture of decision tree.
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.tree import plot_treefig, ax = plt.subplots(figsize=(14, 5), dpi=150)
plot_tree(clf, ax=ax, filled=True, proportion=False)
plt.show()fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(12, 6), dpi=150, layout='constrained')
pairs = [[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]
for pairidx, (ax, pair) in enumerate(zip(axs.flatten(), pairs)):
# Train model.
X = iris.data[:, pair]
y = iris.target
clf = DecisionTreeClassifier()
clf.fit(X, y)
# Plot the decision boundary
DecisionBoundaryDisplay.from_estimator(
clf,
X,
cmap=plt.cm.RdYlBu,
response_method='predict',
ax=ax,
xlabel=iris.feature_names[pair[0]],
ylabel=iris.feature_names[pair[1]],
)
# Plot the training points
for i, color in enumerate('ryb'):
idx = np.where(y == i)
ax.scatter(
X[idx, 0],
X[idx, 1],
c=color,
label=iris.target_names[i],
cmap=plt.cm.RdYlBu,
edgecolor="black",
s=15,
)
plt.legend(loc='lower right', borderpad=0, handletextpad=0)
plt.show()Forest Cover Type
Read in the data as pandas.DataFrame. Download data as CSV files from the UCI dataset collection then unzip it. There is a corresponding Kaggle competition.
!wget -cq --show-progress https://archive.ics.uci.edu/static/public/31/covertype.zip
!unzip -f covertype.zip -d covertypeimport pandas as pd
import numpy as npcolnames = [
'Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology',
'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
'Horizontal_Distance_To_Fire_Points'
]
colnames += [f'Wilderness_Area{i}' for i in range(4)]
colnames += [f'Soil_Type{i}' for i in range(40)]
colnames += ['Cover_Type']df = pd.read_csv('covertype/covtype.data.gz', compression='gzip', names=colnames)
df.head()df.shapedf_train = df.iloc[:15120]
df_test = df.iloc[15120:]assert len(df_train) == 15120
assert len(df_test) == 565892Use the top rows as a train test.
df = df_trainSplit Data
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.tree import DecisionTreeClassifierX_train, X_test, y_train, y_test = train_test_split(df.drop('Cover_Type', axis=1),
df.Cover_Type, train_size=.80, random_state=42)params_grid = {
'criterion': ['gini', 'entropy'],
'max_depth': np.arange(3, 30),
'min_samples_split': np.arange(10, 30, 5)
}clf = DecisionTreeClassifier()
cv = KFold(n_splits=5, shuffle=True, random_state=322)Adjust number of jobs (number of trees trained in parallel) to your environment.
N_JOBS = -1gs = GridSearchCV(clf, param_grid=params_grid, cv=cv, n_jobs=N_JOBS, verbose=1)
gs.fit(X_train, y_train)gs.best_estimator_gs.best_params_gs.best_score_Model Evaluation
from sklearn.metrics import accuracy_scorey_pred = gs.predict(X_test)accuracy_score(y_test, y_pred)Let’s plot ROC.
Public Test
Fit model with best parameters to the whole available dataste (train + val parts).
gs.best_estimator_.fit(df.drop('Cover_Type', axis=1), df.Cover_Type)Now load public test test,
test = ...then make predictions,
y_pred_leaderboard = gs.predict(test)and write predictions as a CSV file to local filesystem.
predictions = pd.DataFrame(data=y_pred_leaderboard,
index=test.index,
columns=['Cover_Type'])
predictions.to_csv('decision_tree.csv')!head -n 10 decision_tree.csvFinally, we can submit predictions to kaggle competition but first let’s encode hyper-parameters of best model to JSON.
from json import dumps
comment = dumps(gs.best_params_)
comment!echo '${comment}'If you logged in to kaggle you can submit predictions (competition).
!kaggle competitions submit \
-c forest-cover-type-prediction \
-f decision_tree.csv \
-m '${comment}'References
- All parameters of a
DecisionTreeClassifierexplained.