Decision Tree Classification Algorithm

Decision Tree Example

Let’s consider a simple classification: there are 20 balls of blue and yellow colours. Each ball is located in integer points from 0 to 20 (excluded). We want to guess ball colour y given its position (integer coordinate x).

Visualization of decision tree

Probabilities

(Sample Means)

Before the first split (aka class probabilities)

\[ P(y=\text{BLUE}) = \frac{9}{20} = 0.45, \quad P(y=\text{YELLOW}) = \frac{11}{20} = 0.55. \]

After the first split (aka conditional on coordinate proba)

\[ P(y=\text{BLUE}|X\leq 12) = \frac{8}{13} \approx 0.62, \quad P(y=\text{BLUE}|X> 12) = \frac{1}{7} \approx 0.14. \]

\[ P(y=\text{YELLOW}|X\leq 12) = \frac{5}{13} \approx 0.38, \quad P(y=\text{YELLOW}|X > 12) = \frac{6}{7} \approx 0.86. \]

Information Criterion

Entropy

\[ H(p) = - \sum_i^K p_i\log(p_i) \]

Before the first split

\[H = - 0.45 \log 0.45 - 0.55 \log 0.55 \approx -0.69 \]

After the first split

\[H_{\text{left}} = - 0.62 \log 0.62 - 0.38 \log 0.38 \approx -0.66\]

\[H_{\text{right}} = - 0.14 \log 0.14 - 0.86 \log 0.86 \approx -0.40\]

\[H_{\text{total}} = - \frac{13}{20} 0.66 - \frac{7}{20} 0.40 \approx -0.86\]

Information Gain

\[ IG = H(\text{parent}) - \sum_{child} H(\text{child}) \]

\[IG = -0.69 - (-0.86) = 0.13\]

Toy Example

First, load dataset as usual.

from sklearn.datasets import load_iris
iris = load_iris()
feats = iris.data
labels = iris.target

Let’s import an algorithm and train its parameters.

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(feats, labels)

Let’s take a look at strucuture of decision tree.

import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(14, 5), dpi=150)
plot_tree(clf, ax=ax, filled=True, proportion=False)
plt.show()
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(12, 6), dpi=150, layout='constrained')

pairs = [[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]
for pairidx, (ax, pair) in enumerate(zip(axs.flatten(), pairs)):
    # Train model.
    X = iris.data[:, pair]
    y = iris.target
    clf = DecisionTreeClassifier()
    clf.fit(X, y)

    # Plot the decision boundary
    DecisionBoundaryDisplay.from_estimator(
        clf,
        X,
        cmap=plt.cm.RdYlBu,
        response_method='predict',
        ax=ax,
        xlabel=iris.feature_names[pair[0]],
        ylabel=iris.feature_names[pair[1]],
    )

    # Plot the training points
    for i, color in enumerate('ryb'):
        idx = np.where(y == i)
        ax.scatter(
            X[idx, 0],
            X[idx, 1],
            c=color,
            label=iris.target_names[i],
            cmap=plt.cm.RdYlBu,
            edgecolor="black",
            s=15,
        )

plt.legend(loc='lower right', borderpad=0, handletextpad=0)
plt.show()

Forest Cover Type

Read in the data as pandas.DataFrame. Download data as CSV files from the UCI dataset collection then unzip it. There is a corresponding Kaggle competition.

!wget -cq --show-progress https://archive.ics.uci.edu/static/public/31/covertype.zip
!unzip -f covertype.zip -d covertype
import pandas as pd
import numpy as np
colnames = [
    'Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology',
    'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
    'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
    'Horizontal_Distance_To_Fire_Points'
]
colnames += [f'Wilderness_Area{i}' for i in range(4)]
colnames += [f'Soil_Type{i}' for i in range(40)]
colnames += ['Cover_Type']
df = pd.read_csv('covertype/covtype.data.gz', compression='gzip', names=colnames)
df.head()
df.shape
df_train = df.iloc[:15120]
df_test = df.iloc[15120:]
assert len(df_train) == 15120
assert len(df_test) == 565892

Use the top rows as a train test.

df = df_train

Split Data

from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
X_train, X_test, y_train, y_test = train_test_split(df.drop('Cover_Type', axis=1),
                                                    df.Cover_Type, train_size=.80, random_state=42)
params_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': np.arange(3, 30),
    'min_samples_split': np.arange(10, 30, 5)
}
clf = DecisionTreeClassifier()
cv = KFold(n_splits=5, shuffle=True, random_state=322)

Adjust number of jobs (number of trees trained in parallel) to your environment.

N_JOBS = -1
gs = GridSearchCV(clf, param_grid=params_grid, cv=cv, n_jobs=N_JOBS, verbose=1)
gs.fit(X_train, y_train)
gs.best_estimator_
gs.best_params_
gs.best_score_

Model Evaluation

from sklearn.metrics import accuracy_score
y_pred = gs.predict(X_test)
accuracy_score(y_test, y_pred)

Let’s plot ROC.

Public Test

Fit model with best parameters to the whole available dataste (train + val parts).

gs.best_estimator_.fit(df.drop('Cover_Type', axis=1), df.Cover_Type)

Now load public test test,

test = ...

then make predictions,

y_pred_leaderboard = gs.predict(test)

and write predictions as a CSV file to local filesystem.

predictions = pd.DataFrame(data=y_pred_leaderboard,
                           index=test.index, 
                           columns=['Cover_Type'])
predictions.to_csv('decision_tree.csv')
!head -n 10 decision_tree.csv

Finally, we can submit predictions to kaggle competition but first let’s encode hyper-parameters of best model to JSON.

from json import dumps
comment = dumps(gs.best_params_)
comment
!echo '${comment}'

If you logged in to kaggle you can submit predictions (competition).

!kaggle competitions submit \
    -c forest-cover-type-prediction \
    -f decision_tree.csv \
    -m '${comment}'

References

  • All parameters of a DecisionTreeClassifier explained.