3 Linear Models: AUROC

An example of ROC Curve

An unbiased estimate of Area Under Receive-Operating Curve (AUROC) is

\[ AUROC(a) = \frac{1}{|\mathcal{D}_0| |\mathcal{D}_1|} \sum_{x_0 \in D_0} \sum_{x_1 \in D_1} I[a(x_0) < a(x_1)] . \]

for an algorithm \(a\), \(\mathcal{D}_{0,1}\) mean a set of negative (0) and positive (1) examples. It is useful since

Dataset

Let’s generate some synthetic 2d dataset for classification problem.

import matplotlib.pyplot as plt
import numpy as np
rs = np.random.RandomState(42)
n_points = 200
clusters = [
    rs.normal(loc=(1, 1), size=(n_points // 2, 2)),
    rs.normal(loc=(-1, -1), size=(n_points // 2, 2)),
]
coords = np.vstack(clusters)
labels = np.zeros(n_points, dtype=int)
labels[n_points // 2:] += 1
for i, cluster in enumerate(clusters):
    plt.scatter(clusters[i][:, 0], clusters[i][:, 1], label=f'cluster {i}')
plt.grid(True)
plt.legend()
plt.show()
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(coords, labels)
probas = clf.predict_proba(coords)

Confusion Matrix

Consusion matrix for a binary classifier is

\[ C = \begin{bmatrix} TP & FP \\ FN & TN \end{bmatrix}. \]

Then let’s define false-positive rate (FPR) and true-positive rate (TPR) as follows \[ FPR = \frac{FP}{TP + TN}, \] \[ TPR = \frac{TP}{TP + FN}. \]

Receiver-Operation Curve (ROC) is a relation between TPR and FPR

\[ TPR= TPR(FPR). \]

from sklearn.metrics import confusion_matrix
threshold = 0.5
preds = (probas[:, 1] >= threshold).astype(int)
confmat = confusion_matrix(labels, preds)
tn, fp, fn, tp  = confmat.ravel()
fpr = fp / (fp + tn)
fpr
tpr = tp / (tp + fn)
tpr

ROC and Area under ROC

We need to change a threshold contiously from 0 to 1, calculate confusion matrix, calculate tpr/fpr, and plot ROC and calculate AUROC.

# thresholds = (0, 0.5, 1)
thresholds = np.linspace(0, 1, 101)

fprs = np.empty(len(thresholds))
tprs = np.empty(len(thresholds))
for i, threshold in enumerate(thresholds):
    preds = (probas[:, 0] <= threshold).astype(int)
    
    confmat = confusion_matrix(labels, preds)
    tn, fp, fn, tp  = confmat.ravel()

    fprs[i] = fp / (fp + tn)
    tprs[i] = tp / (tp + fn)
plt.step(fprs, tprs, '-')
plt.grid(True)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()

If we intergrate numerical over tprs/fprs then the value of that integral is an area under curve (i.e. AUROC).

np.trapz(tprs, fprs)

Great! Let’s compare with library implementation.

from sklearn.metrics import roc_curve, roc_auc_score
roc_auc_score(labels, probas[:, 1])
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 4))

fpr, tpr, th = roc_curve(labels, probas[:, 1])
ax = axs[0]
ax.plot(fpr, tpr, '.-', label='library')
ax.legend()
ax.grid(True)
ax.set_xlabel('FPR')
ax.set_ylabel('TPR')

ax = axs[1]
ax.step(fprs, tprs, '.-', label='ours')
ax.legend()
ax.grid(True)
ax.set_xlabel('FPR')
ax.set_ylabel('TPR')

plt.show()