import matplotlib.pyplot as plt
import numpy as np3 Linear Models: AUROC

An unbiased estimate of Area Under Receive-Operating Curve (AUROC) is
\[ AUROC(a) = \frac{1}{|\mathcal{D}_0| |\mathcal{D}_1|} \sum_{x_0 \in D_0} \sum_{x_1 \in D_1} I[a(x_0) < a(x_1)] . \]
for an algorithm \(a\), \(\mathcal{D}_{0,1}\) mean a set of negative (0) and positive (1) examples. It is useful since
- Threshold Independence
- Robustness to Class Imbalance
- Model Comparison
- Interpretability
- …
Dataset
Let’s generate some synthetic 2d dataset for classification problem.
rs = np.random.RandomState(42)n_points = 200
clusters = [
rs.normal(loc=(1, 1), size=(n_points // 2, 2)),
rs.normal(loc=(-1, -1), size=(n_points // 2, 2)),
]
coords = np.vstack(clusters)
labels = np.zeros(n_points, dtype=int)
labels[n_points // 2:] += 1for i, cluster in enumerate(clusters):
plt.scatter(clusters[i][:, 0], clusters[i][:, 1], label=f'cluster {i}')
plt.grid(True)
plt.legend()
plt.show()from sklearn.linear_model import LogisticRegressionclf = LogisticRegression()
clf.fit(coords, labels)
probas = clf.predict_proba(coords)Confusion Matrix
Consusion matrix for a binary classifier is
\[ C = \begin{bmatrix} TP & FP \\ FN & TN \end{bmatrix}. \]
Then let’s define false-positive rate (FPR) and true-positive rate (TPR) as follows \[ FPR = \frac{FP}{TP + TN}, \] \[ TPR = \frac{TP}{TP + FN}. \]
Receiver-Operation Curve (ROC) is a relation between TPR and FPR
\[ TPR= TPR(FPR). \]
from sklearn.metrics import confusion_matrixthreshold = 0.5
preds = (probas[:, 1] >= threshold).astype(int)confmat = confusion_matrix(labels, preds)
tn, fp, fn, tp = confmat.ravel()fpr = fp / (fp + tn)
fprtpr = tp / (tp + fn)
tprROC and Area under ROC
We need to change a threshold contiously from 0 to 1, calculate confusion matrix, calculate tpr/fpr, and plot ROC and calculate AUROC.
# thresholds = (0, 0.5, 1)
thresholds = np.linspace(0, 1, 101)
fprs = np.empty(len(thresholds))
tprs = np.empty(len(thresholds))
for i, threshold in enumerate(thresholds):
preds = (probas[:, 0] <= threshold).astype(int)
confmat = confusion_matrix(labels, preds)
tn, fp, fn, tp = confmat.ravel()
fprs[i] = fp / (fp + tn)
tprs[i] = tp / (tp + fn)plt.step(fprs, tprs, '-')
plt.grid(True)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()If we intergrate numerical over tprs/fprs then the value of that integral is an area under curve (i.e. AUROC).
np.trapz(tprs, fprs)Great! Let’s compare with library implementation.
from sklearn.metrics import roc_curve, roc_auc_scoreroc_auc_score(labels, probas[:, 1])fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 4))
fpr, tpr, th = roc_curve(labels, probas[:, 1])
ax = axs[0]
ax.plot(fpr, tpr, '.-', label='library')
ax.legend()
ax.grid(True)
ax.set_xlabel('FPR')
ax.set_ylabel('TPR')
ax = axs[1]
ax.step(fprs, tprs, '.-', label='ours')
ax.legend()
ax.grid(True)
ax.set_xlabel('FPR')
ax.set_ylabel('TPR')
plt.show()