import matplotlib.pyplot as plt
import numpy as np
3 Linear Models: AUROC
An unbiased estimate of Area Under Receive-Operating Curve (AUROC) is
\[ AUROC(a) = \frac{1}{|\mathcal{D}_0| |\mathcal{D}_1|} \sum_{x_0 \in D_0} \sum_{x_1 \in D_1} I[a(x_0) < a(x_1)] . \]
for an algorithm \(a\), \(\mathcal{D}_{0,1}\) mean a set of negative (0) and positive (1) examples. It is useful since
- Threshold Independence
- Robustness to Class Imbalance
- Model Comparison
- Interpretability
- …
Dataset
Let’s generate some synthetic 2d dataset for classification problem.
= np.random.RandomState(42) rs
= 200
n_points = [
clusters =(1, 1), size=(n_points // 2, 2)),
rs.normal(loc=(-1, -1), size=(n_points // 2, 2)),
rs.normal(loc
]= np.vstack(clusters)
coords = np.zeros(n_points, dtype=int)
labels // 2:] += 1 labels[n_points
for i, cluster in enumerate(clusters):
0], clusters[i][:, 1], label=f'cluster {i}')
plt.scatter(clusters[i][:, True)
plt.grid(
plt.legend() plt.show()
from sklearn.linear_model import LogisticRegression
= LogisticRegression()
clf
clf.fit(coords, labels)= clf.predict_proba(coords) probas
Confusion Matrix
Consusion matrix for a binary classifier is
\[ C = \begin{bmatrix} TP & FP \\ FN & TN \end{bmatrix}. \]
Then let’s define false-positive rate (FPR) and true-positive rate (TPR) as follows \[ FPR = \frac{FP}{TP + TN}, \] \[ TPR = \frac{TP}{TP + FN}. \]
Receiver-Operation Curve (ROC) is a relation between TPR and FPR
\[ TPR= TPR(FPR). \]
from sklearn.metrics import confusion_matrix
= 0.5
threshold = (probas[:, 1] >= threshold).astype(int) preds
= confusion_matrix(labels, preds)
confmat = confmat.ravel() tn, fp, fn, tp
= fp / (fp + tn)
fpr fpr
= tp / (tp + fn)
tpr tpr
ROC and Area under ROC
We need to change a threshold contiously from 0 to 1, calculate confusion matrix, calculate tpr/fpr, and plot ROC and calculate AUROC.
# thresholds = (0, 0.5, 1)
= np.linspace(0, 1, 101)
thresholds
= np.empty(len(thresholds))
fprs = np.empty(len(thresholds))
tprs for i, threshold in enumerate(thresholds):
= (probas[:, 0] <= threshold).astype(int)
preds
= confusion_matrix(labels, preds)
confmat = confmat.ravel()
tn, fp, fn, tp
= fp / (fp + tn)
fprs[i] = tp / (tp + fn) tprs[i]
'-')
plt.step(fprs, tprs, True)
plt.grid('FPR')
plt.xlabel('TPR')
plt.ylabel( plt.show()
If we intergrate numerical over tprs
/fprs
then the value of that integral is an area under curve (i.e. AUROC).
np.trapz(tprs, fprs)
Great! Let’s compare with library implementation.
from sklearn.metrics import roc_curve, roc_auc_score
1]) roc_auc_score(labels, probas[:,
= plt.subplots(nrows=1, ncols=2, figsize=(15, 4))
fig, axs
= roc_curve(labels, probas[:, 1])
fpr, tpr, th = axs[0]
ax '.-', label='library')
ax.plot(fpr, tpr,
ax.legend()True)
ax.grid('FPR')
ax.set_xlabel('TPR')
ax.set_ylabel(
= axs[1]
ax '.-', label='ours')
ax.step(fprs, tprs,
ax.legend()True)
ax.grid('FPR')
ax.set_xlabel('TPR')
ax.set_ylabel(
plt.show()