# PR Curve

Yao Yao on March 15, 2018

## 1. Axes

• x-axis 是 $\text{Recall} = \frac{TP}{TP+FN} = \frac{TP}{P}$ (i.e. $\text{True Positive Rate}$)
• 没错，和 ROC 的 y-axis 一样
• y-axis 是 $\text{Precision} = \frac{TP}{TP+FP}$ (a.k.a. $\text{Positive Predictive Value}$)

## 2. 序列规律

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve

feat = pd.DataFrame(bc['data'])
feat.columns = bc['feature_names']
label = pd.Series(bc['target'])

train_X = feat.loc[0:99, ]  # contrary to usual python slices, both the start and the stop are included in .loc!
train_y = label.loc[0:99]
test_X = feat.loc[100:, ]
test_y = label.loc[100:]

lr_config = dict(penalty='l2', C=1.0, class_weight=None, random_state=1337,
solver='liblinear', max_iter=100, verbose=0, warm_start=False, n_jobs=1)
lr = LogisticRegression(**lr_config)
lr.fit(train_X, train_y)

proba_test_y = lr.predict_proba(test_X)[:, 1]

precision, recall, threshold = precision_recall_curve(test_y, proba_test_y, pos_label=1)


>>> print(len(precision), len(recall), len(threshold))
376 376 375


def precision_recall_curve(y_true, probas_pred, pos_label=None, sample_weight=None):
fps, tps, thresholds = _binary_clf_curve(y_true, probas_pred,
pos_label=pos_label,
sample_weight=sample_weight)

precision = tps / (tps + fps)
recall = tps / tps[-1]

# stop when full recall attained
# and reverse the outputs so recall is decreasing
last_ind = tps.searchsorted(tps[-1])
sl = slice(last_ind, None, -1)
return np.r_[precision[sl], 1], np.r_[recall[sl], 0], thresholds[sl]

• (recall=0, precision=1) 这个点完全是人为添加上去的，但是没有添加对应的 threshold 的值
• 这里 np.r_ 你姑且理解为 concat 就好了

• 假设 len(recall) == len(precision) == n，那么：recall[0:n-2]precision[0:n-2]thresholds[0:n-2] 是一一对应的
• thresholds 是递增的，从 0 到非常接近 1
• recall 是随着 thresholds 递增而单调递减的，这在 ROC Curve: my interpretation 里就已经讨论过了
• precision 并没有随着 thresholds 递增而单调递减，但总体是下降趋势

• threshold=1 时，没有 prediction 为 positive，所以 $TP$ 和 $FP$ 更是无从谈起，全都是 0，进而 $\text{Precision} =\frac{TP}{TP+FP} = \frac{0}{0}$，undefined

thresholds = np.r_[thresholds, 1]

df = pd.DataFrame({"Precision": precision, "Recall": recall, "Decision Threshold":thresholds})

>>> library(ggplot2)
>>> colnames(auprc_df)
'Decision.Threshold' 'Precision' 'Recall'
>>> p <- ggplot(data=auprc_df, mapping=aes(x=Recall, y=Precision)) +
geom_path(size=0.3) + geom_point(size=0.4, color=I("blue"))
>>> p


>>> p <- ggplot(data=auprc_df[40:60,], mapping=aes(x=Recall, y=Precision)) +
geom_path(size=0.3) + geom_point(size=0.4, color=I("blue"))
>>> p + geom_text(aes(label=sprintf("%0.3f", round(Decision.Threshold, digits = 3))), hjust=-0.2, vjust=-0.4, size=2.2)


## 3. 为什么 precision 没有单调性

• $t_j > t_{j+1}$
• $\lbrace h(x_i) = 1 \rbrace \subseteq \lbrace h(x_{i+1}) = 1 \rbrace$
• threshold 减小，predict 为 positive 的数量只多不少
• 尤其当 $t = 0$ 时，$\vert \lbrace h(x_i) = 1 \rbrace \vert = n$
• 对 $\lbrace \text{Precision}_j \rbrace$ 而言:
• $FP \rvert_{t_j} \leq FP \rvert_{t_{j+1}}$
• $\lbrace h(x_{i+1}) = 1 \rbrace \setminus \lbrace h(x_i) = 1 \rbrace$ 是新增的 predict 为 positive 的 cases，其中必然有 0 个或者若干个新增的 False Positive
• $TP \rvert_{t_j} \leq TP \rvert_{t_{j+1}}$
• $\lbrace h(x_{i+1}) = 1 \rbrace \setminus \lbrace h(x_i) = 1 \rbrace$ 是新增的 predict 为 positive 的 cases，其中必然有 0 个或者若干个新增的 True Positive
• 但无法保证有 $\frac{TP \rvert_{t_j}}{TP\rvert_{t_j} + FP \rvert_{t_j}} \leq \frac{TP \rvert_{t_{j+1}}}{TP \rvert_{t_{j+1}} + FP \rvert_{t_{j+1}}}$

## 4. Baseline

Similarly, if you predict a random assortment of 0’s and 1’s, let’s say 90% 1’s.

• 你在 $P$ 上预测了 90% 为 Positive，这些全部都是 True Positive，所以 $TP = 0.9 * P$
• 你在 $N$ 上预测了 90% 为 Positive，这些全部都是 False Positive，所以 $FP = 0.9 * N$
• 你在 $P$ 上预测了 10% 为 Negative，这些全部都是 False Negative，所以 $FN = 0.1 * P$
• $\therefore \text{Recall} = \frac{TP}{P} = 0.9, \text{Precision} = \frac{TP}{TP+FP} = \frac{P}{P+N} = \text{class balance}$

## 5. PRC is sensitive to class balance

• Balanced dataset: $P=1000, N=1000$; classifier $A$ predicts at threashold $t$: $TP=500, FP=160$
• Imbalanced dataset: $P=1000, N=10000$; classifier $B$ predicts at the same threashold $t$: $TP=500, FP=1600$

如果在所有的 theashold 上都有这样类似的关系，i.e. $TP_A = TP_B, FP_A = \frac{1}{10} FP_B$，那么这两个 classifiers 的 ROC 是完全一样的

• $\text{Recall}_A = \text{Recall}_B = \frac{500}{1000} = 0.5$
• $\text{Precision}_A = \frac{500}{500+160} = 0.7576, \, \text{baseline}_A = \frac{1000}{1000+1000} = 0.5$
• $\text{Precision}_B = \frac{500}{500+1600} = 0.2381, \, \text{baseline}_B = \frac{1000}{1000+10000} = 0.091$

## 6. scikit-learn 如何计算 AUPRC

### 6.1 No interpolation

• Linear Interpolation: 计算梯形（trapezoid）的面积，$\text{A} = \frac{P_{n} + P_{n-1}}{2} \times (R_{n} - R_{n-1})$
• Non-linear Interpolation: Introduction to the precision-recall plot 提到 linear interpolation 太 optimistic，得到的 estimate 偏大，所以它可能是类似红色曲线下的面积。这种 interpolation 方法具体看那篇文章，这里不展开
• No Interpolation：计算曲线内矩形的面积（绿色虚线下的面积）（考虑到 Precision 是下降趋势，所以这里默认 $P_{n-1} > P_{n}$）

average_precision_score() 用的是 No Interpolation:

$\text{AP} = \sum_n (R_n - R_{n-1}) * P_n$

### 6.2 average 参数： micro / macro / None / samples / weighted

• continuous: y is an array-like of floats that are not all integers, and is 1d or a column vector.
• continuous-multioutput: y is a 2d array of floats that are not all integers, and both dimensions are of size > 1.
• binary: y contains <= 2 discrete values and is 1d or a column vector.
• multiclass: y contains more than two discrete values, is not a sequence of sequences, and is 1d or a column vector.
• multiclass-multioutput: y is a 2d array that contains more than two discrete values, is not a sequence of sequences, and both dimensions are of size > 1.
• multilabel-indicator: y is a label indicator matrix, an array of two dimensions with at least two columns, and at most 2 unique values.
• unknown: y is array-like but none of the above, such as a 3d array, sequence of sequences, or an array of non-sequence objects.
>>> import numpy as np
>>> type_of_target([0.1, 0.6])
'continuous'
>>> type_of_target([1, -1, -1, 1])
'binary'
>>> type_of_target(['a', 'b', 'a'])
'binary'
>>> type_of_target([1.0, 2.0])
'binary'
>>> type_of_target([1, 0, 2])
'multiclass'
>>> type_of_target([1.0, 0.0, 3.0])
'multiclass'
>>> type_of_target(['a', 'b', 'c'])
'multiclass'
>>> type_of_target(np.array([[1, 2], [3, 1]]))
'multiclass-multioutput'
>>> type_of_target([[1, 2]])
'multiclass-multioutput'
>>> type_of_target(np.array([[1.5, 2.0], [3.0, 1.6]]))
'continuous-multioutput'
>>> type_of_target(np.array([[0, 1], [1, 1]]))
'multilabel-indicator'


"""
y_true : array, shape = [n_samples] or [n_samples, n_classes]
y_score : array, shape = [n_samples] or [n_samples, n_classes]
"""


"""
y_true : array, shape = [n_samples]
"""


y_true = np.array([[1, 0], [1, 0], [0, 1], [0, 1], [0, 1]])
y_score = np.array([[0.5, 0.5], [0.6, 0.4], [0.7, 0.3], [0.8, 0.2], [0.9, 0.1]])
sample_weight = np.array([1, 1, 2, 2, 2])


#### 6.2.1 average = "micro"

not_average_axis = 1
score_weight = sample_weight
average_weight = None

if average == "micro":
if score_weight is not None:
score_weight = np.repeat(score_weight, y_true.shape[1])  # y_true.shape[1] == 2
y_true = y_true.ravel()
y_score = y_score.ravel()

"""
y_true == np.array([1, 0, 1, 0, 0, 1, 0, 1, 0, 1])
y_score == np.array([0.5, 0.5, 0.6, 0.4, 0.7, 0.3, 0.8, 0.2, 0.9, 0.1])
sample_weight == np.array([1, 1, 1, 1, 2, 2, 2, 2, 2, 2])
"""

if y_true.ndim == 1:
y_true = y_true.reshape((-1, 1))

if y_score.ndim == 1:
y_score = y_score.reshape((-1, 1))

"""
y_true == np.array([[1],
[0],
[1],
[0],
[0],
[1],
[0],
[1],
[0],
[1]])
y_score == np.array([[0.5],
[0.5],
[0.6],
[0.4],
[0.7],
[0.3],
[0.8],
[0.2],
[0.9],
[0.1]])
"""

n_classes = y_score.shape[not_average_axis]  # 1
score = np.zeros((n_classes,))  # array([0])
for c in range(n_classes):
y_true_c = y_true.take([c], axis=not_average_axis).ravel()  # np.array([1, 0, 1, 0, 0, 1, 0, 1, 0, 1])
y_score_c = y_score.take([c], axis=not_average_axis).ravel()  # np.array([0.5, 0.5, 0.6, 0.4, 0.7, 0.3, 0.8, 0.2, 0.9, 0.1])
score[c] = binary_metric(y_true_c, y_score_c, sample_weight=score_weight)

"""
score == np.array([0.36111])
"""

# Average the results
if average is not None:
return np.average(score, weights=average_weight)  # return 0.36111
else:
return score

• average = "micro" 的作用就是把原来的 5 个 labels 和 5 个 scores 扩展成了 10 个 labels 和 10 个 scores
• 最后一步，因为 score 是一个 float，np.average 没有作用

sample_weight 参数在 precision_recall_curve() 函数中的作用在于：影响 $TP$ 的计算方法。代码在 _binary_clf_curve:

# make y_true a boolean vector
y_true = (y_true == pos_label)

......

tps = stable_cumsum(y_true * weight)[threshold_idxs]


• y_true == [1, 1, 1, 0]; weight = None $\Rightarrow TP = 1 + 1 + 1 = 3$
• y_true == [1, 1, 1, 0]; weight = [1, 1, 2, 2] $\Rightarrow TP = 1\times1 + 1\times1 + 1\times2 = 4$

#### 6.2.2 average = "macro" / average = None

not_average_axis = 1
score_weight = sample_weight
average_weight = None

if y_true.ndim == 1:  # ==2, False
y_true = y_true.reshape((-1, 1))

if y_score.ndim == 1: # ==2, False
y_score = y_score.reshape((-1, 1))

n_classes = y_score.shape[not_average_axis]  # 2
score = np.zeros((n_classes,))  # array([0, 0])
for c in range(n_classes):
y_true_c = y_true.take([c], axis=not_average_axis).ravel()
y_score_c = y_score.take([c], axis=not_average_axis).ravel()
score[c] = binary_metric(y_true_c, y_score_c, sample_weight=score_weight)

"""
score == np.array([0.19642857, 0.63888889])
"""

# Average the results
if average is not None:
return np.average(score, weights=average_weight)  # return 0.41766
else:
return score

• average = "macro" 是把 y_true[:, 0]y_true[:, 1] 各算了一个 AUPRC，然后按 average_weight 取平均
• 但是 average = "macro" 时，average_weight is None，所以相当于 return sum(score) / len(score)
• average = None 就是直接返回 score == np.array([0.19642857, 0.63888889])， 没有取平均；而且是多维的，不是个 scalar

#### 6.2.3 average = "weighted"

not_average_axis = 1
score_weight = sample_weight
average_weight = None

elif average == 'weighted':
if score_weight is not None:
average_weight = np.sum(np.multiply(y_true, np.reshape(score_weight, (-1, 1))), axis=0)
else:
average_weight = np.sum(y_true, axis=0)  # np.array([2, 3])
if average_weight.sum() == 0:  # sum(y_true) == 0, i.e. no TP, precision 恒为 0
return 0

"""
average_weight == np.array([2, 6])

Positives 的总 weight 为 2，Negatives 的总 weight 为 6
"""

if y_true.ndim == 1:  # ==2, False
y_true = y_true.reshape((-1, 1))

if y_score.ndim == 1:  # ==2, False
y_score = y_score.reshape((-1, 1))

n_classes = y_score.shape[not_average_axis]  # 2
score = np.zeros((n_classes,))  # array([0, 0])
for c in range(n_classes):
y_true_c = y_true.take([c], axis=not_average_axis).ravel()
y_score_c = y_score.take([c], axis=not_average_axis).ravel()
score[c] = binary_metric(y_true_c, y_score_c, sample_weight=score_weight)

"""
score == np.array([0.19642857, 0.63888889])
"""

# Average the results
if average is not None:
return np.average(score, weights=average_weight)  # return 0.528274
else:
return score

• average = "weighted"average = "macro" 的基本逻辑是一样的，也是把 y_true[:, 0]y_true[:, 1] 各算了一个 AUPRC，然后按 average_weight 取平均
• 但是 average = "weighted" 时，average_weight 是一定有值的，不可能为 None
• score_weight is None 时，average_weight == [P, N]
• score_weight is not None 时，average_weight == [sum(P's weights), sum(N's weights)]

#### 6.2.4 average = "samples"

not_average_axis = 1
score_weight = sample_weight
average_weight = None

elif average == 'samples':
# swap average_weight <-> score_weight
average_weight = score_weight  # np.array([1, 1, 2, 2, 2])
score_weight = None
not_average_axis = 0

if y_true.ndim == 1:  # ==2, False
y_true = y_true.reshape((-1, 1))

if y_score.ndim == 1:  # ==2, False
y_score = y_score.reshape((-1, 1))

n_classes = y_score.shape[not_average_axis]  # 5
score = np.zeros((n_classes,))  # np.array([0, 0, 0, 0, 0])
for c in range(n_classes):
y_true_c = y_true.take([c], axis=not_average_axis).ravel()  # y_true_c == y_true[c, :]
y_score_c = y_score.take([c], axis=not_average_axis).ravel()  # y_score_c == y_score[c, :]
score[c] = binary_metric(y_true_c, y_score_c, sample_weight=score_weight)  # sample_weight is None

"""
score == np.array([0.5, 1, 0.5, 0.5, 0.5])
"""

# Average the results
if average is not None:
return np.average(score, weights=average_weight)  # return 0.5625
else:
return score

• average = "samples" 的作用就是把原来的 5 个 labels 和 5 个 scores 变成了 5 组，每组 2 个 labels 和 2 个 scores
• 然后按 sample_weight = None 计算了 5 个 AUPRC 值
• 最后按 weights = sample_weight（起始值）取 average

#### 6.2.5 Summary

Given:

y_true = np.array([[1, 0], [1, 0], [0, 1], [0, 1], [0, 1]])
y_score = np.array([[0.5, 0.5], [0.6, 0.4], [0.7, 0.3], [0.8, 0.2], [0.9, 0.1]])
sample_weight = np.array([1, 1, 2, 2, 2])

• average = "micro"
• expand into 10 labels, 10 scores and 10 weights
• calculate 1 AUPRC with sample_weight
• return this AUPRC
• average = "macro"
• split into 2 groups:
• 5 labels, 5 scores and 5 weights for Positives
• 5 labels, 5 scores and 5 weights for Negatives
• calculate 2 AUPRC with sample_weight
• return the arithmetic mean of these 2 AUPRC
• average = None
• split into 2 groups:
• 5 labels, 5 scores and 5 weights for Positives
• 5 labels, 5 scores and 5 weights for Negatives
• calculate 2 AUPRC with sample_weight
• return these 2 AUPRC
• average = "weighted"
• split into 2 groups:
• 5 labels, 5 scores and 5 weights for Positives
• 5 labels, 5 scores and 5 weights for Negatives
• calculate 2 AUPRC with sample_weight
• if sample_weight is None, weights are supports, i.e. the numbers of true labels in each group
• if sample_weight is not None, weights are sample_weighted supports, i.e. the total weights of true labels in each group
• return the weighted mean of these 2 AUPRC
• average = "samples
• treat each prediction as a group:
• 2 labels, 2 scores and 2 weights
• totally 5 groups
• calculate 5 AUPRC without sample_weight
• return the sample_weighted mean of these 5 AUPRC