# ML Terminology

Yao Yao on August 23, 2016

## Confusion matrix $\subset$ Contingency table

confusion matrix 常见的是 TP vs FP vs TN vs FN counts，但并不是说只能是 2 $\times$ 2，我还可以是 Actual $A_1,\dots,A_n$ vs Predicted $P_1,\dots,P_n$ 这样的 $n \times n$ table。

contingency table 即一般意上的 2-variable 的统计用表（表头斜线分开，列是 variable 1 的取值，行是 variable 2 的取值），有 marginal total 和 grand total 这些概念。

## Null Accuracy = 1 - Null Error Rate

Null Error Rate is how often you would be wrong if you always predicted the majority class, i.e. == $\frac{N_{total} - N_{major}}{N_{total}} = 1 - \frac{N_{major}}{N_{total}}$.

Therefore, the coined term Null Accuracy == $\frac{N_{major}}{N_{total}}$

## Out-of-bag (OOB) Error

OOB error, a.k.a. OOB estimate, is the mean prediction error on each training sample $x_i$, using only the trees that did not have $x_i$ in their bootstrap sample.

OOB error is a valid estimate of the test error for the bagged model. With the number of tree (also the number of bootstrap samples), $B$, sufficiently large, OOB error is virtually equivalent to leave-one-out cross-validation error.

The OOB approach for estimating the test error is particularly convenient when performing bagging on large data sets for which cross-validation would be computationally onerous ([ˈɒnərəs], burdensome).

## ROC is insensitive to class distribution changes

• 如果把 positive 的 examples 复制一份，那么对这些新的 examples 的 prediction 不变，那么 TP 翻倍，FN 翻倍，true positive rate = $\frac{TP}{TP + FN}$ 保持不变，false negative rate 我们没动
• 如果把 negative 的 examples 复制一份，那么对这些新的 examples 的 prediction 不变，那么 TN 翻倍，FP 翻倍，false positive rate = $\frac{FP}{TN + FP}$ 保持不变，true negative rate 我们没动

## Oversampling vs Undersampling

Either is used to adjust the class distribution of a data set (i.e. the ratio between the different classes/categories represented).

• Oversampling : you duplicate the observations of the minority class to obtain a balanced dataset.
• Risk of over-fitting
• Undersampling : you drop observations of the majority class to obtain a balanced dataset.
• Wasting data

## Subsampling

Subsampling 和 bootstrap sampling 一样，也是一种 resampling 的策略，它与 bootstrap sampling 的区别在于：

1. The resample size is smaller than the sample size.
2. Resampling is done without replacement.

Cross-validation is used for estimating the performance of ONE set of parameters on unseen data.

Grid-search evaluates a model with varying parameters to find the best possible combination of these.

cross validation 是不会帮你写这个 “for loop 再挑 optimal” 的 framework 的，但是有人会，它就是 grid search。

cv_params = {
'num_boost_round': 100,
'eta': 0.05,
'max_depth': 6,
'subsample': 0.9,
'colsample_bytree': 0.9
} # 这是一套固定的参数

gs_param = {
'num_boost_round': [100, 250, 500],
'eta': [0.05, 0.1, 0.3],
'max_depth': [6, 9, 12],
'subsample': [0.9, 1.0],
'colsample_bytree': [0.9, 1.0]
}


grid search 会根据 gs_param 自动衍生出 $3 \times 3 \times 3 \times 2 \times 2 = 108$ 套参数，然后自动帮你跑 108 遍 cross validation（当然你也可以设置不用 cross validation）来挑出 optimal 的一套。

## Bias vs Variance

…bias, represents the difference between the expected value of the estimator and the unknown value of the true generalization error…variance, reflects the variability of the estimator around its expected value due to the sampling of the data $D$ on which it is evaluated.

Another nice post on these 2 concepts is Understanding the Bias-Variance Tradeoff.