Terminology Recap: Sampling / Sample / Sample Space / Experiment / Statistical Model / Statistic / Estimator / Empirical Distribution / Resampling / CV / Jackknife / Bootstrap / Bagging / Likelihood / Estimation and Machine Learning

Yao Yao on February 26, 2019

TOC:

1. Sampling from a Probability Distribution

I have just described how to go from a random variable to its distribution function, but we can go the other way. Namely, given a distribution ($F_{X}$), we can sample a random variable from it, by which we mean we choose a probability space $(\Omega, \mathcal{F}, \mathbb{P})$ and a function $X:\Omega \to \mathbb{R}$ satisfying $\mathbb{P}(X \leq a)=F_{X}(a)$ for all $a \in \mathbb{R}$. It is not obvious that such a random variable must exist! But in fact one does provided you have a valid distribution function.

… sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attempt for the samples to represent the population in question. Two advantages of sampling are lower cost and faster data collection than measuring the entire population.

• 假设真实问题领域有一个 probability space $(\Omega, \mathcal{F}, \mathbb{P})$
• 统计学上这个 outcome 的集合 $\Omega$ 在这里称为 population
• 我们 somehow 从一个 distribution (搞不好还是从 sample 中估计出来的) 反向推出一个 probability space $(\Omega’, \mathcal{F}’, \mathbb{P}’)$ (通过 sampling)
• 这个 $\Omega’ \subseteq \Omega$ 在这里称为一个 statistical sample
• 搞不好这里还有个 “先有鸡还是先有蛋的问题”：你是先拿出一个 sample 来 estimate 一个 distribution，还是先拿出一个 distribution 反推出一个 sample？

2. Sample

• 一个 random variable $X$ (来自 sampling)
• 一个 $\Omega’ \subseteq \Omega$ (来自 sampling)
• 假设 $\vert \Omega’ \vert = m$，那么这个 $m$ 个 outcomes 又称为 $m$ 个 sample values
• 所以一个 sample 也可以理解为 $m$ 个 sample values
• 你要是理解成这 $m$ 个 sample values 是来自 $m$ 个 random variables，从而算是来自 $m$ 个 samples，我觉得你是把问题复杂化了。Don’t do this.

3. Sample Space / Experiment

In probability theory, an experiment or trial is any procedure that can be infinitely repeated and has a well-defined set of possible outcomes, known as the sample space.

• 然后这个 $\Omega’’$ 就叫 sample space，至于为什么这么叫，我倒是也想问问最初这么命名的人 ()
• 发现了没？sample space 和 sampling 在定义上是没有直接关联的！老子信了你的邪！ (当然你硬要从 sample 的定义上把这两者联系起来也是行得通的，但是，有帮助到你理解吗？)
• sample space $\Omega’’$ 不一定是 population $\Omega$，但有可能是
• sample space 小于 population 的例子
• 比如 population 全国人口的身高
• 然后 experiment 是在北京市抽查 100 人的身高，然后恰巧全国最高的人不在北京，你在北京怎么抽也不可能抽到这个最大的身高值
• sample space 等于 population 的例子：
• experiment 是 “roll a dice once”

4. Statistical Model

4.1 (预备知识) Mathematical Model

Mathematical model 是一个抽象概念：Wikipedia: Mathematical model:

A mathematical model is a description of a system using mathematical concepts and language.

Mathematical models are usually composed of relationships and variables.

• Relationships can be described by operators, such as algebraic operators, functions, differential operators, etc.
• Variables are abstractions of system parameters of interest, that can be quantified.

• Linear vs. nonlinear
• 判断标准是：operator 是否是 linear 的
• Static vs. dynamic
• A dynamic model accounts for time-dependent changes in the state of the system。
• 一般会用到 differential equations (微分方程) 或者 difference equations (差分方程)
• 注：差分方程其实就是 递推关系 (recurrence relation)，类似 $a_{n+1} = f(a_n)$ 这种
• A static (or steady-state) model calculates the system in equilibrium, and thus is time-invariant.
• Explicit vs. implicit
• Explicit 指 “all of the input parameters of the overall model are known, and the output parameters can be calculated by a finite series of computations”
• Implicit 指 “the output parameters are unknown, and the corresponding inputs must be solved for by an iterative procedure”
• Newton’s method 即是这类 iterative procedure 的一种
• Discrete vs. continuous
• Deterministic vs. probabilistic (stochastic)
• A deterministic model is one in which every set of variable states is uniquely determined by parameters in the model and by sets of previous states of these variables; therefore, a deterministic model always performs the same way for a given set of initial conditions.
• In a stochastic model, randomness is present, and variable states are not described by unique values, but rather by probability distributions.
• 下面所说的 probability model 和 statistical model 都属于 stochastic models 的大类
• Deductive vs. inductive
• A deductive model is a logical structure based on a theory.
• An inductive model arises from empirical findings and generalization from them.

4.2 (预备知识) Probability Model

• 注：$\Omega$ 是 sample space

4.3 Statistical Model

Let the observed outcome of a statistical experiment be a sample of $m$ i.i.d. random variables $X_1, \dots, X_m$ in some measurable space $E$ (usually $E \subseteq \mathbb{R}$) and denote by $\mathbb{P}^\ast$ their common distribution. A statistical model associated to that statistical experiment is a pair $(E, \mathcal{P})$, where

• $E$ is the sampel space,
• $\mathcal{P} = \lbrace \mathbb{P}_{\theta} \mid \theta \in \Theta \rbrace$ is a family of probabilty measures (i.e. distributions) on $E$, and
• $\Theta$ is the parameter set.

$\mathcal{P}$ 举个例子：”不同参数的 Gaussian distributions”：$\mathcal{P} = \lbrace \mathbb{P}_{\mathcal{N}(\sigma, \mu^2)} \mid (\sigma, \mu) \in \mathbb{R}^2 \rbrace$.

4.3.1 Specification

A statistical model is said to be well-specified if $\exists \theta^\ast \in \Theta$ such that $\mathbb{P}_{\theta^\ast} = \mathbb{P}^\ast$.

A statistical model is said to be misspecified if $\not \exists \theta^\ast \in \Theta$ such that $\mathbb{P}_{\theta^\ast} = \mathbb{P}^\ast$.

Well-specified means that the class of distribution $\mathcal{C}$ you are assuming for your modeling actually contains the unknown probability distribution $p$ from where the sample is drawn.

Misspecified means, on the other hand, that $\mathcal{C}$ does not contain $p$. You made a modeling assumption, and it is not perfect: for instance, you assume your sample is Gaussian, but (maybe due to noise, or just inherently) it is not actually originating from any Gaussian distribution.

This particular $\theta^\ast$ is called the true parameter, and is unknown: The aim of the statistical experiment is to estimate $\theta^\ast$, or check its properties when they have a special meaning (比如 $\theta^\ast > 2$ 时 $\mathbb{P}_{\theta^\ast} = \mathbb{P}^\ast$ 有什么特点)

4.3.2 Identifiability

We say that statistical model $(E, \mathcal{P})$ is identifiable if the mapping $\theta \mapsto \mathbb{P}_{\theta}$ is injective, i.e.

$\theta = \theta' \Rightarrow \mathbb{P}_{\theta} = \mathbb{P}_{\theta'}$

4.3.3 Dimension

• The model is said to be parametric if it has a finite dimension
• The model is said to be nonparametric if it has a infinite dimension

5. Statistic / Estimator

Given an observed sample $X_1, \dots, X_m$ and a statistical model $(E, \mathcal{P})$, we define:

• A statistic is any measurable function of the sample
• An estimator of $\theta^\ast$, denoted by $\hat \Theta_m$, is any statistic whose expression does not depend on $\theta^\ast$
• 比如 $\hat \Theta_m = f(X_1, \dots, X_m)$
• An estimate of $\theta^\ast$ is denoted by $\hat \theta_m$
• 比如 $\hat \theta_m = f(x_1, \dots, x_m)$
• 这里又涉及到类似 “一个 random variable $X=3$ 是什么意思?” 的问题。我们在根据 $\hat \Theta_m$ 计算 $\hat \theta_m$ 的时候的确就是把 $X_i$ 替换成 $x_i$ 而已，但这并不是 $X_i = x_i$ 的意思，也不是说要用 $X_i(x_i)$ 这个值去算，而是模拟了一个 $X_i(\overset{?}{\cdot}) = x_i$ 的过程，至于这个 $\overset{?}{\cdot}$ 是多少这里我们并不关心

• estimand $\theta^\ast$ = the parameter of interest (to be estimated)
• estimator $\hat \Theta_m$ = a function (of estimating)
• estimate $\hat \theta_m$ = a value (of estimating)

• 因为 statistic 没有一个名词去表示它的具体值，所以 statistic 有时表示函数 (or random variable)，有时表示一个具体值
• estimator 一定是函数 (or random variable)，但有时有的人就是要叫它 estimate，而且就是不愿意用大写字母表示 ()，所以你看到 “an estimate $\hat \theta_m$” 这样的描述，你需要根据上下文来判断它到底说的是 “an estimator $\hat \Theta_m$” 还是 “an estimate $\hat \theta_m$” ()

5.1 Bias

The bias of an estimator $\hat \Theta_m$ is defined as:

$\operatorname{bias}(\hat \Theta_m) = \mathbb{E}[\hat \Theta_m] - \theta^\ast$

An estimator $\hat \Theta_m$ is said to be unbiased if $\operatorname{bias}(\hat \Theta_m) = 0$, which implies that $\mathbb{E}[\hat \Theta_m] = \theta^\ast$.

An estimator $\hat \Theta_m$ is said to be asymptotically unbiased if $\underset{m \to \infty}{\lim} \operatorname{bias}(\hat \Theta_m) = 0$, which implies that $\underset{m \to \infty}{\lim} \mathbb{E}[\hat \Theta_m] = \theta^\ast$.

5.2 Variance / Standard Deviation / Standard Error (to its own Mean)

$\operatorname{Var}(X) = \mathbb{E} \big[(X - \mathbb{E}[X])^2\big]$

Standard deviation 为：

$\sigma(X) = \sqrt{\operatorname{Var}(X)}$

Standard error (to its own mean) 为：

$\operatorname{SEM}(X) = \sqrt{\frac{\operatorname{Var}(X)}{m}} = \frac{\sigma(X)}{\sqrt{m}}$
• $m$ is the size of the sample (number of observations).

• $\operatorname{bias}(\hat \Theta_m)$ 有涉及 $\theta^\ast$，所以它研究的是 “$\Theta_m$ 与 $\theta^\ast$ 之间的关系
• 而 $\operatorname{Var}(\hat \Theta_m)$, $\sigma(\hat \Theta_m)$, $\operatorname{SEM}(\hat \Theta_m)$ 和 $\theta^\ast$ 无关，所以它们表示的是 “$\Theta_m$ 本身的性质

5.3 Error (to its Estimand) / Mean Square Error (to its Estimand) / (Quadratic) Risk

The error of an estimator $\hat \Theta_m$ to its estimand $\theta^\ast$ is defined as:

$e(\hat \Theta_m) = \hat \Theta_m - \theta^\ast$

The mean square error of an estimator $\hat \Theta_m$ to its estimand $\theta^\ast$ is defined as:

$\operatorname{MSE}(\hat \Theta_m) = \mathbb{E} \big[ e(\hat \Theta_m)^2 \big] = \mathbb{E} \big[ (\hat \Theta_m - \theta^\ast)^2 \big]$

The (quadratic) risk of an estimator $\hat \Theta_m$ is defined as:

$\operatorname{risk}(\hat \Theta_m) = \mathbb{E} \big[\vert \hat \Theta_m - \theta^\ast \vert^2 \big]$
• 明显，$\operatorname{risk}(\hat \Theta_m) = \operatorname{MSE}(\hat \Theta_m)$

If $\theta^\ast \in \mathbb{R}$ (一维参数), then:

$\operatorname{risk} = \operatorname{MSE} = \operatorname{bias}^2 + \operatorname{Var}$

5.4 Consistency

Let $X_1, \dots, X_m$ be a sample sharing a common distribution $\mathbb{P}$ from a statistical model $(E, \mathcal{P})$ (which implies $\mathbb{P} \in \mathcal{P}$), and $\hat \Theta_m = f(X_1, \dots, X_m)$ be an estimator of $\theta^\ast$.

Estimator $\hat \Theta_m$ is said to be (weakly) consistent if it converges to $\theta^\ast$ in probability, i.e.:

$\underset{m \to \infty}{p\lim} \hat \Theta_m = \theta^\ast$

which means $\forall \epsilon > 0$:

$\underset{m \to \infty}{\lim} \mathbb{P}(\lvert \hat \Theta_m - \theta^\ast \rvert > \epsilon) = 0$
• 我们也可以把 convergence in probability 写作 $\hat \Theta_m \overset{\mathbb{P}}{\rightarrow} \theta^\ast$

Estimator $\hat \Theta_m$ is said to be mean square consistent if $\underset{m \to \infty}{\lim} \operatorname{MSE}(\hat \Theta_m) = 0$

Estimator $\hat \Theta_m$ is said to be $L_2$ consistent if $\lVert \hat \Theta_m - \theta^\ast \rVert_2 \overset{\mathbb{P}}{\rightarrow} 0$

• 这么一来，weak consistency 也可以看做是 $L_1$ consistency

Estimator $\hat \Theta_m$ is said to be strongly consistent if it converges to $\theta^\ast$ almost surely, i.e. $\forall \epsilon > 0$:

$\mathbb{P} \big ( \underset{m \to \infty}{\lim} \lvert \hat \Theta_m - \theta^\ast \rvert \leq \epsilon \big ) = 1$
• 我们也可以把 almost sure convergence 写作 $\hat \Theta_m \overset{\text{a.s.}}{\rightarrow} \theta^\ast$

• $L_1$ consistency $\Leftrightarrow$ weak consistency
• mean square consistency $\Rightarrow$ weak consistency
• $L_2$ consistency $\Rightarrow$ weak consistency
• strong consistency $\Rightarrow$ weak consistency
• 但目前 mean square consistency、$L_2$ consistency 和 strong consistency 这三者的强弱关系我还不清楚

5.5 Efficiency (of Unbiased Estimators)

$\operatorname{Var}(\hat \Theta_m)$ 有 Cramér–Rao bound (CRB):

$\operatorname{Var}(\hat \Theta_m) \geq \frac{1}{I(\theta^\ast)}$

where the Fisher Information $I(\theta^\ast)$ is defined by:

$I(\theta^\ast) = \mathbb{E} \left[ \left( \frac{\partial \log f(x;\theta^\ast)}{\partial \theta^\ast} \right)^2 \right] = - \mathbb{E} \left[ \frac {\partial^2 \log f(x;\theta^\ast)}{\partial {\theta^\ast}^2} \right]$
• 严格来说，Fisher information 是衡量 model 的一个指标，所以它应该把 $\theta^\ast$ 当做一个变量 $\theta$ 对待，所以 $I(\theta)$ 应该是一个函数，当 $\theta = \theta^\ast$ 时，$I(\theta^\ast)$ 取一个具体值
• 更多内容参考 Chapter 8 - Fisher Information

The efficiency of $\hat \Theta_m$ is defined as:

$e(\hat \Theta_m) = \frac{1}{\operatorname{Var}(\hat \Theta_m) I(\theta^\ast)} \leq 1$

6. Empirical Distribution / Resampling

6.1 EDF $\hat{F}_{\mathbb{P}_{X}}$

Let $X_1, \dots, X_m$ be i.i.d. real random variables with the common distribution function $F_{\mathbb{P}_{X}}$. Then the empirical distribution function (EDF) is defined as:

$\hat{F}_{m}(a) = \hat{F}_{\mathbb{P}_{X}}(a) = \frac{1}{m} \sum_{i=1}^{m} \mathbf{1}_{X_{i} \leq a}$

where $\mathbf{1}_{A}$ is the indicator of event $A$.

• 注意 $F_{\mathbb{P}_{X}}(a) = \mathbb{P}(X \leq a)$

For a fixed $a$, the indicator $\mathbf{1}_{X_{i} \leq a}$ is a Bernoulli random variable with parameter $p = F_{\mathbb{P}_{X}}(a)$; hence $m\hat{F}_{m}(a)$ is a binomial random variable with mean $mF_{\mathbb{P}_{X}}(a)$ and variance $m F_{\mathbb{P}_{X}}(a) \big( 1 − F_{\mathbb{P}_{X}}(a) \big)$. This implies that $\hat{F}_{m}(a)$ is an unbiased estimator for $F_{\mathbb{P}_{X}}(a)$.

6.2 What does an Empirical Distribution represent?

The empirical distribution is the distribution you’d get if you sampled from your sample instead of the whole population.

whuber’s answer, StackExchange: What does an empirical distribution represent? 提到了我们为什么要用 empirical distribution：

The intuition is that if your observations are representative of the original population (that is, of the set of tickets in the original box), then you can study the EDF to learn how to make inferences about the contents of a box based on a sample of it.

6.3 Resampling / CV

• 从头到尾只用一个 training dataset 会不会有失偏颇？
• 万一这一个 training dataset 的 empirical distribution 和 true distribution 差得很远怎么办？

• resampling：获取不同的 samples
• CV：获取不同的 training datasets

6.3.1 Jackknife Resampling / Leave-one-out CV

The Jackknife samples are computed by leaving out one observation $x_i$ from $\mathbf{x} = (x_1, x_2, \dots, x_n)$ at a time:

$\mathbf{x}_{(i)} = (x_1, x_2, \dots, x_{i−1}, x_{i+1}, \dots, x_n)$

6.3.2 $k$-fold CV

$k$-folds CV can be seen as a (very rigid) way of resampling the data without replacement.

6.3.3 Bootstrap Resampling (Bootstrapping)

A bootstrap sample $\mathbf{x}^{\ast} = (x_{1}^{\ast}, x_{2}^{\ast}, \cdots, x_{n}^{\ast})$ is obtained by randomly sampling $n$ times, with replacement, from the original data points $\mathbf{x}=(x_1, x_2, \cdots, x_n)$.

Suppose all $x_i$ are different. The probability that a particular $x_i$ is left out the resample $\mathbf{x}^{\ast} = (x_{1}^{\ast}, x_{2}^{\ast}, \cdots, x_{n}^{\ast})$ is:

$\mathbb{P}(x_{j}^{\ast} \neq x_i, 1 \leq j \leq n) = (1 - \frac{1}{n})^n$

since $\mathbb{P}(x_{j}^{\ast} = x_i) = \frac{1}{n}$.

When $n \to \infty$, $(1 - \frac{1}{n})^n$ converges to $e^{-1} \approx 0.37$.

6.3.4 Bootstrap Aggregating (Bagging)

Bootstrap resampling 没有直接和某种 CV 对应，但是作为一种 resampling，它可以直接出现在 ensemble framework 中，比如 bagging。

bagging 其实很简单：

• Bootstrap resample $1^{\text{st}}$ time $\Rightarrow \mathbf{x}_{1}^{\ast} \Rightarrow$ classifier $C_1$
• Bootstrap resample $2^{\text{nd}}$ time $\Rightarrow \mathbf{x}_{2}^{\ast} \Rightarrow$ classifier $C_2$
• $\cdots$
• Bootstrap resample $n^{\text{th}}$ time $\Rightarrow \mathbf{x}_{n}^{\ast} \Rightarrow$ classifier $C_n$
• Aggregate $\Rightarrow$ final classifier $C = f(C_1, C_2, \dots, C_n)$

bagging 有助于减小 variance。

7. The Likelihood Function / Sufficient Statistics

7.1 Sufficient Statistics

Suppose that $X_1, \dots, X_m \sim p(x; \theta^\ast)$. Statistic $T$ is sufficient for $\theta^\ast$ if the conditional distribution of $X_1, \dots, X_m \vert T$ does not depend on $\theta^\ast$. Thus, $p(x_1, \dots, x_m \vert t; \theta) = p(x_1, \dots, x_m \vert t)$.

$T$ is a Minimal Sufficient Statistic (MSS) if:

1. $T$ is sufficient, and
2. $\forall$ other sufficient statistic $U$, $\exists$ some function $g$ such that $T=g(U)$
• 亦即任意的 sufficient statistic $U$ 都能变形 (或者认为是 reduce) 成 MSS $T$

What does sufficiency mean?

If $T$ is sufficient, then $T$ contains all the information you need from the data to compute the likelihood function. It does not contain all the information in the data.

• 它其实指的是 “all the information needed to compute any estimate of the parameter
• 换言之，如果有一个 statistic $T$ containing all the information，那么我们就可以不用 $f(X_1, \dots, X_m)$ 的形式、而是用 $g(T(X_1, \dots, X_m))$ 的形式去 estimate
• 那么这里说 sufficient statistic $T$ 并没有 contains all the information，也就是说 $T(X_1, \dots, X_m)$ 无法取代 $X_1, \dots, X_m$ when computing estimates

Let $C = \lbrace c_1, \dots , c_N \rbrace$ be a finite set of constants. Let $\theta^\ast = \frac{1}{N} \sum_{i=1}^{N} c_i$. Suppose we want to estimate $\theta^\ast$. We proceed as follows. Let $S_1, \dots , S_n \sim \operatorname{Bern}(\pi)$ where $\pi$ is known. If $S_i = 1$ you can see $c_i$; otherwise, you cannot. (This is an example of survey sampling.) The likelihood function is:

$\prod_i \pi^{S_i} (1 - \pi)^{S_i}$

The likelihood function contains no information at all. But we can still estimate $\theta^\ast$ (without this likelihood function):

$\hat \Theta = \frac{1}{N \pi} \sum_{i=1}^{N} c_i S_i$

and $\mathbb{E}[\hat \Theta] = \theta^\ast$.

7.2 The Likelihood Function

Let $X^m = (X_1, \dots, X_m)$ have joint PDF $f(x^m; \theta) = f(x_1, \dots, x_m; \theta)$ where $\theta \in \Theta$. The likelihood function $L: \Theta \to [0, \infty)$ is dedined by:

$L(\theta) \equiv L(\theta; x^m) = f(x^m; \theta)$

where $x^m$ is fixed and $\theta$ varies in $\Theta$.

Let $X^m = (X_1, \dots, X_m)$ have joint PMF $p(x^m; \theta) = p(x_1, \dots, x_m; \theta)$ where $\theta \in \Theta$. The likelihood function $L: \Theta \to [0, 1]$ is dedined by:

$L(\theta) \equiv L(\theta; x^m) = p(x^m; \theta)$

where $x^m$ is fixed and $\theta$ varies in $\Theta$.

• In informal contexts, likelihood is often used as a synonym for probability.
• In statistics, the two terms have different meanings.
• Probability is used to describe the plausibility of some data, given a value for the parameter.
• Likelihood is used to describe the plausibility of a value for the parameter, given some data.
• 非常哲学的 “一体两面” 的思想

7.3 The likelihood Function is a Minimal Sufficient Statistic

1. statistic 定义本身，因为 statistic 既可以指 function 又可以指 value
2. likelihood function 是一个 parametric statistic (varied by $\theta$)
• 如果我们定义 likelihood $L(\theta) \equiv L(\theta; x^m)$，那么它是一个 parametric statistic value
• 如果我们把 likelihood 看做 $L(\theta; X^m)$，那么它是一个 parametric statistic function

likelihood function 是 sufficient 这一点比较直观，那么下一步我们需要证明它是一个 MSS。这需要 Factorization Theorem 的辅助：

$f(x_1, \dots , x_m;\theta) = \phi [ t(x_1, \dots , x_m);\theta ] h(x_1, \dots , x_m)$

where:

• function $\phi$ depends on the data $x_1, \dots, x_m$ ONLY through statistic $t(x_1, \dots , x_m)$, and
• function $h$ does not depend on $\theta$
\begin{align} & \because L(\theta) \equiv f(x_1, \dots , x_m;\theta) \newline & \therefore \forall \text{sufficient statistic } T, \exists \phi, g \text{ such that } L(\theta) = \phi [ t(x_1, \dots , x_m);\theta ] h(x_1, \dots , x_m) \newline & \therefore L(\theta) \text{ is minimal sufficient} \end{align}

• If $T$ is sufficient, then $T$ contains all the information you need from the data to compute the likelihood function.
• 这就很好理解了，因为 sufficient statistic $T$ 一定能 reduce 成 MSS $L(\theta)$
• … It does not contain all the information in the data.
• 换言之，$L(\theta)$ 也没有 contains all the information
• 但有的教材认为 $L(\theta)$ (roughly) contains all the information，这是个不严谨的 (我觉得可以直接认为是不正确的) 说法
• 即便如此，还是存在很多的场合 where parameter 是可以用 $L(\theta)$ 去 estimate 的

8. Connections between Estimation and Machine Learning

8.1 Unsupervised / Supervised Learning

$\hat{\mathbb{P}}_{X} = \hat{\mathbb{P}}_m \overset{?}{\to} \mathbb{P}^\ast$

$\hat{\mathbb{P}}_{Y \mid X} = \hat{\mathbb{P}}_m \overset{?}{\to} \mathbb{P}^\ast$

8.2 Training Error / Test Error / Underfitting / Overfitting

• $X_1^\text{train}, \dots, X_m^\text{train}$ are i.i.d (by distribution $\mathbb{P}_\text{data}$)
• $X_1^\text{test}, \dots, X_{m’}^\text{test}$ are i.i.d (by distribution $\mathbb{P}_\text{data}$)

$\operatorname{MSE}(\hat{\theta}_{X^\text{train}}) = \operatorname{MSE}(\hat{\theta}_{X^\text{test}}) = \operatorname{MSE}(\theta^\dagger)$
• 注意我们这里用的是 $\operatorname{MSE}$ of an estimate 而不是 of an estimator

\begin{aligned} \text{training error} &= \operatorname{MSE} \big( \hat{\Theta}_{X^\text{train}}(x^\text{train}) \big) = \operatorname{MSE}(\hat{\theta}_{X^\text{train}}) \\ \text{test error} &= \operatorname{MSE} \big( \hat{\Theta}_{X^\text{train}}(x^\text{test}) \big) \neq \operatorname{MSE}(\hat{\theta}_{X^\text{test}}) \end{aligned}
• 这个 $\hat{\Theta}_{X^\text{train}}(x^\text{test})$ 的矛盾就是 generalization error 的根源
• 你可能会注意到这里有个 dimension 的问题：若 $\hat{\Theta}_{X^\text{train}} = f(X_1^\text{train}, \dots, X_m^\text{train})$ 的话，你 $X^\text{test} = (X_1^\text{test}, \dots, X_{m’}^\text{test})$ 有 $m’$ 项，如何代入 $f$ 计算？秘诀就是我们可以规定 $f$ 是单个 random vector 的函数，而不是固定数量的 random variables 的函数
• 然后你计算的过程决定了：$\hat{\Theta}_{X^\text{train}}$ 只会 minimize $\operatorname{MSE}$ on $x^\text{train}$，所以一般有 $\text{training error} \leq \text{test error}$ (proof skeleton)
• Underfitting: $\text{training error}$ high, $\text{test error}$ high
• Overfitting: $\text{training error}$ low, $\text{test error}$ high
• Good fit: $\text{training error}$ low, $\text{test error}$ slightly higher
• Unknown fit: $\text{training error}$ high, $\text{test error}$ low
• 出现 overfitting 的一个原因可能是，我们 minimize 的 $\operatorname{MSE}$ 并不是 $\hat{\Theta}_{X^\text{train}}$ 和 $\theta^\ast$ 之间的 error，因为 $\theta^\ast$ 我们并不知道，所以我觉得在实际计算时，我们用的是 empirical distribution 的参数 $\theta^\ast_{X^\text{train}} := \theta^\mathbf{1}_{X^\text{train}}$，所以最终的关系可能是：
$\hat{\Theta}_{X^\text{train}} \hat= \theta^{\mathbf{1}}_{X^\text{train}} \overset{?}{\rightarrow} \theta^\ast$
• 所以一定程度以后，越逼近 $\theta^{\mathbf{1}}_{X^\text{train}}$ 可能离 $\theta^\ast$ 越远

$\operatorname{risk} = \operatorname{MSE} = \operatorname{bias}^2 + \operatorname{Var}$

• conservative (less complex) models 一般 low-variance-high-bias，有 underfitting 倾向
• flexible (more complex) models 一般 high-variance-low-bias，有 overfitting 倾向

8.5 Estimators used in Machine Learning

1. Frequentist Statistics
• MLE, Maximum Likelihood Estimation
• 等价于 minimizing KL divergence $D_{KL}(\hat{\mathbb{P}}_{\text{data}} \Vert \mathbb{P}_{\text{model}})$
• 等价于 minimizing cross-entropy $H(\hat{\mathbb{P}}_{\text{data}}, \mathbb{P}_{\text{model}})$
• When $\mathbb{P}_{\text{model}}$ is Gaussian，等价于 minimizing $\operatorname{MSE}$
• 亦即 $\operatorname{MSE}$ is the cross-entropy between the empirical distribution and a Gaussian model.
2. Bayersian Statistics
• MAP, Maximum A Posteriori (Estimation)

8.5.1 MLE

• True but unknown data generating distribution $\mathbb{P}_\text{data}(\mathbf{x})$ (Estimand, i.e. $\mathbb{P}^\ast$, 对应 $\theta^\ast$)
• 但是我们在 estimation 的过程中根本就不会用到这个 distribution，这里列出来只是为了对应 $\hat{\mathbb{P}}_\text{data}$
• Distribution of the statistic model $\mathbb{P}_\text{model}(\mathbf{x}; \theta)$ (Estimator)
• Training data $\mathbb{X} = \lbrace x^{(1)}, \dots, x^{(m)} \rbrace$ (assumed to be draw independently from $\mathbb{P}_\text{data}$)
• Empirical distribution of the training data $\hat{\mathbb{P}}_\text{data}(\mathbf{x})$ (Approximation of the Estimand)
\begin{aligned} \theta_{\text{MLE}} &= \underset{\theta}{\arg \max } \, \mathbb{P}_{\text{model}}(\mathbb{X}; \theta) \newline &= \underset{\theta}{\arg \max } \prod_{i=1}^{m} \mathbb{P}_{\text{model}} (x^{(i)}; \theta) \,\,\,\, \text{(i.i.d.)} \iff \underset{\theta}{\arg \max } \prod_{i=1}^{m} L_{\text{model}} (\theta; x^{(i)}) \newline &= \underset{\theta}{\arg \max } \sum_{i=1}^{m} \log \mathbb{P}_{\text{model}} (x^{(i)} ; \theta) \,\,\,\, \text{(顺便解决了 underflow 的问题)} \newline &= \underset{\theta}{\arg \max } \, \frac{1}{m} \sum_{i=1}^{m} \log \mathbb{P}_{\text{model}} (x^{(i)} ; \theta) \newline &= \underset{\theta}{\arg \max } \, \mathbb{E}_{\mathbf x \sim \hat{\mathbb{P}}_\text{data}} \left[ \log \mathbb{P}_{\text{model}} (\mathbf x; \theta) \right] \newline &= \underset{\theta}{\arg \min } \, - \mathbb{E}_{\mathbf x \sim \hat{\mathbb{P}}_\text{data}} \left[ \log \mathbb{P}_{\text{model}} (\mathbf x; \theta) \right] \newline &= \underset{\theta}{\arg \min } \, H(\hat{\mathbb{P}}_\text{data}, \mathbb{P}_{\text{model}}(\theta)) \,\,\,\, \text{(cross-entropy)} \newline &= \underset{\theta}{\arg \min } \, -H(\hat{\mathbb{P}}_\text{data}) + H(\hat{\mathbb{P}}_\text{data}, \mathbb{P}_{\text{model}}(\theta)) \,\,\,\, \text{(} \hat{\mathbb{P}}_\text{data} \text{ 的 entropy 与 } \theta \text{ 无关)} \newline &= \underset{\theta}{\arg \min } \, D_{KL}(\hat{\mathbb{P}}_{\text{data}} \Vert \mathbb{P}_{\text{model}}(\theta)) \end{aligned}

8.5.2 MAP

Bayesians represent their assumption of $\theta$ using the prior probability distribution, $\mathbb{P}(\theta)$ (sometimes referred to as simply “the prior”), thus:

• Prior $\Rightarrow$ $\mathbb{P}(\theta)$
• Likelihood $\Rightarrow$ $\mathbb{P}(\mathbf{D} \vert \theta)$
• Posterior $\Rightarrow$ $\mathbb{P}(\theta \vert \mathbf{D})$
\begin{aligned} \because & \mathbb{P}(\theta \vert \mathbf{D}) = \frac{\mathbb{P}(\mathbf{D} \vert \theta) \, \mathbb{P}(\theta)}{\mathbb{P}(\mathbf{D})} \newline \therefore &\mathbb{P}(\theta \vert \mathbf{D}) \propto \mathbb{P}(\mathbf{D} \vert \theta) \, \mathbb{P}(\theta) \newline \therefore & \underset{\theta}{\arg \max } \, \mathbb{P}(\theta \vert \mathbf{D}) = \underset{\theta}{\arg \max } \, \mathbb{P}(\mathbf{D} \vert \theta) \, \mathbb{P}(\theta) \end{aligned}

\begin{aligned} \theta_{\text{MAP}} &= \underset{\theta}{\arg \max } \, \mathbb{P}(\theta \vert \mathbf{D}) \newline &= \underset{\theta}{\arg \max } \, \mathbb{P}(\mathbf{D} \vert \theta) \, \mathbb{P}(\theta) \newline &= \underset{\theta}{\arg \max } \, \log \mathbb{P}(\mathbf{D} \vert \theta) + \log \mathbb{P}(\theta) \end{aligned}
• MAP with a Gaussian prior on the weights thus corresponds to weight decay.
• 反过来，many regularized estimation strategies, such as maximum likelihood learning regularized with weight decay, can be interpreted as making the MAP approximation to Bayesian inference.
• This view applies when the regularization consists of adding an extra term to the objective function that corresponds to $\log \mathbb{P}(\theta)$.
• 相比 MLE 有 increase bias, decrease variance 的功效
• 你联系 weight decay 就好理解了

8.5.3 MLE vs MAP

\begin{aligned} \theta_{\text{MLE}} &= \underset{\theta}{\arg \max } \, \mathbb{P}(\mathbf{D}; \theta) \newline \theta_{\text{MAP}} &= \underset{\theta}{\arg \max } \, \mathbb{P}(\mathbf{D} \vert \theta) \, \mathbb{P}(\theta) \end{aligned}

• 再强调一下，这两个 distribution 在 probabilistic 上的意义明显不同，但不妨碍它们可以有同样的表达式