# Terminology Recap: Generative Models / Discriminative Models / Frequentist Machine Learning / Bayesian Machine Learning / Supervised Learning / Unsupervised Learning / Linear Regression / Naive Bayes Classifier

Yao Yao on April 11, 2019

## 1. Generative vs Discriminative

• Generative Model $\Rightarrow$ tries to learn $\mathbb{P}_{\mathbf{Y}, \mathbf{X}}(y, \mathbf{x})$
• 这两条路线都可以走：
• $\mathbb{P}_{\mathbf{Y}, \mathbf{X}}(y, \mathbf{x}) = \mathbb{P}_{\mathbf{Y} \vert \mathbf{X}}(y \vert \mathbf{x}) \, \mathbb{P}_{\mathbf{X}}(\mathbf{x})$
• $\mathbb{P}_{\mathbf{Y}, \mathbf{X}}(y, \mathbf{x}) = \mathbb{P}_{\mathbf{X} \vert \mathbf{Y}}(\mathbf{x} \vert y) \, \mathbb{P}_{\mathbf{Y}}(y)$
• Explicitly models the distribution of both the features and the corresponding labels (classes)
• Aims to explain the generation of all data
• Example techniques:
• Naive Bayes Classifier
• Hidden Markov Models (HMM)
• Gaussian Mixture Models (GMM)
• Multinomial Mixture Models
• Discriminative Model $\Rightarrow$ tries to learn $\mathbb{P}_{\mathbf{Y} \vert \mathbf{X}}(y \vert \mathbf{x})$
• Aims to predict relevant data
• Example techniques:
• $K$ nearest neighbors
• logistic regression
• linear regression
• 没错，linear regression 其实是 discriminative model
• 我觉得这就是 discriminative 这个名字不好的地方
• Conditional Random Fields (CRFs)
• Logistic Regression is the simplest CRF
• SVMs
• perceptrons

## 2. Frequentist vs Bayesian

Section 5.6 Bayesian Statistics, Deep Learning 上说：

As discussed in section 5.4.1, the frequentist perspective is that the true parameter value $\theta^\ast$ is fixed but unknown, while the point estimate $\hat{\Theta}_m$ is a random variable on account of it being a function of the dataset (which is seen as random).

The Bayesian perspective on statistics is quite different. The Bayesian uses probability to reflect degrees of certainty of states of knowledge. The dataset is directly observed and so is not random. On the other hand, the true parameter $\theta^\ast$ is unknown or uncertain and thus is represented as a random variable.

• Frequentist $\Rightarrow$ the true, unknown parameter $\theta^\ast$ is a value
• 所以 Frequentist machine learning $\Rightarrow \mathbb{P}_{\mathbf{D}}(\mathbf{d}; \theta)$
• $\theta$ is not modeled probabilistically
• Bayesian $\Rightarrow$ the true, unknown parameter $\theta^\ast$ is a random variable
• 所以 Bayesian machine learning $\Rightarrow \mathbb{P}_{\mathbf{D}, \Theta}(\mathbf{d}, \theta)$
• $\theta$ is modeled probabilistically
• 这个 $\theta$ 可以是 latent variable

• Frequentist $\Rightarrow$
• 写出 $\mathbb{P}_{\mathbf{D}}(\mathbf{D}; \theta) = \prod_{i=1}^{m}\mathbb{P}_{D_i}(\mathbf{d}_i; \theta)$ 的表达式
• 做 point estimate $\hat{\theta} \to \theta^\ast$ 使得 $\mathbb{P}_{\mathbf{D}}(\theta) \to \mathbb{P}_{\mathbf{D}}^\ast$
• 对 test data 做 prediction：$\mathbf{d}_{m+1} = \mathbb{E}[\mathbb{P}_{D_{m+1}}(\mathbf{d}_{m+1}; \hat{\theta})]$
• Bayesian $\Rightarrow$
• 变形 $\mathbb{P}_{\Theta \vert \mathbf{D}}(\theta \vert \mathbf{D}) = \frac{\mathbb{P}_{\mathbf{D} \vert \Theta}(\mathbf{D} \vert \theta) \, \mathbb{P}_{\Theta}(\theta)}{\mathbb{P}_{\mathbf{D}}(\mathbf{D})}$
• 或者用 $\mathbb{P}_{\Theta \vert \mathbf{D}}(\theta \vert \mathbf{D}) \propto \mathbb{P}_{\mathbf{D} \vert \Theta}(\mathbf{D} \vert \theta) \, \mathbb{P}_{\Theta}(\theta)$ 做 MAP
• 注意：做 MAP 会让人觉得这很像是 Frequentist，但注意 Bayesian 的主要特征其实是变形
• 对 test data 做 prediction：$\mathbb{P}(D_{m+1} \vert \mathbf{D}) = \int \mathbb{P}(D_{m+1} \vert \theta) \, \mathbb{P}(\theta \vert \mathbf{D}) d\theta$
• 可以得到 $D_{m+1} \vert \mathbf{D}$ 的 distribution

• 这个 $\mathbf{D}$，它既可以表示 $\mathbf{D} = \mathbf{Y}, \mathbf{X}$，也可以表示 $\mathbf{D} = \mathbf{Y} \vert \mathbf{X}$，完全看你自己的需求
• 也就是说：无论是 Frequentist 还是 Bayesian machine learning，$\mathbf{D}$ 的形式确定了你到底是 Generative 还是 Discriminative model

## 3. Generative vs Discriminative, Frequentist vs Bayesian

Frequentist Bayesian
Discriminative $\mathbb{P}_{\mathbf{Y} \vert \mathbf{X}}(y \vert \mathbf{x}; \theta)$ $\mathbb{P}_{(\mathbf{Y} \vert \mathbf{X}), \Theta}\big( (y \vert \mathbf{x}), \theta \big)$
Generative $\mathbb{P}_{\mathbf{Y}, \mathbf{X}}(y, \mathbf{x}; \theta)$ $\mathbb{P}_{\mathbf{Y}, \mathbf{X}, \Theta}(y, \mathbf{x}, \theta)$
• Items to the right of the semicolon (;) are not modeled probabilistically
• 注意符号：
• $\mathbb{P}_{\mathbf{Y} \vert \mathbf{X}, \Theta}$ 表示 “distribution of $\mathbf{Y}$, conditioned on $\mathbf{X}$ and $\Theta$”
• $\mathbb{P}_{(\mathbf{Y} \vert \mathbf{X}), \Theta}$ 表示 “joint distribution of $\mathbf{Y} \vert \mathbf{X}$ and $\theta$”

## 4. Unsupervised vs Supervised

• clustering 可以看做是 learn $\mathbb{P}_{\mathbf{C} \vert \mathbf{X}}(c \vert \mathbf{x})$
• PCA 可以看做是 learn $\mathbb{P}_{\mathbf{X’} \vert \mathbf{X}}(\mathbf{x}’ \vert \mathbf{x})$
• embedding 可以看做是 learn $\mathbb{P}_{\mathbf{E} \vert \mathbf{X}}(\mathbf{e} \vert \mathbf{x})$

## 5. Frequentist Discriminative Example: Linear Regression

• MLE 等价于 minimizing KL divergence $D_{KL}(\hat{\mathbb{P}}_{\text{data}} \Vert \mathbb{P}_{\text{model}})$
• MLE 等价于 minimizing cross-entropy $H(\hat{\mathbb{P}}_{\text{data}}, \mathbb{P}_{\text{model}})$
• When $\mathbb{P}_{\text{model}}$ is Gaussian，等价于 minimizing $\operatorname{MSE}$
• 亦即 $\operatorname{MSE}$ is the cross-entropy between the empirical distribution and a Gaussian model.

linear regression 并没有说需要 assumption on Gaussian distributions，但是你会注意到我们 linear regression 一般是 minimizing $\operatorname{MSE}$，联系我们之前说到的 “$\operatorname{MSE}$ is the cross-entropy between the empirical distribution and a Gaussian model”，那么 linear regression 中到底哪里出现了 Gaussian model 呢？

We can imagine that with an infinitely large training set, we might see several training examples with the same input value $\mathbf{x}$ but different values of $y$. 那我们假设单个 input $\mathbf{x}_1$ 对应的所有 output 构成一个 sample $Y_1$，你的 training data $y_1$ 只是 $Y_1$ 的一个 value；我们假设 $Y_1 \vert \mathbf{x}_1 \sim \mathcal{N}(?, ?)$ (这是我们在 $\mathbf{x}_1$ 处的 $\mathbb{P}_\text{model}$)

\begin{aligned} \mathbb{E}[Y_1] &= \mathbb{E}[\mathbf{w} \mathbf{x}_1] + \mathbb{E}[\epsilon] = \mathbf{w} \mathbf{x}_1 + 0 = \mathbf{w} \mathbf{x}_1 \newline \operatorname{Var}(Y_1) &= \operatorname{Var}(\mathbf{w} \mathbf{x}_1) + \operatorname{Var}(\epsilon) = 0 + \sigma^2 = \sigma^2 \end{aligned}

• 你也可以把所有的 $Y_i$ 集合起来，这时 $\mathbb{P}_\text{model}$ 是一个多元的 Gaussian：$\mathbf{Y} \vert \mathbf{X} \sim \mathcal{N}(\mathbf{w} \mathbf{X}, \sigma^2)$
\begin{aligned} \mathbf{w}_{\text{ML}} &= \underset{\mathbf{w}}{\arg \max } \prod_{i=1}^{m} \mathbb{P}_{Y_i \vert \mathbf{x}_i} (y_i) \newline &= \underset{\mathbf{w}}{\arg \max } \sum_{i=1}^{m} \ln \mathbb{P}_{Y_i \vert \mathbf{x}_i} (y_i) \newline &= \underset{\mathbf{w}}{\arg \max } \sum_{i=1}^{m} \ln \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{(y_i- \mathbf{w} \mathbf{x}_i)^{2}}{2 \sigma^{2}}} \newline &= \underset{\mathbf{w}}{\arg \max } -\frac{m}{2} \ln 2 \pi - \frac{m}{2} \ln \sigma^{2} - \frac{\sum_{i=1}^{m} (y_i- \mathbf{w} \mathbf{x}_i)^{2}}{2 \sigma^{2}} \newline &= \underset{\mathbf{w}}{\arg \min } \sum_{i=1}^{m} (y_i- \mathbf{w} \mathbf{x}_i)^{2} \newline &= \underset{\mathbf{w}}{\arg \min } \operatorname{MSE}(\mathbf{Y} \vert \mathbf{X}) \end{aligned}

• 对一个新来的 test data point $\mathbf{x}_{m+1}$，prediction 很简单：$\hat{y}_{m+1} = \mathbf{w}_{\text{ML}} \mathbf{x}_{m+1}$。但这个式子是怎么得来的呢？
• 因为我们的 assumption 是 $Y_{m+1} = \mathbf{w} \mathbf{x}_{m+1} + \epsilon$，所以我觉得应该把 prediction 理解成 $\hat{y}_{m+1} = \mathbb{E}[Y_{m+1}] = \mathbf{w}_{\text{ML}} \mathbf{x}_{m+1}$

## 6. Bayesian Generative Example: Naive Bayes Classifier

\begin{aligned} \mathbb{P}(\mathbf{Y} \vert \mathbf{X}) &\propto \mathbb{P}(\mathbf{X} \vert \mathbf{Y}) \, \mathbb{P}(\mathbf{Y}) \newline &\propto \big( \prod_{i=1}^{n} \mathbb{P}(x_i \vert \mathbf{Y}) \big) \, \mathbb{P}(\mathbf{Y}) \end{aligned}

\begin{aligned} \mathbb{P}(\theta \vert \mathbf{X}) &\propto \big( \prod_{i=1}^{n} \mathbb{P}(x_i \vert \theta) \big) \, \mathbb{P}(\theta) \newline &\propto \sum_{i=1}^{n} \log \mathbb{P}(x_i \vert \theta) + \log \mathbb{P}(\theta) \end{aligned}