# Structured Output Support Vector Machines

Yao Yao on December 11, 2014

## 1. Intro

Structured output SVMs Extends SVMs to handle arbitrary output spaces, particularly ones with non-trivial structure (e.g. space of poses, textual translations, sentences in a grammar, etc.).

• $w^T x$ 叫 score
• $\hat{y}(x;w)$ 其实就是 $h(x)$
• $sign(x)$ 就是 $g(x)$，用 $sign$ 还标准些
• 没有使用 $b$，直接 $\hat{y}(x;w) = w^T x$
• $\phi$ 称为 feature map
• With a feature map, the nature of the input $x$ is irrelevant (image, video, audio, …).
• 优化问题里使用了 hinge loss $L$
• 0-1 classification 的情况下，$L^{(i)}(w) = \max \lbrace 0, 1 - y^{(i)}(w^T x^{(i)}) \rbrace$
• 如果是 Support Vector Regression（直接用 $w^T x$ 来预测 $y$ 的值），则 $L^{(i)}(w) = \left \vert y^{(i)} - w^T x^{(i)} \right \vert$，因为是 1 阶的，也称 $L^1$ error
• hinge loss is a convex function
• 然后它的优化问题是 minimize $E(w) = \frac{\lambda}{2} \left \vert w \right \vert ^2 + \frac{1}{n}\sum_{i=1}{n}{L^{(i)}(w)}$

## 2. SSVM

\begin{align} f(x,y) = w^T \Psi(x,y) \tag{1} \label{eq1} \end{align}

$\Psi$ 称为 joint feature map

\begin{align} & \underset{y \in \mathcal{Y}}{\arg\max} & w^T \Psi(x,y) \tag{2} \label{eq2} \end{align}

Standard SVMs can be easily interpreted as a structured SVMs:

• output space: $y \in \mathcal{Y} = \lbrace -1, 1 \rbrace$
• joint feature map: $\Psi(x,y) = \frac{y}{2} x$
• inference: $\hat{y}(x;w) = \underset{y \in \lbrace -1, 1 \rbrace}{\arg\max} \, \frac{y}{2} w^T x = sign(w^T x)$

## 3. The surrogate loss

The key in the success of the structured SVMs is the existence of good surrogates. The aim is to make minimising $L^{(i)}(w)$ have the same effect as minimising $\Delta(y^{(i)}, \hat{y}(x^{(i)};w))$, 具体的术语是要求 $L^{(i)}(w)$ 是 $\Delta(y^{(i)}, \hat{y}(x^{(i)};w))$ 的 tight binding approximation。但是 tight 的要求比较严格，所以一般有个能用的 surrogate 就可以了。常用的 surrogate 有：

• Margin rescaling surrogate: $L^{(i)}(w) = \underset{y \in \mathcal{Y}}{\sup} \, \Delta(y, \hat{y}) + [w^T \Psi(x^{(i)},y) - w^T \Psi(x^{(i)},y^{(i)})]$
• $\sup$ 表示 “最小上界”
• Slack rescaling surrogate: $L^{(i)}(w) = \underset{y \in \mathcal{Y}}{\sup} \, \Delta(y, \hat{y}) [1 + w^T \Psi(x^{(i)},y) - w^T \Psi(x^{(i)},y^{(i)})]$