# Support Vector Machines and Kernels

Yao Yao on December 5, 2014

## 1. Intro

To start with, we will be considering a linear classifier for a binary classification problem with labels $y \in {-1,1}$ and features $x$. We will use parameters $w$, $b$ instead of $\theta$, and write our classifier as

\begin{align} h(x) = g(w^T x + b) \tag{1} \label{eq1} \end{align}

Here, $g(z) = 1$ if $z \geq 0$, and $g(z) = −1$ otherwise.

\begin{align} f(x) = w^T x + b \tag{2} \label{eq2} \end{align}

discriminant function

$f(x) = 0$ 就构成了我们的 hyperplane，intuition 什么的我就不说了。

\begin{align} \text{functional margin} = \text{geometric margin} \times \left \vert w \right \vert \tag{3} \label{eq3} \end{align}

\begin{align} w^T x = \left \langle w,x \right \rangle = \left \vert w \right \vert^2 \end{align}
• $w^T x = \left \langle w,x \right \rangle$ 叫 inner product (内积) 或者 dot product
• $\left \vert w \right \vert$ 称为 vector 的 norm (模)
• $\frac{w}{\left \vert w \right \vert }$ 称为 unit-length vector (单位向量，模为 1)

## 2. The Non-Linear Case

There is a straightforward way of turning a linear classifier non-linear, or making it applicable to non-vectorial data. It consists of mapping our data to some vector space, which we will refer to as the feature space, using a function $\phi$. The discriminant function then is

\begin{align} f(x) = w^T \phi(x) + b \tag{4} \label{eq4} \end{align}

Note that $f(x)$ is linear in the feature space defined by the mapping $\phi$; but when viewed in the original input space then it is a nonlinear function of $x$ if $\phi(x)$ is a nonlinear function.

Kernel methods avoid this complexity by avoiding the step of explicitly mapping the data to a high dimensional feature-space.

## 3. Lagrange duality 登场

\begin{align} & \underset{w, b}{\text{minimize}} & & \frac{1}{2} \left \vert w \right \vert ^2 \newline & \text{subject to} & & y^{(i)}(w^T x^{(i)} + b) \geq 1, & i = 1, \cdots, n \tag{5} \label{eq5} \end{align}

\begin{align} & \underset{w, b}{\text{minimize}} & & \frac{1}{2} \left \vert w \right \vert ^2 \newline & \text{subject to} & & - y^{(i)}(w^T x^{(i)} + b) + 1 \leq 0, & i = 1, \cdots, n \tag{OPT} \label{eqopt} \end{align}

yes! $(\ref{eqopt})$ 出来啦！$g_i(w) = - y^{(i)}(w^T x^{(i)} + b) + 1$，然后 $h_i(w)$ 不存在，所以标准的 Lagrangian $L(x, \alpha, \beta)$ 中，$x$ 要换成 $(w,b)$，$\beta$ 不需要，于是变成了：

\begin{align} L(w, b, \alpha) = \frac{1}{2} \left \vert w \right \vert ^2 - \sum_{i=1}^{m}{\alpha_{i} [y^{(i)}(w^T x^{(i)} + b) - 1]} \tag{6} \label{eq6} \end{align}

Let’s find the dual form of the problem. To do so, we need to first minimize $L(w, b, \alpha)$ with respect to $w$ and $b$ (for fixed $\alpha$), to get $\theta_{D}$, which we’ll do by setting the derivatives of $L$ with respect to $w$ and $b$ to zero. We have:

\begin{align} \nabla_{w}{L(w, b, \alpha)} &= w - \sum_{i=1}^{m}{\alpha_i y^{(i)} x^{(i)}} = 0 \newline w &= \sum_{i=1}^{m}{\alpha_i y^{(i)} x^{(i)}} \end{align}

\begin{align} f(x) & = w^T x + b \newline & = \left ( \sum_{i=1}^{m}{\alpha_i y^{(i)} x^{(i)}} \right )^T x + b \newline & = \sum_{i=1}^{m}{\alpha_i y^{(i)} \left \langle x^{(i)},x \right \rangle } + b \tag{7} \label{eq7} \end{align}

Hence, if we’ve found the $\alpha_i$’s, in order to make a prediction, we have to calculate a quantity that depends only on the inner product between $x$ and the points in the training set. Moreover, we saw earlier that the $\alpha_i$’s will all be 0 except for the support vectors. Thus, many of the terms in the sum above will be 0, and we really need to find only the inner products between $x$ and the support vectors (of which there is often only a small number) in order calculate $(\ref{eq7})$ and make our prediction.

## 4. Kernels

\begin{align} f(x) & = w^T \phi(x) + b \newline & = \sum_{i=1}^{m}{\alpha_i y^{(i)} \left \langle \phi(x^{(i)}),\phi(x) \right \rangle} + b \tag{8} \label{eq8} \end{align}

\begin{align} K(x, x') = \left \langle \phi(x),\phi(x') \right \rangle \tag{9} \label{eq9} \end{align}

## 5. Kernel Examples

Popular choices for $K$ in the SVM literature are:

• linear kernel: $K(x, x’) = \left \langle x,x’ \right \rangle$
• 相当于没有用 $\phi$ 或者 $\phi(x) = x$
• dth-Degree polynomial kernel:
• homogeneous: $K(x, x’) = \left \langle x,x’ \right \rangle^d$
• inhomogeneous: $K(x, x’) = (1 + \left \langle x,x’ \right \rangle)^d$
• Gaussian kernel: $K(x, x’) = \exp(-\frac{\left \vert x-x’ \right \vert ^2}{2 \sigma^2})$
• Radial basis kernel: $K(x, x’) = \exp(-\gamma \left \vert x-x’ \right \vert ^2)$
• Neural network kernel: $K(x, x’) = tanh(k_1 \left \langle x,x’ \right \rangle + k_2)$
• tanh is hyperbolic tangent
• $sinh(x) = \frac{e^x - e^{-x}}{2}$
• $cosh(x) = \frac{e^x + e^{-x}}{2}$
• $tanh(x) = \frac{sinh(x)}{cosh(x)} = \frac{e^x - e^{-x}}{e^x + e^{-x}}$

### 5.2 Kernels for Sequences

Support Vector Machines and Kernels for Computational Biology P12 说到了，我就简单写一下。

#### 5.2.1 Kernels Describing $\ell$-mer Content

• 考虑所有的 dimer，以 ACGT 的顺序，$x_1$ 表示 AA 的个数，$x_2$ 表示 AC 的个数，……，$x_{16}$ 表示 TT 的个数
• 如果要区分 intron 和 exon 的话，那么可以设计成：$x_1$ 表示 intronic AA 的个数，……，$x_{16}$ 表示 intronic TT 的个数，$x_{17}$ 表示 exonic AA 的个数，……，$x_{32}$ 表示 exonic TT 的个数
• 比如一个 sequence 是 intro ACT，那么就只有 intronic AC 和 intronic CT 上是两个 1，其余全 0。这样的一个 vector 称为 sequence 的 spectrum

\begin{align} K_{\ell}^{spectrum}(x, x') = \left \langle \Phi_{\ell}^{spectrum}(x) ,\Phi_{\ell}^{spectrum}(x') \right \rangle \tag{10} \label{eq10} \end{align}

Since the spectrum kernel allows no mismatches, when $\ell$ is sufficiently long the chance of observing common occurrences becomes small and the kernel will no longer perform well. This problem is alleviated if we use the mixed spectrum kernel:

\begin{align} K_{\ell}^{mixedspectrum}(x, x') = \sum_{d=1}^{\ell}{\beta_d K_{d}^{spectrum}(x, x')} \tag{11} \label{eq11} \end{align}

where $\beta_d$ is a weighting for the different substring lengths.

#### 5.2.2 Kernels Using Positional Information

Analogous to Position Weight Matrices (PWMs), the idea is to analyze sequences of fixed length $L$ and consider substrings starting at each position $l = 1 ,\cdots,L$ separately, as implemented by the so-called weighted degree (WD) kernel:

\begin{align} K_{\ell}^{weighteddegree}(x, x') = \sum_{l=1}^{L}{\sum_{d=1}^{\ell}{\beta_d K_{d}^{spectrum}(x_{[l:l+d]}, x'_{[l:l+d]})}} \tag{12} \label{eq12} \end{align}

where $x_{[l:l+d]}$ is the substring of length $d$ at position $l$. A suggested setting for $\beta_d$ is $\beta_d = \frac{2(\ell-d+1)}{\ell^2 + \ell}$.