# ISL: Unsupervised Learning

Yao Yao on October 9, 2014

## 0. Overview

In this chapter, we will focus on two particular types of unsupervised learning:

• principal components analysis: a tool used for data visualization or data pre-processing before supervised techniques are applied
• clustering: a broad class of methods for discovering unknown subgroups in data

## 1. The Challenge of Unsupervised Learning

P374，没啥技术性的内容，只有这一句我稍微有一点在意：Unsupervised learning is often performed as part of an exploratory data analysis.

## 2. Principal Components Analysis

Principal component analysis (PCA) refers to the process by which principal components are computed, and the subsequent use of these components in understanding the data.

### 2.1 What Are Principal Components?

The $1^{st}$ principal component of a set of features $X_1,X_2,\cdots,X_p$ is the normalized linear combination of the features

$$$Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \cdots + \phi_{p1} X_p \tag{2.1}$$$

that has the largest variance. By normalized, we mean that $\sum_{j=1}^{p}{\phi_{j1}^2} = 1$. We refer to the elements $\phi_{11}, \phi_{21}, \cdots, \phi_{p1}$ as the loadings of the $1^{st}$ principal component.

\begin{align} X & = \begin{pmatrix} X_1 & X_2 & \cdots & X_P \end{pmatrix} \newline & = \begin{pmatrix} -- (x^{(1)})^T -- \newline -- (x^{(2)})^T -- \newline \cdots \newline -- (x^{(n)})^T -- \end{pmatrix} \newline & = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1p} \newline x_{21} & x_{22} & \cdots & x_{2p} \newline \cdots & \cdots & \cdots & \cdots \newline x_{n1} & x_{n2} & \cdots & x_{np} \newline \end{pmatrix} \end{align}

\begin{align} z_{i1} & = x_i \cdot \phi_1 \newline & = \phi_{11} x_{i1} + \phi_{21} x_{i2} + \cdots + \phi_{p1} x_{ip} \tag{2.2} \end{align}

We refer to $z_{i1}$ as the scores of the $1^{st}$ principal component for the i^th observation.

optimization problem 的表述见 P376。

There is a nice geometric interpretation for the $1^{st}$ principal component. The loading vector $\phi_1 = (\phi_{11}, \phi_{21}, \cdots, \phi_{p1})^T$ defines a direction in feature space along which the data vary the most. If we project the $n$ data points $x_1, \cdots, x_n$ onto this direction, the projected values are the principal component scores $z_{11}, \cdots, z_{n1}$ themselves.

The second principal component $Z_2$ is the linear combination of $X_1,X_2,\cdots,X_p$ that has maximal variance out of all linear combinations that are uncorrelated with $Z_1$.

It turns out that constraining $Z_2$ to be uncorrelated with $Z_1$ is equivalent to constraining the direction $\phi_2$ to be orthogonal (perpendicular) to the direction $\phi_1$.

P337 这个 USArrests 的例子值得仔细研究下。

• “The principal component score vectors have length $n = 50$” 这句等于是告诉你有 50 个 observation
• 根据描述，明显是 50 个州，每个州一行
• “and the principal component loading vectors have length $p = 4$” 这句就是告诉你有 4 个 feature
• 分别是：AssaultMurderRapeUrbanPop

biplot 的看法：

• axes on the bottom and left 是 scores
• $\phi_1 = (\phi_{11}, \phi_{21}, \phi_{31}, \phi_{41})$

• $\phi_2 = (\phi_{12}, \phi_{22}, \phi_{32}, \phi_{42})$

• 4 个 loading vector 分别是 $(\phi_{11}, \phi_{12})$, $(\phi_{21}, \phi_{22})$, $(\phi_{31}, \phi_{32})$, $(\phi_{41}, \phi_{42})$
• AssaultMurderRape 的 $1^{st}$ loading 都比较大，说明 $1^{st}$ PC roughly corresponds to a measure of overall rates of serious crimes.
• This also indicates that the crime-related variables are correlated with each other — states with high murder rates tend to have high assault and rape rates.
• UrbanPop 的 $2^{nd}$ loading 很大，说明 $2^{nd}$ PC roughly corresponds to the level of urbanization of the state.
• 比如 Kentucky 的 $1^{st}$ score 是 $Z_{k1}$，$2^{nd}$ score 是 $Z_{k2}$，那么在 $(Z_{k1}, Z_{k2})$ 这个点上就会有一个 text 写着 “Kentucky”
• 你仔细想一下，score 其实就是 PCA 之后的新 feature 的值，所以 score 的大小和 $X_i$ 的大小意义是一样的。
• 比如 California 的 $1^{st}$ score 和 $2^{nd}$ score 都很大，联系 $1^{st}$ PC 和 $2^{nd}$ PC 的意义，说明 California has both a high crime rate and a high level of urbanization.
• Indiana, close to 0 scores on both PCs, has approximately average levels of both crime and urbanization.
• Vermont has both a low crime rate and a low level of urbanization.

### 2.2 Another Interpretation of Principal Components

In the previous section, we describe the PC loading vectors as the directions in feature space along which the data vary the most (i.e. have the highest variance), and the PC scores as projections along these directions.

An alternative interpretation is: PCs provide low-dimensional linear surfaces that are closest to the observations.

The $1^{st}$ PC loading vector has a very special property: it is the line in $p$-dimensional space that is closest to the $n$ observations (using average squared Euclidean distance as a measure of closeness). The appeal of this interpretation is clear: we seek a single dimension of the data that lies as close as possible to all of the data points, since such a line will likely provide a good summary of the data.

P380 提到了一个 $M$-dimensional approximation:

$$$x_{ij} \approx \sum_{m=1}^{M}{z_{im} \phi_{jm}} \tag{2.3} \label{eq2.3}$$$

$$$x_{ij} = \sum_{m=1}^{M}{z_{im} \phi_{jm}} \tag{2.4}$$$

• loading vector 是 [U, S, V] = svd(Σ)U 的 1 行
• score $z_{i1}$ 应该是 $z^{(i)}$ 的第一个元素

### 2.3 More on PCA

#### Scaling the Variables

P381 对比了一下 scaled 和 unscaled 的两个结果。

P382 提到了一种可能不需要 scaling 的情况：the variables are measured in the same units.

#### Uniqueness of the Principal Components

a sign flip 对 loading vector 和 score vector 是没有什么影响的，符号相反的 vector 可以视为相同（唯一有影响的应该是对 PC 和 score 的 interpretation）。

P382-383

#### Deciding How Many Principal Components to Use

In general, a $n \times p$ data matrix $X$ has $\min(n − 1, p)$ distinct PCs.

P384，讲得不错。

## 3. Clustering Methods

Clustering refers to a very broad set of techniques for finding homogeneous ([ˌhɒməˈdʒi:niəs]) subgroups, or clusters, in a data set. When we cluster the observations of a data set, we seek to partition them into distinct groups so that the observations within each group are quite similar to each other, while observations in different groups are quite different from each other. Of course, to make this concrete, we must define what it means for two or more observations to be similar or different. Indeed, this is often a domain-specific consideration that must be made based on knowledge of the data being studied.

In this section we focus on perhaps the two best-known clustering approaches:

• $K$-means clustering
• $K$ 是事先预定的
• hierarchical clustering
• we do not know in advance how many clusters we want
• we end up with a tree-like visual representation of the observations, called a dendrogram ([‘dendrəgræm], 树状图)

• cluster observations on the basis of the features in order to identify subgroups among the observations
• cluster features on the basis of the observations in order to discover subgroups among the features

### 3.1 K-Means Clustering

P386-390，Ng 的课上已经说得很清楚了，这里简单说下：

• 优化目标是 $\min \sum{(\text{within-cluster variation})}$
• 算法是 “不断 update centroid”
• 因为可能收敛到 local optimum，所以要随机初始化 centroid 跑多次
• $K$ 值的选择也需要跑多次试验决定

### 3.2 Hierarchical Clustering

In this section, we describe bottom-up or agglomerative ([ə’glɒmərətɪv], tending to agglomerate, 聚集) clustering. This is the most common type of hierarchical clustering, and refers to the fact that a dendrogram (generally depicted as an upside-down tree) is built starting from the leaves and combining clusters up to the trunk.

#### Interpreting a Dendrogram

• Each leaf of the dendrogram represents one observations.
• As we move up the tree, some leaves begin to fuse into branches.
• the earlier fusions is lower in the tree
• the later fusion is nearer to the top of the tree
• For any two observations, we can look for the point in the tree where branches containing those two observations are first fused. The height of this fusion, as measured on the vertical axis, indicates how different the two observations are.
• Thus, observations that fuse at the very bottom of the tree are quite similar to each other
• whereas observations that fuse close to the top of the tree will tend to be quite different.
• We cannot draw conclusions about the similarity of two observations based on their proximity along the horizontal axis.

P393 说得是：横着一刀下去，上半部分剩下几个 branch 就相当于是几个大 cluster，你在不同的高度切，得到的结果也不同。Therefore it highlights a very attractive aspect of hierarchical clustering: one single dendrogram can be used to obtain any number of clusters.

However, often the choice of where to cut the dendrogram is not so clear.

#### The Hierarchical Clustering Algorithm

• 先定一个指标：dissimilarity measure between each pair of observations.
• Most often, Euclidean distance is used
• 把 n 个 observation 看做 n 个 cluster，找出 dissimilarity 最小的做 fuse
• 得到的 (n-1) 个 cluster 重新计算 dissimilarity，继续找最小的做 fuse
• 重复这个过程 (n-1) 次即可

• Complete: Maximal intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities.
• Single: Minimal intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities. Single linkage can result in extended, trailing clusters in which single observations are fused one-at-a-time.
• Average: Mean intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities.
• Centroid: Dissimilarity between the centroid for cluster A (a mean vector of length p) and the centroid for cluster B. Centroid linkage can result in undesirable inversions.
• Average, complete, and single linkage are most popular among statisticians.
• Average and complete linkage are generally preferred over single linkage, as they tend to yield more balanced dendrograms.
• Centroid linkage is often used in genomics, but suffers from a major drawback in that an inversion can occur, whereby two clusters are fused at a height below (这个 below 不需要倒着看。如果把树倒过来，root 最低，leaves 最高，那这里 inversion 的意思就是 fusion 点比 leaves 还要高) either of the individual clusters in the dendrogram.

#### Choice of Dissimilarity Measure

P396-399，阐述得很详细。

It turns out that these two measures are almost equivalent: if each observation has been scaled to have mean 0 and standard deviation 1, and if we let $r_{ij}$ denote the correlation between the i^th and j^th observations, then the quantity $1 - r_{ij}$ is proportional to the squared Euclidean distance between the i^th and j^th observations.

### 3.3 Practical Issues in Clustering

#### Small Decisions with Big Consequences

• Should the observations or features first be standardized in some way?
• i.e. centered to have mean zero and scaled to have standard deviation one
• In the case of hierarchical clustering,
• What dissimilarity measure should be used?
• What type of linkage should be used?
• Where should we cut the dendrogramin order to obtain clusters?
• In the case of K-means clustering, how many clusters should we look for in the data?

In practice, we try several different choices, and look for the one with the most useful or interpretable solution. With these methods, there is no single right answer—any solution that exposes some interesting aspects of the data should be considered.

#### Validating the Clusters Obtained

There has been no consensus on a single best approach. P400

P400

#### A Tempered Approach to Interpreting the Results of Clustering

P401

These results should not be taken as the absolute truth about a data set. Rather, they should constitute a starting point for the development of a scientific hypothesis and further study, preferably on an independent data set.

## 4. Lab 1: Principal Components Analysis

In this lab, we perform PCA on the USArrests data set, which is part of the base R package. The rows of the data set contain the 50 states, in alphabetical order.

> states = row.names(USArrests)
> states


The columns of the data set contain the four variables.

> names(USArrests)
[1] "Murder" "Assault" "UrbanPop" "Rape"


We first briefly examine the data.

> apply(USArrests, 2, mean) ## column-wise mean
Murder	Assault	UrbanPop	Rape
7.79	170.76	65.54		21.23
## vastly different means

> apply(USArrests, 2, var) ## column-wise variance
Murder	Assault	UrbanPop	Rape
19.0	6945.2	209.5		87.7
## vastly different variances


We now perform PCA using the prcomp() function.

> pr.out = prcomp(USArrests, scale=TRUE)


By default, the prcomp() function centers the variables to have mean zero. By using the option scale=TRUE, we scale the variables to have standard deviation one. The output from prcomp() contains a number of useful quantities.

> names(pr.out)
[1] "sdev" "rotation" "center" "scale" "x"


The center and scale components correspond to the means and standard deviations of the variables that were used for scaling prior to implementing PCA.

> pr.out$center Murder Assault UrbanPop Rape 7.79 170.76 65.54 21.23 > pr.out$scale
Murder	Assault	UrbanPop	Rape
4.36	83.34	14.47		9.37


The rotation matrix provides the PC loadings; each column of pr.out$rotation contains the corresponding PC loading vector. When we matrix-multiply the$ X $matrix by pr.out$rotation, it gives us the coordinates of the data in the rotated coordinate system. These coordinates are the PC scores.

> pr.out$rotation PC1 PC2 PC3 PC4 Murder -0.536 0.418 -0.341 0.649 Assault -0.583 0.188 -0.268 -0.743 UrbanPop -0.278 -0.873 -0.378 0.134 Rape -0.543 -0.167 0.818 0.089  We see that there are 4 distinct PCs. This is to be expected because there are in general$ \min(n − 1, p) $informative PCs in a data set with$ n $observations and$ p $variables. Using the prcomp() function, we do not need to explicitly multiply the data by the PC loading vectors in order to obtain the PC score vectors. Rather the$ 50 \times 4 $matrix x has as its columns the PC score vectors. That is, the k^th column is the k^th PC score vector. > dim(pr.out$x)
[1] 50 4


We can plot the first 2 PC as follows:

> biplot(pr.out, scale=0)


The scale=0 argument to biplot() ensures that the arrows are scaled to represent the loadings; other values for scale give slightly different biplots with different interpretations.

Notice that this figure is a mirror image of Figure 10.1. Recall that the principal components are only unique up to a sign change, so we can reproduce Figure 10.1 by making a few small changes:

> pr.out$rotation = -pr.out$rotation
> pr.out$x = -pr.out$x
> biplot(pr.out, scale=0)


The prcomp() function also outputs the standard deviation of each principal component.

> pr.out$sdev [1] 1.575 0.995 0.597 0.416  The variance explained by each principal component is obtained by squaring these: > pr.var = pr.out$sdev^2
> pr.var
[1] 2.480 0.990 0.357 0.173


To compute the proportion of variance explained by each PC, we simply divide the variance explained by each PC by the total variance explained by all 4 PCs:

> pve = pr.var/sum(pr.var)
> pve
[1] 0.6201 0.2474 0.0891 0.0434


We can plot the PVE explained by each component, as well as the cumulative PVE, as follows:

> plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,1), type=’b’)
> plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", ylim=c(0,1), type=’b’)


The result is shown in Figure 10.4 (i.e. scree plot). Note that the function cumsum() computes the cumulative sum of the elements of a numeric vector.

## 5. Lab 2: Clustering

### 5.1 K-Means Clustering

The function kmeans() performs K-means clustering in R. We begin with a simple simulated example in which there truly are two clusters in the data: the first 25 observations have a mean shift relative to the next 25 observations.

> set.seed(2)
> x = matrix(rnorm(50*2), ncol=2)
> x[1:25,1] = x[1:25,1]+3
> x[1:25,2] = x[1:25,2]-4


We now perform K-means clustering with $K = 2$.

> km.out = kmeans(x, 2, nstart=20) ## 随机初始化 centroid 20 次，应该就是跑 20 次的意思（然后取最优）
> km.out


The cluster assignments of the 50 observations are contained in km.out$cluster. > km.out$cluster
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1
[30] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1


We can plot the data, with each observation colored according to its cluster assignment.

> plot(x, col=(km.out$cluster+1), main="K-Means Clustering Results with K=2", xlab="", ylab="", pch=20, cex=2)  If a value of nstart greater than 1 is used, then K-means clustering will be performed using multiple random assignments in Step 1 of Algorithm 10.1, and the kmeans() function will report only the best results. Here we compare using nstart=1 to nstart=20. > set.seed(3) > km.out = kmeans(x, 3, nstart=1) > km.out$tot.withinss
[1] 104.3319
> km.out = kmeans(x, 3, nstart=20)
> plot(pr.out$x[,c(1,3)], col=Cols(nci.labs), pch=19, xlab="Z1", ylab="Z3")  On the whole, cell lines corresponding to a single cancer type do tend to have similar values on the first few principal component score vectors. This indicates that cell lines from the same cancer type tend to have pretty similar gene expression levels. We can obtain a summary of the proportion of variance explained (PVE) of the first few principal components using the summary() method for a prcomp object. > summary(pr.out)  Using the plot() function, we can also plot the variance explained by the first few principal components. > plot(pr.out)  Or we can plot the scree plot manually by: > pve = 100*pr.out$sdev^2 / sum(pr.out$sdev^2) > par(mfrow=c(1,2)) > plot(pve, type="o", ylab="PVE", xlab="Principal Component", col="blue") > plot(cumsum(pve), type="o", ylab="Cumulative PVE", xlab="Principal Component", col="brown3")  Note that the elements of pve can also be computed directly from summary(pr.out)$importance[2,], and the elements of cumsum(pve) are given by summary(pr.out)$importance[3,]. ### 6.2 Clustering the Observations of the NCI60 Data To begin, we standardize the variables to have mean zero and standard deviation one. As mentioned earlier, this step is optional and should be performed only if we want each gene to be on the same scale. > sd.data = scale(nci.data)  We now perform hierarchical clustering of the observations using complete, single, and average linkage. Euclidean distance is used as the dissimilarity measure. > par(mfrow=c(1,3)) > data.dist = dist(sd.data) > plot(hclust(data.dist), labels=nci.labs, main="Complete Linkage", xlab="", sub="", ylab="") > plot(hclust(data.dist, method="average"), labels=nci.labs, main="Average Linkage", xlab="", sub="", ylab="") > plot(hclust(data.dist, method="single"), labels=nci.labs, main="Single Linkage", xlab="", sub="", ylab="")  We see that the choice of linkage certainly does affect the results obtained. Typically, single linkage will tend to yield trailing clusters: very large clusters onto which individual observations attach one-by-one. On the other hand, complete and average linkage tend to yield more balanced, attractive clusters. We can cut the dendrogram at the height that will yield a particular number of clusters, say four: > hc.out = hclust(dist(sd.data)) > hc.out > hc.clusters = cutree(hc.out, 4) > table(hc.clusters, nci.labs)  We can plot the cut on the dendrogram that produces these four clusters: > par(mfrow =c(1,1)) > plot(hc.out, labels=nci.labs) > abline(h=139, col="red")  The argument h=139 plots a horizontal line at height 139 on the dendrogram; this is the height that results in four distinct clusters. We claimed earlier in Section 10.3.2 that K-means clustering and hierarchical clustering with the dendrogram cut to obtain the same number of clusters can yield very different results. How do these NCI60 hierarchical clustering results compare to what we get if we perform K-means clustering with$ K = 4 $? > set.seed(2) > km.out = kmeans(sd.data, 4, nstart=20) > km.clusters = km.out$cluster
> table(km.clusters, hc.clusters)


Rather than performing hierarchical clustering on the entire data matrix, we can simply perform hierarchical clustering on the first few principal component score vectors, as follows:

> hc.out = hclust(dist(pr.out\$x[,1:5]))
> plot(hc.out, labels=nci.labs, main="Hier. Clust. on First Five Score Vectors")
> table(cutree(hc.out,4), nci.labs)