# R apply family

Yao Yao on June 27, 2014
• Published in category
• R

## 1. lapply: Apply a Function to Each Element of a List or Vector

lapply(X, func, ...) 可以理解成：

List<T> result = ...;

for (T xn : X) {
}

return result;


If X is not a list, it will be coerced to a list using as.list.

if (!is.vector(X) || is.object(X))
X <- as.list(X)


lapply always returns a list, regardless of the class of the input.

apply family 里常见 anonymous function，比如这个 lapply(x, function(x) x[,1]) 就是取 list<Matrix> 中每个 matrix 的第一列。

## 2. sapply: Simplify the Result of lapply

The simplification rule is:

• If the function returns a list where every element is length 1, then a vector is returned
• If the function returns a list where every element is a vector of the same length (> 1), a matrix is returned.
• If it can’t figure things out, a list is returned

> tests <- lapply(scores, t.test) ## 如果用 sapply，返回 matrix 就不好办了
> sapply(tests, function(t) t$conf.int) ## function 的作用就是把 t$conf.int 给 print 出来


> sapply(batches, class)
batch 	clinic    dosage shrinkage
"factor" "factor" "integer" "numeric"


### 2.1 sapply example: Removing low-correlation variables from a set of predictors

Suppose that resp is a response variable (a vector) and pred is a data frame of predictor variables. Suppose further that we have too many predictors and therefore want to select the top 10 as measured by correlation with the response.

The first step is to calculate the correlation between each predictor and response. In R, that’s a one-liner:

> cors <- sapply(pred, cor, y=resp)


Any arguments beyond the second one in sapply are passed to cor, so the function call will be cor(pred[[i]],y=resp), which calculates the correlation between the given column and resp.

The result cors is a vector of correlations, one for each column. We use the rank function to find the positions of the correlations that have the largest magnitude:

> mask <- (rank(-abs(cors)) <= 10)


rank 的作用是把 vector 的元素按升序排列，返回一个序号 vector，比如

> rank(c(4,6,5))
[1] 1 3 2  ## 表示 4 是一号位，6 是三号位，5 是二号位


Using mask, we can select just those columns from the data frame:

> best.pred <- pred[,mask]


At this point, we can regress resp against best.pred, knowing that we have chosen the predictors with the highest correlations:

> lm(resp ~ best.pred)


### 2.2 vapply: Safer sapply

vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.

## 3. mapply: Apply a Function to Parallel Vectors or Lists (a Multivariate Version of sapply)

> l1 <- list(a = c(1:10), b = c(11:20))
> l2 <- list(c = c(21:30), d = c(31:40))
> mapply(sum, l1$a, l1$b, l2$c, l2$d)
[1]  64  68  72  76  80  84  88  92  96 100


sapply(l1$a, sum) sapply(l1$b, sum)
sapply(l2$c, sum) sapply(l2$d, sum)


for (int i = 1; i <= 10; ++i) {
list.add(sum(l1$a[i], l1$b[i], l2$c[i], l2$d[i]));
}
return list;


mapply 可以用于多个 vector 也可以用于多个 list：

> mapply(f, vec1, vec2, ..., vecN)
> mapply(f, list1, list2, ..., listN)


## 4. apply: Apply a Function over Array Margins (e.g. to Every Row or to Every Column)

apply(X, MARGIN, FUN, ...)

> x <- array(rep(1, 24), c(2, 3, 4))
> x
, , 1

[,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1

, , 2

[,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1

, , 3

[,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1

, , 4

[,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1


> x <- matrix(rep(1, 6), nrow=2, ncol=3)
> x
[,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1
> apply(x, 1, sum)
[1] 3 3
> apply(x, 2, sum)
[1] 2 2 2
> apply(x, c(1, 2), sum)
[,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1


> x <- array(rep(1, 24), c(2, 3, 4))
> x
, , 1

[,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1

, , 2

[,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1

, , 3

[,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1

, , 4

[,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1

> apply(x, 1, sum)
[1] 12 12
> apply(x, 2, sum)
[1] 8 8 8
> apply(x, 3, sum)
[1] 6 6 6 6
> apply(x, c(1, 2), sum)
[,1] [,2] [,3]
[1,]    4    4    4
[2,]    4    4    4
> apply(x, c(1, 3), sum)
[,1] [,2] [,3] [,4]
[1,]    3    3    3    3
[2,]    3    3    3    3
> apply(x, c(2, 3), sum)
[,1] [,2] [,3] [,4]
[1,]    2    2    2    2
[2,]    2    2    2    2
[3,]    2    2    2    2


For sums and means of matrix dimensions, we have some shortcuts.

• rowSums = apply(x, 1, sum)
• rowMeans = apply(x, 1, mean)
• colSums = apply(x, 2, sum)
• colMeans = apply(x, 2, mean)

The shortcut functions are much faster，因为有专门优化过.

## 5. tapply: Apply a Function over a Ragged Array (i.e. lapply after splitting a column)

function (X, INDEX, FUN = NULL, ..., simplify = TRUE)

• X: an atomic object, typically a vector.
• INDEX: list of one or more factors, each of same length as X. The elements are coerced to factors by as.factor.
• FUN: the function to be applied, or NULL. In the case of functions like +, %*%, etc., the function name must be backquoted or quoted. If FUN is NULL, tapply returns a vector which can be used to subscript the multi-way array tapply normally produces.
• simplify: If FALSE, tapply always returns an array of mode “list”. If TRUE (the default), then if FUN always returns a scalar, tapply returns an array with the mode of the scalar.

> X <- 1:9
> INDEX <- factor('a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c')


• a: 1, 2, 3, 4
• b: 5, 6, 7
• c: 8, 9

The combination of a vector and a labelling factor is an example of what is sometimes called a ragged array, since the subclass sizes are possibly irregular. When the subclass sizes are all the same the indexing may be done implicitly and much more efficiently.

• a: ========
• b: ======
• c: ==

> tapply(X, INDEX, sum)
a  b  c
10 18 17


## 6. split: Split a Vector (or list) or Data Frame into Groups by a Factor or List of Factors

> X <- 1:30
> INDEX <- gl(3, 10) ## Generate Levels：10 个 1，10 个 2，10 个 3；levels = 1, 2, 3
> split(X, INDEX)
$1 [1] 1 2 3 4 5 6 7 8 9 10$2
[1] 11 12 13 14 15 16 17 18 19 20

$3 [1] 21 22 23 24 25 26 27 28 29 30  tapply(X, INDEX, fun) == lapply(split(X, INDEX), fun) > lapply(split(X, INDEX), sum)$1
[1] 55

$2 [1] 155$3
[1] 255


> X <- 1:10
> INDEX_1 <- as.factor(c(rep('a', 5), rep('b', 5)))
> INDEX_2 <- gl(5, 2)
> INDEX_1
[1] a a a a a b b b b b
Levels: a b
> INDEX_2
[1] 1 1 2 2 3 3 4 4 5 5
Levels: 1 2 3 4 5
> str(split(X, INDEX_1))
List of 2
$a: int [1:5] 1 2 3 4 5$ b: int [1:5] 6 7 8 9 10
> str(split(X, INDEX_2))
List of 5
$1: int [1:2] 1 2$ 2: int [1:2] 3 4
$3: int [1:2] 5 6$ 4: int [1:2] 7 8
$5: int [1:2] 9 10 > str(split(X, list(INDEX_1, INDEX_2))) List of 10$ a.1: int [1:2] 1 2
$b.1: int(0)$ a.2: int [1:2] 3 4
$b.2: int(0)$ a.3: int 5
$b.3: int 6$ a.4: int(0)
$b.4: int [1:2] 7 8$ a.5: int(0)


$a.2: int [1:2] 3 4$ a.3: int 5
$b.3: int 6$ b.4: int [1:2] 7 8
$b.5: int [1:2] 9 10  Alternatively, you can use the unstack function: > groups <- split(x, f) > groups <- unstack(data.frame(x,f))  Both functions return a list of vectors, where each vector contains the elements for one group. The unstack function goes one step further: if all vectors have the same length, it converts the list into a data frame. ## 7. by: Apply a Function to Groups of Rows (i.e. lapply after splitting a data frame) split 一个 column 得到一个 list of vectors，split 一个 data frame 会得到一个 list of data frames。所以 by(dfrm, factor, fun) 就是先 split 这个 dfrm by factor，然后在得到的 list of data frames 上 lapply 执行 fun。与 tapply 很像，我们可以直接理解为：by(dfrm, factor, fun) == lapply(split(dfrm, factor), fun) 这里 function 就必须是接收 data frame 为参数，一个常见的符合条件的 function 就是 summary，这也是常见的组合用法，比如: > by(trials, trials$sex, summary)


> models <- by(trials, trials\$sex, function(df) lm(post~pre+dose1+dose2, data=df)) ## models is a list of linear models
> lapply(models, confint) ## print confidence intervals of each linear model