# scipy: pdist indexing

Yao Yao on January 31, 2018

scipy.spatial.distance.pdist(X) gives the pair-wise distances of X, $\operatorname{dist}(X[i], X[i])$ not included. The distances are stored in a dense matrix D.

Question: how to get $\operatorname{dist}(X[i], X[j])$ without expanding this dense matrix by scipy.spatial.distance.squareform(D)?

A good answer from How does condensed distance matrix work? (pdist):

def square_to_condensed(i, j, n):
assert i != j, "no diagonal elements in condensed matrix"
if i < j:
i, j = j, i
return n*j - j*(j+1)/2 + i - 1 - j


Take $n$ == X.shape[0] == $4$ as an example. In the code, $i$ and $j$ are swapped if $i < j$, so only consider the bottom triangular region, i.e. $i > j$ always holds in the discussion below.

i\j 0 1 2 3
0 - D[0] D[1] D[2]
1 D[0] - D[3] D[4]
2 D[1] D[3] - D[5]
3 D[2] D[4] D[5] -

Suppose the indexing function is $\operatorname{f}$: $\operatorname{f}(i, j) = k$ when $\operatorname{dist}(X[i], X[j]) = D[k]$

len(D) = ${n \choose 2} = (n-1) + (n-2) + \dots + 1$.

• When $j = 0$, $\operatorname{f}(i, j) = i - j - 1$
• When $j = 1$, $\operatorname{f}(i, j) = (n-1) + (i - j - 1)$
• When $j = 2$, $\operatorname{f}(i, j) = (n-1) + (n-2) + (i - j - 1)$

So the pattern here is:

$\operatorname{f}(i, j) = \text{indices used in previous } (j-1) \text{ columns} + \text{starting index in } j^{\text{th}} \text{ column}$

Further more:

$\operatorname{f}(i, j) = {n \choose 2} - {n-j \choose 2} + (i - j - 1) = nj - \frac{j(j+1)}{2} + i - j - 1$