Date: Oct 21, 2019
If you search “rank numbers using numpy”, you will find various resources offering a mysterious solution:
Yes, that is it, and it works. But, why? As I could not find an answer to that, I will try to bring some clarity here.
Before we start, what is ranking? For the ascending case, ranking is assigning values to numbers such that each value denotes the index of its corresponding number in a would-be sorted list. For example, we want rank([1, 2, 0, 3])
to return [1, 2, 0, 3]
. In other words, we want the order of each number.
Quoting numpy.argsort
documentation, “[argsort] Returns the indices that would sort an array.”. For simplicity, let us assume that our array does not have duplicates.
>>> import numpy as np
>>> nums = np.array([3, 5, 0, 10])
>>> nums.argsort()
array([2, 0, 1, 3])
>>> nums.argsort().argsort()
array([1, 2, 0, 3])
The final result indeed corresponds to rank of each number.
Let as := nums.argsort() and 2as := as.argsort().
1) Define idx(i, array) to be the index of i in array e.g. idx(2, [0, 2, 1]) = 1.
2) From the definition of argsort
, we know that ∀i, j ∈ as, idx(i, as) > idx(j, as) ⇒ nums[i] > nums[j].
∀i, j ∈ 2as, i > j ⇒ 2as[idx(i, 2as)] > 2as[idx(j, 2as)] - follows from (1)
⇒ idx(idx(i, 2as), as) > idx(idx(j, 2as), as) - see the lemma below
⇒ nums[idx(i, 2as)] > nums[idx(j, 2as)] - follows from (2)
What this tells us is higher value in 2as implies higher value for the corresponding number when nums is indexed into. Combining this with the fact that argsort
outputs are a permutation of numbers from 0 to length of the array - 1, we conclude the proof.
This is the crux of the proof. Let’s consider the example above but this time make the indices more explicit.
nums -> (0, 3), (1, 5), (2, 0), (3, 10)
as -> (0, 2), (1, 0), (2, 1), (3, 3)
2as -> (0, 1), (1, 2), (2, 0), (3, 3)
Notice how the indices and values of as seem to swap as they map to 2as. For instance, (0, 2) in as is now (2, 0) in 2as, while the (index) value in 2as is pointing at the original value in as.
Prove: A.argsort()[i] = idx(i, A) when A = permutation({0, ..., len(A) − 1}). In other words, each value in the argsorted list corresponds to the index of the same value in the original list.
Given i ∈ A, we know that there are exactly i other numbers smaller than i in A. Then by definition of argsort
, A.argsort()[i]
holds the index of i i.e. idx(i, A).
While this is an interesting way to rank numbers, there are more efficient ways that sort once instead of twice (with additional linear time processing).
Another interesting question is what happens if we continue calling argsort
on the result itself. Continuing with our example above, we get:
>>> nums.argsort()
array([2, 0, 1, 3])
>>> nums.argsort().argsort()
array([1, 2, 0, 3])
>>> nums.argsort().argsort().argsort()
array([2, 0, 1, 3])
>>> nums.argsort().argsort().argsort().argsort()
array([1, 2, 0, 3])
A cycle? Yes, interestingly enough, we have already proven why. Our lemma above proves that as long as the argsort
input is a permutation of integers (in the array’s index domain), then the output will be a swap of the values and the indices; combining this with the fact that the prerequisite input property is an invariant under argsort
, we get the cyclic property.