Fast Zipf Sampling
As part of a small experiment that I am working on, I need to sample numbers according to a Zipf distribution. This is not very hard in itself, especially because the rand_distr crate provides an implementation of the Zipf distribution, which I could quickly plug into my experiment and everything worked fine.
The “problem” arose when I was trying to make my experiment run faster, and discovered that most of the time is spent not in any complex logic, but rather in calls to powf in the rand_distr crate – how can the bottleneck of my program be sampling the Zipf distribution?
Baseline
The Zipf distribution in the rand_distr crate is implemented via rejection sampling, adapted from Java code. Rejection sampling is a technique to sample values from a random distribution that is hard to sample by using an auxiliary distribution that is easy to sample and “encompasses” the original distribution. You can think of it as sampling from an “overestimation,” and then rejecting samples in the parts where the original distribution is overestimated. The chance of rejection depends on the exact discrepency between the two distributions; a higher discrepancy leads to more samples being rejected.
The Rust code for this looks like this (taken from rand_distr):
loop {
let inv_b = self.inv_cdf(rng.sample(StandardUniform));
let x = (inv_b + one).floor();
let mut ratio = x.powf(-self.s);
if x > one {
ratio = ratio * inv_b.powf(self.s)
};
let y = rng.sample(StandardUniform);
if y < ratio {
return x;
}
}
We can see the calls to powf here, but apart from that, nothing much is going on. The code already pre-computes a lot of the constants to avoid computing them in every sampling step, and the auxiliary distribution is already chosen to be tight to the Zipf distribution to avoid many rejections. So how can we speed this up?
Inverse transform sampling
Inverse transform sampling is another way to generate samples according to a specific target distribution. It works by taking a uniform sample \(u\) in \((0..1)\) and then finding the smallest \(x\) such that \(F(x) \ge u\), where \(F\) represents the cumulative distribution function (CDF). For discrete probability distributions, this is not too hard, since we can simply add the single point-wise probabilities to get the CDF. However, it might require many evaluations of the probability density function, which can be costly.
So if this method seems to be costly, and rejection sampling is fast, why would we bother looking at it? For a single sample, we shouldn’t. However, the big benefit here is that we can compute the cutoffs once, and then draw many samples with them:
fn cutoffs<F: Float>(n: F, s: F) -> Result<Vec<F>, Error> {
if !(s >= F::zero()) {
return Err(Error::STooSmall);
}
if !(n >= F::one()) {
return Err(Error::NTooSmall);
}
if n.is_infinite() && s <= F::one() {
return Err(Error::IllDefined);
}
let mut cutoffs = Vec::new();
let mut accum = F::zero();
let mut k = F::one();
while k <= n {
accum = accum + F::one() / k.powf(s);
cutoffs.push(accum);
k = k + F::one();
}
cutoffs.iter_mut().for_each(|c| *c = *c / accum);
Ok(cutoffs)
}
// Assume that this is a method on a struct where self.cutoffs was
// computed by cutoffs() above.
fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> F {
let sample = rng.sample(StandardUniform);
for (i, c) in self.cutoffs.iter().enumerate() {
if sample <= *c {
return F::from(i + 1).unwrap();
}
}
unreachable!()
}
We can see that the cutoffs function is inefficient in the sense that it needs to compute the probability density function for all values up to \(N\) (which I have already inlined). The trade-off is that sample is very efficient now: We only need one sample from the uniform distribution, and no powf or inv_cdf calls. And this trade-off becomes even better when you have a scenario in which \(N\) is small, as pre-computation then is fast as well!
Faster lookups
We can speed this process up even further by using the fact that the CDF is a monotonically increasing function. This means that cutoffs is a sorted list, and we can use a binary search to have a \(\mathcal{O}(\log N)\) lookup instead of the \(\mathcal{O}(N)\) from our naïve linear lookup:
fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> F {
let sample = rng.sample(StandardUniform);
let i = self.cutoffs.partition_point(|c| sample > *c);
F::from(i + 1).unwrap()
}
With this, we expect super fast samplings – which we get, for a bit. When we benchmark the code (see below), we actually see that the benefit from the binary search disappears for larger \(N\), and at some point, the binary search actually becomes slower than the linear search. What gives?
The answer lies in the Zipf distribution: The first items are the likeliest and make up a majority of the returned samples. The tail is long, but each of the probabilities there is very small. That means that a linear search will consider the samples with a high probability first, and often end up not searching very far (as it can return early once it found the cutoff point). The binary search however will always do the full \(\log N\) steps.
This brings us to yet another optimization, where we do two searches: First, we consider the “likely” samples. Only if we do not find the cutoff point there will we do a search over the whole list of cutoff points. The threshold for this can be optimized, and depends on the parameters of the distribution. However, even a simple fixed value should provide a good speedup for the start.
Benchmarks
Finally, let’s present the benchmark results. These were obtained on an AMD Ryzen 5 5625U:
For the constructor, we can see the linear increase for our fast implementation. This is expected, as all cutoff points are computed during initialization. This increase also exists for the parameters between 1 and 10, but the difference is overshadowed by the larger values. For the rejection sampling method from rand_distr, the constructor always finishes in the same amout of time – it only needs to compute a few values, but the amout of values it computes is not dependent on the parameters.
The situation for the sampling method looks a bit more complex. First of all, the rejection sampling has a runtime of pretty much constant 30 ns, even for large \(N\). This is the strength of the method, and what we expected (and 30 ns is fast!).
For our inverse transform sampler, we see that our sampling is faster. Again, this is what we expect – we trade a slower initialization for a faster sampling. The linear search performs quite well even for large \(N\), due to the reasons explained above (it often finds the cutoff in the first elements). The binary search variant is even faster than the linear search for smaller \(N\), but scales worse and eventually becomes less efficient than a linear search. The hybrid variant uses a fixed threshold of 10, and (for the tested parameters) is constantly the fastest.
Conclusion
To wrap up, what have we learned?
First, proper profiling – as always – is an important tool. The bottlenecks to your code might live in different areas than you might expect!
Second, what is fast in one case might not be fast in other cases. While the Zipf rejection-sampler is probably the fastest whenever you only need a single sample, certain scenarios (such as sampling a lot of values with small-ish \(N\)) can benefit from other algorithms.
Third, there is more to algorithm runtimes than average-case big-O. In our scenario, a linear search can definitely beat a binary search in speed.
And fourth, inverse transform sampling is a cool name.
The task of computing the optimal threshold for the hybrid approach is left as an exercise for the reader (or a bored future me).
The source code for this sampler and experiment is available on Codeberg.