How fast is rolling Karp-Rabin hashing?

How fast is rolling Karp-Rabin hashing?

Daniel Lemire's blog

A hash function maps values (e.g., strings) into a fixed number of strings, typically smaller than the original. It is useful to compare quickly two long strings, for example. Instead of comparing the strings, you may compare the hash values.

A simple hash function consists in repeatedly multiplying the hash value by some constant B (e.g., you may pick a number such as 31):

uint32_t hash = 0;
for (size_t i = 0; i < len; i++) {
  hash = hash * B + data[i];
}
return hash;

I credit Karp-Rabin for this type of hash functions. It is an example of recursive hashing: the hash function is computed by taking the hash value and updating it with the next character.

Given a long strings, you may want to hash all sequences of N characters. A naive approach might be as follows:

for(size_t i = 0; i < len-N; i++) {
  uint32_t hash = 0;
  for(size_t j = 0; j < N; j++) {
    hash = hash * B + data[i+j];
  }
  //...
}

You are visiting each character value N times. If N is large, it is inefficient.

You can do better using a rolling hash function: instead of recomputing the hash function anew each time, you just update it. It is possible to only access each character twice:

uint32_t BtoN = 1;
for(size_t i = 0; i < N; i++) { BtoN *= B; }

uint32_t hash = 0;
for(size_t i = 0; i < N; i++) {
  hash = hash * B + data[i];
}
// ...
for(size_t i = N; i < len; i++) {
  hash = hash * B + data[i] - BtoN * data[i-N];
  // ...
}

The most expensive component of this routine is the line with two character accesses and two multiplications.

I wrote a simple benchmark in C++ to see how fast a straight-forward implementation could be. I use a set window of size 8 for the purpose of this analysis, but the larger the window, the less competitive the naive implementation becomes.

Karp-Rabin is not the only type hash functions. Tabulated hashing is another approach that would be much more compute efficient.

However, I suspect we could greatly improve my naive implementation of the Karp-Rabin rolling hash function.

Further reading:

Possibly relevant software:

Generated by RSStT. The copyright belongs to the original author.

Source

Report Page