Recognizing string prefixes with SIMD instructions

Recognizing string prefixes with SIMD instructions

Daniel Lemire's blog

Suppose that I give you a long list of strings (e.g., “A”, “A6”, “AAAA”, “AFSDB”, “APL”, “CAA”, “CDS”, “CDNSKEY”, “CERT”, “CH”, “CNAME”, “CS”, “CSYNC”, “DHC”, etc.). I give you a pointer inside a much larger string and I ask you whether you are pointing at one of these strings, and if so, which one. To make things slightly more complicated, you want the token to be followed by a valid separator (e.g., a space, a semi-colon, etc.) and you want to ignore the case.

How might you solve this efficiently?

Comparing against each string might do well if you have few of them, but it is clearly a bad idea if you have many (say 70).

A reasonable approach is to do a binary search through the sorted list of strings. The C function bsearch is well suited. You need a comparison function as part of the implementation. You may use the C function strncasecmp to compare the strings while ignoring the case, and you add a check to make sure that you only return a match (value 0) when the input has a terminating character at the right position.

Then the linearithmic (O(n log n)) implementation looks like this…

std::string *lookup_symbol(const char *input) {
  return bsearch(input, strings.data(), strings.size(),
  sizeof(std::string), compare);
}

Simple enough.

Another approach is to use a trie. You implement a tree where the first level checks for the first character, the second level for the second character and so forth.

It gets a little bit lengthy, but you can use a script or a library to generate the code. You can use a series of switch/case like so…

switch (s[0]) {
  case 'A': case 'a':
  switch (s[1]) {
    case '6': return is_separator(s[2]) ? 1 : -1;
  case 'A': case 'a':
    switch (s[2]) {
     case 'A': case 'a':
       switch (s[3]) { 
         case 'A': case 'a':
          return is_separator(s[4]) ? 2 : -1;
       default:
         return -1;
...

The running time complexity depends on the data, but is at most the length of the longest string in your set. The trie is a tempting solution but it is extremely branchy: if the processor can predict the upcoming content, it should do well, but if the input is random, you might be unlikely and get poor performance.

With SIMD instructions, you can write a tight implementation that is effectively branchless: its execution does not depend on the input data.

The code works in this manner:

  1. We load unconditionally 16 bytes in a register.
  2. We first find the location of the first separator, if any. We can do this with vectorized classification. It is a significant cost.
  3. We set to zero all bytes in the register starting from this first separator. We also switch all characters in A-Z to a lower case.
  4. We use a hash function to map the processed bytes to a table containing our strings in 16-byte blocks. The hash function is designed in a such a way that if the input matches one of our set elements, then it should be mapped to an identical value. We can derive the hash function empirically (by trial and error). Computing the hash function is a significant cost.
  5. We compare the processed input with the loaded value from the hash function.

The C function written using Intel intrinsic functions is as follows:

bool sse_type(const char *type_string, uint8_t *type) {
  __m128i input = _mm_loadu_si128((const __m128i *)type_string);
  __m128i high_mask =
  _mm_setr_epi8(0x00, 0x00, 0x01, 0x02, 0x04, 0x08, 0x04, 0x08, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00);
  __m128i low_mask = _mm_setr_epi8(
0x02 | 0x08, 0x0e | 0x08, 0x0e | 0x08, 0x0e | 0x08, 0x0e | 0x08,
0x0e | 0x08, 0x0e | 0x08, 0x0e | 0x08, 0x0e | 0x08, 0x0e | 0x08, 0x0a,
0x04, 0x05, 0x04 | 0x01, 0x04, 0x04);
  __m128i high_shuffled = _mm_shuffle_epi8(
  high_mask, _mm_and_si128(_mm_srli_epi16(input, 4), _mm_set1_epi8(0x0f)));
  __m128i low_shuffled = _mm_shuffle_epi8(low_mask, input);
  __m128i intersect = _mm_and_si128(high_shuffled, low_shuffled);
  __m128i alnum = _mm_cmpgt_epi8(intersect, _mm_setzero_si128());
  input = _mm_and_si128(_mm_or_si128(input, _mm_set1_epi8(0x20)), alnum);
  uint16_t delim = ~(uint16_t)_mm_movemask_epi8(alnum);
  uint16_t len = (uint16_t)__builtin_ctz(delim | (1 << 15));
  __m128i zero_m = _mm_loadu_si128((const __m128i *)(zero_masks + 16 - len));
  input = _mm_andnot_si128(zero_m, input);
  uint8_t idx = hash((uint64_t)_mm_cvtsi128_si64(input));
*type = idx;
  __m128i compar = _mm_loadu_si128((const __m128i *)buffers[idx]);
  __m128i xorthem = _mm_xor_si128(compar, input);
  return _mm_test_all_zeros(xorthem, xorthem);
}

I wrote a benchmark where we repeatedly try to check for matching strings. I run it on an Ice Lake server, and the code is compiled with GCC 11 (targeting an old processor, Westmere).

In this particular test, the SIMD-based approach is three times faster than the trie despite the fact that it retires more instructions. The trie struggles with branch mispredictions. The binary search is disastrous in this case, being more than 10 times slower.

My code is available.

Credit: The problem was suggested to me by Jeroen Koekkoek from NLnet Labs. He sketched part of the solution. I want to thank GitHub user @aqrit for their comments.

Further reading: Is Prefix Of String In Table? by Nelson.

Generated by RSStT. The copyright belongs to the original author.

Source

Report Page