Converting ASCII strings to lower case at crazy speeds with …

Converting ASCII strings to lower case at crazy speeds with …

Daniel Lemire's blog

AMD Zen 4 and Zen 5, as well as server-side recent Intel processors, support an advanced set of instructions called AVX-512. They are powerful SIMD (Single Instruction, Multiple Data) instructions. Importantly, they allow ‘masked’ operations. That is, you can compute a mask and only do an operation on bytes indicated by the mask. Thus you can easily store only the first k bytes of a block of 64 bytes of memory as one instruction.

Tony Finch recently described how you can take an ASCII string of arbitrary length and convert them to lower case quickly using AVX-512. Finch’s results is that for both tiny and large strings, the AVX-512 approach is faster. In his work, Finch assumes that the length of the string is known up front. However, C strings are stored as a pointer to the beginning of the string with a null character (\0) indicating its end. Thus the string love is stored in memory as love\0.

Can we extend his work to C strings?

With AVX-512 is that you can load 64 bytes at a time, instead of loading individual bytes. In general, it is unsafe to read beyond the scope of allocated memory. It may crash your application if you are loading into a memory page that does not belong to your process. How do you know when to stop reading blocks of 64 bytes?

The trick is that it is always safe to do aligned loads. That is, if you load at an address that is divisible by 64 bytes, you will never cross a memory page because memory pages are always divisible by 64 on Intel and AMD systems.

To convert ASCII letters to lower case, we use the fact that the letters from A to Z in ASCII are in a continuous range as code point values (values stored in memory), and so are the letters from a to z. Thus if you can identify the upper case letters, it suffices to add a constant to them to make them lower case.

Finch wrote a function which converts 64 ASCII bytes to lower case when a block of 64 bytes (c) has been loaded:

static inline __m512i tolower64(__m512i c) {
  __m512i A = _mm512_set1_epi8('A');
  __m512i Z = _mm512_set1_epi8('Z');
  __m512i to_lower = _mm512_set1_epi8('a' - 'A');
  __mmask64 ge_A = _mm512_cmpge_epi8_mask(c, A);
  __mmask64 le_Z = _mm512_cmple_epi8_mask(c, Z);
  __mmask64 is_upper = _kand_mask64(ge_A, le_Z);
  return (_mm512_mask_add_epi8(c, is_upper, c, to_lower));
}

This function efficiently converts a 64-byte block of characters (represented as a __m512i vector) to lowercase using SIMD instructions. The variables A and Z are vectors filled with the characters ‘A’ and ‘Z’ respectively. The variable to_lower contains the difference between ‘a’ and ‘A’ which is 32. The variable ge_A is a mask where bits are set to 1 if the corresponding element is greater than or equal to ‘A’. The variable le_Z is a mask where bits are set to 1 if the corresponding element is less than or equal to ‘Z’. The variable is_upper combines the two masks to identify characters that are both greater than or equal to ‘A’ and less than or equal to ‘Z’, indicating uppercase letters. In the final step, we add the value to_lower only for the values identified by the mask is_upper. This effectively converts uppercase letters to lowercase.

Of course, we still need to use this function to process an actual string, not a block of 64 bytes. Let us first consider a naive function that does the same task, character by character:

size_t lower(char *srcorig) {
  char *p = srcorig;
  for (; *p; ++p) {
    *p = *p > 0x40 && *p < 0x5b ? *p | 0x20 : *p;
  }
  return p - srcorig;
}

This function uses the fact that instead of an addition, we can just do a bitwise OR to change the case of an ASCII letter. In this particular case, we do not null terminated the result but we return the length of the string.

Let us now consider a possible AVX-512 implementation.

size_t lower64(const char *srcorig, char *dstorig) {
  uintptr_t address = reinterpret_cast<uintptr_t>(srcorig);
  uintptr_t aligned_address = address / 64 * 64;     // round down
  uintptr_t notincluded = address - aligned_address; // [0,64)
  const char *src;
  if(notincluded) {
    src = reinterpret_cast<const char *>(aligned_address);
    __mmask64 init_mask = _cvtu64_mask64((~UINT64_C(0)) << notincluded);
    __m512i src_v = _mm512_maskz_loadu_epi8(init_mask, src);
    __mmask64 is_zero =
        _mm512_mask_cmpeq_epu8_mask(init_mask, src_v, _mm512_setzero_si512());
    __m512i dst_v = tolower64(src_v);
    if (is_zero) {
      __mmask64 zero_mask = (is_zero - 1) ^ is_zero;
      _mm512_mask_storeu_epi8(dstorig - notincluded, zero_mask, dst_v);
      return __tzcnt_u64(is_zero) + (src - srcorig);
    }
    _mm512_mask_storeu_epi8(dstorig - notincluded, init_mask, dst_v);
    src += 64;
    dstorig += 64 - notincluded;
  } else { // fast path
    src = reinterpret_cast<const char *>(srcorig);
    __m512i src_v = _mm512_loadu_epi8(src);
    __mmask64 is_zero =
        _mm512_cmpeq_epu8_mask(src_v, _mm512_setzero_si512());
    __m512i dst_v = tolower64(src_v);
    if (is_zero) {
      __mmask64 zero_mask = (is_zero - 1) ^ is_zero;
      _mm512_mask_storeu_epi8(dstorig - notincluded, zero_mask, dst_v);
      return __tzcnt_u64(is_zero);
    }
    _mm512_storeu_epi8(dstorig, dst_v);
    src += 64;
    dstorig += 64;
  }

  while (true) {
    __m512i src_v = _mm512_loadu_epi8(src);
    __m512i dst_v = tolower64(src_v);
    __mmask64 is_zero = _mm512_cmpeq_epu8_mask(src_v, _mm512_setzero_si512());
    if (is_zero) {
      __mmask64 zero_mask = (is_zero - 1) ^ is_zero;
      _mm512_mask_storeu_epi8(dstorig, zero_mask, dst_v);
      return __tzcnt_u64(is_zero) + (src - srcorig);
    }
    _mm512_storeu_epi8(dstorig, dst_v);
    src += 64;
    dstorig += 64;
  }
}

The code converts a string of characters to lowercase using AVX-512 instructions. It works in 64-byte chunks for efficiency. We have two pointers are parameters, srcorig is a pointer to the original source string, dstorig is a pointer to the destination buffer for the lowercase string. Initially we calculates the alignment offset of srcorig to a 64-byte boundary. We initialize pointers and masks based on the alignment offset. We have a fast path for the case where the string is already aligned on a 64-byte boundary. Initially, we load a 64-byte chunk into a __m512i vector, possibly reading prior to the beginning of the string. We converts the chunk to lowercase using tolower64.  We also check if an element is null, if that is the case, we will store and return a string of length smaller than 64 bytes. In the main loop, we process process 64-byte chunks in a loop until a null character is encountered. That is, we load a 64-byte chunk into an __m512i vector, we convert the chunk to lowercase using tolower64. We check if the loaded chunk contains a null character and ends the process if that is the case, calculating and returning the number of processed characters. If not, we store the converted chunk to the destination buffer.

The gotcha with this approach is that you will read before the beginning of the string if it is not already aligned on a 64-byte boundary and some tools might warn you. However, the code remains safe. You just have to tell your tool that the warnings should be omitted.

How fast is the AVX-512 code? I am using an Intel Ice Lake processor and LLVM 16. In my benchmark, I use fixed strings of various size. My benchmark repeatedly processes the same string which omits the branch mispredictions that would occur in practice, so the real speed might be lower.

Thus, as you can see, the AVX-512 can be 20 times faster than the conventional approach on small strings while remaining competitive on tiny strings.

To my knowledge, on the AVX-512 instruction set allows this magical performance. It is significant advantage for recent AMD and Intel processors. Sadly, Intel no longer include AVX-512 in its non-server processors.

Credit: I chatted with Robert Clausecker on these issues about a year ago.

Generated by RSStT. The copyright belongs to the original author.

Source

Report Page