Transcoding Latin 1 strings to UTF-8 strings at 12 GB/s usin…

Transcoding Latin 1 strings to UTF-8 strings at 12 GB/s usin…

Daniel Lemire's blog

Though most strings online today follow the Unicode standard (e.g., using UTF-8), the Latin 1 standard is still in widespread inside some systems (such as browsers). It captures the first 256 characters from the Unicode standard and represents them as a single byte per character.

In a previous blog post, we examined how we could convert UTF-8 strings to Latin 1. I want to examine the reverse problem, the transcoding of Latin 1 strings to UTF-8. We receive a byte array, and each byte is one character. There is no need for any validation. However, some characters (the ASCII characters) are mapped to one byte whereas all others should be mapped to two bytes.

You can code it in C using code such as this routine:

 unsigned char byte = data[pos];
if ((byte & 0x80) == 0) { // if ASCII
  // will generate one UTF-8 byte
  *utf8_output++ = (char)(byte);
  pos++;
} else {
  // will generate two UTF-8 bytes
  *utf8_output++ = (char)((byte >> 6) | 0b11000000);
  *utf8_output++ = (char)((byte & 0b111111) | 0b10000000);
  pos++;
}

Can we do better using the new AVX-512 instructions available on recent Intel (e.g. Ice Lake) and AMD (e.g., Zen4) processors?

We can try to do it as follows:

  1. Load 32 bytes into a register.
  2. Identify all the non-ASCII bytes (they have the most significant bit set to 1).
  3. Identify the bytes that have their two most significant bits set by comparing with 192.
  4. Cast all bytes to 16-bit words into a 64-byte register.
  5. Add 0xc200 to all 16-bit words, as the most significant byte in a 16-bit word must be either 0xc2 or 0xc3.
  6. Add 0x00c0 to the 16-bit words where the corresponding byte had its two most significant bit set, this will move up a bit value so that the most signficant byte may become 0xc3.
  7. Flip the order of the bytes within each 16-bit word since UTF-8 is big endian.
  8. Remove one byte (‘compress’) where we had an ASCII byte.
  9. Store the result (can be between 32 bytes and 64 bytes).

Using Intel intrinsics, the result might look as follows:

__m256i input = _mm256_loadu_si256((__m256i *)(buf + pos));
__mmask32 nonascii = _mm256_movepi8_mask(input);
int output_size = 32 + _popcnt32(nonascii);
uint64_t compmask = ~_pdep_u64(~nonascii, 0x5555555555555555);
__mmask32 sixth =
_mm256_cmpge_epu8_mask(input, _mm256_set1_epi8(192));
__m512i input16 = _mm512_cvtepu8_epi16(input);
input16 =_mm512_add_epi16(input16, _mm512_set1_epi16(0xc200));
__m512i output16 =
_mm512_mask_add_epi16(input16, sixth, input16,
_mm512_set1_epi16(0x00c0));
output16 = _mm512_shuffle_epi8(output16, byteflip);
__m512i output = _mm512_maskz_compress_epi8(compmask, output16);

I use GCC 11 on an Ice Lake server. My source code is available. For benchmarking, I use a French version of the Mars wikipedia entry.

On a 3.2 GHz processor, the AVX-512 routine reaches 12 GB/s. It is nearly seventimes faster than the conventional routine, and it uses six times fewer instructions.

My approach is less efficient than the best approach to do the reverse operation (UTF-8 to Latin 1). I am disappointed for two reasons:

  1. It is usually easier to transform data when the input has a regular form (as Latin 1 does) than the reverse, so the transcoding from Latin 1 should be easier.
  2. There is no need to validate the Latin 1 input.

Thus it is likely that my approach is far from optimal.

Generated by RSStT. The copyright belongs to the original author.

Source

Report Page