Detect control characters, quotes and backslashes efficientl…

Detect control characters, quotes and backslashes efficientl…

Daniel Lemire's blog

When trying to write fast functions operating over many bytes, we sometimes use ‘SWAR’. SWAR stands for SIMD Within A Register, and it is a technique to perform parallel computations on multiple data elements packed into a single word. It treats a register (e.g., 64-bit) as a vector of smaller units (e.g., 8-bit bytes) and processes them simultaneously without explicit SIMD instructions.

For example, if you have 8 bytes in a 64-bit word x, computing (x – 0x2020202020202020) subtracts 32 (0x20 in hex) from each byte. It works well if each byte value is greater than or equal to 32. Otherwise, the operation ‘overflows’: if the least significant byte is smaller than 32, then its most significant bit is set, and you cannot rely on the other more significant byte results.

It works well if all bytes are ‘ASCII’ (they have their most significant bit set). When that is the case, you can check if any byte value is less than 32 by computing (x – 0x2020202020202020) and checking if any most significant bit is set, by masking the result: (x – 0x2020202020202020) & 0x8080808080808080.

Similarly,  you can check whether the byte value 34 (0x22) appears simply by first doing an XOR which sets any such byte value to zero, (x ^ 0x2222222222222222) and then subtracting each byte value by 1: (x ^ 0x2222222222222222) – 0x0101010101010101. It then suffices to masking with 0x80…80 again to see if any  byte value was set to 34, as long as all input byte values were ASCII.

Thankfully, it is easy to select only the byte values that are ASCII: (0x8080808080808080 & ~x) will do.

In JSON and other languages, you need to escape within strings all character values smaller than 32 (control characters in ASCII), equal to 34 (‘”‘ in ASCII) or 92 (“\” in ASCII). So you want to quickly check if a byte is smaller than 32, equal to 34 or 92. With SWAR, you can do it with about ten operations:
bool has_json_escapable_byte(uint64_t x) {
    uint64_t is_ascii = 0x8080808080808080ULL & ~x;
    uint64_t lt32 =
        (x - 0x2020202020202020ULL);
    uint64_t sub34 = x ^ 0x2222222222222222ULL;
    uint64_t eq34 = (sub34 - 0x0101010101010101ULL);
    uint64_t sub92 = x ^ 0x5C5C5C5C5C5C5C5CULL;
    uint64_t eq92 = (sub92 - 0x0101010101010101ULL);
    return ((lt32 | eq34 | eq92) & is_ascii) != 0;
}

That’s slightly more than one operation per input byte.

You might be able to do better.

Generated by RSStT. The copyright belongs to the original author.

Source

Report Page