Scan HTML faster with SIMD instructions: .NET/C# Edition

Scan HTML faster with SIMD instructions: .NET/C# Edition

Daniel Lemire's blog

Recently, the two major Web engines (WebKit and Chromium) adopted fast SIMD routines to scan HTML content. The key insight is to use vectorized classification (Langdale and Lemire, 2019): you load blocks of characters and identify the characters you seek using a few instructions. In particular, we use ‘SIMD instructions’, special instructions that are available on practically all modern processors and can process 16 bytes or more at once.

The problem that WebKit and Chromium solve is to jump to the next relevant characters: one of <, &, \r and \0. Thus we must identify quickly whether we have found one of these characters in a block. On my Apple macbook, a fast SIMD-based approach can scan an HTML page at about 7 GB/s, with code written in C/C++.

But what about C#? The recent C# runtime (.NET8) supports fast SIMD instructions.

Let us first consider a simple version of the function:

public unsafe static void NaiveAdvanceString(ref byte* start, 
                                             byte* end)
{
  while (start < end)
  {
    if (*start == '<' || *start == '&' 
         || *start == '\r' || *start == '\0')
    {
       return;
    }
    start++;
  }
}

This function just visits each character, one by one, and it compares it against the target characters. If one target character is found, we return.

public unsafe static void SIMDAdvanceString(ref byte* start, 
                                          byte* end)
{

    Vector128<byte> low_nibble_mask = Vector128.Create(0, 0, 0, 0, 0, 
                  0, (byte)0x26, 0, 0, 0, 0, 0, (byte)0x3c, 
                 (byte)0xd, 0, 0);
    Vector128<byte> v0f = Vector128.Create((byte)0x0F);
    Vector128<byte> bit_mask = Vector128.Create((byte)16, 15, 14, 13, 
                        12, 11, 10, 9, 8,
                        7, 6, 5, 4, 3, 2, 1);

    const int stride = 16;
    while (start + (stride - 1) < end)
    {
        Vector128<byte> data = AdvSimd.LoadVector128((byte*)start);
        Vector128<byte> lowpart 
           = AdvSimd.Arm64.VectorTableLookup(low_nibble_mask, data);
        Vector128<byte> matchesones = AdvSimd.CompareEqual(lowpart, 
                                            data);
        if (matchesones != Vector128<byte>.Zero)
        {
            Vector128<byte> matches = AdvSimd.And(bit_mask, 
                                          matchesones);
            int m = AdvSimd.Arm64.MaxAcross(matches).ToScalar();
            start += 16 - m;
            return;
        }
        start += stride;
    }


    while (start < end)
    {
        if (*start == '<' || *start == '&' || *start == '\r' 
                || *start == '\0')
        {
            return;
        }
        start++;
    }
}

The function takes two pointers (ref byte* start and byte* end) that mark the beginning and end of the byte array.  The main loop continues  as long as start is at least 16 bytes away from end. This ensures there’s enough data for vectorized operations. We load in the variable ‘data’ 16 bytes from the memory pointed to by start. We use a vectorized lookup table and a comparison to quickly identify the target characters.The code checks if any element in matchesones is not zero. If there’s a match, then we locate the first one (out of 16 characters), we advance start and return. If no match is found, we advance by 16 characters and repeat. We conclude with a fallback look that processes the leftover data (less than 16 bytes).

I wrote a benchmark (in C#) that you can run if you have an ARM-based processor. It would be trivial to extend the benchmark to x64 processors.

Incredibly, the SIMD-based function is 15 times faster than the conventional function in these tests, and the accelerated C# function is just as fast, if not faster, than the equivalent C/C++ code.

Generated by RSStT. The copyright belongs to the original author.

Source

Report Page