Mixing ARM NEON with SVE code for fun and profit

Mixing ARM NEON with SVE code for fun and profit

Daniel Lemire's blog

Most mobile devices use 64-bit ARM processors. A growing number of servers (Amazon, Microsoft) also use 64-bit ARM processors.

These processors  have special instructions called ARM NEON providing parallelism called Single instruction, multiple data (SIMD). For example, you can compare sixteen values with sixteen other values using one instruction.

Some of the most recent ARM processors also support even more advanced instructions called SVE or Scalable Vector Extension. They have added more and more extensions over time: SVE 2 and SVE 2.1.

While ARM NEON’s registers are set at 128 bits, SVE registers are a multiple of 128 bits. In practice, SVE registers are most often 128 bits although there are exceptions. The Amazon Graviton 3 is based on ARM Neoverse V1 core, which supports SVE with a vector length of 256 bits (32 bytes). For the Graviton 4, it is based on the ARM Neoverse V2 core, which supports SVE2 with a vector length of 16 bytes, like NEON.

So if you are a low-level engineer, you are supposed to choose: either you use ARM NEON or SVE. As a comment to a recent article I wrote, Ashton Six observed that you can mix and match these instructions (NEON and SVE) because it is guaranteed that the first 128 bits of SVE registers are the NEON registers. Ashton provided a demonstration using assembly code.

If you have a recent C/C++ compiler (e.g., GCC 14), it turns out that you can fairly easily switch back and forth between NEON and SVE. If you use the arm_neon_sve_bridge.h header, you are providing two functions:


  • svset_neonq: sets the contents of a NEON 128-bit vector (uint8x16_t, int32x4_t, etc.) into an SVE scalable vector (svuint8_t, svint32_t, etc.).
  • svget_neonq: Extracts the first 128 bits of an SVE scalable vector and returns them as a NEON 128-bit vector.

These functions are ‘free’: they are likely compiled to no instruction whatsoever.

Let us illustrate with an example. In a recent post, I discussed how it is a bit complicated to check whether there is a non-zero byte in a NEON register. A competitive solution is as follows:

int veq_non_zero_max(uint8x16_t v) {
  return vmaxvq_u32(vreinterpretq_u32_u8(v)) != 0;
}

Effectively, we compute the maximum 32-bit integer in the register, considering it as four 32-bit integers. The function  compiles down to three essential instruction: umaxv, fmov and a comparison (cmp).

Let us consider the following SVE alternative. It converts the input to an SVE vector, creates a mask for all 16 positions, compares each element to zero to generate a predicate of non-zero positions, and finally tests if any elements are non-zero, returning 1 if so or 0 if all are zero—essentially performing an efficient vectorized “any non-zero” check.
int sve_non_zero_max(uint8x16_t nvec) {
  svuint8_t vec;
  vec = svset_neonq_u8(vec, nvec);
  svbool_t mask = svwhilelt_b8(0, 16);
  svbool_t cmp = svcmpne_n_u8(mask, vec, 0);
  return svptest_any(mask, cmp);
}

Except for the initialization of mask, the function is made of two instructions cmpne and cset. These two instructions may be fused to one instruction in some ARM cores. Even though the code mixing NEON and SVE looks more complicated, it should be more efficient.

If you know that your target processor supports SVE (or SVE 2 or SVE 2.1), and you already have ARM NEON code, you could try adding bits of SVE to it.

 

Generated by RSStT. The copyright belongs to the original author.

Source

Report Page