Checking whether an ARM NEON register is zero
Daniel Lemire's blogYour phone probably runs on 64-bit ARM processors. These processors are ubiquitous: they power the Nintendo Switch, they power cloud servers at both Amazon AWS and Microsoft Azure, they power fast laptops, and so forth. ARM processors have special powerful instructions called ARM NEON. They provide a specific type of parallelism called Single instruction, multiple data (SIMD). For example, you can add sixteen values with sixteen other values using one instruction.
Using special functions called intrinsics, you can program a C function that adds the values from two arrays and writes the result into a third:
#include <arm_neon.h>
void vector_int_add(int32_t *a, int32_t *b, int32_t *result, int len) {
int i = 0;
// Loop processing 4 elements at a time until we can't anymore
for (; i < len - 3; i += 4) {
// Load 4 32-bit integers from 'a' into a NEON register
int32x4_t vec_a = vld1q_s32(a + i);
// Load 4 32-bit integers from 'b' into another NEON register
int32x4_t vec_b = vld1q_s32(b + i);
// Perform vector addition
int32x4_t vec_res = vaddq_s32(vec_a, vec_b);
// Store the result back to memory
vst1q_s32(result + i, vec_res);
}
// Handle any remaining elements not divisible by 4
for (; i < len; i++) {
result[i] = a[i] + b[i];
}
}In this code, the type int32x4_t represents four 32-bit integers. You can similarly represent sixteen 8-bit integers as int8x16_t and so forth.
The intrinsic functions vld1q_s32 and vst1q_s32 load and store 16 bytes of data. The intrinsic function vaddq_s32 does the sum.
A simple but deceptively tricky task is to determine whether the register contains only zeros. This comes up regularly in algorithms. Unfortunately, there is no corresponding instruction in ARM NEON unlike Intel/AMD instruction sets (AVX2, AVX-512).
There are many different valid approaches with ARM NEON but they all require several instructions. To keep things simple, I will assume that you are receiving sixteen 8-bit integers (uint8x16_t). In practice, you can reinterpret sixteen bytes as whatever you like (e.g., go from uint8x16_t to int32x4_t) without cost.
My favorite approach is to use the fact that ARM NEON can compute the maximum or the minimum value in an SIMD register: the intrinsic vmaxvq_u32 (corresponding to the instruction umaxv) computes the maximum value across all elements of a vector and returns it as a scalar (not a vector). There are also other variants like vmaxvq_u8 depending on your data type, but vmaxvq_u32 tends to be the most performant approach. The code might look as follows:
int veq_non_zero_max(uint8x16_t v) {
return vmaxvq_u32(vreinterpretq_u32_u8(v)) != 0;
}
It compiles down to three essential instruction: umaxv, fmov and a comparison (cmp). We need the fmov instruction or the equivalent to move the data from the SIMD register to a scalar register. The overall code is not great: umaxv has at least three cycles of latency and so does fmov.
There is more complicated but potentially more useful approach based on narrowing shift. The vshrn_n_u16 intrinsic (corresponding to the shrn instruction) shifts each of the eight 16-bit integers right by 4 bits, and also narrowing the result to 8 bits. The result is a 64-bit value which contains the 4 most significant bits of each byte in the original 16-byte register. We might check whether the register is zero like so:
int veq_non_zero_narrow(uint8x16_t v) {
return vget_lane_u64(vshrn_n_u16(vreinterpretq_u16_u8(v), 4), 0) != 0;
}It compiles down to three instructions: shrn, fmov and a comparison. It is no faster.
There is a faster approach, pointed out to me by Robert Clausecker. Instead of moving the data from the SIMD register to a scalar register, we may use the fact that the SIMD registers also serve as floating-point registers. Thus we can leave the data in place. It applies to both techniques presented so far (vmaxvq_u32 and vshrn_n_u16). Here is the version with vshrn_n_u16:
int veq_non_zero_float(uint8x16_t v) {
uint8x8_t narrowed = vshrn_n_u16(vreinterpretq_u16_u8(v), 4);
return (vdupd_lane_f64(vreinterpret_f64_u16(narrowed), 0) != 0.0);
}This compiles to only two instructions: shrn and fcmp. It is thus much faster.
So why not use the floating-point approach?
- We have two distinct values 0.0 and -0.0 that are considered equal.
- The floating-point standard includes tiny values called subnormal values that may be considered as being equal to zero under some configurations.
- We may generate a signaling NaN value which might cause a signal to be emitted.
If you are careful, you can avoid all of these problems, but it does not make it a good general-purpose solution.
Newer 64-bit ARM processors have another family of instruction sets: SVE or Scalar Vector Extension. It is not directly interoperable with ARM NEON as far as I know… But it does have a dedicated instruction (cmpeq) generated by the following code:
bool check_all_zeros(svbool_t mask, svint8_t vec) {
svbool_t cmp = svcmpeq_n_s8(mask, vec, 0);
return svptest_any(pg, cmp);
}SVE code is more sophisticated: it requires a mask indicating which values are ‘active’. To set all values to active, you might need to generate a a true mask: e.g., like so “svbool_t pg = svptrue_b8()”. However, it is somewhat more powerful: you can check that all active values are zeroes…
Unfortunately, SVE is not yet widespread.
Generated by RSStT. The copyright belongs to the original author.