Dynamic bit shuffle using AVX-512
Daniel Lemire's blogSuppose that you want to reorder, arbitrarily, the bits in a 64-bit word. This question was raised on Twitter by @experquisite. Formally, you might want to provide, for each of the 64 bit position, an original bit position you want to copy.
Hence, the following code would reverse the bit order in your 64-bit word:
uint64_t w = some value;
uint8_t indexes[64] = {63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51,
50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38,
37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25,
24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12,
11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0};
bit_shuffle(w, indexes); // returns a reversed version
A naive way to do it in C might be as follows:
uint64_t slow_bit_shuffle(uint64_t w, uint8_t indexes[64]) {
uint64_t out{};
for (size_t i = 0; i < 64; i++) {
bool bit_set = w & (uint64_t(1) << indexes[i]);
out |= (uint64_t(bit_set) << i);
}
return out;
}
This might be an acceptable implementation, but what if you want do it using few instructions? You can do it on recent Intel and AMD processors with support for AVX-512 instructions. You go from the general-purpose register to a mask register, to a 512-bit AVX-512 register, you apply a shuffle (vpermb), you go back to a mask register and finally back to a general-purpose register.
The code with Intel intrinsic functions looks as follows:
uint64_t bit_shuffle(uint64_t w, uint8_t indexes[64]) {
__mmask64 as_mask = _cvtu64_mask64(w);
__m512i as_vec_register =
_mm512_maskz_mov_epi8(as_mask, _mm512_set1_epi8(0xFF));
__m512i as_vec_register_shuf =
_mm512_permutexvar_epi8(_mm512_loadu_epi8(indexes), as_vec_register);
return _cvtmask64_u64(_mm512_movepi8_mask(as_vec_register_shuf));
}
It might compile to about six instructions:
kmovq k0, rdi vpmovm2b zmm0, k0 vmovdqu8 zmm1, ZMMWORD PTR [rsi] vpermb zmm0, zmm1, zmm0 vpmovb2m k1, zmm0 kmovq rax, k1
Loading your indexes is likely to have a long latency, so if you can buffer the load (_mm512_loadu_epi8(indexes)), you will reduce significantly the latency.
I have an implementation in C++.
Generated by RSStT. The copyright belongs to the original author.