Packing a string of digits into an integer quickly

Packing a string of digits into an integer quickly

Daniel Lemire's blog

Suppose that I give you a short string of digits, containing possibly spaces or other characters (e.g., "20141103 012910"). We would like to pack the digits into an integer (e.g., 0x20141103012910) so that the lexicographical order over the string matches the ordering of the integers.

We can use the fact that in ASCII, the digits have byte values 0x30, 0x31 and so forth. Thus as a sequence of bytes, the string "20141103 012910" is actually 0x32, 0x30, 0x31… so we must select the least significant 4 bits of each byte, discarding the rest. Intel and AMD processors have a convenient instruction which allows us to select any bits from a word to construct a new word (pext).

A problem remains: Intel and AMD processors are little endian, which means that if I load the string in memory, the first byte becomes the least significant, not the most significant. Thankfully, there is an instruction to reverse the order of the bytes.

In C, the desired function using Intel intrinsics might look like this:

#include <x86intrin.h> // Windows: &LTintrin.h>
#include <string.h>

// From "20141103 012910", we want to get
// 0x20141103012910
uint64_t extract_nibbles(const char* c) {
  uint64_t part1, part2;
  memcpy(&part1, c, sizeof(uint64_t));
  memcpy(&part2 , c + 7, sizeof(uint64_t));
  part1 = _bswap64(part1);
  part2 = _bswap64(part2);
  part1 = _pext_u64(part1, 0x0f0f0f0f0f0f0f0f);
  part2 = _pext_u64(part2, 0x0f000f0f0f0f0f0f);
  return (part1<<24) | (part2);
}

It compiles to relatively few instructions: only 4 non-load instructions. The memcpy calls in my code translate into 64-bit load instructions. The register loading instructions (movabs) are nearly free in practice.

movbe rax, QWORD PTR [rdi]
movbe rdx, QWORD PTR [rdi+7]
movabs rcx, 1085102592571150095
pext rax, rax, rcx
movabs rcx, 1080880467920490255
sal rax, 24
pext rdx, rdx, rcx
or rax, rdx

Prior to the AMD Zen 3 processors, pext had terrible performance on AMD processors. Recent Intel processors have performance on par with Intel, meaning that pext has a latency of about 3 cycles, and can run once per cycle. So it is about as expensive as a multiplication. Not counting the loads, the above function could nearly be complete in about 5 cycles, which is quite fast.

It would be interesting to research how to solve the same problem for ARM processors.

Generated by RSStT. The copyright belongs to the original author.

Source

Report Page