Compressing floating-point numbers quickly by converting the…

Daniel Lemire's blog

We sometimes have to work a large quantity of floating-point numbers. This volume can be detrimental to performance. Thus we often want to compress these numbers. Large-language models routinely do so.

A sensible approach is to convert them to brain floating point numbers. These are 16-bit numbers that are often capable of representing accurately a wide range of numbers. Compared to the common 64-bit floating-point numbers, it is a net saving of 4×. You can do quite better than a 4× factor by using statistical analysis over your dataset. However, brain floats have the benefit of being standard and they are supported at the CPU level.

If you have a recent AMD processor (Zen 4 or better) or a recent Intel server processor (Sapphire Rapids), you have fast instructions to convert 32-bit floating-point numbers to and from 16-bit brain float numbers. Specifically, you have access to Single Instruction, Multiple Data (SIMD) instructions that can convert several numbers at once (with one instructions).

We have had instructions to go from 64-bit numbers to 32-bit numbers and back for some time. So we can go from 64-bit to 32-bit and then to 16-bit, and similarly in reverse.

Using Intel intrinsic functions, in C, you might compress 8 numbers with this code:

// Load 8 double-precision floats
__m512d src_vec = _mm512_loadu_pd(&src[i]);

// Convert to 16-bit floats with rounding
__m128bh dst_vec = _mm256_cvtneps_pbh(_mm512_cvt_roundpd_ps(src_vec, 
      _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC));

// Store the result
_mm_storeu_si128((__m128i*)&dst[i], *(__m128i*)&dst_vec);

Going the other way and converting eight 16-bit numbers to eight 64-bit numbers is similar:

// Load 8 half-precision floats
__m128i src_vec = _mm_loadu_si128((__m128i*)&src[i]);

// Convert to double-precision floats
__m512d dst_vec = _mm512_cvtps_pd(_mm256_cvtpbh_ps(*(__m128bh*)&src_vec));

// Store the result
_mm512_storeu_pd(&dst[i], dst_vec);

How precise is it? It tried it on a geojson file representing the border of Canada: it is a collection of coordinates. The worst absolute error happens when the number -135.500305 is approximated by -136. The brain float format will not distinguish between the numbers -65.613617 and -66.282776, they both get represented as -66.5. Whether that’s acceptable depends on your application.

Is it fast? I wrote a benchmark where I convert all the numbers in the coordinates of the Canadian border. I use a Zen 4 processor (AMD EPYC 9R14 @ 2.6GHz) with GCC 13.

It is likely that better code can get even better performance, but it is already quite fast.

Generated by RSStT. The copyright belongs to the original author.

Source

Compressing floating-point numbers quickly by converting the…

Report Page