Base64 for compression

Base64 for compression

Daniel Lemire's blog

C and C++ compilers like GCC first take your code and produce assembly, typically a pure ASCII output (so just basic English characters). This assembly code is a low-level representation of the program, using mnemonic instructions specific to the target processor architecture. The compiler then passes this assembly code to an assembler, which translates it into machine code—binary instructions that the processor can execute directly.

When compiling code, characters like ‘é’ in strings, such as unsigned char a[] = "é";, may be represented in UTF-8. The Unicode (UTF-8) encoding for ‘é’ is two bytes, \303\251. However, when this is represented as an assembly string, it requires 8 characters to express those two bytes (e.g., "\303\251") because the assembly is ASCII. Thus, a single character in source code can expand significantly in the compiled output.

As a related issue, new versions of C and C++ have an ‘#embed’ directive that allows you to directly embed an arbitrary file in your code (e.g., en image). Such data might be encoded inefficiently as assembly.

What could you do?

Base64 is an encoding method that converts binary data into a string of printable ASCII characters, using a set of 64 characters (uppercase and lowercase letters, digits, and symbols like + and /). It is commonly used to represent binary data, such as images or files, in text-based formats like JSON, XML, or emails (MIME).

When starting from binary data, base64 data expands the data, turning 3 input bytes into 4 ASCII characters. Interestingly, in some cases, base64 can be used for compression purposes. Older versions of GCC would compile

unsigned char a[] = "éééééééé";

to

.string "\303\251\303\251\303\251\303\251\303\251\303\251\303\251\303\251"

GCC 15 now supports base64 encoding of data during compilation, with a new “base64” pseudo-op. Our array now gets compile to the much shorter string

.base64 "w6nDqcOpw6nDqcOpw6nDqQA="

Generated by RSStT. The copyright belongs to the original author.

Source

Report Page