Careful with Pair-of-Registers instructions on Apple Silicon

Daniel Lemire's blog

Egor Bogatov is an engineer working on C# compiler technology at Microsoft. He had an intriguing remark about a performance regression on Apple hardware following what appears to be an optimization. The .NET 9.0 runtime introduced the optimization where two loads (ldr) could be combined into a single load (ldp). It is a typical peephole optimization. Yet it made things much slower in some cases.

Under ARM, the ldr instruction is used to load a single value from memory into a register. It operates on a single register at a time. Its assembly syntax is straightforward ldr Rd, [Rn, #offset]. The ldp instruction (Load Pair of Registers) loads two consecutive values from memory into two registers simultaneously. Its assembly syntax is similar but there are two destination registers: ldp Rd1, Rd2, [Rn, #offset]. The ldp instruction loads two 32-bit words or two 64-bit words from memory, and writes them to two registers.

Given a choice, it seems that you should prefer the ldp instruction. After all, it is a single instruction. But there is a catch on Apple silicon: if you are loading data from a memory that was just written to, there might be a significant penalty to ldp.

To illustrate, let us consider the case where we write and load two values repeatedly using two loads and two stores:

for (int i = 0; i < 1000000000; i++) {
  int tmp1, tmp2;
  __asm__ volatile("ldr %w0, [%2]\n"
                   "ldr %w1, [%2, #4]\n"
                   "str %w0, [%2]\n"
                   "str %w1, [%2, #4]\n"
    : "=&r"(tmp1), "=&r"(tmp2) : "r"(ptr):);
}

Next, let us consider an optimized approach where we combine the two loads into a single one:

for (int i = 0; i < 1000000000; i++) {
  int tmp1, tmp2;
  __asm__ volatile("ldp %w0, %w1, [%2]\n"
                   "str %w0, [%2]\n"
                   "str %w1, [%2, #4]\n"
    : "=&r"(tmp1), "=&r"(tmp2) : "r"(ptr) :);
}

It would be surprising if this new version was slower, but it can be. The code for the benchmark is available. I benchmarked both on AWS using Amazon’s graviton 3 processors, and on Apple M2. Your results will vary.

I have no particular insight as to why it might be, but my guess is that Apple Silicon has a Store-to-Load forwarding optimization that does not work with Pair-Of-Registers loads and stores.

There is an Apple Silicon CPU Optimization Guide which might provide better insight.

Generated by RSStT. The copyright belongs to the original author.

Source

Careful with Pair-of-Registers instructions on Apple Silicon

Report Page