Careful with Pair-of-Registers instructions on Apple Silicon

Careful with Pair-of-Registers instructions on Apple Silicon

Daniel Lemire's blog

Egor Bogatov is an engineer working on C# compiler technology at Microsoft. He had an intriguing remark about a performance regression on Apple hardware following what appears to be an optimization. The .NET 9.0 runtime introduced the optimization where two loads (ldr) could be combined into a single load (ldp). It is a typical peephole optimization. Yet it made things much slower in some cases.


Under ARM, the ldr instruction is used to load a single value from memory into a register. It operates on a single register at a time. Its assembly syntax is straightforward ldr Rd, [Rn, #offset]. The ldp instruction (Load Pair of Registers) loads two consecutive values from memory into two registers simultaneously. Its assembly syntax is similar but there are two destination registers: ldp Rd1, Rd2, [Rn, #offset]. The ldp instruction loads two 32-bit words or two 64-bit words from memory, and writes them to two registers.

Given a choice, it seems that you should prefer the ldp instruction. After all, it is a single instruction. But there is a catch on Apple silicon: if you are loading data from a memory that was just written to, there might be a significant penalty to ldp.

To illustrate, let us consider the case where we write and load two values repeatedly using two loads and two stores:

for (int i = 0; i < 1000000000; i++) {
  int tmp1, tmp2;
  __asm__ volatile("ldr %w0, [%2]\n"
                   "ldr %w1, [%2, #4]\n"
                   "str %w0, [%2]\n"
                   "str %w1, [%2, #4]\n"
    : "=&r"(tmp1), "=&r"(tmp2) : "r"(ptr):);
}

Next, let us consider an optimized approach where we combine the two loads into a single one:

for (int i = 0; i < 1000000000; i++) {
  int tmp1, tmp2;
  __asm__ volatile("ldp %w0, %w1, [%2]\n"
                   "str %w0, [%2]\n"
                   "str %w1, [%2, #4]\n"
    : "=&r"(tmp1), "=&r"(tmp2) : "r"(ptr) :);
}

It would be surprising if this new version was slower, but it can be. The code for the benchmark is available. I benchmarked both on AWS using Amazon’s graviton 3 processors, and on Apple M2. Your results will vary.

I have no particular insight as to why it might be, but my guess is that Apple Silicon has a Store-to-Load forwarding optimization that does not work with Pair-Of-Registers loads and stores.

There is an Apple Silicon CPU Optimization Guide which might provide better insight.

Generated by RSStT. The copyright belongs to the original author.

Source

Report Page