How CCMP reduce the pressure of branch predictor on aarch64
属于CYY自己的世界 (陈泱宇 (Yangyu Chen))Preface
When comparing branch MPKI (Miss Per Kilo Instructions) on aarch64 with other architectures such as RISC-V (including RVA23) or x86-64 (without APX), we often observe that certain unpredictable branches are eliminated on aarch64.
This is because aarch64 offers numerous conditional execution instructions that significantly reduce the number of branch instructions that need to be executed. On other architectures like RISC-V, even with Bitmanip or Zicond extensions, to achieve the same number of branches, we require 3 or more ALU instructions to be inserted before the branch to calculate the result. This increases the number of instructions that need to be executed and also negatively impacts performance if the branch is easy to predict.
This article delves into the design of conditional execution on aarch64 and its application in the 429.mcf workload SPECINT 2006 benchmark.
ARM Conditional Execution
Unlike some RISC architectures like RISC-V or MIPS, which lack a flags register, aarch64 has an additional CPSR(Current Program Status Register) to store certain flags, such as conditional flags (its N,Z,C,V bits). This register offers some useful flags for the cmp instruction, similar to x86, and can be utilized in branch instructions like beq, which checks if the Z bit is set in the CPSR register.
Then, the aarch64 ISA also provides the ccmp instruction for calculating the cascade condition. The format of ccmp can be CCMP <Xn>, #<imm>, #<nzcv>, <cond> (imm) or CCMP <Wn>, <Wm>, #<nzcv>, <cond> (register).
The cond parameter can be a condition such as eq or ne, which checks the bits in the conditional flags stored in the CPSR register.
The #<nzcv> parameter is a 4-bit immediate value used to set the CPSR.nczv flag when the condition specified by <cond> does not meet.
The register version of ccmp works as follows:
if (cond) {
CPSR.nczv = Compare(<Wn>, <Wm>);
}
else {
CPSR.nczv = #<nzcv>;
}We can use the cmp and ccmp operators in conjunction to create cascading conditions, similar to how we use the and and or operators in boolean expressions.
Here is an example:
int foo(int a, int b, int c, int d) {
if (a == b && c == d) return 1;
else return 2;
}Compiled on aarch64 GCC14.2.0 with -O3:
foo:
cmp w0, w1
mov w3, 2
ccmp w1, w2, 4, ne
cset w0, eq
sub w0, w3, w0
retBut on x86-64 (without APX), we got:
foo:
cmp edi, esi
sete al
cmp edx, ecx
sete dl
movzx edx, dl
and edx, eax
mov eax, 2
sub eax, edx
retAnd on RISC-V (RVA23 with b_zicond), the compiler even split the basic block:
foo:
bne a0,a1,.L3
sub a2,a2,a3
snez a0,a2
addi a0,a0,1
ret
.L3:
li a0,2
retSPECINT 2006 429.mcf
The CCMP helps SPECINT 2006 429.mcf benchmark on aarch64 processors.
In mcf, we have a function like this:
int bea_is_dual_infeasible( arc_t *arc, cost_t red_cost )
{
return( (red_cost < 0 && arc->ident == AT_LOWER)
|| (red_cost > 0 && arc->ident == AT_UPPER) );
}This function will then be inlined in an if condition, but let’s analyze the function itself.
On aarch64 with GCC 14.2.0 -O3 -fverbose-asm -S, we will see the instructions like:
.L34:
// pbeampp.c:42: || (red_cost > 0 && arc->ident == AT_UPPER) );
ccmp w5, 2, 0, ne // _21,,,
beq .L35 //,
.L33:
// pbeampp.c:165: for( ; arc < stop_arcs; arc += nr_group )
add x0, x0, x8 // arc, arc, _34
// pbeampp.c:165: for( ; arc < stop_arcs; arc += nr_group )
cmp x2, x0 // stop_arcs, arc
bls .L32 //,However, on RISC-V, the things went differently:
.L35:
# pbeampp.c:42: || (red_cost > 0 && arc->ident == AT_UPPER) );
beq a4,zero,.L34 #, red_cost,,
# pbeampp.c:42: || (red_cost > 0 && arc->ident == AT_UPPER) );
beq a0,a3,.L36 #, _21, tmp268,
.L34:
# pbeampp.c:165: for( ; arc < stop_arcs; arc += nr_group )
add a5,a5,t1 # _34, arc, arc
# pbeampp.c:165: for( ; arc < stop_arcs; arc += nr_group )
bleu a2,a5,.L33 #, stop_arcs, arc,However, when the red_cost is hard to predict, these instructions may cause significantly more branch mispredictions, which flush the entire CPU pipeline, but are only used for skipping a condition that does not need to be checked.
A better RISC-V code might be (Using Zicond, assuming a3 is a free register):
.L35:
# pbeampp.c:42: || (red_cost > 0 && arc->ident == AT_UPPER) );
xor a3,a0,t6 #, tmp268, _21,
seqz a3,a3
# pbeampp.c:42: || (red_cost > 0 && arc->ident == AT_UPPER) );
czero.eqz a3,a3,a4 #,,red_cost,
bnez a3,.L36
.L34:
# pbeampp.c:165: for( ; arc < stop_arcs; arc += nr_group )
add a5,a5,t1 # _34, arc, arc
# pbeampp.c:165: for( ; arc < stop_arcs; arc += nr_group )
bleu a2,a5,.L33 #, stop_arcs, arc,Performance Evaluation
I emulated the code pattern with 2 branches on aarch64 on Apple M1 CPU:
I patched the ASM on aarch64 in this way:
--- pbeampp.s
+++ pbeampp.s
@@ -269,6 +269,7 @@
.p2align 2,,3
.L34:
// pbeampp.c:42: || (red_cost > 0 && arc->ident == AT_UPPER) );
+ beq .L33
ccmp w5, 2, 0, ne // _21,,,
beq .L35 //,
.L33:Then I run the benchmark on Apple M1, we sees:
Original version:
Performance counter stats for 'numactl --physcpubind 4-7 ./mcf /home/cyy/spec06/benchspec/CPU2006/429.mcf/data/ref/input/inp.in mcf.out':
2,024,810,367 apple_firestorm_pmu/branch_mispred_nonspec:u/ (100.00%)
63,058,938,380 apple_firestorm_pmu/inst_branch:u/ (100.00%)
2,024,807,836 apple_firestorm_pmu/branch_cond_mispred_nonspec:u/ (100.00%)
288,346,808,544 apple_firestorm_pmu/instructions:u/ # 0.81 insn per cycle (100.00%)
355,818,120,265 apple_firestorm_pmu/cycles:u/ (100.00%)
112.009282105 seconds time elapsed
111.437283000 seconds user
0.073728000 seconds sysPatched two branch version:
Performance counter stats for 'numactl --physcpubind 4-7 ./mcf /home/cyy/spec06/benchspec/CPU2006/429.mcf/data/ref/input/inp.in mcf.out':
3,449,189,303 apple_firestorm_pmu/branch_mispred_nonspec:u/ (100.00%)
67,591,263,076 apple_firestorm_pmu/inst_branch:u/ (100.00%)
3,449,186,323 apple_firestorm_pmu/branch_cond_mispred_nonspec:u/ (100.00%)
291,092,191,936 apple_firestorm_pmu/instructions:u/ # 0.76 insn per cycle (100.00%)
381,611,914,976 apple_firestorm_pmu/cycles:u/ (100.00%)
120.090912907 seconds time elapsed
119.539129000 seconds user
0.056814000 seconds sysThe second version we received resulted in 70% (2e9 to 3.4e9) more branch mispredict events and a 7.2% (111.42s to 119.54s) increase in running time.
So, we might need some extensions like Zccmp on RISC-V?
About other ISAs
Intel also introduced the CCMP instruction for x86 in the APX instruction sets in 2023. However, there is currently no chip support for this extension. Nevertheless, we can observe Intel’s efforts to support CCMP in GCC and LLVM.
See also
The AArch64 processor (aka arm64), part 16: Conditional execution – The Old New Thing
Condition Codes 1: Condition Flags and Codes
Generated by RSStT. The copyright belongs to the original author.