How CCMP reduce the pressure of branch predictor on aarch64

Preface

When comparing branch MPKI (Miss Per Kilo Instructions) on aarch64 with other architectures such as RISC-V (including RVA23) or x86-64 (without APX), we often observe that certain unpredictable branches are eliminated on aarch64.

This is because aarch64 offers numerous conditional execution instructions that significantly reduce the number of branch instructions that need to be executed. On other architectures like RISC-V, even with Bitmanip or Zicond extensions, to achieve the same number of branches, we require 3 or more ALU instructions to be inserted before the branch to calculate the result. This increases the number of instructions that need to be executed and also negatively impacts performance if the branch is easy to predict.

This article delves into the design of conditional execution on aarch64 and its application in the 429.mcf workload SPECINT 2006 benchmark.

ARM Conditional Execution

Unlike some RISC architectures like RISC-V or MIPS, which lack a flags register, aarch64 has an additional CPSR(Current Program Status Register) to store certain flags, such as conditional flags (its N,Z,C,V bits). This register offers some useful flags for the cmp instruction, similar to x86, and can be utilized in branch instructions like beq, which checks if the Z bit is set in the CPSR register.

Then, the aarch64 ISA also provides the ccmp instruction for calculating the cascade condition. The format of ccmp can be CCMP &LTXn>, #&LTimm>, #&LTnzcv>, &LTcond> (imm) or CCMP &LTWn>, &LTWm>, #&LTnzcv>, &LTcond> (register).

The cond parameter can be a condition such as eq or ne, which checks the bits in the conditional flags stored in the CPSR register.

The #&LTnzcv> parameter is a 4-bit immediate value used to set the CPSR.nczv flag when the condition specified by &LTcond> does not meet.

The register version of ccmp works as follows:

if (cond) {
    CPSR.nczv = Compare(&LTWn>, &LTWm>);
}
else {
    CPSR.nczv = #&LTnzcv>;
}

We can use the cmp and ccmp operators in conjunction to create cascading conditions, similar to how we use the and and or operators in boolean expressions.

Here is an example:

int foo(int a, int b, int c, int d) {
    if (a == b && c == d) return 1;
    else return 2;
}

Compiled on aarch64 GCC14.2.0 with -O3:

foo:
        cmp     w0, w1
        mov     w3, 2
        ccmp    w1, w2, 4, ne
        cset    w0, eq
        sub     w0, w3, w0
        ret

But on x86-64 (without APX), we got:

foo:
        cmp     edi, esi
        sete    al
        cmp     edx, ecx
        sete    dl
        movzx   edx, dl
        and     edx, eax
        mov     eax, 2
        sub     eax, edx
        ret

And on RISC-V (RVA23 with b_zicond), the compiler even split the basic block:

foo:
        bne     a0,a1,.L3
        sub     a2,a2,a3
        snez    a0,a2
        addi    a0,a0,1
        ret
.L3:
        li      a0,2
        ret

SPECINT 2006 429.mcf

The CCMP helps SPECINT 2006 429.mcf benchmark on aarch64 processors.

In mcf, we have a function like this:

int bea_is_dual_infeasible( arc_t *arc, cost_t red_cost )
{
    return(    (red_cost < 0 && arc->ident == AT_LOWER)
            || (red_cost > 0 && arc->ident == AT_UPPER) );
}

This function will then be inlined in an if condition, but let’s analyze the function itself.

On aarch64 with GCC 14.2.0 -O3 -fverbose-asm -S, we will see the instructions like:

.L34:
// pbeampp.c:42:             || (red_cost > 0 && arc->ident == AT_UPPER) );
    ccmp    w5, 2, 0, ne    // _21,,,
    beq .L35        //,
.L33:
// pbeampp.c:165:     for( ; arc < stop_arcs; arc += nr_group )
    add x0, x0, x8  // arc, arc, _34
// pbeampp.c:165:     for( ; arc < stop_arcs; arc += nr_group )
    cmp x2, x0  // stop_arcs, arc
    bls .L32        //,

However, on RISC-V, the things went differently:

.L35:
# pbeampp.c:42:             || (red_cost > 0 && arc->ident == AT_UPPER) );
    beq a4,zero,.L34    #, red_cost,,
# pbeampp.c:42:             || (red_cost > 0 && arc->ident == AT_UPPER) );
    beq a0,a3,.L36  #, _21, tmp268,
.L34:
# pbeampp.c:165:     for( ; arc < stop_arcs; arc += nr_group )
    add a5,a5,t1    # _34, arc, arc
# pbeampp.c:165:     for( ; arc < stop_arcs; arc += nr_group )
    bleu    a2,a5,.L33  #, stop_arcs, arc,

However, when the red_cost is hard to predict, these instructions may cause significantly more branch mispredictions, which flush the entire CPU pipeline, but are only used for skipping a condition that does not need to be checked.

A better RISC-V code might be (Using Zicond, assuming a3 is a free register):

.L35:
# pbeampp.c:42:             || (red_cost > 0 && arc->ident == AT_UPPER) );
    xor a3,a0,t6    #, tmp268, _21,
    seqz    a3,a3
# pbeampp.c:42:             || (red_cost > 0 && arc->ident == AT_UPPER) );
    czero.eqz   a3,a3,a4    #,,red_cost,
    bnez    a3,.L36
.L34:
# pbeampp.c:165:     for( ; arc < stop_arcs; arc += nr_group )
    add a5,a5,t1    # _34, arc, arc
# pbeampp.c:165:     for( ; arc < stop_arcs; arc += nr_group )
    bleu    a2,a5,.L33  #, stop_arcs, arc,

Performance Evaluation

I emulated the code pattern with 2 branches on aarch64 on Apple M1 CPU:

I patched the ASM on aarch64 in this way:

--- pbeampp.s
+++ pbeampp.s
@@ -269,6 +269,7 @@
        .p2align 2,,3
 .L34:
 // pbeampp.c:42:             || (red_cost > 0 && arc->ident == AT_UPPER) );
+       beq .L33
        ccmp    w5, 2, 0, ne    // _21,,,
        beq     .L35            //,
 .L33:

Then I run the benchmark on Apple M1, we sees:

Original version:

  Performance counter stats for 'numactl --physcpubind 4-7 ./mcf /home/cyy/spec06/benchspec/CPU2006/429.mcf/data/ref/input/inp.in mcf.out':

     2,024,810,367      apple_firestorm_pmu/branch_mispred_nonspec:u/                                        (100.00%)
    63,058,938,380      apple_firestorm_pmu/inst_branch:u/                                        (100.00%)
     2,024,807,836      apple_firestorm_pmu/branch_cond_mispred_nonspec:u/                                        (100.00%)
   288,346,808,544      apple_firestorm_pmu/instructions:u/ #    0.81  insn per cycle              (100.00%)
   355,818,120,265      apple_firestorm_pmu/cycles:u/                                           (100.00%)

     112.009282105 seconds time elapsed

     111.437283000 seconds user
       0.073728000 seconds sys

Patched two branch version:

  Performance counter stats for 'numactl --physcpubind 4-7 ./mcf /home/cyy/spec06/benchspec/CPU2006/429.mcf/data/ref/input/inp.in mcf.out':

     3,449,189,303      apple_firestorm_pmu/branch_mispred_nonspec:u/                                        (100.00%)
    67,591,263,076      apple_firestorm_pmu/inst_branch:u/                                        (100.00%)
     3,449,186,323      apple_firestorm_pmu/branch_cond_mispred_nonspec:u/                                        (100.00%)
   291,092,191,936      apple_firestorm_pmu/instructions:u/ #    0.76  insn per cycle              (100.00%)
   381,611,914,976      apple_firestorm_pmu/cycles:u/                                           (100.00%)

     120.090912907 seconds time elapsed

     119.539129000 seconds user
       0.056814000 seconds sys

The second version we received resulted in 70% (2e9 to 3.4e9) more branch mispredict events and a 7.2% (111.42s to 119.54s) increase in running time.

So, we might need some extensions like Zccmp on RISC-V?

About other ISAs

Intel also introduced the CCMP instruction for x86 in the APX instruction sets in 2023. However, there is currently no chip support for this extension. Nevertheless, we can observe Intel’s efforts to support CCMP in GCC and LLVM.

How CCMP reduce the pressure of branch predictor on aarch64

How CCMP reduce the pressure of branch predictor on aarch64

Preface

ARM Conditional Execution

SPECINT 2006 429.mcf

About other ISAs

See also

Report Page