I was skimming over Computer Architecture A Quantitative Approach a few days ago, specifically, the 6th edition’s chapter on the Intel Core i7 6700 when I noticed this part:

Macro-op fusion? This is the first I’ve read on this. The textbook gives a good brief overview on the topic, but I decided to spend a few days going through it in more detail. As a result, I’d like to share some of the details on Macro-op and Micro-op fusion in modern processors, and then test for them on my M4 Mac.
Brief overview of Instruction Decoding
All modern, high performing, out-of-order CPUs internally translate ISA-defined opcodes into internal, fixed-length instructions referred to as micro-ops (uops). This happens for a couple of reasons.
For one, it simplifies the execution engine. On CISC ISAs like x86, the instruction set has grown to over 2,000 instructions with varying lengths (1 to 15 bytes). Building an execution engine that handles all of these natively would be impractical. Instead, the front-end decoder breaks each architectural instruction (“macro-op”) into one or more simpler, fixed-width micro-ops that the backend can schedule uniformly. For example, an x86 instruction like:
add [rax+rcx*4+8], rdxlooks like a single instruction, but it actually does three things: compute an address, load from memory, and add. The decoder cracks this into separate uops: a load uop, an ALU uop, and a store uop Each can be independently scheduled and dispatched to different execution units.
The second reason is scheduling. By breaking complex instructions into independent uops, the out-of-order engine can overlap their execution with other work. That add [mem], reg above: the load uop can issue as soon as the address is ready, the ALU uop fires when the load completes, and the store uop writes back when the ALU finishes. Meanwhile other instructions can execute in the gaps. This wouldn’t be possible if the processor treated the whole thing as one monolithic operation.
These uops are what actually flow through the out-of-order pipeline, renamed, scheduled, dispatched to execution ports, and retired. The original macro-ops are just a front-end concern.
Modern CPUs also cache these translations. AMD’s Zen 5 processors (2024) have a 2x6-wide op cache that sits before the decode stage. On a cache hit, the processor bypasses decoding entirely, feeding uops straight to the rename/dispatch stage. Intel has a similar structure (the “uop cache” or DSB, introduced with Sandy Bridge). This matters because decode is typically a bottleneck. On x86, the decoders can typically only handle 4-6 instructions per cycle, but the op cache can deliver 6-8 uops per cycle.
ARM processors like Apple Silicon also decode into internal micro-ops, though the process is simpler since ARM instructions are fixed-width (4 bytes). Apple’s M4 performance cores have a massive decode width of up to 10 instructions per cycle.
Macro-op fusion
With that background, macro-op fusion refers to the ability to combine multiple macro-ops (i.e. regular architectural instructions) into a single uop during decode. Since certain instruction patterns appear so frequently, it’s worth adding hardware to recognize and merge them.
The canonical pattern is a comparison followed by a conditional branch. This shows up at the end of virtually every loop:
; x86
cmp rax, rbx ; sets flags
jne .loop ; reads flags, branches
; ARM
cmp x0, x1 ; sets flags
b.ne .loop ; reads flags, branchesThese are two separate instructions, but semantically they form a single decision: “compare and branch if not equal.” If the decoder can fuse them into one uop, that uop occupies a single slot through the entire pipeline this pretty much increases the pipeline’s throughput without physically widening it.
x86: History and current state
Macro-op fusion first appeared in Intel’s Core 2 (2006). Initially it was limited: only CMP or TEST followed by a conditional jump, and only with certain condition codes (e.g., JE, JNE, JL, JGE). Later microarchitectures expanded this:
| Generation | Fusion support |
|---|---|
| Core 2 (2006) | CMP/TEST + Jcc (limited conditions) |
| Nehalem (2008) | CMP/TEST + all Jcc conditions |
| Sandy Bridge (2011) | + ADD/SUB/AND/INC/DEC + Jcc |
| Haswell (2013) | + TEST with immediate operands |
| Modern (Zen 4/5, Golden Cove+) | CMP, TEST, ADD, SUB, AND, OR, XOR, INC, DEC + all Jcc |
AMD followed a similar trajectory, adding macro-op fusion in Bulldozer (2011) and progressively expanding it through the Zen generations.
The practical effect is significant. In a tight loop like:
.loop:
dec ecx ; ecx -= 1, sets flags
jnz .loop ; branch if ecx != 0Without fusion, this is 2 uops per iteration. With fusion, it’s 1. On a 6-wide machine, that means the fused loop overhead takes 1/6th of the pipeline bandwidth instead of 2/6ths.
ARM: A slightly different story
ARM’s fixed-width instruction encoding makes the decode stage simpler, but macro-op fusion is still valuable. The same compare-and-branch pattern is just as common:
cmp x0, #0
b.eq .doneARM also has single-instruction alternatives like CBZ (Compare and Branch if Zero) and TBZ (Test Bit and Branch if Zero) that accomplish the same thing without needing fusion. However, these cover a limited set of conditions, so the two-instruction CMP/CMN/TST + B.cond pattern remains common.
Apple’s CPU cores are particularly aggressive here. The M-series chips (M1 through M4) perform macro-op fusion in the decode stage, and Apple’s 10-wide P-core decoder has dedicated fusion logic. LLVM’s AArch64 backend has a MacroFusion pass that understands which pairs Apple’s cores can fuse, and it will reorder instructions to keep fusable pairs adjacent.
Testing for Macro-op fusion on Apple M4
Reading about fusion is one thing, but I wanted to actually measure it. I decided to test this on my M4 Macbook Air by doing the following: run a tight loop of instruction pairs that should fuse, compare their throughput against a baseline of instructions that shouldn’t, and see if the fused pairs are 2x faster.
Methodology
The test program runs each instruction pattern in a tight loop with 50x unrolling (to minimize loop overhead) across 500 million iterations, measuring wall-clock time via mach_absolute_time. Since Apple locked down the hardware performance counters (kpc/PMC) on M4 and macOS Sequoia, even for root, I estimate CPU cycles by calibrating the frequency first: a dependent chain of 1 billion ADD instructions (1-cycle latency each, IPC=1 by construction) gives me cycles/ns.
The test runs on both P-cores (via QOS_CLASS_USER_INTERACTIVE) and E-cores (via QOS_CLASS_BACKGROUND) to compare fusion behavior across core types.
P-core results
The P-core runs at ~4.4 GHz. The baseline is two independent ADD instructions per pair, which gives me 1.00 cycles/pair: two instructions consuming two dispatch slots as expected.
Fusion with xzr (zero register) operands:
| Instruction Pair | Cyc/pair | vs Baseline | Fuses? |
|---|---|---|---|
| ADD x2 (baseline) | 1.000 | 1.00x | — |
| CMP + B.NE | 0.506 | 0.51x | Yes |
| CMN + B.NE | 0.507 | 0.51x | Yes |
| TST + B.NE | 0.507 | 0.51x | Yes |
| ADDS + B.NE | 0.506 | 0.51x | Yes |
| SUBS + B.NE | 0.506 | 0.51x | Yes |
| ANDS + B.NE | 0.507 | 0.51x | Yes |
| BICS + B.NE | 0.507 | 0.51x | Yes |
Every flag-setting instruction fuses with every conditional branch at exactly half the baseline throughput. Two instructions, one dispatch slot.
Operand type doesn’t matter on the P-core:
| Variant | Cyc/pair |
|---|---|
| CMP xzr, xzr + B.NE | 0.506 |
| CMP x9, x10 + B.NE (register) | 0.506 |
| CMP x9, #0 + B.NE (immediate) | 0.507 |
| ADDS x11, x9, x10 + B.NE (reg, writes dest) | 0.506 |
| ADDS x11, x9, #0 + B.NE (imm, writes dest) | 0.506 |
| SUBS x11, x9, x10, lsl #0 + B.NE (shifted) | 0.506 |
All 0.506. The P-core’s fusion logic doesn’t care about encoding format, operand source, or whether the flag-setter also writes a destination register. If it sets flags and the next instruction is a conditional branch, it fuses.
What breaks fusion:
| Test | Cyc/pair | Fuses? |
|---|---|---|
| CMP + B.NE (adjacent) | 0.506 | Yes |
| CMP + NOP + B.NE | 0.505 | Yes (!) |
| CMP + ADD + B.NE | 0.992 | No |
| CMP + B.NE (taken) | 0.849 | Different bottleneck |
Two things to note:
First, inserting a NOP between CMP and B.NE does not break fusion on the P-core. This means the P-core eliminates NOPs at the rename stage before fusion happens, so the CMP and B.NE end up adjacent in the uop stream anyway.
Second, taken branches are slower (~0.85 cyc) regardless of fusion. This is a separate throughput constraint: the branch execution unit has a limited taken-branch bandwidth. It’s not that fusion fails; it’s that correctly-predicted taken branches still cost more than fall-through.
E-core results
The E-core runs at ~1.3 GHz with a narrower pipeline. The initial results here looked like ADDS/SUBS with a destination register “didn’t fuse”, but further investigation revealed a more nuanced picture involving dependency chains and instruction cracking.
Compare-only instructions fuse cleanly:
| Instruction Pair | Cyc/pair | vs Baseline | Fuses? |
|---|---|---|---|
| ADD x2 (baseline) | 0.982 | 1.00x | — |
| CMP + B.NE | 0.510 | 0.52x | Yes |
| CMN + B.NE | 0.494 | 0.50x | Yes |
| TST + B.NE | 0.538 | 0.55x | Yes |
| CMP reg,reg + B.NE | 0.536 | 0.55x | Yes |
| CMP imm + B.NE | 0.510 | 0.52x | Yes |
Flag-setting ALU ops with a destination register are slower, but why?
My initial tests showed ADDS x9 + B.NE at ~0.68 cyc/pair, between fused (0.5) and unfused (1.0). I initially attributed this to the E-core refusing to fuse instructions that write a destination register. But fusion is binary: a pair either fuses or it doesn’t. An intermediate value demands a better explanation.
An isolation test revealed two confounds in the original measurement:
-
Dependency chains.
ADDS x9, x9, #0followed by anotherADDS x9, x9, #0creates a serial dependency: each iteration reads x9, which the previous iteration wrote. This serializes execution regardless of fusion. When I changed the destination to a different register (ADDS x11, x9, #0 + B.NE, no dependency chain), the P-core result dropped to 0.506, identical to fused CMP + B.NE. On the E-core, it dropped from 0.87 to 0.65. -
Instruction cracking. Even with the dependency chain removed, ADDS x11 + B.NE on the E-core still measured 0.65 cyc/pair, slower than CMP + B.NE at 0.49. The standalone throughput data explains why: ADDS with a real destination register is intrinsically slower on the E-core (0.54 cyc standalone) than CMP (0.34 cyc standalone). This suggests the E-core cracks ADDS into 2 internal uops (one for the GPR result, one for flags) when it writes a real register, while CMP is a single uop. The throughput difference in the paired test comes from this cracking overhead, not from a failure to fuse.
| E-core standalone throughput | Cyc/instr |
|---|---|
| CMP (flags only) | 0.34 |
| TST (flags only) | 0.34 |
| ADDS xzr (= CMP, no dest) | 0.38 |
| ADDS x9 (flags + dest, no dep chain) | 0.54 |
| ADD x9 (dest only, no flags) | 1.00 |
The P-core shows no such cracking. ADDS with or without a destination register has the same throughput when dependencies are removed.
Experimental note on confounds: My first round of register-operand tests used uninitialized registers, which meant x9 and x10 could be unequal, making B.NE a taken branch while the xzr tests always produced not-taken branches. Taken branches are slower due to separate throughput constraints, so the results looked like “partial fusion” when they were partly measuring taken-branch overhead. Zeroing the registers and controlling for dependency chains was necessary to get clean data.
Taken vs not-taken
Both core types show a clear split based on branch direction:
P-core:
| Branch direction | Cyc/pair |
|---|---|
| CMP + B.NE (not taken) | 0.506 |
| CMP + B.EQ (not taken) | 0.509 |
| CMP + B.LT (not taken) | 0.506 |
| CMP + B.HI (not taken) | 0.506 |
| CMP + B.NE (taken) | 0.849 |
| CMP + B.GE (taken) | 0.846 |
| CMP + B.LS (taken) | 0.724 |
E-core:
| Branch direction | Cyc/pair |
|---|---|
| Not-taken (all conditions) | ~0.50 |
| Taken (all conditions) | ~1.05-1.12 |
Not-taken branches fuse and execute at maximum throughput. Taken branches incur additional cost even when correctly predicted. This is a fetch/redirect penalty, not a fusion failure. On the P-core, the penalty is moderate (~0.85 vs 0.50). On the E-core, taken branches are essentially unfused speed (~1.0+), suggesting the E-core may not fuse taken branches at all, or that its branch throughput limit is tighter.
Micro-op fusion
Micro-op fusion operates at a different level. Where macro-op fusion merges two architectural instructions into one uop, micro-op fusion deals with a single architectural instruction that internally decodes into multiple micro-ops.
x86: The original motivation
On x86, micro-op fusion was introduced by Intel with the Pentium M (2003). The motivating case is memory-operand instructions. Consider:
add eax, [rbx+rcx*4+16]This single instruction does two things: load a value from memory, then add it to eax. Internally it decodes into 2 uops: a load uop and an ALU uop. Without micro-op fusion, these consume 2 slots in the reorder buffer (ROB), 2 slots during dispatch, and 2 slots in the retirement queue.
With micro-op fusion, the two uops are kept together as a single fused entry through the front-end: they share one ROB slot, one dispatch slot, and one retirement slot. They only split apart at the execution stage when they actually need different execution ports (load port vs ALU port). This effectively increases the ROB capacity and dispatch bandwidth without physically making them larger.
The rules for what can be micro-fused have evolved:
- Pentium M / Core: simple addressing modes (base, base+displacement)
- Sandy Bridge+: indexed addressing modes (base+index+displacement) can be micro-fused at decode but may be “unlaminated” before dispatch if the instruction has 3 or more inputs
- Haswell+: expanded unlamination rules
Unlamination
Unlamination is the reverse of micro-op fusion: a fused uop being split back into separate uops before dispatch. This happens on Intel CPUs when a micro-fused instruction has too many register inputs for the rename stage to handle in one slot. For example:
add rax, [rbx+rcx*4] ; 3 inputs: rax, rbx, rcxThis gets micro-fused at decode (1 fused uop in the uop cache), but the renamer splits it back into 2 uops because it needs 3 source registers. The fusion still helps by saving uop cache bandwidth, but the ROB/dispatch savings are lost.
Instructions with 2 or fewer register inputs stay fused all the way through:
add rax, [rbx+16] ; 2 inputs: rax, rbx (stays fused)ARM / Apple Silicon: LDP and STP
ARM has its own version of this with load/store pair instructions. LDP (Load Pair) loads two 64-bit registers from consecutive memory locations in a single instruction:
ldp x0, x1, [sp] ; loads x0 from [sp], x1 from [sp+8]Architecturally this is one instruction, but it performs two memory accesses and writes two registers. The hardware cracks it into two load micro-ops internally, but with micro-op fusion, keeps them fused through dispatch, occupying a single pipeline slot. They only split at execution to hit two separate load ports.
Testing micro-op fusion on Apple M4
I tested this by comparing the throughput of LDP/STP against equivalent pairs of LDR/STR instructions, all loading/storing from a hot L1 cache line:
P-core:
| Instruction | Cyc/pair | vs 2x Single |
|---|---|---|
| LDP (load pair) | 0.324 | 0.51x vs LDR x2 (0.641) |
| STP (store pair) | 0.465 | 0.50x vs STR x2 (0.933) |
E-core:
| Instruction | Cyc/pair | vs 2x Single |
|---|---|---|
| LDP (load pair) | 0.430 | 0.37x vs LDR x2 (1.157) |
| STP (store pair) | 1.054 | 0.55x vs STR x2 (1.907) |
LDP and STP consistently deliver ~2x the throughput of separate load/store pairs on both core types. The pair instruction goes through dispatch as a single fused uop, only splitting at the execution stage. The E-core shows an even more pronounced relative advantage because its narrower dispatch makes the saved slot proportionally more valuable.
This has direct implications for compiler output and hand-tuned code: LDP/STP should always be preferred over adjacent LDR/STR pairs. LLVM and GCC already do this aggressively in their AArch64 backends.
Other fusion and elimination on Apple M4
Beyond compare+branch fusion, Apple Silicon performs other instruction merging optimizations. Dougall Johnson’s reverse-engineering of the M1 (Firestorm/Icestorm) documented several of these. I tested whether they carry forward to the M4.
MOVZ + MOVK fusion
ARM64 can’t load a full 64-bit immediate in one instruction. Instead, constants are built up with MOVZ (load 16 bits, zero the rest) followed by one or more MOVK (patch in 16 bits, keep the rest):
movz x9, #0x1234 ; x9 = 0x0000000000001234
movk x9, #0x5678, lsl #16 ; x9 = 0x0000000056781234This pattern is extremely common since every 32-bit+ constant needs it. If the CPU can fuse the pair into a single operation, it saves a dispatch slot every time a constant is materialized.
P-core results:
| Test | Cyc/blk | Cyc/instr |
|---|---|---|
| MOVZ standalone | 0.108 | 0.108 |
| MOVK standalone | 0.257 | 0.257 |
| MOV imm standalone | 0.109 | 0.109 |
| MOVZ + MOVK (same reg) | 0.216 | 0.108 |
| MOVZ + MOVK (diff reg) | 0.989 | 0.494 |
| MOVZ + NOP + MOVK | 0.313 | 0.104 |
| MOVZ + MOVK + MOVK (48-bit) | 0.330 | 0.110 |
MOVZ alone is essentially free (0.108 cyc, close to NOP at 0.101). It’s eliminated at rename, like a zero-idiom. The pair MOVZ + MOVK to the same register runs at 0.216 cyc/pair, which is 0.108 cyc/instr: the MOVZ is eliminated and the MOVK runs at its standalone throughput. This isn’t traditional fusion (merging two uops into one); it’s MOVZ elimination making the pair effectively a single instruction.
The three-instruction sequence MOVZ + MOVK + MOVK (for 48-bit constants) is 0.330 cyc = 0.110 cyc/instr, confirming the MOVZ is eliminated and both MOVKs execute independently. MOVZ + MOVK to different registers takes 0.989 cyc (no elimination, the MOVK needs the MOVZ result from a different register, which doesn’t make semantic sense and can’t be optimized).
Interestingly, MOVZ + NOP + MOVK is 0.313 cyc. The NOP is also eliminated, so the MOVK still executes at roughly its standalone throughput. The P-core eliminates both NOPs and MOVZ at rename.
E-core: The same pattern holds but with more noise and higher base costs. MOVZ + MOVK (same reg) at 0.659 cyc/pair vs MOVK standalone at 0.509, consistent with MOVZ elimination.
ADRP + ADD (address materialization)
Loading a symbol’s address on ARM64 is a two-instruction sequence:
adrp x9, symbol@PAGE ; x9 = page-aligned address of symbol
add x9, x9, symbol@PAGEOFF ; x9 += offset within pageThis is emitted by the linker for every global variable access. Fusing this pair would save a dispatch slot on one of the most common instruction patterns in any binary.
P-core results:
| Test | Cyc/blk | Cyc/instr |
|---|---|---|
| NOP standalone | 0.101 | 0.101 |
| ADD standalone | 1.003 | 1.003 |
| ADRP standalone | 0.405 | 0.405 |
| ADR standalone | 0.406 | 0.406 |
| ADRP x2 | 0.800 | 0.400 |
| ADRP + ADD (same reg) | 0.321 | 0.160 |
| ADRP + ADD (diff reg) | 0.306 | 0.153 |
| ADRP + NOP + ADD | 0.310 | 0.103 |
ADRP standalone is 0.405 cyc, not as cheap as NOP (0.101) but cheaper than ADD (1.003). It’s not fully eliminated but may use a specialized fast path (PC-relative computation is simple). ADR has the same cost.
The ADRP + ADD pair at 0.32 cyc is faster than either instruction alone. Since ADRP + NOP + ADD is 0.31 cyc (nearly the same), this isn’t adjacency-dependent fusion. It’s just that both instructions are cheap and the pipeline can overlap them. ADRP doesn’t need an ALU port (it’s a PC-relative constant), so it doesn’t compete with the ADD for execution resources.
Notably, there’s no evidence of ADRP + ADD fusion on the M4 P-core in the macro-op fusion sense. The pair is fast because ADRP is cheap, not because they merge into one uop. This is consistent with Dougall Johnson’s M1 findings: “still not adrp + add fusion.”
E-core: ADRP standalone is 0.356 cyc (similar to NOP at 0.198, suggesting partial elimination). ADRP + ADD is 0.612 cyc, roughly ADRP + ADD standalone costs. No fusion here either.
ALU + CBZ/CBNZ fusion
Dougall Johnson documented that the M1’s P-core (Firestorm) can fuse non-flag-setting ALU instructions (ADD, SUB, AND, ORR, EOR, BIC, etc.) with CBZ/CBNZ when the destination register of the ALU op matches the operand of the compare-and-branch. This is a different fusion pattern from CMP+B.cond: here the CPU fuses an instruction that doesn’t set flags with a branch that does its own comparison.
P-core results:
| Test | Cyc/blk | vs CMP+B.NE (0.506) |
|---|---|---|
| CBZ alone (reference) | 0.668 | — |
| SUB x9 + CBNZ x9 (match) | 0.508 | Fuses |
| ADD x9 + CBZ x9 (match) | 0.794 | Doesn’t fuse cleanly |
| AND x9 + CBZ x9 (match) | 0.709 | Doesn’t fuse cleanly |
| ORR x9 + CBZ x9 (match) | 0.821 | Doesn’t fuse cleanly |
| EOR x9 + CBZ x9 (match) | 0.798 | Doesn’t fuse cleanly |
| BIC x9 + CBZ x9 (match) | 0.721 | Doesn’t fuse cleanly |
| ADD x9 + CBZ x10 (no match) | 0.817 | No |
SUB + CBNZ fuses perfectly at 0.508, confirming the pattern exists on M4. The other ALU ops show intermediate values (0.7-0.8) rather than the clean 0.5 of full fusion. These tests have the same dependency chain issue as before (the ALU writes x9, CBZ reads x9, creating a serial chain across iterations), which inflates the numbers. The fact that SUB+CBNZ still hits 0.508 despite this dependency suggests it has a special fast path, possibly the CBNZ can read the SUB’s result with zero latency when fused.
Shift and extend break fusion
Dougall Johnson noted that branch fusion doesn’t work with implicit shift or extend operands. My M4 tests confirm this:
| Test | Cyc/blk | Fuses? |
|---|---|---|
| CMP reg + B.NE (no shift) | 0.505 | Yes |
| CMP lsl + B.NE (shifted) | 0.827 | No |
| ADDS lsl + B.NE (shifted) | 0.802 | No |
| SUBS uxtw + B.NE (extend) | 0.507 | Yes (!) |
CMP with a shifted register operand (cmp x9, x10, lsl #2) does not fuse. The shift makes the instruction too complex for the fuser. However, SUBS with uxtw extend still fuses on the P-core (0.507). This might be an M4 improvement over M1, or the specific extend type might matter.
Flag-reading instructions don’t fuse
Instructions that read flags (like ADCS, which reads the carry flag) cannot fuse with a following branch:
| Test | Cyc/blk |
|---|---|
| CMP + B.NE (reference) | 0.506 |
| ADCS + B.NE | 0.722 |
ADCS + B.NE at 0.722 doesn’t fuse. This makes sense: the fuser handles instructions that write flags followed by branches that read them. ADCS both reads and writes flags, creating a more complex dependency pattern that the fusion logic can’t handle.
AES instruction fusion
Apple Silicon fuses AES encrypt/decrypt rounds with their subsequent mix-columns step:
| Test | Cyc/blk |
|---|---|
| AESE alone | 2.10 |
| AESMC alone | 2.00 |
| AESE + AESMC (operands match) | 2.22 |
| AESE + AESMC (operands don’t match) | 2.25 |
| AESD + AESIMC (match) | 2.22 |
| AESE + EOR (match) | 2.17 |
The fused AESE + AESMC pair at 2.22 cyc is dramatically faster than the sum of standalone costs (2.10 + 2.00 = 4.10). The pair takes only slightly longer than AESE alone, meaning AESMC is essentially free when fused. This is a critical optimization for AES-GCM and AES-CBC performance. The fusion requires a specific operand pattern (“A,B ; A,A”, the output of AESE feeds directly into AESMC).
AESE + EOR also fuses at 2.17, important for XOR-based AES modes (CTR, GCM).
Patterns that don’t fuse (confirmed)
Consistent with Dougall Johnson’s M1 findings, these patterns show no fusion on M4:
| Test | Cyc/blk | Expected if fused |
|---|---|---|
| MUL + UMULH | 0.661 | ~0.33 (MUL alone) |
| ADRP + ADD | 0.321 | — (fast but not fused) |
| MOVZ + MOVK | 0.216 | — (MOVZ eliminated, not fused) |
MUL + UMULH at 0.661 is exactly 2x MUL alone (0.332). No fusion, they’re independent instructions that happen to share operands.
Summary: P-core vs E-core fusion on Apple M4
| Feature | P-core | E-core |
|---|---|---|
| CMP/CMN/TST + B.cond | Fuses | Fuses |
| ADDS/SUBS/ANDS/BICS (xzr dest) + B.cond | Fuses | Fuses |
| ADDS/SUBS/ANDS/BICS (real dest) + B.cond | Fuses | Slower (cracking overhead) |
| ALU (ADD/SUB/etc.) + CBZ/CBNZ (dest match) | SUB+CBNZ fuses | SUB+CBNZ fuses |
| Shifted operands + B.cond | No | No |
| Extend operands + B.cond | SUBS uxtw fuses | Unclear |
| ADCS (flag-reading) + B.cond | No | No |
| AESE + AESMC / AESD + AESIMC | Fuses | Fuses |
| AESE/AESD + EOR | Fuses | Fuses |
| Taken branches | Throughput penalty (~0.85 cyc) | Does not fuse (~1.0+ cyc) |
| NOP elimination | Yes | Partial |
| MOVZ elimination | Yes (0.108 cyc, ~free) | Yes |
| ADRP + ADD | No (but ADRP is cheap) | No |
| MUL + UMULH | No | No |
| LDP/STP micro-op fusion | ~2x vs LDR/STR | ~2x vs LDR/STR |
References
[1] Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 6th Edition
[2] Dougall Johnson, Apple Silicon CPU microarchitecture reverse engineering: https://dougallj.github.io/applecpu/firestorm.html
[3] Corsix, x86 macro-op fusion notes: https://www.corsix.org/content/x86-macro-op-fusion-notes
[4] Intel 64 and IA-32 Architectures Optimization Reference Manual
[5] EasyPerf, Micro-fusion in Intel CPUs: https://easyperf.net/blog/2018/02/15/MicroFusion-in-Intel-CPUs
[6] US Patent US20040199748A1, Micro-op fusion (Intel): https://patentimages.storage.googleapis.com/00/d5/90/553f1307abdc19/US20040199748A1.pdf
[7] LLVM AArch64 MacroFusion pass: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64MacroFusion.cpp
[8] Travis Downs, uarch-bench: https://github.com/travisdowns/uarch-bench