I was skimming over Computer Architecture A Quantitative Approach a few days ago, specifically, the 6th edition’s chapter on the Intel Core i7 6700 when I noticed this part:

Macro-op fusion? This is the first I’ve read on this. The textbook gives a good brief overview on the topic, but I decided to spend a few days going through it in more detail. As a result, I’d like to share some of the details on Macro-op and Micro-op fusion in modern processors, and then test for them on my M4 Mac.

Brief overview of Instruction Decoding

All modern, high performing, out-of-order CPUs internally translate ISA-defined opcodes into internal, fixed-length instructions referred to as micro-ops (uops). This happens for a couple of reasons.

For one, it simplifies the execution engine. On CISC ISAs like x86, the instruction set has grown to over 2,000 instructions with varying lengths (1 to 15 bytes). Building an execution engine that handles all of these natively would be impractical. Instead, the front-end decoder breaks each architectural instruction (“macro-op”) into one or more simpler, fixed-width micro-ops that the backend can schedule uniformly. For example, an x86 instruction like:

add [rax+rcx*4+8], rdx

looks like a single instruction, but it actually does three things: compute an address, load from memory, and add. The decoder cracks this into separate uops: a load uop, an ALU uop, and a store uop Each can be independently scheduled and dispatched to different execution units.

The second reason is scheduling. By breaking complex instructions into independent uops, the out-of-order engine can overlap their execution with other work. That add [mem], reg above: the load uop can issue as soon as the address is ready, the ALU uop fires when the load completes, and the store uop writes back when the ALU finishes. Meanwhile other instructions can execute in the gaps. This wouldn’t be possible if the processor treated the whole thing as one monolithic operation.

These uops are what actually flow through the out-of-order pipeline, renamed, scheduled, dispatched to execution ports, and retired. The original macro-ops are just a front-end concern.

Modern CPUs also cache these translations. AMD’s Zen 5 processors (2024) have a 2x6-wide op cache that sits before the decode stage. On a cache hit, the processor bypasses decoding entirely, feeding uops straight to the rename/dispatch stage. Intel has a similar structure (the “uop cache” or DSB, introduced with Sandy Bridge). This matters because decode is typically a bottleneck. On x86, the decoders can typically only handle 4-6 instructions per cycle, but the op cache can deliver 6-8 uops per cycle.

ARM processors like Apple Silicon also decode into internal micro-ops, though the process is simpler since ARM instructions are fixed-width (4 bytes). Apple’s M4 performance cores have a massive decode width of up to 10 instructions per cycle.

Macro-op fusion

With that background, macro-op fusion refers to the ability to combine multiple macro-ops (i.e. regular architectural instructions) into a single uop during decode. Since certain instruction patterns appear so frequently, it’s worth adding hardware to recognize and merge them.

The canonical pattern is a comparison followed by a conditional branch. This shows up at the end of virtually every loop:

; x86
cmp  rax, rbx     ; sets flags
jne  .loop         ; reads flags, branches
 
; ARM
cmp  x0, x1        ; sets flags
b.ne .loop          ; reads flags, branches

These are two separate instructions, but semantically they form a single decision: “compare and branch if not equal.” If the decoder can fuse them into one uop, that uop occupies a single slot through the entire pipeline this pretty much increases the pipeline’s throughput without physically widening it.

x86: History and current state

Macro-op fusion first appeared in Intel’s Core 2 (2006). Initially it was limited: only CMP or TEST followed by a conditional jump, and only with certain condition codes (e.g., JE, JNE, JL, JGE). Later microarchitectures expanded this:

GenerationFusion support
Core 2 (2006)CMP/TEST + Jcc (limited conditions)
Nehalem (2008)CMP/TEST + all Jcc conditions
Sandy Bridge (2011)+ ADD/SUB/AND/INC/DEC + Jcc
Haswell (2013)+ TEST with immediate operands
Modern (Zen 4/5, Golden Cove+)CMP, TEST, ADD, SUB, AND, OR, XOR, INC, DEC + all Jcc

AMD followed a similar trajectory, adding macro-op fusion in Bulldozer (2011) and progressively expanding it through the Zen generations.

The practical effect is significant. In a tight loop like:

.loop:
    dec  ecx          ; ecx -= 1, sets flags
    jnz  .loop        ; branch if ecx != 0

Without fusion, this is 2 uops per iteration. With fusion, it’s 1. On a 6-wide machine, that means the fused loop overhead takes 1/6th of the pipeline bandwidth instead of 2/6ths.

ARM: A slightly different story

ARM’s fixed-width instruction encoding makes the decode stage simpler, but macro-op fusion is still valuable. The same compare-and-branch pattern is just as common:

cmp  x0, #0
b.eq .done

ARM also has single-instruction alternatives like CBZ (Compare and Branch if Zero) and TBZ (Test Bit and Branch if Zero) that accomplish the same thing without needing fusion. However, these cover a limited set of conditions, so the two-instruction CMP/CMN/TST + B.cond pattern remains common.

Apple’s CPU cores are particularly aggressive here. The M-series chips (M1 through M4) perform macro-op fusion in the decode stage, and Apple’s 10-wide P-core decoder has dedicated fusion logic. LLVM’s AArch64 backend has a MacroFusion pass that understands which pairs Apple’s cores can fuse, and it will reorder instructions to keep fusable pairs adjacent.

Testing for Macro-op fusion on Apple M4

Reading about fusion is one thing, but I wanted to actually measure it. I decided to test this on my M4 Macbook Air by doing the following: run a tight loop of instruction pairs that should fuse, compare their throughput against a baseline of instructions that shouldn’t, and see if the fused pairs are 2x faster.

Methodology

The test program runs each instruction pattern in a tight loop with 50x unrolling (to minimize loop overhead) across 500 million iterations, measuring wall-clock time via mach_absolute_time. Since Apple locked down the hardware performance counters (kpc/PMC) on M4 and macOS Sequoia, even for root, I estimate CPU cycles by calibrating the frequency first: a dependent chain of 1 billion ADD instructions (1-cycle latency each, IPC=1 by construction) gives me cycles/ns.

The test runs on both P-cores (via QOS_CLASS_USER_INTERACTIVE) and E-cores (via QOS_CLASS_BACKGROUND) to compare fusion behavior across core types.

P-core results

The P-core runs at ~4.4 GHz. The baseline is two independent ADD instructions per pair, which gives me 1.00 cycles/pair: two instructions consuming two dispatch slots as expected.

Fusion with xzr (zero register) operands:

Instruction PairCyc/pairvs BaselineFuses?
ADD x2 (baseline)1.0001.00x
CMP + B.NE0.5060.51xYes
CMN + B.NE0.5070.51xYes
TST + B.NE0.5070.51xYes
ADDS + B.NE0.5060.51xYes
SUBS + B.NE0.5060.51xYes
ANDS + B.NE0.5070.51xYes
BICS + B.NE0.5070.51xYes

Every flag-setting instruction fuses with every conditional branch at exactly half the baseline throughput. Two instructions, one dispatch slot.

Operand type doesn’t matter on the P-core:

VariantCyc/pair
CMP xzr, xzr + B.NE0.506
CMP x9, x10 + B.NE (register)0.506
CMP x9, #0 + B.NE (immediate)0.507
ADDS x11, x9, x10 + B.NE (reg, writes dest)0.506
ADDS x11, x9, #0 + B.NE (imm, writes dest)0.506
SUBS x11, x9, x10, lsl #0 + B.NE (shifted)0.506

All 0.506. The P-core’s fusion logic doesn’t care about encoding format, operand source, or whether the flag-setter also writes a destination register. If it sets flags and the next instruction is a conditional branch, it fuses.

What breaks fusion:

TestCyc/pairFuses?
CMP + B.NE (adjacent)0.506Yes
CMP + NOP + B.NE0.505Yes (!)
CMP + ADD + B.NE0.992No
CMP + B.NE (taken)0.849Different bottleneck

Two things to note:

First, inserting a NOP between CMP and B.NE does not break fusion on the P-core. This means the P-core eliminates NOPs at the rename stage before fusion happens, so the CMP and B.NE end up adjacent in the uop stream anyway.

Second, taken branches are slower (~0.85 cyc) regardless of fusion. This is a separate throughput constraint: the branch execution unit has a limited taken-branch bandwidth. It’s not that fusion fails; it’s that correctly-predicted taken branches still cost more than fall-through.

E-core results

The E-core runs at ~1.3 GHz with a narrower pipeline. The initial results here looked like ADDS/SUBS with a destination register “didn’t fuse”, but further investigation revealed a more nuanced picture involving dependency chains and instruction cracking.

Compare-only instructions fuse cleanly:

Instruction PairCyc/pairvs BaselineFuses?
ADD x2 (baseline)0.9821.00x
CMP + B.NE0.5100.52xYes
CMN + B.NE0.4940.50xYes
TST + B.NE0.5380.55xYes
CMP reg,reg + B.NE0.5360.55xYes
CMP imm + B.NE0.5100.52xYes

Flag-setting ALU ops with a destination register are slower, but why?

My initial tests showed ADDS x9 + B.NE at ~0.68 cyc/pair, between fused (0.5) and unfused (1.0). I initially attributed this to the E-core refusing to fuse instructions that write a destination register. But fusion is binary: a pair either fuses or it doesn’t. An intermediate value demands a better explanation.

An isolation test revealed two confounds in the original measurement:

  1. Dependency chains. ADDS x9, x9, #0 followed by another ADDS x9, x9, #0 creates a serial dependency: each iteration reads x9, which the previous iteration wrote. This serializes execution regardless of fusion. When I changed the destination to a different register (ADDS x11, x9, #0 + B.NE, no dependency chain), the P-core result dropped to 0.506, identical to fused CMP + B.NE. On the E-core, it dropped from 0.87 to 0.65.

  2. Instruction cracking. Even with the dependency chain removed, ADDS x11 + B.NE on the E-core still measured 0.65 cyc/pair, slower than CMP + B.NE at 0.49. The standalone throughput data explains why: ADDS with a real destination register is intrinsically slower on the E-core (0.54 cyc standalone) than CMP (0.34 cyc standalone). This suggests the E-core cracks ADDS into 2 internal uops (one for the GPR result, one for flags) when it writes a real register, while CMP is a single uop. The throughput difference in the paired test comes from this cracking overhead, not from a failure to fuse.

E-core standalone throughputCyc/instr
CMP (flags only)0.34
TST (flags only)0.34
ADDS xzr (= CMP, no dest)0.38
ADDS x9 (flags + dest, no dep chain)0.54
ADD x9 (dest only, no flags)1.00

The P-core shows no such cracking. ADDS with or without a destination register has the same throughput when dependencies are removed.

Experimental note on confounds: My first round of register-operand tests used uninitialized registers, which meant x9 and x10 could be unequal, making B.NE a taken branch while the xzr tests always produced not-taken branches. Taken branches are slower due to separate throughput constraints, so the results looked like “partial fusion” when they were partly measuring taken-branch overhead. Zeroing the registers and controlling for dependency chains was necessary to get clean data.

Taken vs not-taken

Both core types show a clear split based on branch direction:

P-core:

Branch directionCyc/pair
CMP + B.NE (not taken)0.506
CMP + B.EQ (not taken)0.509
CMP + B.LT (not taken)0.506
CMP + B.HI (not taken)0.506
CMP + B.NE (taken)0.849
CMP + B.GE (taken)0.846
CMP + B.LS (taken)0.724

E-core:

Branch directionCyc/pair
Not-taken (all conditions)~0.50
Taken (all conditions)~1.05-1.12

Not-taken branches fuse and execute at maximum throughput. Taken branches incur additional cost even when correctly predicted. This is a fetch/redirect penalty, not a fusion failure. On the P-core, the penalty is moderate (~0.85 vs 0.50). On the E-core, taken branches are essentially unfused speed (~1.0+), suggesting the E-core may not fuse taken branches at all, or that its branch throughput limit is tighter.

Micro-op fusion

Micro-op fusion operates at a different level. Where macro-op fusion merges two architectural instructions into one uop, micro-op fusion deals with a single architectural instruction that internally decodes into multiple micro-ops.

x86: The original motivation

On x86, micro-op fusion was introduced by Intel with the Pentium M (2003). The motivating case is memory-operand instructions. Consider:

add eax, [rbx+rcx*4+16]

This single instruction does two things: load a value from memory, then add it to eax. Internally it decodes into 2 uops: a load uop and an ALU uop. Without micro-op fusion, these consume 2 slots in the reorder buffer (ROB), 2 slots during dispatch, and 2 slots in the retirement queue.

With micro-op fusion, the two uops are kept together as a single fused entry through the front-end: they share one ROB slot, one dispatch slot, and one retirement slot. They only split apart at the execution stage when they actually need different execution ports (load port vs ALU port). This effectively increases the ROB capacity and dispatch bandwidth without physically making them larger.

The rules for what can be micro-fused have evolved:

  • Pentium M / Core: simple addressing modes (base, base+displacement)
  • Sandy Bridge+: indexed addressing modes (base+index+displacement) can be micro-fused at decode but may be “unlaminated” before dispatch if the instruction has 3 or more inputs
  • Haswell+: expanded unlamination rules

Unlamination

Unlamination is the reverse of micro-op fusion: a fused uop being split back into separate uops before dispatch. This happens on Intel CPUs when a micro-fused instruction has too many register inputs for the rename stage to handle in one slot. For example:

add rax, [rbx+rcx*4]   ; 3 inputs: rax, rbx, rcx

This gets micro-fused at decode (1 fused uop in the uop cache), but the renamer splits it back into 2 uops because it needs 3 source registers. The fusion still helps by saving uop cache bandwidth, but the ROB/dispatch savings are lost.

Instructions with 2 or fewer register inputs stay fused all the way through:

add rax, [rbx+16]      ; 2 inputs: rax, rbx (stays fused)

ARM / Apple Silicon: LDP and STP

ARM has its own version of this with load/store pair instructions. LDP (Load Pair) loads two 64-bit registers from consecutive memory locations in a single instruction:

ldp x0, x1, [sp]       ; loads x0 from [sp], x1 from [sp+8]

Architecturally this is one instruction, but it performs two memory accesses and writes two registers. The hardware cracks it into two load micro-ops internally, but with micro-op fusion, keeps them fused through dispatch, occupying a single pipeline slot. They only split at execution to hit two separate load ports.

Testing micro-op fusion on Apple M4

I tested this by comparing the throughput of LDP/STP against equivalent pairs of LDR/STR instructions, all loading/storing from a hot L1 cache line:

P-core:

InstructionCyc/pairvs 2x Single
LDP (load pair)0.3240.51x vs LDR x2 (0.641)
STP (store pair)0.4650.50x vs STR x2 (0.933)

E-core:

InstructionCyc/pairvs 2x Single
LDP (load pair)0.4300.37x vs LDR x2 (1.157)
STP (store pair)1.0540.55x vs STR x2 (1.907)

LDP and STP consistently deliver ~2x the throughput of separate load/store pairs on both core types. The pair instruction goes through dispatch as a single fused uop, only splitting at the execution stage. The E-core shows an even more pronounced relative advantage because its narrower dispatch makes the saved slot proportionally more valuable.

This has direct implications for compiler output and hand-tuned code: LDP/STP should always be preferred over adjacent LDR/STR pairs. LLVM and GCC already do this aggressively in their AArch64 backends.

Other fusion and elimination on Apple M4

Beyond compare+branch fusion, Apple Silicon performs other instruction merging optimizations. Dougall Johnson’s reverse-engineering of the M1 (Firestorm/Icestorm) documented several of these. I tested whether they carry forward to the M4.

MOVZ + MOVK fusion

ARM64 can’t load a full 64-bit immediate in one instruction. Instead, constants are built up with MOVZ (load 16 bits, zero the rest) followed by one or more MOVK (patch in 16 bits, keep the rest):

movz x9, #0x1234            ; x9 = 0x0000000000001234
movk x9, #0x5678, lsl #16   ; x9 = 0x0000000056781234

This pattern is extremely common since every 32-bit+ constant needs it. If the CPU can fuse the pair into a single operation, it saves a dispatch slot every time a constant is materialized.

P-core results:

TestCyc/blkCyc/instr
MOVZ standalone0.1080.108
MOVK standalone0.2570.257
MOV imm standalone0.1090.109
MOVZ + MOVK (same reg)0.2160.108
MOVZ + MOVK (diff reg)0.9890.494
MOVZ + NOP + MOVK0.3130.104
MOVZ + MOVK + MOVK (48-bit)0.3300.110

MOVZ alone is essentially free (0.108 cyc, close to NOP at 0.101). It’s eliminated at rename, like a zero-idiom. The pair MOVZ + MOVK to the same register runs at 0.216 cyc/pair, which is 0.108 cyc/instr: the MOVZ is eliminated and the MOVK runs at its standalone throughput. This isn’t traditional fusion (merging two uops into one); it’s MOVZ elimination making the pair effectively a single instruction.

The three-instruction sequence MOVZ + MOVK + MOVK (for 48-bit constants) is 0.330 cyc = 0.110 cyc/instr, confirming the MOVZ is eliminated and both MOVKs execute independently. MOVZ + MOVK to different registers takes 0.989 cyc (no elimination, the MOVK needs the MOVZ result from a different register, which doesn’t make semantic sense and can’t be optimized).

Interestingly, MOVZ + NOP + MOVK is 0.313 cyc. The NOP is also eliminated, so the MOVK still executes at roughly its standalone throughput. The P-core eliminates both NOPs and MOVZ at rename.

E-core: The same pattern holds but with more noise and higher base costs. MOVZ + MOVK (same reg) at 0.659 cyc/pair vs MOVK standalone at 0.509, consistent with MOVZ elimination.

ADRP + ADD (address materialization)

Loading a symbol’s address on ARM64 is a two-instruction sequence:

adrp x9, symbol@PAGE       ; x9 = page-aligned address of symbol
add  x9, x9, symbol@PAGEOFF ; x9 += offset within page

This is emitted by the linker for every global variable access. Fusing this pair would save a dispatch slot on one of the most common instruction patterns in any binary.

P-core results:

TestCyc/blkCyc/instr
NOP standalone0.1010.101
ADD standalone1.0031.003
ADRP standalone0.4050.405
ADR standalone0.4060.406
ADRP x20.8000.400
ADRP + ADD (same reg)0.3210.160
ADRP + ADD (diff reg)0.3060.153
ADRP + NOP + ADD0.3100.103

ADRP standalone is 0.405 cyc, not as cheap as NOP (0.101) but cheaper than ADD (1.003). It’s not fully eliminated but may use a specialized fast path (PC-relative computation is simple). ADR has the same cost.

The ADRP + ADD pair at 0.32 cyc is faster than either instruction alone. Since ADRP + NOP + ADD is 0.31 cyc (nearly the same), this isn’t adjacency-dependent fusion. It’s just that both instructions are cheap and the pipeline can overlap them. ADRP doesn’t need an ALU port (it’s a PC-relative constant), so it doesn’t compete with the ADD for execution resources.

Notably, there’s no evidence of ADRP + ADD fusion on the M4 P-core in the macro-op fusion sense. The pair is fast because ADRP is cheap, not because they merge into one uop. This is consistent with Dougall Johnson’s M1 findings: “still not adrp + add fusion.”

E-core: ADRP standalone is 0.356 cyc (similar to NOP at 0.198, suggesting partial elimination). ADRP + ADD is 0.612 cyc, roughly ADRP + ADD standalone costs. No fusion here either.

ALU + CBZ/CBNZ fusion

Dougall Johnson documented that the M1’s P-core (Firestorm) can fuse non-flag-setting ALU instructions (ADD, SUB, AND, ORR, EOR, BIC, etc.) with CBZ/CBNZ when the destination register of the ALU op matches the operand of the compare-and-branch. This is a different fusion pattern from CMP+B.cond: here the CPU fuses an instruction that doesn’t set flags with a branch that does its own comparison.

P-core results:

TestCyc/blkvs CMP+B.NE (0.506)
CBZ alone (reference)0.668
SUB x9 + CBNZ x9 (match)0.508Fuses
ADD x9 + CBZ x9 (match)0.794Doesn’t fuse cleanly
AND x9 + CBZ x9 (match)0.709Doesn’t fuse cleanly
ORR x9 + CBZ x9 (match)0.821Doesn’t fuse cleanly
EOR x9 + CBZ x9 (match)0.798Doesn’t fuse cleanly
BIC x9 + CBZ x9 (match)0.721Doesn’t fuse cleanly
ADD x9 + CBZ x10 (no match)0.817No

SUB + CBNZ fuses perfectly at 0.508, confirming the pattern exists on M4. The other ALU ops show intermediate values (0.7-0.8) rather than the clean 0.5 of full fusion. These tests have the same dependency chain issue as before (the ALU writes x9, CBZ reads x9, creating a serial chain across iterations), which inflates the numbers. The fact that SUB+CBNZ still hits 0.508 despite this dependency suggests it has a special fast path, possibly the CBNZ can read the SUB’s result with zero latency when fused.

Shift and extend break fusion

Dougall Johnson noted that branch fusion doesn’t work with implicit shift or extend operands. My M4 tests confirm this:

TestCyc/blkFuses?
CMP reg + B.NE (no shift)0.505Yes
CMP lsl + B.NE (shifted)0.827No
ADDS lsl + B.NE (shifted)0.802No
SUBS uxtw + B.NE (extend)0.507Yes (!)

CMP with a shifted register operand (cmp x9, x10, lsl #2) does not fuse. The shift makes the instruction too complex for the fuser. However, SUBS with uxtw extend still fuses on the P-core (0.507). This might be an M4 improvement over M1, or the specific extend type might matter.

Flag-reading instructions don’t fuse

Instructions that read flags (like ADCS, which reads the carry flag) cannot fuse with a following branch:

TestCyc/blk
CMP + B.NE (reference)0.506
ADCS + B.NE0.722

ADCS + B.NE at 0.722 doesn’t fuse. This makes sense: the fuser handles instructions that write flags followed by branches that read them. ADCS both reads and writes flags, creating a more complex dependency pattern that the fusion logic can’t handle.

AES instruction fusion

Apple Silicon fuses AES encrypt/decrypt rounds with their subsequent mix-columns step:

TestCyc/blk
AESE alone2.10
AESMC alone2.00
AESE + AESMC (operands match)2.22
AESE + AESMC (operands don’t match)2.25
AESD + AESIMC (match)2.22
AESE + EOR (match)2.17

The fused AESE + AESMC pair at 2.22 cyc is dramatically faster than the sum of standalone costs (2.10 + 2.00 = 4.10). The pair takes only slightly longer than AESE alone, meaning AESMC is essentially free when fused. This is a critical optimization for AES-GCM and AES-CBC performance. The fusion requires a specific operand pattern (“A,B ; A,A”, the output of AESE feeds directly into AESMC).

AESE + EOR also fuses at 2.17, important for XOR-based AES modes (CTR, GCM).

Patterns that don’t fuse (confirmed)

Consistent with Dougall Johnson’s M1 findings, these patterns show no fusion on M4:

TestCyc/blkExpected if fused
MUL + UMULH0.661~0.33 (MUL alone)
ADRP + ADD0.321— (fast but not fused)
MOVZ + MOVK0.216— (MOVZ eliminated, not fused)

MUL + UMULH at 0.661 is exactly 2x MUL alone (0.332). No fusion, they’re independent instructions that happen to share operands.

Summary: P-core vs E-core fusion on Apple M4

FeatureP-coreE-core
CMP/CMN/TST + B.condFusesFuses
ADDS/SUBS/ANDS/BICS (xzr dest) + B.condFusesFuses
ADDS/SUBS/ANDS/BICS (real dest) + B.condFusesSlower (cracking overhead)
ALU (ADD/SUB/etc.) + CBZ/CBNZ (dest match)SUB+CBNZ fusesSUB+CBNZ fuses
Shifted operands + B.condNoNo
Extend operands + B.condSUBS uxtw fusesUnclear
ADCS (flag-reading) + B.condNoNo
AESE + AESMC / AESD + AESIMCFusesFuses
AESE/AESD + EORFusesFuses
Taken branchesThroughput penalty (~0.85 cyc)Does not fuse (~1.0+ cyc)
NOP eliminationYesPartial
MOVZ eliminationYes (0.108 cyc, ~free)Yes
ADRP + ADDNo (but ADRP is cheap)No
MUL + UMULHNoNo
LDP/STP micro-op fusion~2x vs LDR/STR~2x vs LDR/STR

References

[1] Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 6th Edition

[2] Dougall Johnson, Apple Silicon CPU microarchitecture reverse engineering: https://dougallj.github.io/applecpu/firestorm.html

[3] Corsix, x86 macro-op fusion notes: https://www.corsix.org/content/x86-macro-op-fusion-notes

[4] Intel 64 and IA-32 Architectures Optimization Reference Manual

[5] EasyPerf, Micro-fusion in Intel CPUs: https://easyperf.net/blog/2018/02/15/MicroFusion-in-Intel-CPUs

[6] US Patent US20040199748A1, Micro-op fusion (Intel): https://patentimages.storage.googleapis.com/00/d5/90/553f1307abdc19/US20040199748A1.pdf

[7] LLVM AArch64 MacroFusion pass: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64MacroFusion.cpp

[8] Travis Downs, uarch-bench: https://github.com/travisdowns/uarch-bench

ComputerArchitecture MicroArchitecture Technical