Macro-op fusion & Micro-op fusion on CPUs

I was skimming over Computer Architecture A Quantitative Approach a few days ago, specifically, the 6th edition’s chapter on the Intel Core i7 6700 when I noticed this part:

Macro-op fusion? This is the first I’ve read on this. The textbook gives a good brief overview on the topic, but I decided to spend a few days going through it in more detail. As a result, I’d like to share some of the details on Macro-op and Micro-op fusion in modern processors.

Brief overview of x86 Instruction Decoding

All modern, high performing, out-of-order CPUs internally translate ISA-defined opcodes into internal, fixed length instructions referred to as Micro-ops (µops). Internally, this is done for a couple of reasons:

For one, it makes decoding complex instructions easier on CISC-style ISA’s like x86, where you have thousands of potential instructions to worry about. For example, a complex instruction such as, CMP uhhh and hmmm

Another reason is that by breaking a complicated instructions we can schedule them more effectively,.. Taking the previous example..

These µops are then what goes through the pipeline and are scheduled dynamically.

Modern CPU’s also have a µop cache to store these translations. For example AMD’s Zen5 processors (2024) have 2 6-wide op caches that sit before the decode stage. On a cache hit, this allows the processor to bypass the decode stage for these instructions.

Macro-op fusion

With that in mind, Macro-op fusion refers to the ability to combine multiple macro-ops (e.i regular x86 instructions) into a single µop. This was first done on Intel Core 2 CPUs in 2006, but initially was only possible under very specific situations where you had a CMP/TEST followed by a conditional jump with certain conditions x†

Later on Intel and AMD expanded fusion support to apply to CMP, TEST, ADD, SUB, AND, OR, XOR, INC, or DEC combined with all conditional jumps†*.

A good example of this is