ILP

Veljko Milutinovic

ILP:
Understanding the Essence

vm@etf.rs

UNDERSTANDING THE ILP

Methods to exploit the ILP:

Superscalar processing (SS), superpipelined processing (SP), and VLIW

ILP: A measure of the average number of instructions

that an SS/SP/VLIW processor might be able to execute at the same time

Figure ILPU1: Instruction timing in a RISC processor without memory limitations (source: [Johnson91])

Legend:

i—instruction, f—fetch, d—decode, e—execute.

SUPERSCALAR PROCESSING

Figure ILPU2: Instruction timing in a SUPERSCALAR processor without memory limitations (source: [Johnson91])

Legend: Self-explanatory.

Instruction timing in a superscalar processor

ILP: A function of the number of dependencies

in relation to other instructions

MLP: A measure of the ability of a superscalar processor

to take advantage of the ILP

Must be balanced: MLP and ILP!

Fundamental performance limitations of a superscalar processor

Figure ILPU3: True data dependencies—effect of true data dependencies on instruction timing (source: [Johnson91])

Legend: Self-explanatory.

Figure ILPU4: Procedural dependencies—effect of procedural dependencies on instruction timing (source: [Johnson91])

Legend: Self-explanatory.

Figure ILPU5: Resource conflicts—effect of resource conflicts on instruction timing (source: [Johnson91])

Legend: Self-explanatory.

Figure ILPU6: Resource conflicts—effect of resource conflicts on instruction timing (source: [Johnson91])

Legend: Self-explanatory.

Out-of-order instruction issue, execution, and completion (state change)

An example from [Johnson91]:

I1 requires 2 cycles to execute

I3 and I4 conflict for a functional unit

I5 depends on a datum generated by I4

I5 and I6 conflict for a functional unit

In-order issue with in-order completion

Decode		Execute			Writeback		Cycle
I₁	I₂						1
I₃	I₄	I₁	I₂				2
I₃	I₄	I₁					3
	I₄			I₃	I₁	I₂	4
I₅	I₆			I₄			5
	I₆		I₅		I₃	I₄	6
			I₆				7
					I₅	I₆	8

Figure ILPU7: Example II (source: [Johnson91])

Legend: II—In-order issue, In-order completion.

Although simple, not used even in scalar microprocessors,

because long-latency instructions become a performance-bottleneck.

In-order issue with out-of-order completion

r₃ := r₃ op r₅ (1)

r₄ := r₃ + 1 (2)

r₃ := r₅ + 1 (3)

r₇ := r₃ op r₄ (4)

Decode		Execute			Writeback		Cycle
I₁	I₂						1
I₃	I₄	I₁	I₂				2
	I₄	I₁		I₃	I₂		3
I₅	I₆			I₄	I₁	I₃	4
	I₆		I₅		I₄		5
			I₆		I₅		6
					I₆		7

Figure ILPU8: Example IO (source: [Johnson91])

Legend: IO—In-order issue, Out-of-order completion.

First used in scalar microprocessors,

to improve performance of LD and FP instructions.

Instruction issue is stalled in three cases:

If the issued instruction could generate an FU conflict;
If the issued instruction depends on a not-yet-completed instruction;
If the result of the issued instruction could be overwritten,
by an older instruction which takes longer to complete.

Control hardware is responsible for stalling (complicated by exceptions)

Out-of-order issue with out-of-order completion

Decode		Window	Execute			Writeback		Cycle
I₁	I₂							1
I₃	I₄	I₁, I₂	I₁	I₂				2
I₅	I₆	I₃, I₄	I₁		I₃	I₂		3
		I₄, I₅, I₆		I₆	I₄	I₁	I₃	4
		I₅		I₅		I₄	I₆	5
						I₅		6

Figure ILPU9: Example OO (source: [Johnson91])

Legend: OO—Out-of-order issue, Out-of-order completion.

Instruction issue does not stop at the instruction creating a conflict,

because one or more subsequent instructions might be executable.

Instruction fetch and decoding while room in the instruction window.

Instruction issue from the window regardless of the original order.

Out-of-order issue introduces an additional constraint:

when instruction N + 1 destroys the input for instruction N.

VLIW

Figure ILPU10: Instruction timing of a VLIW processor (source: [Johnson91])

Legend: Self-explanatory.

Single instruction specifies more than one concurrent operation.

Less run-time instructions and more compile-time activities.

Difficult to make it binary-code compatible with existing processors.

Code size may increase if application/compiler not adequate.

Good for special purpose applications/compilations.

SUPERPIPELINED PROCESSING

Figure ILPU11: Instruction timing in a superpipelined processor (source: [Johnson91])

Legend: Self-explanatory.

SP takes longer (1.5cp) to generate a result, compared to SS (1cp);

consequently, lower performance if more true data dependencies.

SP takes less (0.5cp) for simpler operations, compared to SS (1cp);

consequently, higher performance since test results are known sooner.

SP has a longer clock period, due to substage latches;

consequently, lower performance.

SP has less resources duplicated, due to resource timesharing;

consequently, lower complexity.

Good if clock skew controllable and resource duplication expensive.

HYBRID TECHNIQUES: SS, SP, and VLIW combined!

Theoretical limits of ILP exploitation:

(a) conditional branches;

(b) complexity grows quadratically with the issue width, and affects clock;

Practical limits of ILP exploitation:

(a) 8-way issue or equivalent in SS/SP (if hardware support only);

(b) N-op VLIW with trace scheduling (if compiler support included);

Delayed branching:

(a) The delay is longer, due to lookahead;

(b) The number of safe instructions is smaller, compared to the issue width;

Consequently, more sophisticated techniques are needed!

MIPS R10000

Fetching 4 instructions at a time from the I cache;
these instructions have been predecoded before being put into the I cache,
to help determine the instruction type immediately upon the I cache fetch.

After I cache fetch (512 entries), branches are predicted,
using a 512-entry prediction table within the I cache controller;
one table entry holds a 2-bit counter to encode the history information.

If a branch is predicted to be taken, it takes a cycle to redirect fetching;
instructions in the non-taken path are placed into the resume cache,
for the case of incorrect prediction (enough space for 4 predictions).

Fetch is followed by decoding,
register operands renaming (logical-to-physical mapping table updated),
and dispatching into one of non-FIFO I queues (easier due to predecoding).

Three I queues (memory, integer, and FP) with at most four I slots/queue;
four functional units
(address adder, integer unit with shifter, integer unit with fam, and fp unit).

On-chip primary data cache (32 Kbytes with 32-byte lines, 2w/sa);
off-chip secondary data cache;
using a reorder buffer to maintain precise state at the time of exception.

Figure ILPU12: Organization of MIPS R10000 (source: [Smith95])

Legend:

FP—Floating-point.

DEC Alpha21164

Simple superscalar going after high clock rate;
four instructions fetching from an 8Kbyte I cache,
and placed in one of two I buffers (four instructions in each one).

Issue goes from the buffers (in-order, no instructions bypassing at all);
one must be empty before the other one can be used
(simpler design, but slower issue).

Branches predicted using a 2-bit history table
associated within the I cache controller;
I issue stalled if 2nd branch encountered before the 1st branch resolved.

After fetch and decode, instructions are arranged
according to the functional unit they will use;
Issue follows after the operands are ready (from registers or via bypassing).

Four FUs: int with shifter, int with branchevalor, FP adder, FP multiplier;
integer instructions update registers in-order;
FP instructions update files out-of-order; not all FP exceptions are precise.

On-chip primary caches: 8Kbytes each (I + D, dm for 1cp cache access);
a 6-entry MAF with MissAddresses/TargetRegs
for up to 21 missed loads (merge).

On-chip secondary cache: 96Kbytes (shared, 3w/sa)

Figure ILPU12: Organization of DEC Alpha21164 (source: [Smith95])

Legend:

FP—Floating-point.

REFERENCES

[Johnson91] Johnson, M., Superscalar Microprocessor Design, Prentice-Hall, Englewood Cliffs, New Jersey, 1991.

[Smith95] Smith, J.E., Sohi, G., "The Microarchitecture of Superscalar Processors,"
Proceedings of the IEEE, Vol. 83, No. 12, December 1995, pp. 1609–1624.

Veljko Milutinovic

ILP:
State of the Art

vm@etf.rs

OPTIMAL NUMBER OF FUs
FOR OUT-OF-ORDER
SUPERSCALAR PROCESSORS

Problem:

Number of FUs for maximal speed-up of modern SS microprocessors

(MC 88110, UltraSparc, DEC Alpha21164,
IBM 604 ooo, and MIPS R10000 ooo).

Processor DateShip	PPC 604 94	MC 88110 92	UltraSparc 95	DEC 21164 95	MIPS 10000 95
Degree	4	2	4	4	4
IntegerUnit	2	2	2	2	2
ShiftUnit	2	1	2	2	2
DivideUnit	1	1	2	2	2
MultiplyUnit	1	1	2	2	2
FPAddUnit	1	1	2	1	1
ConvertUnit	1	1	2	1	1
FPDivideUnit	1	1	2	1	1
FPMultiplyUnit	1	1	2	1	1
DataCachePort	1	1	1	2	1

Figure ILPS1: Basic characteristics of modern superscalar processors (source: [Jourdan95])

Legend: Self-explanatory.

Solution:

From 5 to 9 FUs necessary to exploit the existing ILP,

using the modern microprocessor architectures.

If the degree of superscaling is 4 or more, several data cache ports are needed.

Conditions:

Real benchmark: Spec92

Lookahead window degree: 2 to 8

Effect of cache misses: n/a

Effect of ooo execution: assumed where not existing (unified issue buffer)

Reference:

[Jourdan96] Jourdan, S., Sainrat, P., Litaize, D.,
Exploring Configurations of Functional Units
in an Out-of-Order Superscalar Processor,
Proceedings of the ISCA-95, Santa Margherita Ligure, Italy,
May 1995, pp. 117–125.

INCREASING
THE CACHE PORT EFFICIENCY
FOR SUPERSCALAR PROCESSORS

Problem:

Figure ILPS4: The load/store unit data path in modern SS microprocessors (source: [Wilson96])

Legend:

CAB—Cache Address Buffer

Multiple cache ports are necessary for better exploitation of ILP;

however, they are expensive.

Solution:

Bandwidth of a single port can be improved,

by using additional buffering in the processor and wider ports in caches.

Four proposed hardware enhancements;

load all, load all wide, keep tags, and line buffer.

Conditions:

A 200 MHz R10000-like ISA with a 64-entry ROB and a 32-entry load/store buf;

Split primary caches (2w/sa:e8KB), unified secondary caches (2w/sa:2MB), MM;

Four miss status handling registers (MSHRs), to enable four on-going misses;

SimOS and Spec95.

Reference:

[Wilson96] Wilson, K. M., Olokotun, K., Rosenblum, M.,
Increasing Cache Port Efficiency
for Dynamic Superscalar Microprocessors,
Proceedings of the ISCA-96,
Philadelphia, Pennsylvania, May 1996, pp. 147–157.

REGION-BASED COMPILATION

Problem:

Traditionally, compilers have been built assuming functions as units of opt;

consequently, function boundaries hide valuable optimization opportunities.

Figure ILPS2: Example of an undesirable function-based code partition (source: [Hank95])

Legend: Numbers refer to task units.

Solution:

Compiler is allowed to repartition the code into more desirable units,

in order to achieve more efficient code optimization.

Figure ILPS3: Example of a region-based compiler, inlining and repartition into regions (source: [Hank95])

Legend: Self-explanatory.

Conditions:

Compilation units (CUs) selected by compiler, not by the software designer.

Optimal size of the CU, for better utilization of employed transformations.

Profiler support to determine regions for code closer to optimal.

Multiflow VLIW compiler with trace scheduling and register allocation.

Reference:

[Hank95] Hank, R. E., Hwu, W-m. W., Rau, B. R.,
Region-Based Compilation: An Introduction and Motivation,
Proceedings of the MICRO-28, Ann Arbor, Michigan, November/December 1995, pp. 158–168.

Veljko Milutinovic

ILP:
IFACT

vm@etf.rs

A High-Level Language Oriented ILP

Essence:

Supported by NCR back in 80's;
research performed at Purdue University
Combining the state-of-the-art in ILP
and the theory of HLL architecture
An ILP based on statistical analysis
of HLL constructs
Speed-up should be measured for HLL constructs
before it is measured for HLL applications

References:

[Milutinovic87a] Milutinovic, V., Lopez-Benitez, N., Hwang, K.,
"A GaAs-Based Microprocessor Architecture
for Real-Time Applications,"
IEEE Transactions on Computers,
June 1987, pp. 714–727.

[Milutinovic87b] Milutinovic, V.,
"A Simulation Study
of the Vertical-Migration
Microprocessor Architecture,"
IEEE Transactions on Software Engineering,
December 1987, pp. 1265–1277.