GaAs Part 4

CATALYTIC MIGRATION

from the

RISC ENVIRONMENT

POINT-OF-VIEW

Veljko Milutinović

UNIVERSITY of BELGRADE

This research was sponsored by NCR

DEFINITION: DIRECT MIGRATION

Migration of an entire hardware resource

into the system software.

EXAMPLES:

Pipeline interlock.

Branch delay control.

ESSENCE:

Examples that result in code* speed-up

are very difficult to invent.

DELAYED CONTROL TRANSFER

DEFINITION: Catalytic Migration

Migration base on the utilization of a catalyst.

MIGRANT vs CATALIST

Figure 7.13. The catalytic migration concept. Symbols M, C, and P refer to the migrant, the catalyst, and the processor, respectively. The acceleration, achieved by the extraction of a migrant of a relatively large VLSI area, is achieved after adding a catalyst of a significantly smaller VLSI area.

ESSENCE:

Examples that result in code speed-up

are much easier to invent.

METHODOLOGY:

Area estimation: Migrant

Area estimation: Catalyst

Real estate to invest: Difference

Investment strategy: R

Compile time algorithms

Analytical analysis

Simulation analysis

Implementational analysis

NOTE: Before the reinvestment,

the migration may result in slow-down.

(N-2)*W vs DMA

Figure 7.16. An example of the DW (double windows) type of catalytic migration, (a) before the migration; (b) after the migration. Symbol M refers to the main store. The symbol L-bit DMA refers to the direct memory access which transfers L bits in one clock cycle. Symbol NW refers to the register file with N partially overlapping windows (as in the UCB-RISC processor), while the symbol DW refers to the register file of the same type, only this time with two partially overlapping windows. The addition of the L-bit DMA mechanism, in parallel to the execution using one window, enables the simultaneous transfer between the main store and the window which is currently not in use. This enables one to keep the contents of the nonexistent N – 2 windows in the main store, which not only keeps the resulting code from slowing down, but actually speeds it up, because the transistors released through the omission of N – 2 windows can be reinvested more appropriately.

Migrant: (N- 2)*W

Catalyst: L-bit DMA

i: load r1, MAÎ {MEM – 6}

i + 1: load r2, MAÎ {MEM – 3}

Figure 7.14. An example of catalytic migration: Type HW (hand walking): (a) before the migration; (b) after the migration. Symbols P and GRF refer to the processor and the general-purpose register file, respectively. Symbols RA and MA refer to the register address and the memory address in the load instruction. Symbol MEM – n refers to the main store which is n clocks away from the processor. Addition of another bus for the register address eliminates a relatively large number of nop instructions (which have to separate the interfering load instructions).

Figure 7.15. An example of catalytic migration: type II (ignore instruction): (a) before the migration; (b) after the migration. Symbol t refers to time, and symbol UI refers to the useful instruction. This figure shows the case in which the code optimizer has successfully eliminated only two nop instructions, and has inserted the ignore instruction, immediately after the last useful instruction. The addition of the ignore instruction and the accompanying decoder logic eliminates a relatively large number of nop instructions, and speeds up the code, through a better utilization of the instruction cache.

CODE INTERLEAVING

Figure 7.17. An example of the CI (code interleaving) catalytic migration: (a) before the migration; (b) after the migration. Symbols A and B refer to the parts of the code in two different routines that share no data dependencies. Symbols GRF and SGRF refer to the general purpose register file (GRF), and the subset of the GRF (SGRF). The sequential code of routine A is used to fill in the slots in routine B, and vice versa. This is enabled by adding new registers (SGRF) and some additional control logic which is quite. The speed-up is achieved through the elimination of nop instructions, and the increased efficiency of the instruction cache (a consequence of the reduced code size).

M: Code

C: SGRF

APPLICATION:

1. Technologies with small on-chip transistor count.

The larger the ratio of off-chip to on-chip delays,

the better it works.

2. Technologies with dissipation-related limitations.

The larger the dissipation costs,

the better it works.

EXAMPLES:

CLASSIFICATION:

EXAMPLES:

Figure 7.18. A methodological review of catalytic migration (intended for a detailed study of a new catalytic migration example). Symbols S and R refer to the speed-up and the initial register count. Symbol N refers to the number of generated ideas. The meaning of other symbols is as follows: MAE—migrant area estimate, CAE—catalyst area estimate, DFR—difference for reinvestment, RSD—reinvestment strategy developed, CTA—compile-time algorithm, AAC—analytical analysis of the complexity, AAP—analytical analysis of the performance, SAC—simulation analysis of the complexity, SAP—simulation analysis of the performance, SLL—summary of lessons learned.

RISCs FOR NN: Core + Accelerators

Figure 8.1. RISC architecture with on-chip accelerators. Accelerators are labeled ACC#1, ACC#2, …, and they are placed in parallel with the ALU. The rest of the diagram is the common RISC core. All symbols have standard meanings.

Figure 8.2. Basic problems encountered during the realization of a neural computer: (a) an electronic neuron; (b) an interconnection network for a neural network. Symbol D stands for the dendrites (inputs), symbol S stands for the synapses (resistors), symbol N stands for the neuron body (amplifier), and symbol A stands for the axon (output). The symbols , , , and stand for the input connections, and the symbols , , , and stand for the output connections.

Figure 8.3. A system architecture with N-RISC processors as nodes. Symbol PE (processing element) represents one N-RISC, and refers to “hardware neuron.” Symbol PU (processing unit) represents the software routine for one neuron, and refers to “software neuron.” Symbol H refers to the host processor, symbol L refers to the 16-bit link, and symbol R refers to the routing algorithm based on the MP (message passing) method.

Figure 8.4. The architecture of an N-RISC processor. This figure shows two neighboring N-RISC processors, on the same ring. Symbols A, D, and M refer to the addresses, data, and memory, respectively. Symbols PLA (comm) and PLA (proc) refer to the PLA logic for the communication and processor subsystems, respectively. Symbol NLR refers to the register which defines the address of the neuron (name/layer register). Symbol refers to the only register in the N-RISC processor. Other symbols are standard.

Figure 8.5. Example of an accelerator for neural RISC: (a) a three-layer neural network; (b) its implementation based on the reference [Distante91]. The squares in Figure 8.5.a stand for input data sources, and the circles stand for the network nodes. Symbols W in Figure 8.5.b stand for weights, and symbols F stand for the firing triggers. Symbols PE refer to the processing elements. Symbols W have two indices associated with them, to define the connections of the element (for example, and so on). The exact values of the indices are left to the reader to determine, as an exercise. Likewise, the PE symbols have one index associated with them, to determine the node they belong to. The exact values of these indices were also left out, so the reader should determine them, too.

Figure 8.6. VLSI layout for the complete architecture of Figure 8.5. Symbol T refers to the delay unit, while symbols IN and OUT refer to the inputs and the outputs, respectively.

Figure 8.7. Timing for the complete architecture of Figure 8.5. Symbol t refers to time, symbol F refers to the moments of triggering, and symbol P refers to the ordinal number of the processing element.