Branching in pipelined machines:

Interlock mechanism:

hw (cisc-mostly) versus sw (risc-mostly)

Scoreboard branch: hw interlock

(clock slow-down)

ALU (arithmetic-logic-unit) suspend

RWB (register-write-unit) suspend

Delayed branch: sw interlock

source code:

i-1 ADD R7, imm32

i JUMP R1, R2>R3

i+1 MOVE R3, R4

i+2 SUB R5, R6

after code generation:

i-1 ADD R7, imm32

i JUMP R1+1, R2>R3

i+1 NOOP

i+2 MOVE R3, R4

i+3 SUB R5, R6

after code optimization:

i-1

i JUMP R1+1, R2>R3

i+1 ADD R7, imm32

i+2 MOVE R3, R4

i+3 SUB R5, R6

condition: THE MOVED INSTRUCTION

a. MUST BE EXECUTED (no matter if the

branch is taken or not), AND

b. HAS CONDITION AND/OR

THE JUMP TARGET ADDRESS.

parameters:

a. PIPELINE FILL-IN DEPTH

(which is not the pipeline depth minus one!)

b. BRANCHING-RELATED STATISTICS

(branches executed versus branches taken)

c. BRANCH FILL-IN FUNCTION

(local versus global code optimization)

d. CLOCK SLOW DOWN FUNCTION

(in-the-critical-path versus off-the-critical-path)

e. TECHNOLOGY-RELATED STATISTICS

(on-chip versus off-chip delays)

f. CACHE IMPACT (hit versus miss penalty)

NUMERICAL EXAMPLE:

What is the equation for the condition that

hw and sw interlock have the same

benchmark execution time (not clock-count)

Loading in pipelined machines:

Interlock mechanism: hw versus sw

Scoreboard LOAD:

Syspend

Bypass

Delayed LOAD: sw interlock

source code:

i-1 MOVE R3,R4

i LOAD R7, memory

i+1 ADD R2, R1, R7

after code generation:

i-1 MOVE R3,R4

i LOAD R7, memory

i+1 NOOP

i+2 ADD R2, R1, R7

after code optimization:

i-1

i LOAD R7, memory

i+1 MOVE R3,R4

i+2 ADD R2, R1, R7

condition: mutual independence

parameters: technology related,

design + organization +

architecture related,

system software related,

and application related.

numerical example:

What is the equation ... ?

Where is the ISP' code to describe delayed branching and delayed loading?

Where are the two taken care of?

The complete "case":

! Instruction decode and execution is done here. The "case" statement performs

! the decode - note that the opcode bits are tested as one would expect.

! For each legal opcode, a unique action is specified.

! Only one action is performed, the the bottom of the "main" process is reached,

! and we return to the top of the process.

case op

0:	reg[dst] = reg[src1] + reg[src2]	! add (reg-reg)
1:	reg[dst] = reg[src1] + imm16 sxt 32	! add (reg-imm)
2:	reg[dst] = pc + imm16 sxt 32	! add (pc-imm)	!!
3:	reg[dst] = reg[src1] - reg[src2]	! sub (reg-reg)
4:	reg[dst] = reg[src1] - imm16 sxt 32	! sub (reg-imm)
5:	reg[dst] = pc - imm16 sxt 32	! sub (pc-imm)
6:	reg[dst] = reg[src1]	! mov (reg-reg)
7:	reg[dst] = imm16 sxt 32	! mov (reg-imm)
8:	reg[dst] = pc	! mov (pc-imm)
9:	reg[dst] = - reg[src1]	! negate
10:	reg[dst] = reg[src1] and reg[src2]	! and (reg-reg)
11:	reg[dst] = reg[src1] and imm16 sxt 32	! and (reg-imm)
12:	reg[dst] = reg[src1] or reg[src2]	! or (reg-reg)
13:	reg[dst] = reg[src1] or imm16 sxt 32	! or (reg-imm)
14:	reg[dst] = not reg[src1]	! not
15:	reg[dst] = reg[src1] *:arith (imm5 ext 32)	! shift left	!!
16:	reg[dst] = reg[src1] /:arith (imm5 ext 32)	! shift right	!!
17:	if reg[src1] eql reg[src2]	! set if equal
	reg[dst] = - 1
	else reg[dst] = 0
18:	if reg[src1] gtr reg[src2]	! set if greater
	reg[dst] = - 1
	else reg[dst] = 0
19:	if reg[src1] eql -1	! branch on true
	pc = reg[dst]
20:	pc = reg[dst]	! branch always
21:	(pastdst = dst;	! load
	pastval = memry[reg[src2]]
	)
22:	memry[reg[src2]] = reg[dst]	! store
23:	;

esac;

The ".m" file:

- Instr Section

instr

I<32>$

- Format Section

format

op = I<32:24>,

dst = I<23:20>,

src1 = I<19:16>,

src2 = I<15:12>,

imm16 = I<15:12>,

imm5 = I<4:0>$

- Macro section

macro

r0 = 0&,

r1 = 1&,

...

r15 = 15&,

addr(d,s1,s2) = op=0; dst=d;

src1=s1; src2=s2$&,

noophalt = op=23$&$

- Begin-end section

begin

include ee666.test$

end

The ".i" file:

- Instr Section

instr

I<32>$

- Format Section

format

op = I<32:24>,

dst = I<23:20>,

src1 = I<19:16>,

src2 = I<15:12>,

imm16 = I<15:0>,

imm5 = I<4:0>$

- Space section

space

<0:4095>$

- Transfer section

transfer

{new}

- Mode section

mode

case op eql 7

imm16~address$

break$

esac,

default:

imm16~imm16$

break$

esac$

The ".t" file

processor cpu = "ee666.sim";

time delay = 100ns;

initial memry = l.out;

The ".b" file:

Sample assembler language program that uses the instructions

for the RISC-like processor of the ee666 (Advanced Computer Systems),

Purdue University, Spring Semester 1987.

Filename: eee666.test

movi(r0,100)

subri(r1,10,100)

movr(r2,r1)

seq(r3,r1,r2)

movi(r4,11)

movi(r5,12)

moci(r6,13)

bt(r4,r3)

ba(r5)

movi(r1,10)

11: addri(r1,r1,1)

addri(r1,r1,1)

12: sgt(r7,r2,r1)

bt(r6,r7)

addr(r8,r0,r2)

subri(r9,r1,10)

st(r9,r8)

ba(r5)

addri(r2,r2,2)

13: subri(r8,r8,2)

ld(r8,r8)

movr(r10,r8)

addrr(r10,r10,r8)

sla(r10,r10,2)

halt

Sample Fura RISC VMS Session:

set def [.N2]

copy VL$A:[N2.E666]*.* *.*

[N2]login

n2 -script.txt ee666.e00

If you want to test your own CPU:

[N2]login

edit cpuname.isp

ic cpuname.isp

edit cpuname.m

edit program.m

micro cpuname.m

edit cpuname.i

inter cpuname.i

cater cpuname.a cpuname.n

edit cpuname.t

ec -b cpuname.t

n2 -s script.txt cpuname.e00

back to front page