Branching in pipelined machines:

 

Interlock mechanism:

hw (cisc-mostly) versus sw (risc-mostly)

Scoreboard branch: hw interlock

(clock slow-down)

ALU (arithmetic-logic-unit) suspend

RWB (register-write-unit) suspend

 

 

Delayed branch: sw interlock

source code:

i-1 ADD R7, imm32

i JUMP R1, R2>R3

i+1 MOVE R3, R4

i+2 SUB R5, R6

after code generation:

i-1 ADD R7, imm32

i JUMP R1+1, R2>R3

i+1 NOOP

i+2 MOVE R3, R4

i+3 SUB R5, R6

after code optimization:

i-1

i JUMP R1+1, R2>R3

i+1 ADD R7, imm32

i+2 MOVE R3, R4

i+3 SUB R5, R6

 

 

condition: THE MOVED INSTRUCTION

a. MUST BE EXECUTED (no matter if the

branch is taken or not), AND

b. HAS CONDITION AND/OR

THE JUMP TARGET ADDRESS.

parameters:

a. PIPELINE FILL-IN DEPTH

(which is not the pipeline depth minus one!)

b. BRANCHING-RELATED STATISTICS

(branches executed versus branches taken)

c. BRANCH FILL-IN FUNCTION

(local versus global code optimization)

d. CLOCK SLOW DOWN FUNCTION

(in-the-critical-path versus off-the-critical-path)

e. TECHNOLOGY-RELATED STATISTICS

(on-chip versus off-chip delays)

f. CACHE IMPACT (hit versus miss penalty)

NUMERICAL EXAMPLE:

What is the equation for the condition that

hw and sw interlock have the same

benchmark execution time (not clock-count)

 

 

Loading in pipelined machines:

Interlock mechanism: hw versus sw

Scoreboard LOAD:

Syspend

Bypass

 

 

Delayed LOAD: sw interlock

 

source code:

i-1 MOVE R3,R4

i LOAD R7, memory

i+1 ADD R2, R1, R7

 

after code generation:

i-1 MOVE R3,R4

i LOAD R7, memory

i+1 NOOP

i+2 ADD R2, R1, R7

 

after code optimization:

i-1

i LOAD R7, memory

i+1 MOVE R3,R4

i+2 ADD R2, R1, R7

 

condition: mutual independence

parameters: technology related,

design + organization +

architecture related,

system software related,

and application related.

 

numerical example:

What is the equation ... ?

 

Where is the ISP' code to describe delayed branching and delayed loading?

Where are the two taken care of?

 

 

The complete "case":

 

! Instruction decode and execution is done here. The "case" statement performs

! the decode - note that the opcode bits are tested as one would expect.

! For each legal opcode, a unique action is specified.

! Only one action is performed, the the bottom of the "main" process is reached,

! and we return to the top of the process.

 

case op

0:

reg[dst] = reg[src1] + reg[src2]

! add (reg-reg)

 

1:

reg[dst] = reg[src1] + imm16 sxt 32

! add (reg-imm)

 

2:

reg[dst] = pc + imm16 sxt 32

! add (pc-imm)

!!

3:

reg[dst] = reg[src1] - reg[src2]

! sub (reg-reg)

 

4:

reg[dst] = reg[src1] - imm16 sxt 32

! sub (reg-imm)

 

5:

reg[dst] = pc - imm16 sxt 32

! sub (pc-imm)

 

6:

reg[dst] = reg[src1]

! mov (reg-reg)

 

7:

reg[dst] = imm16 sxt 32

! mov (reg-imm)

 

8:

reg[dst] = pc

! mov (pc-imm)

 

9:

reg[dst] = - reg[src1]

! negate

 

10:

reg[dst] = reg[src1] and reg[src2]

! and (reg-reg)

 

11:

reg[dst] = reg[src1] and imm16 sxt 32

! and (reg-imm)

 

12:

reg[dst] = reg[src1] or reg[src2]

! or (reg-reg)

 

13:

reg[dst] = reg[src1] or imm16 sxt 32

! or (reg-imm)

 

14:

reg[dst] = not reg[src1]

! not

 

15:

reg[dst] = reg[src1] *:arith (imm5 ext 32)

! shift left

!!

16:

reg[dst] = reg[src1] /:arith (imm5 ext 32)

! shift right

!!

17:

if reg[src1] eql reg[src2]

! set if equal

 

 

reg[dst] = - 1

 

 

 

else reg[dst] = 0

 

 

18:

if reg[src1] gtr reg[src2]

! set if greater

 

 

reg[dst] = - 1

 

 

 

else reg[dst] = 0

 

 

19:

if reg[src1] eql -1

! branch on true

 

 

pc = reg[dst]

 

 

20:

pc = reg[dst]

! branch always

 

21:

(pastdst = dst;

! load

 

 

pastval = memry[reg[src2]]

 

 

 

)

 

 

22:

memry[reg[src2]] = reg[dst]

! store

 

23:

;

 

 

esac;

 

 

The ".m" file:

 

- Instr Section

instr

I<32>$

 

- Format Section

format

op = I<32:24>,

dst = I<23:20>,

src1 = I<19:16>,

src2 = I<15:12>,

imm16 = I<15:12>,

imm5 = I<4:0>$

 

- Macro section

macro

r0 = 0&,

r1 = 1&,

...

r15 = 15&,

addr(d,s1,s2) = op=0; dst=d;

src1=s1; src2=s2$&,

noophalt = op=23$&$

 

 

- Begin-end section

begin

include ee666.test$

end

 

The ".i" file:

 

- Instr Section

instr

I<32>$

 

- Format Section

format

op = I<32:24>,

dst = I<23:20>,

src1 = I<19:16>,

src2 = I<15:12>,

imm16 = I<15:0>,

imm5 = I<4:0>$

 

- Space section

space

<0:4095>$

 

- Transfer section

transfer

{new}

 

- Mode section

mode

case op eql 7

imm16~address$

break$

esac,

default:

imm16~imm16$

break$

esac$

 

The ".t" file

processor cpu = "ee666.sim";

time delay = 100ns;

initial memry = l.out;

 

The ".b" file:

Sample assembler language program that uses the instructions

for the RISC-like processor of the ee666 (Advanced Computer Systems),

Purdue University, Spring Semester 1987.

Filename: eee666.test

movi(r0,100)

subri(r1,10,100)

movr(r2,r1)

seq(r3,r1,r2)

movi(r4,11)

movi(r5,12)

moci(r6,13)

bt(r4,r3)

ba(r5)

movi(r1,10)

11:    addri(r1,r1,1)

addri(r1,r1,1)

12:     sgt(r7,r2,r1)

bt(r6,r7)

addr(r8,r0,r2)

subri(r9,r1,10)

st(r9,r8)

ba(r5)

addri(r2,r2,2)

13:    subri(r8,r8,2)

ld(r8,r8)

movr(r10,r8)

addrr(r10,r10,r8)

sla(r10,r10,2)

halt

 

Sample Fura RISC VMS Session:

  1. set def [.N2]
  2. copy VL$A:[N2.E666]*.* *.*
  3. [N2]login
  4. n2 -script.txt ee666.e00

If you want to test your own CPU:

  1. [N2]login
  2. edit cpuname.isp
  3. ic cpuname.isp
  4. edit cpuname.m
  5. edit program.m
  6. micro cpuname.m
  7. edit cpuname.i
  8. inter cpuname.i
  9. cater cpuname.a cpuname.n
  10. edit cpuname.t
  11. ec -b cpuname.t
  12. n2 -s script.txt cpuname.e00

 back to front page