Branching in pipelined machines:
Interlock mechanism:
hw (cisc-mostly) versus sw (risc-mostly)
Scoreboard branch: hw interlock
(clock slow-down)
ALU (arithmetic-logic-unit) suspend
RWB (register-write-unit) suspend
Delayed branch: sw interlock
source code:
i-1 ADD R7, imm32
i JUMP R1, R2>R3
i+1 MOVE R3, R4
i+2 SUB R5, R6
after code generation:
i-1 ADD R7, imm32
i JUMP R1+1, R2>R3
i+1 NOOP
i+2 MOVE R3, R4
i+3 SUB R5, R6
after code optimization:
i-1
i JUMP R1+1, R2>R3
i+1 ADD R7, imm32
i+2 MOVE R3, R4
i+3 SUB R5, R6
condition:
THE MOVED INSTRUCTIONa. MUST BE EXECUTED (no matter if the
branch is taken or not), AND
b. HAS CONDITION AND/OR
THE JUMP TARGET ADDRESS.
parameters:
a. PIPELINE FILL-IN DEPTH
(which is not the pipeline depth minus one!)
b. BRANCHING-RELATED STATISTICS
(branches executed versus branches taken)
c. BRANCH FILL-IN FUNCTION
(local versus global code optimization)
d. CLOCK SLOW DOWN FUNCTION
(in-the-critical-path versus off-the-critical-path)
e. TECHNOLOGY-RELATED STATISTICS
(on-chip versus off-chip delays)
f. CACHE IMPACT (hit versus miss penalty)
NUMERICAL EXAMPLE:
What is the equation for the condition that
hw and sw interlock have the same
benchmark execution time (not clock-count)
Loading in pipelined machines:
Interlock mechanism: hw versus sw
Scoreboard LOAD:
Syspend
Bypass
Delayed LOAD: sw interlock
source code:
i-1
MOVE R3,R4i LOAD R7, memory
i+1
ADD R2, R1, R7
after code generation:
i-1 MOVE R3,R4
i LOAD R7, memory
i+1 NOOP
i+2 ADD R2, R1, R7
after code optimization:
i-1
i LOAD R7, memory
i+1 MOVE R3,R4
i+2 ADD R2, R1, R7
condition:
mutual independenceparameters: technology related,
design + organization +
architecture related,
system software related,
and application related.
numerical example:
What is the equation ... ?
Where is the ISP' code to describe delayed branching and delayed loading?
Where are the two taken care of?
The complete "case":
! Instruction decode and execution is done here. The "case" statement performs
! the decode - note that the opcode bits are tested as one would expect.
! For each legal opcode, a unique action is specified.
! Only one action is performed, the the bottom of the "main" process is reached,
! and we return to the top of the process.
case op
0: |
reg[dst] = reg[src1] + reg[src2] |
! add (reg-reg) |
|
1: |
reg[dst] = reg[src1] + imm16 sxt 32 |
! add (reg-imm) |
|
2: |
reg[dst] = pc + imm16 sxt 32 |
! add (pc-imm) |
!! |
3: |
reg[dst] = reg[src1] - reg[src2] |
! sub (reg-reg) |
|
4: |
reg[dst] = reg[src1] - imm16 sxt 32 |
! sub (reg-imm) |
|
5: |
reg[dst] = pc - imm16 sxt 32 |
! sub (pc-imm) |
|
6: |
reg[dst] = reg[src1] |
! mov (reg-reg) |
|
7: |
reg[dst] = imm16 sxt 32 |
! mov (reg-imm) |
|
8: |
reg[dst] = pc |
! mov (pc-imm) |
|
9: |
reg[dst] = - reg[src1] |
! negate |
|
10: |
reg[dst] = reg[src1] and reg[src2] |
! and (reg-reg) |
|
11: |
reg[dst] = reg[src1] and imm16 sxt 32 |
! and (reg-imm) |
|
12: |
reg[dst] = reg[src1] or reg[src2] |
! or (reg-reg) |
|
13: |
reg[dst] = reg[src1] or imm16 sxt 32 |
! or (reg-imm) |
|
14: |
reg[dst] = not reg[src1] |
! not |
|
15: |
reg[dst] = reg[src1] *:arith (imm5 ext 32) |
! shift left |
!! |
16: |
reg[dst] = reg[src1] /:arith (imm5 ext 32) |
! shift right |
!! |
17: |
if reg[src1] eql reg[src2] |
! set if equal |
|
|
reg[dst] = - 1 |
|
|
|
else reg[dst] = 0 |
|
|
18: |
if reg[src1] gtr reg[src2] |
! set if greater |
|
|
reg[dst] = - 1 |
|
|
|
else reg[dst] = 0 |
|
|
19: |
if reg[src1] eql -1 |
! branch on true |
|
|
pc = reg[dst] |
|
|
20: |
pc = reg[dst] |
! branch always |
|
21: |
(pastdst = dst; |
! load |
|
|
pastval = memry[reg[src2]] |
|
|
|
) |
|
|
22: |
memry[reg[src2]] = reg[dst] |
! store |
|
23: |
; |
|
|
esac;
The ".m" file:
-
Instr Sectioninstr
I<32>$
- Format Section
format
op = I<32:24>,
dst = I<23:20>,
src1 = I<19:16>,
src2 = I<15:12>,
imm16 = I<15:12>,
imm5 = I<4:0>$
- Macro section
macro
r0 = 0&,
r1 = 1&,
...
r15 = 15&,
addr(d,s1,s2) = op=0; dst=d;
src1=s1; src2=s2$&,
noophalt = op=23$&$
- Begin-end section
begin
include ee666.test$
end
The ".i" file:
-
Instr Sectioninstr
I<32>$
- Format Section
format
op = I<32:24>,
dst = I<23:20>,
src1 = I<19:16>,
src2 = I<15:12>,
imm16 = I<15:0>,
imm5 = I<4:0>$
- Space section
space
<0:4095>$
- Transfer section
transfer
{new}
- Mode section
mode
case op eql 7
imm16~address$
break$
esac,
default:
imm16~imm16$
break$
esac$
The ".t" file
processor cpu = "ee666.sim";
time delay = 100ns;
initial memry = l.out;
The ".b" file:
Sample assembler language program that uses the instructions
for the RISC-like processor of the ee666 (Advanced Computer Systems),
Purdue University, Spring Semester 1987.
Filename: eee666.test
movi(r0,100)
subri(r1,10,100)
movr(r2,r1)
seq(r3,r1,r2)
movi(r4,11)
movi(r5,12)
moci(r6,13)
bt(r4,r3)
ba(r5)
movi(r1,10)
11: addri(r1,r1,1)
addri(r1,r1,1)
12: sgt(r7,r2,r1)
bt(r6,r7)
addr(r8,r0,r2)
subri(r9,r1,10)
st(r9,r8)
ba(r5)
addri(r2,r2,2)
13: subri(r8,r8,2)
ld(r8,r8)
movr(r10,r8)
addrr(r10,r10,r8)
sla(r10,r10,2)
halt
Sample Fura RISC VMS Session:
If you want to test your own CPU:
back to front page