Veljko Milutinovic

BPS:
Understanding the Essence

vm@etf.rs

 

 

 

 

 

 

UNDERSTANDING THE BPS

 

Multiple-path execution, efficient but costly (IBM370/Model168 + IBM3033):

doubling the resources at each branch encountered

prior to first branch outcome resolved;

 

Branch prediction, which relies on the correlation principle:

future outcome of a branch usually can be predicted

from previous outcomes of the same or related branch;

 

Hardware BPS:

dynamic prediction based on BTB and/or BPB(BHT)
+ related hardware resources,

assuming that outcome of a branch is stable over a period of time.

 

Software BPS:

static prediction based on pre-annotating techniques

(to point to the parallelism to be exploited)

and prearranging techniques

(to detect and eliminate the dependencies, to increase the level of ILP),

assuming the typical outcome based on the HLL program context.

 

Hybrid BPS:

Hardware and compiler cooperating together;

predicated (conditional) and speculative (renaming) instructions.

 

HARDWARE BPS

Each BTB entry includes (for a recently executed branch):

full branch instruction address, or only low-order bits (for access),

its outcome (only if outcome: taken, based on a chosen predictor),

and target address (only if outcome: taken);

BTB includes the target address

(which means a speed-up, since no branch delay if hit + correct prediction);

BPB excludes the target address

(which means a slowdown, but easier cooperation with complex predictors);

The PowerPC620 solution: both a BTB and a BPB.

A variation of BTB, considered for PowerPC620

(one/more target instruction(s) instead-of/in-addition-to target address):

(a) a BTB access can take longer than the fetch stage;

(b) branch folding is enabled (0cy uncond and sometimes-0cy cond branches)

The CRISP solution: BranchFoldingBTB (Ditzel and McLellan)

Prediction of indirect branches

(destination address varies at run-time):

(a) A stack of return addresses (procedure returns contribute with about 90%)

(b) Return addresses pushed/popped at call/return

The DLX solution: ReturnBufferUpTo16 (Hennessy and Patterson)

Figure BPSU1: States of the two-bit predictor; avoiding the misprediction on the first iteration of the repeated loop
(source: [Hennessy96])

Legend:

Nodes—States of the scheme,

Arcs—State changes due to branches.

 

Figure BPSU2: Average branch-target buffer prediction accuracy;
a BTB with 4K entries is about the same in performance as an infinite BTB (source: [Johnson91])

Legend:

CPB—Percentage of Correctly Predicted Branches,

4WSA—4-Way Set-Associative,

DM—Direct Mapped,

NE—Number of Entries.

Note: it is not sufficient to find the branch instruction in the BTB;

in addition, the prediction must be correct.

 

Another solution implies some prediction info included in the IC:

e.g., each entry contains the address of the probable successor entry.

 

Figure BPSU3: Instruction Cache Entry for Branch Prediction (source: [Johnson91])

Legend:

I—Cache entry.

Successor index field:

next cache entry predicted to be fetched;

first instruction in that entry predicted to be executed.

 

Branch entry index field:

location of a branch point within the current instruction cache entry;

instructions beyond branch point predicted not to be executed.

 

Two-level predictors:

predictors that use the behavior of other branches to make a prediction;

typically, they represent an improvement of two-bit predictors.

 

Figure BPSU4: A (2,2) Branch Predictor using a 2-bit global history to select one of the four 2-bit predictors
(source: [Hennessy96])

Legend:

GBH—Global Branch History,

PO—Prediction Outcome.

 

 

HLL:

if (d == 0)
d = 1;

if (d == 1)

 

MLL (d assigned to r1):

bnez r1, l1 ; branch B1 (d ¹ 0)

addi r1, r0, #1 ; d = 0, so d ¬ 1

l1: subi r3, r1, #1

bnez r3, l2 ; branch B2 (d ¹ 1)

l2:

 

Figure BPSU5a: Example code (source: [Hennessy96])

Legend: Self-explanatory.

 

 

 

 

Dinit

D = 0?

B1

DbeforeB2

D = 1?

B2

0

Yes

Not taken

1

Yes

Not taken

1

No

Taken

1

Yes

Not taken

2

No

Taken

2

No

Taken

Figure BPSU5b: Example explanation—illustration of the advantage of a two-level predictor with one-bit history;
if B1 is not_taken, then B2 will also be not_taken, which can be utilized to achieve better prediction (source: [Hennessy96])

Legend:

Dinit—Initial D,

DbeforeB2—Value of D before B2.

 

 

 

Conclusion:

an [MN] predictor uses the behavior of the last M branches

to select one of the 2M predictors,

each one being an N-bit predictor for a single branch.

 

Note:

counters indexed by a single value of the global predictor

may correspond to different branches;

also, the prediction may not correspond to the current branch,

but to another one with the same low-order address bits;

still the scheme works!

 

 

 

Figure BPSU6a: Schemes 2bC, GAs, and PAs (source: [Evers96])

Legend:

PHT—Pattern History Table;

BHSR—Branch History Shift Register.

 

Figure BPSU6b: Schemes gshare, pshare, GSg, and PSg (source: [Evers96])

Legend:

BHSR—Branch History Shift Register.

 

SOFTWARE BPS

 

Pre-annotating:

Done at coding-time, compiling-time, or profiling-time.

 

Software scheduling:

Done within basic blocks and across branches.

 

Global code motion:

Code is moved across basic block boundaries and compensated.

 

Trace scheduling:

Principal technique used in VLIW architectures;

essentially, global code motion

enhanced with techniques to detect parallelism across conditional branches,

assuming a special type of architecture (LIW or VLIW).

 

Loop unrolling:

A technique used on any architecture,

but best in conjunction with trace scheduling;

increasing the amount of the sequentially executable code.

 

Software pipelining (symbolic loop unrolling):

A technique to pipeline operations from different loop iterations;

each iteration of a software-pipelined loop

includes instructions from different iterations of the original loop

(software equivalent of Tomasulo's algorithm in hardware).

 

 

HYBRID BPS

 

Predicated instructions

Speculative instructions

PREDICATED INSTRUCTIONS

 

Instruction includes a condition which is evaluated during its execution;

if true - normal execution;

if false - noop execution.

The most common example in new architectures: predicated reg-to-reg move;

helps eliminate branches in some contexts (if-then with minimal then body).

 

Examples: computing the absolute value ABS, or the ix86 repetition REP.

 

Usefulness of predicated instructions limited by:

  1. The canceled instructions do take execution cycles;
    moving such an instruction across a branch makes the code slower,
    unless the cycle would be idle anyway.
  2. The moment when the condition can be evaluated is important;
    the sooner - the better.
  3. The clock count of predicated instructions may be larger.
  4. Exception handling is a problem!

 

Recent examples of reg-to-reg move predicated: ALPHA+MIPS+PowerPC+SPARC;

Recent example with any reg-to-reg predicated: PA (it nullifies the next one)

 

 

 

SPECULATIVE INSTRUCTIONS

 

Instruction is executed before the processor knows if it should execute:

  1. First, the branch is predicted;
  2. Second, the instruction is made speculative and moved behind the branch;
  3. Third, scheduling is done for higher efficiency of speculation;
  4. Fourth, recovery is done if misprediction.

 

Two approaches:

  1. Compiler schedules a speculative instruction and hardware helps recover
    if it shows up that the speculation was wrong;
    speculation is done at compile time.
  2. Compiler does a straightforward code generation
    and the branch prediction hardware is responsible for the speculation;
    speculation is done at run time.

 

 

Exception handling is less critical (speculative vs. predicated)!

 

Different handling of different exceptions:

  1. Program errors, which cause the program to terminate;
  2. Page faults or similar, which cause the program to resume.

 

Three techniques are used to deal with exceptions:

  1. Exceptions for speculative instructions are handled by hardware and/or os,
    both for resumable exceptions
    and for terminating exceptions.
  2. A set of special status bits (poison bits) can be attached
    to each register written by a speculative instruction,
    at the time of exception;
    consequently, a fault will be generated
    if some instruction selects a "poisoned" register for read.
  3. A mechanism can be included (boosting: a type of hardware renaming)
    to move instructions past branches,
    to label each instruction during the period while it is speculative,
    and to guard the results of the labeled instruction,
    using a special renaming buffer.

 

Hardware-based speculation

is complex-to-design and transistor-count-consuming;

however, it offers numerous advantages over the software-based speculation

[Hennessy96].

 

The approach of CDC 6600: Scoreboarding (Thornton's algorithm)

The early approach to dynamic scheduling (no speculation).

 

The approach of the IBM 360/91: Renaming (Tomasulo's algorithm)

A version of scoreboarding

(for out-of-order execution),

with reservation stations for renaming

(to avoid speculative instruction destroying live values in WAW/RAW hazards);

an approach to let execution proceed in spite of hazards

(no speculation):

issue+execution+writeback.

 

Since the number of reservation stations > the number of registers,

Tomasulo's algorithm eliminates more hazards than a compiler

 

Figure BPSU7: A CPU based on the Tomasulo's algorithm—TA (source: [Hennessy96])

Legend: Self-explanatory.

 

 

1. ld f6, 34(r2)

2. ld f2, 45(r3)

3. multd f0, f2, f4

4. subd f8, f6, f2

5. divd f10, f0, f6

6. addd f6, f8, f2

 

Figure BPSU8: An example code sequence; add:2cy, mlt:10cy, div:40cy (source: [Hennessy96])

Legend: Self-explanatory.

 

 

 

InstructionStatus

Instruction

 

Issue

Execute

WriteResult

ld

f6, 34(r2)

ü

ü

ü

ld

f2, 45(r3)

ü

ü

ü

multd

f0, f2, f4

ü

ü

 

subd

f8, f6, f2

ü

ü

ü

divd

f10, f0, f6

ü

 

 

addd

f6, f8, f2

ü

ü

ü

 

 

ReservationStations

Name

Busy

Op

Vj

Vk

Qj

Qk

Add1

No

 

 

 

 

 

Add2

No

 

 

 

 

 

Add3

No

 

 

 

 

 

Mult1

Yes

mult

Mem[45 + Regs[r3]]

Regs[f4]

 

 

Mult2

Yes

div

 

Mem[34 + Regs[r2]]

Mult1

 

 

 

FPRegisterStatus

Field

f0

f2

f4

f6

f8

f10

f12

L

f30

Qi

Mult1

 

 

 

 

Mult2

 

 

 

Figure BPSU9: Status of the code execution on a CPU using TA;
instruction ADDD is finished immediately after SUBD, in spite of the WAR (source: [Hennessy96])

Legend: Self-explanatory.

 

The approach of modern microprocessors

(PowerPC620, MIPS R10000, Intel P6, HP PA 8000, AMD K5):

Extending the Tomasulo's algorithm hardware (for dynamic scheduling)

to support speculation

(not undoable updates only after the instruction is no longer speculative):

issue(dispatch)+execution+writeback+commit

 

Reading from a "virtual" register written by a speculative instruction

provides an input which is not known if correct,

until after the instruction is no longer speculative,

which is when the real register/memory is updated (instruction commit).

Speculation allows execution out-of-order,

but commit must be in-order.

 

Instructions that have finished execution,

but have not yet committed,

are kept in a hardware buffer called reorder buffer (ROB).

Passing results among the not yet committed instructions goes via ROB,

because ROB provides virtual registers;

ROB is analogues to TRS (Tomasulo's reservation stations),

except that TRS provides final data and ROB provides speculated data.

 

Each ROB entry includes three fields:

(a) instruction type (branch, store, registerop);

(b) destination (branch:no, store:memory_address, registerop:register_number);

(c) value (keeping the results before commit);

The last part can be separate (renaming or extended register file of 620+10K).

 

Figure BPSU10: A CPU based on Tomasulo's algorithm extended to handle speculation—TAES (source: [Hennessy96])

Legend: Self-explanatory.

 

 

1. ld f6, 34(r2)

2. ld f2, 45(r3)

3. multd f0, f2, f4

4. subd f8, f6, f2

5. divd f10, f0, f6

6. addd f6, f8, f2

 

Figure BPSU11: An example code sequence; add:2cy, mlt:10cy, div:40cy (source: [Hennessy96])

Legend: Self-explanatory.

 

 

 

ReservationStations

Name

Busy

Op

Vj

Vk

Qj

Qk

Dest

Add1

No

 

 

 

 

 

 

Add2

No

 

 

 

 

 

 

Add3

No

 

 

 

 

 

 

Mult1

No

mult

Mem[45 + Regs[r3]]

Regs[f4]

 

 

#3

Mult2

Yes

div

 

Mem[34 + Regs[r2]]

#3

 

#5

 

 

ReorderBuffer

Entry

Busy

Instruction

State

Destination

Value

1

No

ld

f6, 34(r2)

Commit

f6

Mem[34 +  Regs[r2]]

2

No

ld

f2, 45(r3)

Commit

f2

Mem[45 +  Regs[r3]]

3

Yes

multd

f0, f2, f4

Write result

f0

#2 ´ Regs[f4]

4

Yes

subd

f8, f6, f2

Write result

f8

#1 - #2

5

Yes

divd

f10, f0, f6

Execute

f10

 

6

Yes

addd

f6, f8, f2

Write result

f6

#4 +  #2

 

 

FPRegisterStatus

Field

f0

f2

f4

f6

f8

f10

f12

L

f30

Reorder #

3

 

 

6

4

5

 

 

 

Busy

Yes

No

No

Yes

Yes

Yes

No

L

No

Figure BPSU12: Status tables for code execution on a CPU using TAES; SUBD completes first, but commits after MULTD
(source: [Hennessy96])

Legend: Self-explanatory.

 

 

Figure BPSU13: Pipeline of the IBM PowerPC 620—organization (source: [Hennessy96])

Legend:

FU—Functional Unit.

  1. Fetch—Loads the decode queue with instructions from the cache and determines the address of the next instruction. A 256-entry two-way set-associative branch-target buffer is used as the first source for predicting the next fetch address. There is also a 2048-entry branch-prediction buffer used when the branch-target buffer does not hit but a branch is present in the instruction stream. Both the target and prediction buffers are updated, if necessary, when the instruction completes using the information from the BPU. In addition, there is a stack of return address registers used to predict subroutine returns.
  2. Instruction decode—Instructions are decoded and prepared for issue. All time-critical portions of decode are done here. The next four instructions are passed to the next pipeline stage.
  3. Instruction issue—Issues the instructions to the appropriate reservation station. Operands are read from the register file in this stage, either into the functional unit or into the reservation stations. A rename register is allocated to hold the result of the instruction and a reorder buffer entry is allocated to ensure in-order completion. In some speculative and dynamically scheduled machined, this process is called dispatch, rather than issue. We use the term issue, since the process corresponds to the issue process of the CDC 6600, the first dynamically scheduled machine.
  4. Execution—This stage proceeds when the operands are all available in a reservation station. One of six functional units executes the instruction. The simple integer units XSU0 and XSU1 have a one-stage execution pipeline. The MCFXU has a pipeline depth of between one and three, though integer divide is not pipelined and takes more clock cycles. The FPU has a three-stage pipeline, while the LSU has a two-stage pipeline. At the end of execution, the result is written into the appropriate result bus, and from there into any reservation stations that are waiting for the result as well as into the rename buffer allocated for this instruction. The completion unit is notified that the instruction has completed. If the instruction is a branch, and the branch was mispredicted, the instruction fetch unit and completion unit are notified, causing instruction fetch to restart at the corrected address and causing the completion unit to discard the speculated instructions and free the rename buffers holding speculated results. When an instruction moves to the functional unit, we say that it has initiated execution; some machines use the term issue for this transition. An instruction frees up the reservation station when it initiates execution, allowing another instruction to issue to that station. If the instruction is ready to execute when it first issues to the reservation station, it can initiate on the next clock cycle freeing up the reservation station: it acts simply as a latch between stages. When an instruction has finished execution and is ready to move to the next stage, we say it has completed execution.
  5. Commit—This occurs when all previous instructions have been committed. Up to four instructions may be complete per cycle. The results in the rename buffer are written into the register file and the rename buffer freed. Upon completion of a store instruction, the LSU is also notified, so that the corresponding store buffer may be sent to the cache. Some machines use the term instruction completion for this stage. In a small number of cases, an extra stage may be added for write backs that cannot complete during commit because of a shortage of write ports.

 

Figure BPSU14: Pipeline of the IBM PowerPC 620—explanation (source: [Hennessy96])

Legend: Self-explanatory.

 

REFERENCES

 

[Thornton64] Thornton, J.E.,
"Parallel Operation on the Control Data 6600,"
Proceedings of the Fall Joint Computer Conference, October 1964, pp. 33–40.

 

[Tomasulo67] Tomasulo, R.M.,
"An Efficient Algorithm for Exploiting Multiple Arithmetic Units,"
IBM Journal of Research and Development, January 1967, pp. 25–33.

 

[Johnson91] Johnson, M.,
Superscalar Microprocessor Design,
Prentice-Hall, Englewood Cliffs, New Jersey, 1991.

 

[Evers96] Evers, M., Chang, P.-Y., Patt, Y.,
"Using Hybrid Branch Predictors to Improve Branch Prediction Accuracy in the Presence of Context Switches,"
Proceedings of the International Symposium on Computer Architecture,
May 1996, Philadelphia, Pennsylvania, USA, pp. 3–11.

 

[Hennessy96] Hennessy, J.L., Patterson, D. A.,
Computer Architecture: A Quantitative Approach,
Morgan Kaufmann, San Francisco, California, 1996.

 

 

 

 

 

 

 

Veljko Milutinovic

BPS:
State of the Art

vm@etf.rs

 

 

 

 

 

 

 

BPS on CONTEXT SWITCH: REINITIALIZATION USING DEFAULT

 

Problem:

 

State of the art in BPS:

  1. TLA (two-level adaptive) BPS
  2. TCH (two-component hybrid) BPS

If context switching is involved performance degrades seriously.

 

What is the BPS which provides the best performance,

if context-switch is present,

in conditions when, during a context switch,

all history information associated with the hybrid branch predictor is lost?

 

 

Solution:

 

A new BPS—the Multi Hybrid (MH) BPS

(better performance at the same complexity)

based on an array of special 2-bit up-down predictor selection counters - PSCs

(each BTB entry is extended with one PSC).

 

 

Initial value of all PSC entries is 3,

and priority logic is used if several predictors are equal;

if at least one of the value-3 predictors was correct,

all incorrect predictors are decremented;

if none of the value-3 predictors was correct,

all correct predictors are incremented;

guaranteed, at least one PSC equals 3;

for a BTB of 2K entries, complexity of the selection mechanism is 2K*2*C,

where C is the number of component predictors.

Figure BPSS1: Predictor Selection Mechanism (source: [Evers96])

Legend:

BTB—Branch Target Buffer,

PSC—Prediction Selection Counter,

P1—Predictor 1,

PN— Predictor N,

PE—Priority Encoding,

PO—Prediction Outcome.

 

 

 

HybPredSiz [kB]

 

~11

~18

~33

~64

~116

 

CompCost [kB]

SelectMech

2

2.5

3

3

3

2bC

0.5

0.5

0.5

0.5

0.5

GAs

2

2

4

8

Gshare

4

8

16

32

64

Pshare

4

5.25

7.5

20

36.25

loop

4

4

4

AlwaysTaken

0

0

0

0

0

Figure BPSS2: Multi-Hybrid Configurations and Sub-Optimal Priority Ordering 95.22/95.65 (source: [Evers96])

Legend:

HybPredSiz—Hybrid Predictor Size,

CompCost—Component Cost,

SelectMech—Selection Mechanism.

 

Complex/large predictors with better accuracy for steady-state,

and simple/small predictors with shorter warm-up time for context-switch;

static predictors (always-taken and always-not-taken) have zero warm-up time;

also, profiling (small dynamic) predictors (PSg and PSg(algo));

always-not-taken and PSg/PSg(algo) not included due to their marginal impact.

 

 

Predictor

Algorithm

Cost (bits)

2bC

A two-bit counter predictor
consisting of a 2K entry array of two bit counters.

212

GAs(mn)

A global variation of the Two-level Adaptive Branch Predictor
consisting of a single m-bit global branch history
and n pattern history tables.

m +  2m+ 1n

PSg(m)

A modified version of the per-address variation
of the Two-level Adaptive Branch Predictor
consisting of 2K m-bit branch history registers
and a single pattern history table
(each PHT entry uses one statically determined hint bit instead of a 2bC).
The version of PSg used in this study is the PSg(algo).

211m +  2m

gshare(m)

A modified version of the global variation
of the Two-level Adaptive Branch Predictor
consisting of a single m-bit global branch history
and a single pattern history table.

m +  2m+ 1

pshare(m)

A modified version of the per-address variation
of the Two-level Adaptive Branch Predictor
consisting of 2K m-bit branch history registers
and one pattern history table.
As in the gshare scheme,
the branch history is XORed
with the branch address to select the appropriate PHT entry.

211m +  2m+ 1

loop(m)

An AVG predictor where the prediction of a loop’s exit
is based on the iteration count of the previous run of this loop.
A 2K entry array of two m-bit counters
is used to keep the iteration counts of loops. In this study, m = 8.

212m

Always Taken

 

0

Always Not Taken

 

0

Figure BPSS3: Single Scheme Predictors—Algorithms and Complexities (source: [Evers96])

Legend:

AlwaysTaken—Branch Always Taken (MIPS 10000),

AlwaysNotTaken— Branch Always Not Taken (Motorola 88110).

Note:
Misprediction (for various SPECint92 benchmarks) drops from 12% to 20%,
for predictor sizes of 18KB and 64KB, respectively.

Conditions:

The MH BP works best if branch histories are periodically flushed

due to the presence of context switch in the application.

Reference:

[Evers96] Evers, M., Chang, P.-Y., Patt, Y.,
"Using Hybrid Branch Predictors to Improve Branch Prediction Accuracy
in the Presence of Context Switches,"
Proceedings of the ISCA-96, Philadelphia, Pennsylvania, May 1996, pp. 3–11.

[Chang95] Chang, P., Banerjee, U,
"Profile-guided Multiheuristic Branch Prediction,"
Proceedings of the International Conference on Parallel Processing, July 1995.

 

 

BPS on CONTEXT SWITCH:
RESTART USING RESUME

 

Problem:

User-only data give different results/rankings compared to user-kernel data.

 

Solution:

Figure BPSS4: A model of BPS (source: [Gloy96])

Legend: Self-explanatory.

 

 

Figure BPSS5: Explanation of four BPS approaches (source: [Gloy96])

Legend:

BranchAddr—Branch Address,

BHSR—Branch History Shift Register,

BHT—Branch History Table,

PO—Prediction Outcome.

 

It is misleading to use periodic flushing,

since it fails to capture the differences in the organization/size of schemes.

Conditions:

Results are based on IBS?

Reference:

[Gloy96] Gloy, N., Young, C., Chen, J.B., Smith, M.,
"An Analysis of Branch Prediction Schemes on System Workloads,"
Proceedings of the ISCA-96, Philadelphia, Pennsylvania, May 1996,
pp. 12–21.

BPS on CONTEXT SWITCH:
REVISITING THE PROBLEM
USING LONGER BENCHMARKS

Problem:

SPECint89 and SPECint92 are not long enough!

 

Figure BPSS6: A model of BPS (source: [Sechrest96])

Legend: Self-explanatory.

 

Solution:

 

B

DI

DCB (%[TI])

SCB

N

compress

83,947,354

11,739,532 (14.0%)

236

13

eqntott

1,395,165,044

342,595,193 (24.6%)

494

5

espresso

521,130,798

76,466,489 (14.7%)

1784

110

gcc

142,359,130

21,579,307 (15.2%)

9531

2020

xlisp

1,307,000,716

147,425,333 (11.3%)

489

48

sc

889,057,008

150,381,340 (16.9%)

1269

157

groff

104,943,750

11,901,481 (11.3%)

6333

459

gs

118,090,975

16,308,247 (13.8%)

12852

1160

mpeg_play

99,430,055

9,566,290 (9.6%)

5598

532

nroff

130,249,374

22,574,884 (17.3%)

5249

228

real_gcc

107,374,368

14,309,867 (13.3%)

17361

3214

sdet

42,051,812

5,514,439 (13.1%)

5310

508

verilog

47,055,243

6,212,381 (13.2%)

4636

850

video_play

52,508,059

5,759,231 (11.0%)

4606

757

Figure BPSS7: Benchmarks—SPEC versus IBS (source: [Sechrest96])

Legend:

B—Benchmarks,

DI—Dynamic Instructions,

DCB—Dynamic Conditional Branches (percentage of total instructions),

TI—Total Instructions,

SCB—Static Conditional Branches,

N—Number of Static Branches Constituting 90% of Total DCB.

Conditions:

Control for aliasing and inter-branch correlations is the key!

Reference:

[Sechrest96] Sechrest, S., Lee, C. C., Mudge, T.,
"Correlation and Aliasing in Dynamic Branch Predictors,"
Proceedings of the ISCA-96, Philadelphia, Pennsylvania,
May 1996, pp. 21–32.

 

 

 

 

 

 

Veljko Milutinovic

BPS:
IFACT

vm@etf.rs

 

 

 

 

 

 

 

A Hybrid/Adaptive
Application Oriented BPS

Essence:

References:

[Milutinovic96a] Milutinovic, V.,
"Some Solutions for Critical Problems
in Distributed Shared Memory,"
IEEE TCCA Newsletter, September 1996.

 

[Milutinovic96b] Milutinovic, V.,
"The Best Method for Presentation of Research Results,"
IEEE TCCA Newsletter, September 1996.

 

[Ekmecic97] Ekmecic, I.,
"The LOCO++ Approach to Task Allocation
in Heterogeneous Computer Systems,"
M.Sc. Thesis, University of Belgrade,
Belgrade, Serbia, Yugoslavia, 1997.

 

[Petrovic97] Petrovic, M.,
"The ASUP Approach to Multi-Hybrid Branch Prediction,"
M.Sc. Thesis, University of Belgrade,
Belgrade, Serbia, Yugoslavia, 1997.