ArticlePDF Available

Control Flow Prediction through Multiblock Formation in Parallel RegisterSharing Architecture

January 2010

January 2010

Authors:

Rajendra Kumar

Sharda University

Pk Singh

Madan Mohan Malaviya University of Technology, Gorakhpur (U.P.) India

In this paper we introduce control flow prediction (CFP) in parallel register sharing architecture. The main idea behind this concept is to use a step beyond the prediction of common branch and permitting the hardware to have the information about the CFG (Control Flow Graph) components of the program to have better branch decision in navigation. The degree of ILP depends upon the navigation bandwidth of prediction mechanism. It can be increased on increase of control flow prediction. By this the size of initiation is increased that permit the overlapped execution of multiple independent flow of control. It can allow multiple branch instruction that can be resolved simultaneously. These are intermediate steps to be taken in order to increase the size of dynamic window can achieve a high degree of ILP exploitation.

CFPB entries

…

Basic structure for different programs

…

Branch traversal results

…

Figures - uploaded by Rajendra Kumar

Content may be subject to copyright.

Content uploaded by Rajendra Kumar

Content may be subject to copyright.

Rajendra Kumar et. al. / (IJCSE) International Journal on Computer Science and Engineering

Vol. 2, No. 4, 2010, 1179-1183

ISSN: 0975-3397

1179

Control Flow Prediction through Multiblock

Formation in Parallel Register

Sharing Architecture

Rajendra Kumar Dr. P K Singh

Dept. of Computer Science & engineering, Dept. of Computer Science & engineering

Vidya College of Engineering MMM Engineering College,

Meerut (UP), India Gorakhpur (UP), India

rajendra04@gmail.com topksingh@gmail.com

Abstract - In this paper we introduce control flow

prediction (CFP) in parallel register sharing architecture.

The main idea behind this concept is to use a step beyond

the prediction of common branch and permitting the

hardware to have the information about the CFG (Control

Flow Graph) components of the program to have better

branch decision in navigation. The degree of ILP depends

upon the navigation bandwidth of prediction mechanism. It

can be increased on increase of control flow prediction. By

this the size of initiation is increased that permit the

overlapped execution of multiple independent flow of

control. It can allow multiple branch instruction that can

be resolved simultaneously. These are intermediate steps to

be taken in order to increase the size of dynamic window

can achieve a high degree of ILP exploitation.

Keywords: CFP; ISB; ILP; CFG; Basic Block

I. INTRODUCTION

ILP allows the capability of execution of multiple

instructions per cycle. It is now essential to the

performance of modern processors. ILP is greatly forced

by branch instructions. Also it has been observed that

branch prediction is employed with speculative

execution [2]. However, inevitable branch misprediction

compromises such a remedy. On the other hand branch

prediction exposes high degree of ILP to scheduler by

converting control flow into equivalent predicated

instructions which are protected by Boolean source

operands. The if-conversion has been shown to be

promising method for exploitation of ILP in the presence

of control flow.

The if-conversion in the predication goes for control

dependency between the branches and remaining

instructions into data dependency between the predicate

definition and predicated structures of the program. As a

result the transformation of control flow becomes

optimized traditional data flow and branch scheduling

becomes reordering of serial instructions. ILP can be

increased by overlapping multiple program path

executions. Some predicate specific optimization may

also be enabled as some supplement of traditional approaches.

The major questions regarding the if-conversion [2]: what to

if-convert and when to if-convert explores that the if-

conversion should be performed early in the compilation

stage. It has the advantage of classified optimization

facilitation on the predicated instructions whereas the delay in

if-conversion is schedule in the time leaves for better

selection for code efficiency and target processor

characteristics. The dynamic branch prediction is

fundamentally is restricted to establishing a dynamic window

because it makes local decision without any prior knowledge

or of global control statement in the program. This short of

knowledge creates several problems like (1) branch prediction

and (2) its identity. It means the branch must be encountered

by parallel register sharing architecture [1]. Due to normal

branch prediction, a prediction can be made while the fetch

unit fetches the branch instruction.

II. RELATED WORK

The fetch unit has a great role in prediction mechanism

[2] in parallel register sharing architecture but [13, 15]

proposes some recent prediction mechanism that do not

require the addresses of branch for prediction rather there is

requirement of identity of branch to be known so that the

predicted target address can be obtained using either BTB

[11] or by decoding branch instructions. There are so many

commercially available embedded processors that are capable

to extend the set of base instructions for a specific application

domain. A steady progress has been observed in tools and

methodology for automatic instruction set extension for

processors that can be configured. It has been seen that the

limited data bandwidth is available in the core processors.

This creates a serious performance deadlock. [8] represents a

very low cost architectural extension and a compilation

technique that creates data bandwidth problem. A novel

parallel global register binding is also presented in [8] with

the help of hash function algorithm. This leads to a nearly

optimal performance speedup of 2% of ideal speedup. A

compilation framework [5] is presented that allows a

compiler to maximize the benefits of prediction. A heuristic

Rajendra Kumar et. al. / (IJCSE) International Journal on Computer Science and Engineering

Vol. 2, No. 4, 2010, 1179-1183

ISSN: 0975-3397

1180

[14] is given through experiments on Trimaran simulator

[18]. [14] shows how the weakness of traditional

heuristics are exploited. Optimal use of loop cash is also

explored to release the unnecessary pressure. A

technique to enhance the ability of dynamic ILP

processors to exploit the parallelism is introduced in [6].

A performance metric is presented in [14] to guide the

nested loop optimization. This way the effect of ILP is

combined with loop optimization.

The impact of ILP processors on the performance of

shared memory multiprocessors with and without latency

hiding optimizing software prefetching is represented in

[16]. One of the critical goals in the code optimization

for multiprocessor system on single chip architecture

[17] is to minimize the number of off chip memory

access. [17] represents a strategy that can reduce the

number of off chip references due to shared data.

Static technique like trace scheduling [4, 7]

predicated execution [9] super block and hyper block

scheduling [3, 12] have been used to elevate the impact

of control dependencies. [10] represents a study that

shows the ILP processors which perform branch

prediction and speculative execution. But it allows only

a single flow of control that can extract a parallelism of

only 7. The parallelism limit is increased to 13 if the ILP

processors use the maximal of control dependence

information for instruction execution before branches

which they are independent.

III. EXPLOITATION OF CONTROL FLOW

GRAPG CHARACTERISTICS

The ISB architecture and the ISB structure are

presented in [1] for control flow predication. The

information presented in CFG for a program can be

exploited by ISB architecture that presents

parallelization of shared register after inspection of

control flow graph of a program, it is possible to infer

that some of the basic blocks may be executed regardless

previous branch outcome. Below is a C language code.

Fig. 1 A ‘C’ language code

The figure 2 represents CFG. This shows a number of

instructions in each basic block.

Fig 2. Control Flow Graph of fig 1.

The experiments are performed Trimaran simulator for a

MIPS 2000 executable extending from the node BB1

following multiblock BB-1 to BB-2, BB-1 to BB-3, BB-1 to

BB-4, BB-1 to BB-5, and BB-1 to BB-8 can be formed as

BB-1 to BB-8 as maximal multiblock. Because they have

single target. The multiblock BB-1 to BB-6, BB-1 to BB-7

and BB-1 to BB-9 can not be counted as multiblocks as they

have three targets. A CFG (whose nodes are basic blocks) can

be transformed into an equivalent graph whose nodes are

multiblocks. The information of multiblock is sent to ISB

architecture and informed decisions are navigated through the

control free graph. When a multiblock enters then its exit

point can be determined easily even though the exact path is

unknown.

The execution of multiblocks may overlap each other

creating overlapped execution of multiple control flow. The

data dependencies between the instructions between

multiblocks and parallel register sharing architecture creates a

platform for a kind of subgraph used in multiblock

construction. There are several reasons for restricting the

scope of multiblocks. As an instance if the architecture is

capable for exploiting inter multiblock parallelism then it

could be better to combine the dependent instructions into a

single unit (multiblock). Each iteration of data independent

loop can be considered as a multiblock to permit one iteration

BB-2

BB-1

BB-3

BB-4

BB-5

BB-6

BB-7

BB-8

BB-9

Return two

instructions (10-11)

Return two

instructions (12-13)

Three instructions

(1-3), 61% prediction

accuracy One instruction

(

)

One instruction (5)

61% prediction

accuracy One instruction

(

)

Two instructions

(

7-8

)

One instruction

(

)

Return four instructions

(14-17), 97% prediction

accuracy

for (i = 0; i < input; i++){

a1 = a[0]->ptand[i];

b1 = b[0]->ptend[i];

if(a1==2)

a1 = 0;

if(b1==2)

b1 = 0;

if(a1 != b1){

if(a1 < b1) {

return -1;

}

else{

return 1;

}

Rajendra Kumar et. al. / (IJCSE) International Journal on Computer Science and Engineering

Vol. 2, No. 4, 2010, 1179-1183

ISSN: 0975-3397

1181

per cycle initiation. Following code shows loop where

iterations are dependent:

Fig. 3 Iteration dependent loop

As advantage, an entire loop can be encapsulated in

a multiblock. The above code is double nested loop. The

inner loop is used to traverse a linked list and its

execution is dependent of data and control. If we define

the entire inner loop to be a single multiblock then there

is a possibility of starting several activation of inner loop

without waiting for completion of previous one. The

flexibility in construction in multiblock is increased by

allowing many targets and as a result a larger multiblock

is formed. In case, the number of targets are increased

the dynamic predication setup needs additional number

of state information and as a result the accuracy of

predication is decreased. Therefore, it allow multiblocks

to have maximum two targets may be comprised. As an

exception, when a multiblock has three or more targets

then at run time except on or two, all are rarely

exercised. The reduced CFG of figure 2 is given by

figure 4.

Fig. 4 Reduced CFG of figure 2

Above figure shows a multiblock constructed from

BB-1 to BB-8. It contains 16 static instructions. As

average 7.46 instructions are executed dynamically.

The first multiblock (BB-1 to BB-8) is called MB(1-

8) and the second multiblock (only BB-9) is MB(9). In

this reduced CFG only two predications are required per

iteration of the loop as compare to four predications in

CFG of figure 4 that an ordinary branch predication

approach would require. Following is the control flow

table for control flow predication:

Table 1 Control flow table

Address Target 1 Target 2 Target 3

MB(1-8) MB(9) Return 16

MB(9) MB(10) MB(1-8) 4

The control flow predication buffer (CFPB) is temporary

of CFT entries. The CFT entries are appended with sufficient

information to help dynamic predication decision. The CFPB

is accessed once for every multiblock activation record to

calculate the size and targets of multiblock. The following

table is for CFPB entries of the reduced CFG given by figure

Table 2 CFPB entries

Address State of

predication Target 1 Target 2 length

MB(1-8) Taken MB(9) Return 16

MB(9) Taken MB(10) MB(1-8) 4

IV. EVALUATION ON ABSTRACT MACHINE

We first evaluate the strength of control flow predication

concept on abstract machine that maintains a dynamic

window from which ILP is extracted. The dynamic window

initiates the instructions and the machine executes the

instructions. The instructions chosen by the machine at any

given time can be from various parts of the dynamic window

with different flow of control in the program.

For experimental purpose we used compress, gcc, xlisp,

yacc and tex coded in C language. Following table shows the

basic structure for different programs. The programs are

evaluated in terms of dynamic instructions, conditional and

unconditional branch ratio, static code size, and CFT size.

Table 3 Basic structure for different programs

Program

Name Dynamic

Instructions

(millions)

Conditional

branch

ratio

Un-

conditional

branch ratio

Static

code

size

Static

CFT

size

gcc 1000 0.156 0.042 172032 25653

compress 22.68 0.149 0.040 6144 88.5

tex 214.67 0.143 0.055 60416 9976

yacc 26.37 0.237 0.020 12288 1737

xlisp 500 0.157 0.091 21504 3637

V. OBSERVATIONS

The table below shows variation in number of branches

traversed per cycle with control flow predication. For

example, in case of gcc, the control flow predication we

observed is 1.47 branches per cycle and in tex 1.16 branches

per cycle.

Table 4 Branch traversal results

program Initiation

mean size Window

mean

size

Branch

prediction

accuracy

Traversed

branches

per cycle

Results

without

control

flow

predication

gcc 5.02 72 91.11 N/A

compress 5.24 64 89.59 N/A

tex 5.02 169 95.87 N/A

yacc 3.87 103 95.84 N/A

xlisp 4.00 144 95.63 N/A

Results

with

control

flow

predication

gcc 9.44 105 91.02 1.47

compress 8.40 86 89.71 1.33

tex 6.24 207 96.10 1.16

yacc 4.96 150 96.51 1.22

xlisp 5.11 1.57 95.34 1.16

BB-1 BB-2 BB-3 BB-4 BB-5 BB-6 BB-7

BB-8

BB-9

for(fpt=xlenv; fpt; fpt=cdr(fpt))

{

for(ep=car(fpt); ep; ep=cdr(ep))

{

if(sym== car(car(ep)))

return (cdr(car(ep)));

}

Rajendra Kumar et. al. / (IJCSE) International Journal on Computer Science and Engineering

Vol. 2, No. 4, 2010, 1179-1183

ISSN: 0975-3397

1182

The numbers of branches are reduced by control flow

predication. It uses traversal of multiple branches in a

single prediction. The effect on the accuracy of the

branch prediction is not seen uniform across all

programs.

VI. CONCLUSION

As the prediction decision is over, the instructions

from the predicated path are fetched in the next branch in

the predicated path is encountered. For any two

consecutive arbitrary branches it is sometimes

impossible to determine the identity of the next branch to

make prediction in the very next cycle when a branch

prediction is over. If the branch prediction is not made in

each and every cycle then the prediction bandwidth and

the number of instructions per cycle are suffered. The

predication mechanism can perform one prediction per

cycle as long as the next branch lies inside the block of

fetch instruction in the instruction buffer. The number of

instruction that can enter into the dynamic window in the

cycle is another problem. The best case instruction per

cycle is restricted to the number of instruction that can

move in to dynamic window. If there is possibility of

traversing then only one branch at a time in CFG can be

initialized per cycle in average initiation time is

restricted by the length of code. The solution of this

problem is that we need a mechanism to traverse

multiple branches at a time. This can be done by

initiating a set of node of control flow graph to execute.

The problem of accuracy and the size of dynamic

window can be eliminated if some of the branches with

low prediction accuracies belong to the if-else structure.

REFERENCES

[1] Rajendra Kumar, P K Singh, “A Modern Parallel

Compilation”, IJCA, Volume 1, No. 16, pp. 108-

113, 2010

[2] David I., August Wen-mei W. Hwu, Scott A.

Mahlke “The Partial Reverse If-Conversion

Framework for Balancing Control Flow and

Predication”, International Journal of Parallel

Programming, Volume 27, No. 5, pp. 381-423,

1999.

[3] P. Chang, S. Mahlke, W. Chen, N. Warter, W.

Hwu, “IMPACT: An Architectural Framework for

Multiple-Instruction-Issue Processors”,

Proceeding 18th Annual International Symposium

on Computer Architecture, May 1991.

[4] R. Colwell, R. Nix, J. O’Donnell, D. Papworth,

and P. Rodman, “A VLIW Architecture for a

Trace Scheduling Compiler”, IEEE Transactions

on Computers, vol. 37, pp. 967-979, Aug. 1988.

[5] David I. August, Wen-Mei W. Hwu, Scott A.

Mahlke, “The Partial Reverse If-Conversion

Framework for Balancing Control Flow and

Predication”, International Journal of Parallel

Programming Volume 27, Issue 5, pp. 381–423, 1999.

[6] Dionisios N. Pnevmatikatos, Manoj Franklin, Gurindar S.

Sohi, “Control flow prediction for dynamic ILP

processors”, International Symposium on

Microarchitectur, Proceedings of the 26th annual

international symposium on Microarchitecture, pp. 153

– 163, 1993.

[7] J. Fisher, “Trace Scheduling: A Technique for Global

Microcode Compaction”, IEEE Transactions on

Computers, vol. C-30, July 1981.

[8] J Cong, Guoling Han, Zhiru Zhang, “Architecture and

compilation for data bandwidth improvement in

configurable embedded processors”, International

Conference on Computer Aided Design Proceedings of

the 2005 IEEE/ACM International conference on

Computer-aided design, pp. 263-270. 2005

[9] P. Y. T. Hsu and E. S. Davidson, “Highly Concurrent

Scalar Processing”, Proceeding 13th Annual

International Symposium on Computer Architecture,

June 1986.

[10] Lam Wilson, “Limits of control flow on parallelism”,

proceedings of 19th annual International symposium on

Computer Architecture, pp. 46-57, 1992.

[11] J. K. F. Lee and A. J. Smith, “Branch Prediction

Strategies and Branch Target Buffer Design”, IEEE

Computer, Volume 17, pp. 6-22, 1984.

[12] S. Mahlke, D. Lin, W. Chen, R. Hank, and R.

Bringmann, ‘‘Effective Compiler Support for

Predicated Execution Using the Hyperblock”, Proc. of

the 25th Annual Workshop on Microprogramming and

Microarchitecture, 1992.

[13] S. T. Pan, K. So, and J. T. Rahmeh, “Improving the

Accuracy of Dynamic Branch Prediction Using Branch

Correlation”, Proceeding Architectural Support for

Programming Languages and Operating Systems

(ASPLOS-V), 1992.

[14] Steve Carr, “Combining Optimization for Cache and

Instruction-Level Parallelism”, Proceedings of the

1996 Conference on Parallel Architectures and

Compilation Techniques, 1996

[15] T. Yeh and Y. Patt, “A Comparison of Dynamic Branch

Predictors that use Two Levels of Branch History”,

Proceeding 20th Annual International Symposium on

Computer Architecture, May 1993.

[16] Vijay S. Pai, Parthasarathy Ranganathan, Hazim Abdel-

Shafi, Sarita Adve, “The Impact of Exploiting

Instruction-Level Parallelism on Shared-Memory

Multiprocessors”, IEEE Transactions on Computers,

Volume 48 , Issue 2, Special issue on cache memory

and related problems, pp. 218 – 226, 1999.

[17] Guilin Chen, Mahmut Kandemir, “Compiler-Directed

Code Restructuring for Improving Performance of

MPSoCs”, IEEE Transactions on Parallel and

Distributed Systems, Volume. 19, No. 9, 2008

[18] www.trimaran.org

Rajendra Kumar et. al. / (IJCSE) International Journal on Computer Science and Engineering

Vol. 2, No. 4, 2010, 1179-1183

ISSN: 0975-3397

1183

AUTHORS PROFILE

Rajendra Kumar - He is Assistant Professor and Head

of Computer Science & Engineering department at

Vidya College of engineering, Meerut. He is author of

four text books Theory of Automata, Languages and

Computation from Tata McGrawHill, Human Computer

Interaction from Firewall Media, Information and

Communication Technologies from University Science

Press, and Modeling and Simulation Concept from

University Science Press. He has written distance

learning books on Computer Graphics for MGU Kerla

and MDU Rohtak. His current research area is

Instruction Level Parallelism.

Dr. P K Singh - He is an Associate Professor of

Computer Science & Engineering at MMM Engineering

College, Gorakhpur. He graduated from MMM

Engineering College, Gorakhpur with a Bachelor of

Computer Science degree and M. Tech. from University

of Roorkee in Computer Science and Technology then

obtained a Doctor degree in the area of Parallelizing

Compilers. He teaches a number of Computer Science

subjects including Compiler Design, Automata Theory,

Advanced Computer Architectures, Parallel Computing,

Data Structures and Algorithms, Object Oriented

Programming C++ and Computer Graphics etc., but

mostly he teaches Compiler Design and Parallel

Computing.

An Approach for Compiler Optimization to Exploit Instruction Level Parallelism

Conference Paper

Full-text available

Jun 2014

Instruction Level Parallelism (ILP) is not the new idea. Unfortunately ILP architecture not well suited to for all conventional high level language compilers and compiles optimization technique. Instruction Level Parallelism is the technique that allows a sequence of instructions derived from a sequential program (without rewriting) to be parallelized for its execution on multiple pipelining functional units. As a result, the performance is increased while working with current softwares. At implicit level it initiates by modifying the compiler and at explicit level it is done by exploiting the parallelism available with the hardware. To achieve high degree of instruction level parallelism, it is necessary to analyze and evaluate the technique of speculative execution control dependence analysis and to follow multiple flows of control. The researchers are continuously discovering the ways to increase parallelism by an order of magnitude beyond the current approaches. In this paper we present impact of control flow support on highly parallel architecture with 2-core and 4-core. We also investigated the scope of parallelism explicitly and implicitly. For our experiments we used trimaran simulator. The benchmarks are tested on abstract machine models created through trimaran simulator.

ILP Exploitation and Speedup Issues and Their Solutions through Balanced Scheduling

Conference Paper

Full-text available

Mar 2014

In this paper, we present issues associated with hardware and compiler to exploit instruction level parallelism. In this reference the solutions related to balanced scheduling have been presented. The comparison of balanced scheduler and traditional scheduler has also been discussed. The balanced scheduling with three compiler optimizations is very helpful to increase ILP speedup with respect to loop unrolling, trace scheduling and cache locality analysis. Loop unrolling and trace scheduling increase ILP by giving the scheduler a large space of instructions from which to select. The cache locality analysis, in other way, utilizes the amount of ILP available more efficiently. By loop unrolling, the compiler can generate more ILP by duplication of iterations in multiple to the unrolling factor. The balanced scheduler can increase its advantages over the traditional scheduler in the cases when more ILP is available. We have shown how loop unrolling, trace scheduling and cache locality analysis in association with balanced scheduling can interlock the cycles by reducing them upto 5%. The same thing over the traditional scheduler can reduce cycles not less than 15%.

A novel heuristic for selection of hyperblock in If-conversion

Conference Paper

Full-text available

Apr 2011

In this paper we present a novel heuristic for selection of hyperblock in If-conversion. The if-conversion has been applied to be promising method for exploitation of ILP in the presence of control flow. The if-conversion in the prediction is responsible for control dependency between the branches and remaining instructions creating data dependency between the predicate definition and predicated structures of the program. As a result, the transformation of control flow becomes optimized traditional data flow and branch scheduling becomes reordering of serial instructions. The degree of ILP can be increased by overlapping multiple program path executions. The main idea behind this concept is to use a step beyond the prediction of common branch and permitting the architecture to have the information about the CFG (Control Flow Graph) components of the program to have better branch decision for ILP. The navigation bandwidth of prediction mechanism depends upon the degree of ILP. It can be increased by increasing control flow prediction in procedural languages at compile time. By this the size of initiation is increased that allows the overlapped execution of multiple independent flow of control. The multiple branch instruction can also be allowed as intermediate steps in order to increase the size of dynamic window to achieve a high degree of ILP exploitation.

INSTRUCTION LEVEL PARALLELISM – THE ROLE OF ARCHITECTURE AND COMPILER

Conference Paper

Full-text available

Apr 2013

The instruction level parallelism (ILP) is not a new idea. It has been in practice since 1970 and became a much more significant force in computer design by 1980s. The researchers are continuously working on how to exploit ILP using aggressive techniques. To exploit ILP the role of compiler and Computer Architecture is very important. The compiler identifies the parallelism in the program and communicates it to the hardware (through dependences between operations). Compiler may re-order instructions to facilitate the task of hardware to extract the parallelism. The hardware determines at run-time when each operation is independent from others and perform scheduling, and there is no scanning of the sequential program to determine dependences. To achieve the high degree of ILP, it is necessary to execute the instruction at earliest possible time. The execution of instruction at earliest possible time is subject to availability of input operands and functional units. The compiler may additionally specify on which functional unit and in which cycle, an operation is executed. In this paper we present role of hardware and compiler to exploit instruction level parallelism.

Compiler Design

Chapter

Full-text available

Dec 2020

Rajendra Kumar

In conventional compilers, after the parsing of the source program, it is input to a semantic analyzer, which checks for semantic errors, such as the mismatching of types, etc. The semantic analyzer accesses the symbol table to perform semantic checking involving identifiers. After semantic checking, the compiler generates intermediate code, optimizes the intermediate code, and generates a target program. During parsing, syntax analyzers create a "Symbol table" (also called NT a "Name List Table") that keeps track of information concerning each identifier declared or defined in the source program. This information includes the name and type of each identifier. This information includes the name and type of each identifier, its class (variable, constant, procedure, etc.), nesting level of the block where declared, and other information more specific to the class. It is important that compilers must compile the program quickly and efficiently. In conventional compilers, the design of the semantic analyzer leads to inefficiencies of operation. Specifically, the semantic analyzer must perform a symbol table lookup each time it performs a semantic check involving an identifier.

Highly Concurrent Scalar Processing.

Conference Paper

Full-text available

Jun 1986
Comput Architect News

High speed scalar processing is an essential characteristic of high performance general purpose computer systems. Highly concurrent execution of scalar code is difficult due to data dependencies and conditional branches. This paper proposes an architectural concept called guarded instructions to reduce the penalty of conditional branches in deeply pipelined processors. A code generation heuristic, the decision tree scheduling technique, reorders instructions in a complex of basic blocks so as to make efficient use of guarded instructions. Performance evaluation of several benchmarks are presented, including a module from the UNIX kernel. Even with these difficult scalar code examples, a speedup of two is achievable by using conventional pipelined uniprocessors augmented by guard instructions, and a speedup of three or more can be achieved using processors with parallel instruction pipelines.

Retrospective: IMPACT: An Architectural Framework for Multiple-Instruction Issue.

Conference Paper

Full-text available

Aug 1998

Wen-mei W. Hwu

The Partial Reverse If-Conversion Framework for Balancing Control Flow and Predication

Article

Full-text available

Oct 1999

Predicated execution is a promising architectural feature for exploiting instruction-level parallelism in the presence of control flow. Compiling for predicated execution involves converting program control flow into conditional, or predicated, instructions. This process is known as if-conversion. In order to apply if-conversion effectively, one must address two major issues: what should be if-converted and when the if-conversion should be performed. A compiler's use of predication as a representation is most effective when large amounts of code are if-converted and when if-conversion is performed early in the compilation procedure. On the other hand, efficient execution of code generated for a processor with predicated execution requires a delicate balance between control flow and predication. The appropriate balance is tightly coupled with scheduling decisions and detailed processor characteristics. This paper presents a compilation framework that allows the compiler to maximize the benefits of predication as a compiler representation while delaying the final balancing of control flow and predication to schedule time.

A Modern Parallel Register Sharing Architecture for Code Compilation

Article

Full-text available

Feb 2010

The design of many-core-on-a-chip has allowed renewed anintense interest in parallel computing. On implementationpart, it has been seen that most of applications are not able touse enough parallelism in parallel register sharingarchitecture. The exploitation of potential performance ofsuperscalar processors has shown that processor is fed withsufficient instruction bandwidth. The fetcher and theInstruction Stream Buffer (ISB) are the key elements toachieve this target. Beyond the basic blocks, the instructionstream is not supported by currents ISBs. The split lineinstruction problem depreciates this situation for x86processors. With the implementation of Line WeightedBranch Target Buffer (LWBTB), the advance branchinformation and reassembling of cache lines can be predictedby the ISB. The ISB can fetch some more valid instructionsin a cycle through reassembling of original line containinginstructions for next basic block. If the cache line size ismore than 64 bytes, then there exist good chances to havetwo basic blocks in the recognized instruction line.The code generation for parallel register share architectureinvolves some issues that are not present in sequential codecompilation and is inherently complex. To resolve suchissues, a consistency contract between the code and themachine can be defined and a compiler is required topreserve the contract during the transformation of code. Inthis paper, we present a correctness framework to ensure theprotection of the contract and then we use code optimizationfor verification under parallel code.

Architecture and compilation for data bandwidth improvement in configurable embedded processors

Conference Paper

Full-text available

Dec 2005

Many commercially available embedded processors are capable of extending their base instruction set for a specific domain of applications. While steady progress has been made in the tools and methodologies of automatic instruction set extension for configurable processors, recent study has shown that the limited data bandwidth available in the core processor (e.g., the number of simultaneous accesses to the register file) becomes a serious performance bottleneck. In this paper, we propose a new low-cost architectural extension and associated compilation techniques to address the data bandwidth problem. Specifically, we embed a single control bit in the instruction op-codes to selectively copy the execution results to a set of hash-mapped shadow registers in the write-back stage. This can efficiently reduce the communication overhead due to data transfers between the core processor and the custom logic. We also present a novel simultaneous global shadow register binding with a hash function generation algorithm to take full advantage of the extension. The application of our approach leads to a nearly-optimal performance speedup (within 2% of the ideal speedup).

A VLIW architecture for a trace scheduling compiler

Article

Oct 1987

Very Long Instruction Word (VLIW) architectures were promised to deliver far more than the factor of two or three that current architectures achieve from overlapped execution. Using a new type of compiler which compacts ordinary sequential code into long instruction words, a VLIW machine was expected to provide from ten to thirty times the performance of a more conventional machine built of the same implementation technology.Multiflow Computer, Inc., has now built a VLIW called the TRACE TM along with its companion Trace Scheduling TM compacting compiler. This new machine has fulfilled the performance promises that were made. Using many fast functional units in parallel, this machine extends some of the basic Reduced-Instruction-Set precepts: the architecture is load/store, the microarchitecture is exposed to the compiler, there is no microcode, and there is almost no hardware devoted to synchronization, arbitration, or interlocking of any kind (the compiler has sole responsibility for runtime resource usage).This paper discusses the design of this machine and presents some initial performance results.

Improving the Accuracy of Dynamic Branch Prediction Using Branch Correlation

Article

Jan 1992

Improving the Accuracy of Dynamic Branch Prediction Using Branch Correlation.

Conference Paper

Sep 1992

In this paper we describe restartable atomic sequences, an optimistic mechanism for implementing simple atomic operations (such as Test-And-Set) on a uniprocessor. A thread that is suspended within a restartable atomic ...

A VLIW Architecture for a Trace Scheduling Compiler.

Conference Paper

Oct 1987

A VLIW (very long instruction word) architecture machine called the TRACE has been built along with its companion Trace Scheduling compacting compiler. This machine has three hardware configurations, capable of executing 7, 14, or 28 operations simultaneously. The 'seven-wide' achieves a performance improvement of a factor of five or six for a wide range of scientific code, compared to machines of higher cost and fast chip implementation technology (such as the VAX 8700). The TRACE extends some basic reduced-instruction-set computer (RISC) precepts: the architecture is load/store, the microarchitecture is exposed to the compiler, there is no microcode, and there is almost no hardware devoted to synchronization, arbitration, or interlocking of any kind (the compiler has sole responsibility for run-time resource usage). The authors discuss the design of this machine and present some initial performance results.

Trace Scheduling: A Technique for Global Microcode Compaction

Article

Jul 1981

Joseph Fisher

In this study ″trace scheduling″ is developed as a solution to the global compaction problem. Trace scheduling works on traces (or paths) through microprograms. Compacting is thus done with a broad overview of the program. Important operations are given priority, no matter what their source block was. This is in sharp contrast with earlier methods, which compact one block at a time and then attempt iterative improvement. It is argued that those methods suffer from the lack of an overview and make many undesirable compactions, often preventing desirable ones. Loops are handled using the reducible property of most flow graphs. The loop handling technique permits the operations to move around loops, as well as into loops, where appropriate.

Control Flow Prediction through Multiblock Formation in Parallel RegisterSharing Architecture

Abstract and Figures

Recommended publications

Role of Multiblocks in Control Flow Prediction using Parallel Register Sharing Architecture

Simulation of Branch Prediction Optimization in Parallel Register Sharing Architecture

A novel heuristic for selection of hyperblock in If-conversion

A Modern Parallel Register Sharing Architecture for Code Compilation