ArticlePDF Available

Control Flow Prediction through Multiblock Formation in Parallel RegisterSharing Architecture

Authors:

Abstract and Figures

In this paper we introduce control flow prediction (CFP) in parallel register sharing architecture. The main idea behind this concept is to use a step beyond the prediction of common branch and permitting the hardware to have the information about the CFG (Control Flow Graph) components of the program to have better branch decision in navigation. The degree of ILP depends upon the navigation bandwidth of prediction mechanism. It can be increased on increase of control flow prediction. By this the size of initiation is increased that permit the overlapped execution of multiple independent flow of control. It can allow multiple branch instruction that can be resolved simultaneously. These are intermediate steps to be taken in order to increase the size of dynamic window can achieve a high degree of ILP exploitation.
Content may be subject to copyright.
Rajendra Kumar et. al. / (IJCSE) International Journal on Computer Science and Engineering
Vol. 2, No. 4, 2010, 1179-1183
ISSN: 0975-3397
1179
Control Flow Prediction through Multiblock
Formation in Parallel Register
Sharing Architecture
Rajendra Kumar Dr. P K Singh
Dept. of Computer Science & engineering, Dept. of Computer Science & engineering
Vidya College of Engineering MMM Engineering College,
Meerut (UP), India Gorakhpur (UP), India
rajendra04@gmail.com topksingh@gmail.com
Abstract - In this paper we introduce control flow
prediction (CFP) in parallel register sharing architecture.
The main idea behind this concept is to use a step beyond
the prediction of common branch and permitting the
hardware to have the information about the CFG (Control
Flow Graph) components of the program to have better
branch decision in navigation. The degree of ILP depends
upon the navigation bandwidth of prediction mechanism. It
can be increased on increase of control flow prediction. By
this the size of initiation is increased that permit the
overlapped execution of multiple independent flow of
control. It can allow multiple branch instruction that can
be resolved simultaneously. These are intermediate steps to
be taken in order to increase the size of dynamic window
can achieve a high degree of ILP exploitation.
Keywords: CFP; ISB; ILP; CFG; Basic Block
I. INTRODUCTION
ILP allows the capability of execution of multiple
instructions per cycle. It is now essential to the
performance of modern processors. ILP is greatly forced
by branch instructions. Also it has been observed that
branch prediction is employed with speculative
execution [2]. However, inevitable branch misprediction
compromises such a remedy. On the other hand branch
prediction exposes high degree of ILP to scheduler by
converting control flow into equivalent predicated
instructions which are protected by Boolean source
operands. The if-conversion has been shown to be
promising method for exploitation of ILP in the presence
of control flow.
The if-conversion in the predication goes for control
dependency between the branches and remaining
instructions into data dependency between the predicate
definition and predicated structures of the program. As a
result the transformation of control flow becomes
optimized traditional data flow and branch scheduling
becomes reordering of serial instructions. ILP can be
increased by overlapping multiple program path
executions. Some predicate specific optimization may
also be enabled as some supplement of traditional approaches.
The major questions regarding the if-conversion [2]: what to
if-convert and when to if-convert explores that the if-
conversion should be performed early in the compilation
stage. It has the advantage of classified optimization
facilitation on the predicated instructions whereas the delay in
if-conversion is schedule in the time leaves for better
selection for code efficiency and target processor
characteristics. The dynamic branch prediction is
fundamentally is restricted to establishing a dynamic window
because it makes local decision without any prior knowledge
or of global control statement in the program. This short of
knowledge creates several problems like (1) branch prediction
and (2) its identity. It means the branch must be encountered
by parallel register sharing architecture [1]. Due to normal
branch prediction, a prediction can be made while the fetch
unit fetches the branch instruction.
II. RELATED WORK
The fetch unit has a great role in prediction mechanism
[2] in parallel register sharing architecture but [13, 15]
proposes some recent prediction mechanism that do not
require the addresses of branch for prediction rather there is
requirement of identity of branch to be known so that the
predicted target address can be obtained using either BTB
[11] or by decoding branch instructions. There are so many
commercially available embedded processors that are capable
to extend the set of base instructions for a specific application
domain. A steady progress has been observed in tools and
methodology for automatic instruction set extension for
processors that can be configured. It has been seen that the
limited data bandwidth is available in the core processors.
This creates a serious performance deadlock. [8] represents a
very low cost architectural extension and a compilation
technique that creates data bandwidth problem. A novel
parallel global register binding is also presented in [8] with
the help of hash function algorithm. This leads to a nearly
optimal performance speedup of 2% of ideal speedup. A
compilation framework [5] is presented that allows a
compiler to maximize the benefits of prediction. A heuristic
Rajendra Kumar et. al. / (IJCSE) International Journal on Computer Science and Engineering
Vol. 2, No. 4, 2010, 1179-1183
ISSN: 0975-3397
1180
[14] is given through experiments on Trimaran simulator
[18]. [14] shows how the weakness of traditional
heuristics are exploited. Optimal use of loop cash is also
explored to release the unnecessary pressure. A
technique to enhance the ability of dynamic ILP
processors to exploit the parallelism is introduced in [6].
A performance metric is presented in [14] to guide the
nested loop optimization. This way the effect of ILP is
combined with loop optimization.
The impact of ILP processors on the performance of
shared memory multiprocessors with and without latency
hiding optimizing software prefetching is represented in
[16]. One of the critical goals in the code optimization
for multiprocessor system on single chip architecture
[17] is to minimize the number of off chip memory
access. [17] represents a strategy that can reduce the
number of off chip references due to shared data.
Static technique like trace scheduling [4, 7]
predicated execution [9] super block and hyper block
scheduling [3, 12] have been used to elevate the impact
of control dependencies. [10] represents a study that
shows the ILP processors which perform branch
prediction and speculative execution. But it allows only
a single flow of control that can extract a parallelism of
only 7. The parallelism limit is increased to 13 if the ILP
processors use the maximal of control dependence
information for instruction execution before branches
which they are independent.
III. EXPLOITATION OF CONTROL FLOW
GRAPG CHARACTERISTICS
The ISB architecture and the ISB structure are
presented in [1] for control flow predication. The
information presented in CFG for a program can be
exploited by ISB architecture that presents
parallelization of shared register after inspection of
control flow graph of a program, it is possible to infer
that some of the basic blocks may be executed regardless
previous branch outcome. Below is a C language code.
Fig. 1 A ‘C’ language code
The figure 2 represents CFG. This shows a number of
instructions in each basic block.
Fig 2. Control Flow Graph of fig 1.
The experiments are performed Trimaran simulator for a
MIPS 2000 executable extending from the node BB1
following multiblock BB-1 to BB-2, BB-1 to BB-3, BB-1 to
BB-4, BB-1 to BB-5, and BB-1 to BB-8 can be formed as
BB-1 to BB-8 as maximal multiblock. Because they have
single target. The multiblock BB-1 to BB-6, BB-1 to BB-7
and BB-1 to BB-9 can not be counted as multiblocks as they
have three targets. A CFG (whose nodes are basic blocks) can
be transformed into an equivalent graph whose nodes are
multiblocks. The information of multiblock is sent to ISB
architecture and informed decisions are navigated through the
control free graph. When a multiblock enters then its exit
point can be determined easily even though the exact path is
unknown.
The execution of multiblocks may overlap each other
creating overlapped execution of multiple control flow. The
data dependencies between the instructions between
multiblocks and parallel register sharing architecture creates a
platform for a kind of subgraph used in multiblock
construction. There are several reasons for restricting the
scope of multiblocks. As an instance if the architecture is
capable for exploiting inter multiblock parallelism then it
could be better to combine the dependent instructions into a
single unit (multiblock). Each iteration of data independent
loop can be considered as a multiblock to permit one iteration
BB-2
BB-1
BB-3
BB-4
BB-5
BB-6
BB-7
BB-8
BB-9
Return two
instructions (10-11)
Return two
instructions (12-13)
Three instructions
(1-3), 61% prediction
accuracy One instruction
(
4
)
One instruction (5)
61% prediction
accuracy One instruction
(
6
)
Two instructions
(
7-8
)
One instruction
(
9
)
Return four instructions
(14-17), 97% prediction
accuracy
for (i = 0; i < input; i++){
a1 = a[0]->ptand[i];
b1 = b[0]->ptend[i];
if(a1==2)
a1 = 0;
if(b1==2)
b1 = 0;
if(a1 != b1){
if(a1 < b1) {
return -1;
}
else{
return 1;
}
}
}
Rajendra Kumar et. al. / (IJCSE) International Journal on Computer Science and Engineering
Vol. 2, No. 4, 2010, 1179-1183
ISSN: 0975-3397
1181
per cycle initiation. Following code shows loop where
iterations are dependent:
Fig. 3 Iteration dependent loop
As advantage, an entire loop can be encapsulated in
a multiblock. The above code is double nested loop. The
inner loop is used to traverse a linked list and its
execution is dependent of data and control. If we define
the entire inner loop to be a single multiblock then there
is a possibility of starting several activation of inner loop
without waiting for completion of previous one. The
flexibility in construction in multiblock is increased by
allowing many targets and as a result a larger multiblock
is formed. In case, the number of targets are increased
the dynamic predication setup needs additional number
of state information and as a result the accuracy of
predication is decreased. Therefore, it allow multiblocks
to have maximum two targets may be comprised. As an
exception, when a multiblock has three or more targets
then at run time except on or two, all are rarely
exercised. The reduced CFG of figure 2 is given by
figure 4.
Fig. 4 Reduced CFG of figure 2
Above figure shows a multiblock constructed from
BB-1 to BB-8. It contains 16 static instructions. As
average 7.46 instructions are executed dynamically.
The first multiblock (BB-1 to BB-8) is called MB(1-
8) and the second multiblock (only BB-9) is MB(9). In
this reduced CFG only two predications are required per
iteration of the loop as compare to four predications in
CFG of figure 4 that an ordinary branch predication
approach would require. Following is the control flow
table for control flow predication:
Table 1 Control flow table
Address Target 1 Target 2 Target 3
MB(1-8) MB(9) Return 16
MB(9) MB(10) MB(1-8) 4
The control flow predication buffer (CFPB) is temporary
of CFT entries. The CFT entries are appended with sufficient
information to help dynamic predication decision. The CFPB
is accessed once for every multiblock activation record to
calculate the size and targets of multiblock. The following
table is for CFPB entries of the reduced CFG given by figure
4.
Table 2 CFPB entries
Address State of
predication Target 1 Target 2 length
MB(1-8) Taken MB(9) Return 16
MB(9) Taken MB(10) MB(1-8) 4
IV. EVALUATION ON ABSTRACT MACHINE
We first evaluate the strength of control flow predication
concept on abstract machine that maintains a dynamic
window from which ILP is extracted. The dynamic window
initiates the instructions and the machine executes the
instructions. The instructions chosen by the machine at any
given time can be from various parts of the dynamic window
with different flow of control in the program.
For experimental purpose we used compress, gcc, xlisp,
yacc and tex coded in C language. Following table shows the
basic structure for different programs. The programs are
evaluated in terms of dynamic instructions, conditional and
unconditional branch ratio, static code size, and CFT size.
Table 3 Basic structure for different programs
Program
Name Dynamic
Instructions
(millions)
Conditional
branch
ratio
Un-
conditional
branch ratio
Static
code
size
Static
CFT
size
gcc 1000 0.156 0.042 172032 25653
compress 22.68 0.149 0.040 6144 88.5
tex 214.67 0.143 0.055 60416 9976
yacc 26.37 0.237 0.020 12288 1737
xlisp 500 0.157 0.091 21504 3637
V. OBSERVATIONS
The table below shows variation in number of branches
traversed per cycle with control flow predication. For
example, in case of gcc, the control flow predication we
observed is 1.47 branches per cycle and in tex 1.16 branches
per cycle.
Table 4 Branch traversal results
program Initiation
mean size Window
mean
size
Branch
prediction
accuracy
Traversed
branches
per cycle
Results
without
control
flow
predication
gcc 5.02 72 91.11 N/A
compress 5.24 64 89.59 N/A
tex 5.02 169 95.87 N/A
yacc 3.87 103 95.84 N/A
xlisp 4.00 144 95.63 N/A
Results
with
control
flow
predication
gcc 9.44 105 91.02 1.47
compress 8.40 86 89.71 1.33
tex 6.24 207 96.10 1.16
yacc 4.96 150 96.51 1.22
xlisp 5.11 1.57 95.34 1.16
BB-1 BB-2 BB-3 BB-4 BB-5 BB-6 BB-7
BB-8
BB-9
for(fpt=xlenv; fpt; fpt=cdr(fpt))
{
for(ep=car(fpt); ep; ep=cdr(ep))
{
if(sym== car(car(ep)))
return (cdr(car(ep)));
}
}
Rajendra Kumar et. al. / (IJCSE) International Journal on Computer Science and Engineering
Vol. 2, No. 4, 2010, 1179-1183
ISSN: 0975-3397
1182
The numbers of branches are reduced by control flow
predication. It uses traversal of multiple branches in a
single prediction. The effect on the accuracy of the
branch prediction is not seen uniform across all
programs.
VI. CONCLUSION
As the prediction decision is over, the instructions
from the predicated path are fetched in the next branch in
the predicated path is encountered. For any two
consecutive arbitrary branches it is sometimes
impossible to determine the identity of the next branch to
make prediction in the very next cycle when a branch
prediction is over. If the branch prediction is not made in
each and every cycle then the prediction bandwidth and
the number of instructions per cycle are suffered. The
predication mechanism can perform one prediction per
cycle as long as the next branch lies inside the block of
fetch instruction in the instruction buffer. The number of
instruction that can enter into the dynamic window in the
cycle is another problem. The best case instruction per
cycle is restricted to the number of instruction that can
move in to dynamic window. If there is possibility of
traversing then only one branch at a time in CFG can be
initialized per cycle in average initiation time is
restricted by the length of code. The solution of this
problem is that we need a mechanism to traverse
multiple branches at a time. This can be done by
initiating a set of node of control flow graph to execute.
The problem of accuracy and the size of dynamic
window can be eliminated if some of the branches with
low prediction accuracies belong to the if-else structure.
REFERENCES
[1] Rajendra Kumar, P K Singh, “A Modern Parallel
Register Sharing Architecture for Code
Compilation”, IJCA, Volume 1, No. 16, pp. 108-
113, 2010
[2] David I., August Wen-mei W. Hwu, Scott A.
Mahlke “The Partial Reverse If-Conversion
Framework for Balancing Control Flow and
Predication”, International Journal of Parallel
Programming, Volume 27, No. 5, pp. 381-423,
1999.
[3] P. Chang, S. Mahlke, W. Chen, N. Warter, W.
Hwu, “IMPACT: An Architectural Framework for
Multiple-Instruction-Issue Processors”,
Proceeding 18th Annual International Symposium
on Computer Architecture, May 1991.
[4] R. Colwell, R. Nix, J. O’Donnell, D. Papworth,
and P. Rodman, “A VLIW Architecture for a
Trace Scheduling Compiler”, IEEE Transactions
on Computers, vol. 37, pp. 967-979, Aug. 1988.
[5] David I. August, Wen-Mei W. Hwu, Scott A.
Mahlke, “The Partial Reverse If-Conversion
Framework for Balancing Control Flow and
Predication”, International Journal of Parallel
Programming Volume 27, Issue 5, pp. 381–423, 1999.
[6] Dionisios N. Pnevmatikatos, Manoj Franklin, Gurindar S.
Sohi, “Control flow prediction for dynamic ILP
processors”, International Symposium on
Microarchitectur, Proceedings of the 26th annual
international symposium on Microarchitecture, pp. 153
– 163, 1993.
[7] J. Fisher, “Trace Scheduling: A Technique for Global
Microcode Compaction”, IEEE Transactions on
Computers, vol. C-30, July 1981.
[8] J Cong, Guoling Han, Zhiru Zhang, “Architecture and
compilation for data bandwidth improvement in
configurable embedded processors”, International
Conference on Computer Aided Design Proceedings of
the 2005 IEEE/ACM International conference on
Computer-aided design, pp. 263-270. 2005
[9] P. Y. T. Hsu and E. S. Davidson, “Highly Concurrent
Scalar Processing”, Proceeding 13th Annual
International Symposium on Computer Architecture,
June 1986.
[10] Lam Wilson, “Limits of control flow on parallelism”,
proceedings of 19th annual International symposium on
Computer Architecture, pp. 46-57, 1992.
[11] J. K. F. Lee and A. J. Smith, “Branch Prediction
Strategies and Branch Target Buffer Design”, IEEE
Computer, Volume 17, pp. 6-22, 1984.
[12] S. Mahlke, D. Lin, W. Chen, R. Hank, and R.
Bringmann, ‘‘Effective Compiler Support for
Predicated Execution Using the Hyperblock”, Proc. of
the 25th Annual Workshop on Microprogramming and
Microarchitecture, 1992.
[13] S. T. Pan, K. So, and J. T. Rahmeh, “Improving the
Accuracy of Dynamic Branch Prediction Using Branch
Correlation”, Proceeding Architectural Support for
Programming Languages and Operating Systems
(ASPLOS-V), 1992.
[14] Steve Carr, “Combining Optimization for Cache and
Instruction-Level Parallelism”, Proceedings of the
1996 Conference on Parallel Architectures and
Compilation Techniques, 1996
[15] T. Yeh and Y. Patt, “A Comparison of Dynamic Branch
Predictors that use Two Levels of Branch History”,
Proceeding 20th Annual International Symposium on
Computer Architecture, May 1993.
[16] Vijay S. Pai, Parthasarathy Ranganathan, Hazim Abdel-
Shafi, Sarita Adve, “The Impact of Exploiting
Instruction-Level Parallelism on Shared-Memory
Multiprocessors”, IEEE Transactions on Computers,
Volume 48 , Issue 2, Special issue on cache memory
and related problems, pp. 218 – 226, 1999.
[17] Guilin Chen, Mahmut Kandemir, “Compiler-Directed
Code Restructuring for Improving Performance of
MPSoCs”, IEEE Transactions on Parallel and
Distributed Systems, Volume. 19, No. 9, 2008
[18] www.trimaran.org
Rajendra Kumar et. al. / (IJCSE) International Journal on Computer Science and Engineering
Vol. 2, No. 4, 2010, 1179-1183
ISSN: 0975-3397
1183
AUTHORS PROFILE
Rajendra Kumar - He is Assistant Professor and Head
of Computer Science & Engineering department at
Vidya College of engineering, Meerut. He is author of
four text books Theory of Automata, Languages and
Computation from Tata McGrawHill, Human Computer
Interaction from Firewall Media, Information and
Communication Technologies from University Science
Press, and Modeling and Simulation Concept from
University Science Press. He has written distance
learning books on Computer Graphics for MGU Kerla
and MDU Rohtak. His current research area is
Instruction Level Parallelism.
Dr. P K Singh - He is an Associate Professor of
Computer Science & Engineering at MMM Engineering
College, Gorakhpur. He graduated from MMM
Engineering College, Gorakhpur with a Bachelor of
Computer Science degree and M. Tech. from University
of Roorkee in Computer Science and Technology then
obtained a Doctor degree in the area of Parallelizing
Compilers. He teaches a number of Computer Science
subjects including Compiler Design, Automata Theory,
Advanced Computer Architectures, Parallel Computing,
Data Structures and Algorithms, Object Oriented
Programming C++ and Computer Graphics etc., but
mostly he teaches Compiler Design and Parallel
Computing.
... As the control flow prediction is increases, the size of initiation is increased that permit the overlapped execution of multiple independent flow of control. [9] presented Control Flow Prediction through Multiblock Formation in Parallel Register Sharing Architecture. ...
Conference Paper
Full-text available
Instruction Level Parallelism (ILP) is not the new idea. Unfortunately ILP architecture not well suited to for all conventional high level language compilers and compiles optimization technique. Instruction Level Parallelism is the technique that allows a sequence of instructions derived from a sequential program (without rewriting) to be parallelized for its execution on multiple pipelining functional units. As a result, the performance is increased while working with current softwares. At implicit level it initiates by modifying the compiler and at explicit level it is done by exploiting the parallelism available with the hardware. To achieve high degree of instruction level parallelism, it is necessary to analyze and evaluate the technique of speculative execution control dependence analysis and to follow multiple flows of control. The researchers are continuously discovering the ways to increase parallelism by an order of magnitude beyond the current approaches. In this paper we present impact of control flow support on highly parallel architecture with 2-core and 4-core. We also investigated the scope of parallelism explicitly and implicitly. For our experiments we used trimaran simulator. The benchmarks are tested on abstract machine models created through trimaran simulator.
... By this the size of initiation is increased that permit the overlapped execution of multiple independent flow of control. [10] presented Control Flow Prediction through Multiblock Formation in Parallel Register Sharing Architecture. ...
Conference Paper
Full-text available
In this paper, we present issues associated with hardware and compiler to exploit instruction level parallelism. In this reference the solutions related to balanced scheduling have been presented. The comparison of balanced scheduler and traditional scheduler has also been discussed. The balanced scheduling with three compiler optimizations is very helpful to increase ILP speedup with respect to loop unrolling, trace scheduling and cache locality analysis. Loop unrolling and trace scheduling increase ILP by giving the scheduler a large space of instructions from which to select. The cache locality analysis, in other way, utilizes the amount of ILP available more efficiently. By loop unrolling, the compiler can generate more ILP by duplication of iterations in multiple to the unrolling factor. The balanced scheduler can increase its advantages over the traditional scheduler in the cases when more ILP is available. We have shown how loop unrolling, trace scheduling and cache locality analysis in association with balanced scheduling can interlock the cycles by reducing them upto 5%. The same thing over the traditional scheduler can reduce cycles not less than 15%.
... The parallel register sharing architecture for code compilation is presented in [1]. [3] introduces control flow prediction (CFP) in parallel register sharing architecture. ...
Conference Paper
Full-text available
In this paper we present a novel heuristic for selection of hyperblock in If-conversion. The if-conversion has been applied to be promising method for exploitation of ILP in the presence of control flow. The if-conversion in the prediction is responsible for control dependency between the branches and remaining instructions creating data dependency between the predicate definition and predicated structures of the program. As a result, the transformation of control flow becomes optimized traditional data flow and branch scheduling becomes reordering of serial instructions. The degree of ILP can be increased by overlapping multiple program path executions. The main idea behind this concept is to use a step beyond the prediction of common branch and permitting the architecture to have the information about the CFG (Control Flow Graph) components of the program to have better branch decision for ILP. The navigation bandwidth of prediction mechanism depends upon the degree of ILP. It can be increased by increasing control flow prediction in procedural languages at compile time. By this the size of initiation is increased that allows the overlapped execution of multiple independent flow of control. The multiple branch instruction can also be allowed as intermediate steps in order to increase the size of dynamic window to achieve a high degree of ILP exploitation.
... By this the size of initiation is increased that permit the overlapped execution of multiple independent flow of control. [9] presented Control Flow Prediction through Multiblock Formation in Parallel Register Sharing Architecture. ...
Conference Paper
Full-text available
The instruction level parallelism (ILP) is not a new idea. It has been in practice since 1970 and became a much more significant force in computer design by 1980s. The researchers are continuously working on how to exploit ILP using aggressive techniques. To exploit ILP the role of compiler and Computer Architecture is very important. The compiler identifies the parallelism in the program and communicates it to the hardware (through dependences between operations). Compiler may re-order instructions to facilitate the task of hardware to extract the parallelism. The hardware determines at run-time when each operation is independent from others and perform scheduling, and there is no scanning of the sequential program to determine dependences. To achieve the high degree of ILP, it is necessary to execute the instruction at earliest possible time. The execution of instruction at earliest possible time is subject to availability of input operands and functional units. The compiler may additionally specify on which functional unit and in which cycle, an operation is executed. In this paper we present role of hardware and compiler to exploit instruction level parallelism.
Chapter
Full-text available
In conventional compilers, after the parsing of the source program, it is input to a semantic analyzer, which checks for semantic errors, such as the mismatching of types, etc. The semantic analyzer accesses the symbol table to perform semantic checking involving identifiers. After semantic checking, the compiler generates intermediate code, optimizes the intermediate code, and generates a target program. During parsing, syntax analyzers create a "Symbol table" (also called NT a "Name List Table") that keeps track of information concerning each identifier declared or defined in the source program. This information includes the name and type of each identifier. This information includes the name and type of each identifier, its class (variable, constant, procedure, etc.), nesting level of the block where declared, and other information more specific to the class. It is important that compilers must compile the program quickly and efficiently. In conventional compilers, the design of the semantic analyzer leads to inefficiencies of operation. Specifically, the semantic analyzer must perform a symbol table lookup each time it performs a semantic check involving an identifier.
Conference Paper
Full-text available
High speed scalar processing is an essential characteristic of high performance general purpose computer systems. Highly concurrent execution of scalar code is difficult due to data dependencies and conditional branches. This paper proposes an architectural concept called guarded instructions to reduce the penalty of conditional branches in deeply pipelined processors. A code generation heuristic, the decision tree scheduling technique, reorders instructions in a complex of basic blocks so as to make efficient use of guarded instructions. Performance evaluation of several benchmarks are presented, including a module from the UNIX kernel. Even with these difficult scalar code examples, a speedup of two is achievable by using conventional pipelined uniprocessors augmented by guard instructions, and a speedup of three or more can be achieved using processors with parallel instruction pipelines.
Article
Full-text available
Predicated execution is a promising architectural feature for exploiting instruction-level parallelism in the presence of control flow. Compiling for predicated execution involves converting program control flow into conditional, or predicated, instructions. This process is known as if-conversion. In order to apply if-conversion effectively, one must address two major issues: what should be if-converted and when the if-conversion should be performed. A compiler's use of predication as a representation is most effective when large amounts of code are if-converted and when if-conversion is performed early in the compilation procedure. On the other hand, efficient execution of code generated for a processor with predicated execution requires a delicate balance between control flow and predication. The appropriate balance is tightly coupled with scheduling decisions and detailed processor characteristics. This paper presents a compilation framework that allows the compiler to maximize the benefits of predication as a compiler representation while delaying the final balancing of control flow and predication to schedule time.
Article
Full-text available
The design of many-core-on-a-chip has allowed renewed anintense interest in parallel computing. On implementationpart, it has been seen that most of applications are not able touse enough parallelism in parallel register sharingarchitecture. The exploitation of potential performance ofsuperscalar processors has shown that processor is fed withsufficient instruction bandwidth. The fetcher and theInstruction Stream Buffer (ISB) are the key elements toachieve this target. Beyond the basic blocks, the instructionstream is not supported by currents ISBs. The split lineinstruction problem depreciates this situation for x86processors. With the implementation of Line WeightedBranch Target Buffer (LWBTB), the advance branchinformation and reassembling of cache lines can be predictedby the ISB. The ISB can fetch some more valid instructionsin a cycle through reassembling of original line containinginstructions for next basic block. If the cache line size ismore than 64 bytes, then there exist good chances to havetwo basic blocks in the recognized instruction line.The code generation for parallel register share architectureinvolves some issues that are not present in sequential codecompilation and is inherently complex. To resolve suchissues, a consistency contract between the code and themachine can be defined and a compiler is required topreserve the contract during the transformation of code. Inthis paper, we present a correctness framework to ensure theprotection of the contract and then we use code optimizationfor verification under parallel code.
Conference Paper
Full-text available
Many commercially available embedded processors are capable of extending their base instruction set for a specific domain of applications. While steady progress has been made in the tools and methodologies of automatic instruction set extension for configurable processors, recent study has shown that the limited data bandwidth available in the core processor (e.g., the number of simultaneous accesses to the register file) becomes a serious performance bottleneck. In this paper, we propose a new low-cost architectural extension and associated compilation techniques to address the data bandwidth problem. Specifically, we embed a single control bit in the instruction op-codes to selectively copy the execution results to a set of hash-mapped shadow registers in the write-back stage. This can efficiently reduce the communication overhead due to data transfers between the core processor and the custom logic. We also present a novel simultaneous global shadow register binding with a hash function generation algorithm to take full advantage of the extension. The application of our approach leads to a nearly-optimal performance speedup (within 2% of the ideal speedup).
Article
Very Long Instruction Word (VLIW) architectures were promised to deliver far more than the factor of two or three that current architectures achieve from overlapped execution. Using a new type of compiler which compacts ordinary sequential code into long instruction words, a VLIW machine was expected to provide from ten to thirty times the performance of a more conventional machine built of the same implementation technology.Multiflow Computer, Inc., has now built a VLIW called the TRACE TM along with its companion Trace Scheduling TM compacting compiler. This new machine has fulfilled the performance promises that were made. Using many fast functional units in parallel, this machine extends some of the basic Reduced-Instruction-Set precepts: the architecture is load/store, the microarchitecture is exposed to the compiler, there is no microcode, and there is almost no hardware devoted to synchronization, arbitration, or interlocking of any kind (the compiler has sole responsibility for runtime resource usage).This paper discusses the design of this machine and presents some initial performance results.
Conference Paper
In this paper we describe restartable atomic sequences, an optimistic mechanism for implementing simple atomic operations (such as Test-And-Set) on a uniprocessor. A thread that is suspended within a restartable atomic ...
Conference Paper
A VLIW (very long instruction word) architecture machine called the TRACE has been built along with its companion Trace Scheduling compacting compiler. This machine has three hardware configurations, capable of executing 7, 14, or 28 operations simultaneously. The 'seven-wide' achieves a performance improvement of a factor of five or six for a wide range of scientific code, compared to machines of higher cost and fast chip implementation technology (such as the VAX 8700). The TRACE extends some basic reduced-instruction-set computer (RISC) precepts: the architecture is load/store, the microarchitecture is exposed to the compiler, there is no microcode, and there is almost no hardware devoted to synchronization, arbitration, or interlocking of any kind (the compiler has sole responsibility for run-time resource usage). The authors discuss the design of this machine and present some initial performance results.
Article
In this study ″trace scheduling″ is developed as a solution to the global compaction problem. Trace scheduling works on traces (or paths) through microprograms. Compacting is thus done with a broad overview of the program. Important operations are given priority, no matter what their source block was. This is in sharp contrast with earlier methods, which compact one block at a time and then attempt iterative improvement. It is argued that those methods suffer from the lack of an overview and make many undesirable compactions, often preventing desirable ones. Loops are handled using the reducible property of most flow graphs. The loop handling technique permits the operations to move around loops, as well as into loops, where appropriate.