ArticlePDF Available

MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications

Authors:

Abstract

This paper introduces MorphoSys, a reconfigurable computing system developed to investigate the effectiveness of combining reconfigurable hardware with general-purpose processors for word-level, computation-intensive applications. MorphoSys is a coarse-grain, integrated, and reconfigurable system-on-chip, targeted at high-throughput and data-parallel applications. It is comprised of a reconfigurable array of processing cells, a modified RISC processor core, and an efficient memory interface unit. This paper describes the MorphoSys architecture, including the reconfigurable processor array, the control processor, and data and configuration memories. The suitability of MorphoSys for the target application domain is then illustrated with examples such as video compression, data encryption and target recognition. Performance evaluation of these applications indicates improvements of up to an order of magnitude (or more) on MorphoSys, in comparison with other systems
1
MorphoSys: An Integrated Reconfigurable System for Data-Parallel
Computation-Intensive Applications
Hartej Singh, Ming-Hau Lee, Guangming Lu, Fadi J. Kurdahi, Nader Bagherzadeh,
University of California, Irvine, CA 92697
and
Eliseu M. C. Filho,
Federal University of Rio de Janeiro, Brazil
Abstract: This paper introduces MorphoSys, a reconfigurable computing system developed to investigate the
effectiveness of combining reconfigurable hardware with general-purpose processors for word-level,
computation-intensive applications. MorphoSys is a coarse-grain, integrated reconfigurable system-on-chip
targeted at high-throughput and data-parallel applications. It comprises of a reconfigurable array of
processing cells, a modified RISC processor core and an efficient memory interface unit. This paper
describes the MorphoSys architecture, including the reconfigurable processing array, the control processor
and data and configuration memories. The suitability of MorphoSys for the target application domain is then
illustrated with examples such as video compression, data encryption and target recognition. Performance
evaluation of these applications indicates improvements of up to an order of magnitude on MorphoSys in
comparison with other systems.
Index Terms: Reconfigurable systems, reconfigurable cell array, Single Instruction Multiple Data, dynamic
reconfiguration, target recognition, template matching, multimedia applications, video compression, MPEG-
2, data encryption/decryption.
1. Introduction
Reconfigurable systems are computing systems that combine a reconfigurable hardware processing unit
with a software-programmable processor. These systems allow customization of the reconfigurable
processing unit, in order to meet the specific computational requirements of different applications.
Reconfigurable computing represents an intermediate approach between the extremes of Application Specific
Integrated Circuits (ASICs) and general-purpose processors. A reconfigurable system generally has wider
2
applicability than an ASIC. In addition, the combination of a reconfigurable component with a general-
purpose processor results in better performance (for many application classes) than the general-purpose
processor alone.
The significance of reconfigurable systems can be illustrated through the following example. Many
applications have a heterogeneous nature and comprise of several sub-tasks with different characteristics. For
instance, a multimedia application may include a data-parallel task, a bit-level task, irregular computations,
high-precision word operations and a real-time component. For such complex applications with wide-ranging
sub-tasks, the ASIC approach would lead to an uneconomical die size or a large number of separate chips.
Also, most general-purpose processors would very likely not satisfy the performance constraints for the
entire application. However, a reconfigurable system (that combines a reconfigurable component with a
mainstream microprocessor) may be optimally reconfigured for each sub-task, meeting the application
constraints within the same chip. Moreover, it would be useful for more general-purpose applications, too.
This paper describes MorphoSys, a novel model for reconfigurable computing systems, targeted at
applications with inherent data-parallelism, high regularity, and high throughput requirements. Some
examples of these applications are video compression (discrete cosine transforms, motion estimation),
graphics and image processing, data encryption, and DSP transforms.
M ain P rocesso r
(e.g. advan ced R IS C )
R econ fig urable
Pro cessor A rray
External M e m ory
(e.g. SD R AM , R D R A M )
System B us
H igh B an dw idth
M em ory Interface
In struction,
D ata C a ch e
M orphoS ys
Figure 1: An Integrated Architecture for Reconfigurable Processor Systems
The MorphoSys architecture, shown in Figure 1, comprises a reconfigurable processing unit, a general-
purpose (core) processor and a high bandwidth memory interface, all implemented as a single chip. Given
the nature of target applications, the reconfigurable component is organized in SIMD fashion as an array of
Reconfigurable Cells (RCs). Since most of the target applications possess word-level granularity, the
3
Reconfigurable Cells (RCs) are also coarse-grain. The core (RISC) processor controls the operation of the
Reconfigurable Cell array (RC Array). A specialized streaming buffer (and controller) handles data transfers
between external memory and the RC Array.
The intent of the MorphoSys implementation is to study the viability of this integrated reconfigurable
computing model to satisfy the increasing demand for low cost stream/frame data processing needed for
important application classes, like video and image processing, multimedia, digital signal processing and
data encryption.
Organization of paper: Section 2 provides brief explanations of some terms and concepts used
frequently in reconfigurable computing, along with a tabular review of previous contributions in this field.
Section 3 introduces the system model for MorphoSys, our prototype reconfigurable computing system.
Section 4 describes the architecture of MorphoSys Reconfigurable Cell Array and associated components.
Next, important differences between previous research work and the MorphoSys system are discussed in
Section 5. Section 6 describes the programming and simulation environment and mView, a graphical user
interface for programming and simulating MorphoSys. Section 7 illustrates the mapping of some applications
(video compression, automatic target recognition, and data encryption) to MorphoSys. Performance
estimates obtained from simulation of behavioral VHDL models are provided for these applications, and
compared with other systems. Finally, some conclusions of this research are mentioned in Section 8.
2. Taxonomy for Reconfigurable Systems and Previous Work
This section introduces a set of criteria that are frequently used to characterize the design of a
reconfigurable computing system. These criteria are granularity, depth of programmability,
reconfigurability, interface and computation model.
(a) Granularity: This refers to the data size for operations of the reconfigurable component (or
reconfigurable processing unit, RPU) of a system. A RPU is a logic block of configurable functionality,
having a framework of reconfigurable interconnect. In fine-grain systems, processing elements in the
RPU are typically logic gates, flip-flops and look-up tables. They operate at the bit level, implementing
a boolean function of a finite-state machine. On the other hand, in coarse-grain systems, the processing
4
elements in the RPU may contain complete functional units, like ALUs and/or multipliers that operate
upon multiple-bit words. A system that combines both the above types has mixed-grain granularity.
(b) Depth of Programmability: This pertains to the number of configuration programs (or contexts) stored
within the RPU. A RPU may have a single context or multiple contexts. For single-context systems,
only one context is resident in the RPU. Therefore, the RPU’s functionality is limited to the context
currently loaded. On the other hand, a multiple-context RPU has several contexts concurrently residing
in the system. This enables the execution of different tasks simply by changing the operating context
without having to reload the configuration program.
(c) Reconfigurability: A RPU may need to be frequently reconfigured for executing different applications.
Reconfiguration is the process of reloading configuration programs (context). This process is either
static (execution is interrupted) or dynamic (in parallel with execution). A single context RPU typically
has static reconfiguration. Dynamic reconfiguration is more relevant for a multi-context RPU. It
implies that such a RPU can execute some part of its configuration program, while the other part is
being changed. This feature significantly reduces the overhead for reconfiguration.
(d) Interface: A reconfigurable system has a remote interface if the system’s host processor is not on the
same chip/die as the RPU. A local interface implies that the host processor and the co-processor RPU
reside on the same chip, or that the RPU is unified into the datapath of the host processor.
(e) Computation model: Many reconfigurable systems follow the uniprocessor computation model.
However, there are several others that follow SIMD or MIMD computation models ([4], [7], [8] and
[11]). Some systems may also follow the VLIW model [2].
Conventionally, the most common devices used for reconfigurable computing are field programmable
gate arrays (FPGAs) [1]. FPGAs allow designers to manipulate gate-level devices such as flip-flops, memory
and other logic gates. This makes FPGAs quite useful for complex bit-oriented computations. Examples of
reconfigurable systems using FPGAs are [9], [10], [27], and [29]. However, FPGAs have some
disadvantages, too. They are slower than ASICs, and have inefficient performance for coarse-grained (8 bits
or more) datapath operations. Hence, many researchers have proposed other models of reconfigurable
systems targeting different applications. PADDI [2], MATRIX [4], RaPiD [6], and REMARC [7] are some of
5
the coarse-grain prototype reconfigurable computing systems. Research prototypes with fine-grain
granularity (but not based on FPGAs) include DPGA [3] and Garp [5]. Table 1 summarizes the
characteristics of various reconfigurable systems according to the criteria introduced above.
Table 1: Classification of Reconfigurable Systems
System Granularity Programm-
ability
Reconfigu
ration
Interface Comp. Model Application domain
Splash
/Splash 2
Fine-grain Multiple (for
interconnect)
Static Remote Uniprocessor Complex bit-oriented
computations
DECPeRLe-1 Fine-grain Single Static Remote Uniprocessor -same as above -
Garp Fine-grain Multiple Static Local Uniprocessor Bit level image
processing, cryptography
OneChip
[27]
Fine-grain Single Static Local Uniprocessor Embedded controllers,
Application accelerators
Chimaera
[28]
Fine-grain Single Static Local Uniprocessor Bit-level computations
DISC [29] Fine-grain Single Dynamic Local Uniprocessor General-purpose
DPGA Fine-grain Multiple Dynamic Remote Uniprocessor Bit-level computations
RaPiD Coarse-grain Single Mostly
static
Remote Pipe-lined
Processor
Systolic arrays, regular,
computation-intensive
RAW Mixed-grain Single Static Local MIMD General purpose
computing programs
MATRIX Coarse-grain Multiple Dynamic Undefined MIMD Various
PADDI Coarse-grain Multiple Static Remote VLIW,
SIMD
DSP applications
REMARC Coarse-grain Multiple Static Local SIMD Data-parallel, word-size
applications
MorphoSys Coarse-grain Multiple Dynamic Local SIMD Data-parallel, compute–
intensive applications
2.1 Reconfigurable Systems versus SIMD Array Processors
Reconfigurable systems, in general, have built upon existing ideas and concepts in the field of computer
architecture. For example, several systems ([2], [7], [8], [11]) have arrays of processing units organized in an
SIMD or MIMD manner. Necessarily, these systems have drawn from the research done for SIMD array
processors, such as Illiac-IV [32], and NASA Goodyear MPP [33]. There are many reasons for the re-
emergence of the SIMD approach (after the apparent failure of the erstwhile SIMD processors to gain
widespread usage). The recent years have seen the introduction of many computation-intensive, high-
6
throughput tasks as mainstream applications. These tasks are performed efficiently on SIMD architectures.
At the same time, device technology has made such astronomical progress that systems that cost billions of
dollars (and involved hundreds of boards) two decades ago can now be cheaply produced on a single chip.
However, there are still many differences between the erstwhile SIMD array processors and SIMD-based
reconfigurable systems. A basic difference is that reconfigurable systems are configurable in terms of
functionality and interconnect. Specifically, MorphoSys when contrasted with SIMD array processors also
has other significant enhancements. It has a multi-level interconnection network (as compared to straight 2-D
mesh for Illiac-IV). It integrates the reconfigurable SIMD component with a RISC processor core (to
perform both parallel as well as sequential parts of an application). MorphoSys incorporates a streaming
buffer to provide efficient data transfers (no similar component in Illiac-IV), and a separate memory for
storing context data. Finally, MorphoSys utilizes sub-micron technology (0.35 micron) to implement a
system-on-chip (instead of many boards as in Illiac-IV).
3. MorphoSys: Components, Features and Program Flow
Tiny_RISC
Core Processor
Frame Buffer
(2x2x64x64)
Context
Memory
2x8x
16x32
RC
Array
(8 X 8)
DMA
Controller
64
64
64
32
SDRAM
(Main
Memory)
Instr
Data
Cache
M1 Chip
32
32
16
SDRAM
(Main
Memory)
256
Control Data Chip boundary
Figure 2: Block diagram of MorphoSys implementation (M1 chip)
Figure 2 shows the organization of integrated MorphoSys reconfigurable computing system. It is
composed of an array of Reconfigurable Cells (RC Array) with configuration memory (Context Memory), a
control processor (TinyRISC), data buffer (Frame Buffer) and DMA controller.
7
The correspondence between this figure and the architecture in Figure 1 is as follows: the RC Array
with its Context Memory corresponds to the reconfigurable processor array (SIMD co-processor), TinyRISC
corresponds to the Main Processor, and the high-bandwidth memory interface is implemented through Frame
Buffer and DMA Controller. Typically, the core processor, TinyRISC executes sequential tasks of the
application, while the reconfigurable component (RC Array) is dedicated to the exploitation of parallelism
available in an application’s algorithm.
3.1 System Components
Reconfigurable Cell Array: The main component of MorphoSys is the 8x8 RC (Reconfigurable Cell)
Array, shown in Figure 3. Each RC has an ALU-multiplier and a register file and is configured through a 32-
bit context word. The context words for the RC Array are stored in Context Memory.
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
RC RC
Figure 3: MorphoSys 8 x 8 RC Array with 2-D Mesh and Complete Quadrant Connectivity
Host/Control processor: The controlling component of MorphoSys is a 32-bit processor, called
TinyRISC, based on the design of a RISC processor in [12]. TinyRISC handles general-purpose operations
and controls operation of the RC Array through special instructions added to its ISA. It also initiates all data
transfers to or from the Frame Buffer and configuration program load for the Context Memory.
8
Frame Buffer: An important component is the two-set Frame Buffer, which is analogous to a data cache.
It makes memory accesses transparent to the RC Array, by overlapping of computation with data load and
store, alternately using the two sets. MorphoSys performance benefits tremendously from this data buffer. A
dedicated data buffer has been missing in most of the contemporary reconfigurable systems, with consequent
degradation of performance.
3.2 Features of MorphoSys
The RC Array is configured through context words. Each context word specifies an instruction opcode
for the RC. Context words are stored in the Context Memory. The RC Array follows the SIMD model of
computation. All RCs in the same row/column share the same context word. However, each RC operates on
different data. Sharing the context across a row/column is useful for data-parallel applications. In brief,
important features of MorphoSys are:
Coarse-grain: MorphoSys is designed to operate on 8 or 16-bit data, which ensures faster performance
for word-level operations as compared to FPGAs. MorphoSys is free from variable wire propagation delays
that are characteristic of FPGAs.
Dynamic reconfiguration: Context data may be loaded into a non-active part of Context Memory
without interrupting RC Array operation. Context loads/reloads are specified through TinyRISC and done by
DMA controller.
Considerable depth of programmability: The Context Memory can store up to 32 planes of
configuration. The user has the option of broadcasting contexts either across rows or across columns.
Tightly coupled interface with core processor / main memory: The control processor (TinyRISC) and
the reconfigurable component (RC Array) are resident on the same chip. The on-chip DMA controller
enables fast data transfers between main memory and Frame Buffer.
3.3 TinyRISC Instructions for MorphoSys
Several new instructions were introduced in the TinyRISC instruction set for effective control of the
MorphoSys RC Array execution. These instructions are summarized in Table 2. They perform the following
functions:
9
data transfer between main memory (SDRAM) and Frame Buffer,
loading of context words from main memory into Context Memory, and
control of execution of the RC Array.
There are two categories of these instructions: DMA instructions and RC instructions. The DMA
instruction fields specify load/store, memory address, number of bytes to be transferred and Frame Buffer or
Context Memory address. The RC instruction fields specify the context to be executed, Frame Buffer address
and broadcast mode (row/column, broadcast versus selective).
Table 2: New TinyRISC Instructions for MorphoSys M1 Chip
Mnemonic Description of Operation
LDCTXT
LDFB, STFB
CBCAST
SBCB
DBCBSC,
DBCBSR
DBCBAC,
DBCBAR
WFB,
WFBI
RCRISC
3.4 MorphoSys System Operation
Next, we illustrate the typical operation of the MorphoSys system. TinyRISC handles the general-
purpose operations itself. Specific parts of applications, such as multimedia tasks, are mapped to the RC
Array. These parallel parts of applications are executed in the following steps:
(a) Load context words into Context Memory (CM) from external memory
TinyRISC executes LDCTX instruction (Table 2) and signals DMA controller to perform this transfer
(b) Load computation data into the first set of Frame Buffer (FB) from external memory
TinyRISC executes LDFB instruction (Table 2) and signals DMA controller to perform the transfer.
(c) Execute RC Array and concurrently load data into the second set of FB
10
TinyRISC issues one of CBCAST, SBCB or DBCB-- instructions (Table 2) each cycle to enable
execution of RC Array in row/column mode (using data from first set of FB) for as long as
computation requires. These instructions specify the particular context word (among multiple contexts
words in Context Memory) to be executed by the RCs. There are two modes of specifying the context:
column broadcast and row broadcast. For column (row) broadcast, all RCs in the same column (row)
are configured by the same context word. Within this time, TinyRISC also issues a single instruction
(LDFB) to load computation data into the second FB set or a LDCTXT instruction to reload the
Context Memory. Either of these is done concurrently with RC Array execution. This illustrates
overlap of computation with data transfers.
(d) Execute RC Array, concurrently store data from first FB set to memory, load new data into FB set
Once again, TinyRISC issues one of CBCAST, SBCB or DBCB-- instructions (Table 2) each cycle to
enable execution of RC Array in row/column mode for as long as computation requires. Within this
time, TinyRISC also issues instructions (STFB and LDFB) to store data from first FB set into main
memory and load new computation data into the first set of FB.
(e) Continue execution and data/context transfers till completion
The steps (c) and (d) are repeated till the application kernel concludes.
4. Design of MorphoSys Components
In this section, we describe the major components of MorphoSys: the Reconfigurable Cell, the Context
Memory, the Frame Buffer and the three-level interconnection network of the RC Array. We also briefly
mention some aspects of the ongoing physical implementation of MorphoSys components.
4.1 Architecture of Reconfigurable Cell
The Reconfigurable Cell (RC) array is the programmable core of MorphoSys. It consists of an 8x8 array
(Figure 3) of identical processing elements called Reconfigurable Cells (RCs). Each RC (Figure 4) is the
basic unit of reconfiguration. Its functional model is similar to the datapath of a conventional microprocessor.
11
16
16
MUXA
XQ
R
M
ALU+MULT
O/P REG
O/P REG
ALU_OP
Constant
T
C
B
MUXB
SHIFT
ALU_SFT
Register File
R0 - R3
L
L
E
I
I
U D
FLAG
FLAG
16
16 16
16
64
64
12
12
28
28
C
o
n
t
e
x
t
R
e
g
i
s
t
e
r
WR_BUS, WR_Exp
R0
R3
R2
R1
Context word from Context Memory
Figure 4: Reconfigurable Cell Architecture
As Figure 4 shows, each RC comprises of an ALU-multiplier, a shift unit, and two multiplexers at the
RC inputs. Each RC also has an output register, a feedback register, and a register file. A context word,
loaded from Context Memory and stored in the Context Register, defines the functionality of the RC.
ALU-Multiplier unit: This includes a 16x12 multiplier, and a 16-bit ALU. However, the ALU adder
has been designed for 28 bit inputs. This prevents loss of precision during multiply-accumulate operation,
since each multiplier output may be as large as 28 bits. Besides standard logic/arithmetic functions, the ALU
has other functions such as computation of absolute value of the difference of two operands and a single
cycle multiply-accumulate operation. There are a total of twenty-five ALU functions.
The two input multiplexers (Figure 4) select one of several inputs for the ALU, based on control bits
from the context word in the Context Register. These inputs are the outputs of the four nearest neighbor RCs,
outputs of other RCs in the same row and column (within the quadrant), express lanes (as explained in
Section 4.3), data bus, and register file ports. The register file is composed of four 16-bit registers. The
output register is 32 bits wide (to accommodate intermediate results of multiply-accumulate instructions). A
flag register indicates sign of input operand at port A of ALU. It is useful when the operation to be performed
depends upon the sign of the operand, as in the quantization step during image compression. The feedback
register allows reuse of previous operands.
12
Context register: This 32-bit register contains the context word for configuring each RC. It is a part of
each RC, whereas the Context Memory is separate from the RC Array (Figure 2).
The different fields for the context word are defined in Figure 5. The field ALU_OP specifies the
function for the ALU-Multiplier unit. The fields MUX_A and MUX_B specify control bits for the input
multiplexers of the RC. Other fields determine the register (of the register file of the RC) in which the result
of an operation is to be stored (REG #), and the direction (RS_LS) and amount of shift (ALU_SFT) applied
at the ALU output. The context word also specifies whether a particular RC writes to its row/column express
lane (WR_Exp). The context word field WR_BUS specifies whether the RC output will be written on the
data bus to the Frame Buffer. The field named Constant is used to supply immediate operands to the ALU-
multiplier unit in each RC. This is useful for operations that involve constants (such as multiplication by a
constant over several computations) in which case this operand can be provided through the context word.
27
3031
29-28 15-1218-16 11-022-1926-23
W R -B us
REG # A L U_SFT M UX _B Constant
W R _E xp RS_L S M U X _A AL U _O P
Figure 5: RC context word definition
4.2 Context Memory
The Context Memory provides context words to the RC Array in each cycle of execution. These context
words configure the RC and are also the basis for programming the interconnection network. The context
word from the Context Memory is loaded into the Context Register in each RC.
Context Memory organization: The Context Memory is organized into two blocks (for row and column
contexts) which store the contexts for row-wise and column-wise operation of the RC Array respectively.
Each block has eight sets, with each set corresponding to a row/column of the RC Array. Each set can store
sixteen context words. The RC Array configuration plane (set of context words to program entire RC Array
for one cycle) comprises eight context words (one from each set) from a row or column block. Thus, sixteen
13
configuration planes may be stored in each block of the Context Memory, for a total of thirty-two
configuration planes.
Context broadcast: For MorphoSys, the major focus is on regular and data-parallel applications. Based
on this idea of regularity/parallelism, each context word is broadcast to a row (column) of RCs. Thus, all
eight RCs in a row (column) share the same context word, and perform the same operations. For example, for
DCT computation, eight 1-D DCTs need to be computed, across eight rows. This is achieved with just eight
context words to program the RC Array for each step of the computation and it takes 10 cycles (only 80
context words) to complete a 1-D DCT (refer Section 7.1.2). Broadcast of context enables storage of more
context planes in the Context Memory than the case in which a separate context word is used for each RC.
Dynamic reconfiguration: The Context Memory can be updated concurrently with RC Array execution.
There are 32 context planes and this depth facilitates dynamic (run-time) reloading of the contexts. Dynamic
reconfiguration allows reduction of effective reconfiguration time to zero.
Selective context enabling: This feature implies that it is possible to enable one specific row or column
for operation in the RC Array. One benefit of this feature is that it enables transfer of data to/from the RC
Array, using only one context plane. Otherwise eight context planes (out of the 32 available) would have
been required just to read or write data.
4.3 Interconnection Network
The RC interconnection network is comprised of three hierarchical levels.
RC Array mesh: The underlying network throughout the array (Figure 3) is a 2-D mesh. It provides
nearest neighbor connectivity.
Intra-quadrant (complete row/column) connectivity: The second layer of connectivity is at the
quadrant level (a quadrant is a 4x4 RC group). The RC Array has four quadrants (Figure 3). Within each
quadrant, each cell can access the output of any other cell in its row/column (Figure 3).
Inter-quadrant (express lane) connectivity: At the global level, there are buses between adjacent
quadrants. These buses (express lanes) run across rows and columns. Figure 6 shows express lanes for one
row of the RC Array. These lanes provide data from any one cell (out of four) in a row (column) of a
14
quadrant to other cells in an adjacent quadrant but in the same row (column). Thus, up to four cells in a row
(column) can access the output value of one of four cells in the same row (column) of an adjacent quadrant.
R C R C R C R C
R C
R C R C R C
< = = R o w e x p r e s s l a n e
R o w e x p r e s s la n e = = >
Figure 6: Express lane connectivity (between cells in same row, but adjacent quadrants)
The express lanes greatly enhance global connectivity. Irregular communication patterns can be handled
quite efficiently. For example, an eight-point butterfly operation is accomplished in only three cycles.
4.4 Frame Buffer
The Frame Buffer (FB) is an internal data memory logically organized into two sets, called Set 0 and
Set 1. The Frame Buffer has this two set organization in order to be able to provide overlap of computation
with data transfers. One of the two sets provides computation data to the RC Array (and also stores processed
data from the RC Array) while the other set stores processed data into the external memory through the
DMA controller and reloads data for the next round of computations. These operations proceed concurrently,
thus preventing the latency of data I/O from adversely affecting system performance. Each set has 128 rows
of 8 bytes each, therefore the FB has 128 × 16 bytes.
4.5 Physical Implementation
MorphoSys M1 is being implemented using both custom and standard cell design methodologies for
0.35 micron, four metal layer CMOS (3.3V) technology. The main constraint for this implementation is a
clock period of 10 ns (100 MHz freq.). The total area of the chip is estimated to be 180 sq. mm. The layout
for the Reconfigurable Cell (20000 transistors, area 1.5 sq. mm) is now complete. It has been simulated at the
transistor level using an electrical simulator (HSPICE) with appropriate output loads due to fanout and
interconnect lengths to obtain accurate delay values. The multiplier (approx. 10000 transistors) delay is 4 ns
and the ALU (approx. 6500 transistors) delay is 3 ns. The critical path delay in a RC (which corresponds to a
15
single cycle multiply-accumulate operation) is less than 9 ns. Similarly, the TinyRISC, Frame Buffer,
Context Memory and DMA controller are also being designed to perform within the 10 ns clock constraint.
Preliminary estimates for area/delay are: TinyRISC (100,000 transistors, delay: 10 ns), Frame Buffer
(150,000 transistors, access time: 10 ns), and the Context Memory (100,000 transistors, access time: 5 ns).
The three-level interconnection network is made feasible by the four metal layer technology. Simulations of
the network indicate that with use of appropriate buffers at RC outputs, interconnect delays can be limited to
1 ns. Thus, it is reasonable to expect that M1 will perform at its anticipated clock rate of 100 MHz.
5. Comparison with Related Research
Since MorphoSys architecture falls into the category of coarse-grain reconfigurable systems, it is
meaningful to compare it with other coarse-grain systems (PADDI [2], MATRIX [4], RaPiD [6], REMARC
[7] and RAW [8]). Many of the designs mentioned above have not actually been implemented, whereas
MorphoSys has been developed down to physical layout level.
PADDI [2] has a distinct VLIW nature because each EXU uses a 53-bit instruction word (which may be
different for different EXUs), whereas MorphoSys exhibits SIMD functionality (each column/row performs
the same function). It cannot be configured dynamically (instruction decoders are SRAMs that are loaded at
setup time) while MorphoSys has dynamic reconfiguration. MorphoSys has a deeper depth of
programmability (32) than PADDI (eight). The latter employs a crossbar based interconnection network
whereas MorphoSys uses 2-D mesh and a hierarchical bus network. PADDI is a stand-alone system targeting
DSP applications, does not employ any streaming buffer to speed up I/O transfers and is not integrated with
any core processor. But the reconfigurable component in MorphoSys is integrated with a general-purpose
controller (as also is REMARC, Garp, etc) and includes the Frame Buffer for efficient data transfers.
The MATRIX [4] approach proposes the design of a basic unit (BFU) for a reconfigurable system. This 8-
bit BFU unifies resources used for instruction storage with resources needed for computation and data
storage, assumes a three level interconnection network and may be configured for operation in VLIW or
SIMD fashion. However, this approach is too generic, and may potentially increase the control complexity.
Further, a complete system organization based on the BFU is not presented, while MorphoSys is a well-
16
defined system. This leaves many system-level issues such as integration with host processor, external
memory interface, I/O parameters, performance and reconfiguration (static or dynamic) open to conjecture.
The BFU design can be pipelined, but the hierarchical switch-based interconnection network (similar to
FPGA interconnect) has variable interconnect delay. This becomes a limiting factor for stream processing
since two concurrent streams could have different processing times (due to different interconnect delays for
each). However, the interconnection network for MorphoSys has uniform delays. Further, in the absence of a
complete system, MATRIX has not provided any comparison of performance for target applications with
other systems (as done for MorphoSys and most other reconfigurable systems).
The RaPiD [6] design is organized as a linear array of reconfigurable processing units, which is not
appropriate for block-based applications (for example, 2-D signal processing tasks). This approach
exemplifies provision of datapath parallelism in the temporal domain, whereas MorphoSys provides
parallelism in the spatial domain. Due to its organization, the potential applications of RaPiD are those of a
systolic nature or applications that can be easily pipelined. Once again, a complete system implementation is
not described. At the very least, the issue of memory stalls (mentioned in passing by the authors) could be a
significant bottleneck. However, MorphoSys uses the combination of Frame Buffer and DMA controller to
address this issue. The performance analysis does mention some applications from the target domain of
MorphoSys (DCT, motion estimation, etc.) but estimated performance figures (in cycle times) are not given.
The REMARC system [7] is similar in design to the MorphoSys architecture and targets the same class of
data-parallel and high-throughput applications. Like MorphoSys, it also consists of 64 programmable units
(organized in a SIMD manner) that are tightly coupled to a RISC processor. REMARC also uses a modified
MIPS-like ISA for the RISC (as in case of MorphoSys) for control of the reconfigurable co-processor.
However, REMARC’s reconfigurable component is configured statically, it lacks a direct interface to
external memory, and data transfers cannot be overlapped with computation. It does have the capability for
MIMD operation, but there is no evaluation of applications that could demonstrate the usefulness of this
mode for REMARC. Performance figures for application kernels (e.g. 2-D IDCT) reflect that REMARC is
significantly slower than MorphoSys (refer Sections 7.1.2 and 7.1.3).
17
The RAW [8] design implements a highly parallel architecture as a Reconfigurable Architecture
Workstation (RAW). The architecture is organized in a MIMD manner with multiple instruction streams. It
has multiple RISC processors, each having fine-grain logic as the reconfigurable component. Some important
differences are: it has variable communication overhead from one unit to the other, the target applications
may be irregular or general purpose (instead of data-parallel, regular, data-intensive applications), coarse-
grain functional reconfiguration is absent. However, the architecture has a dispersed nature and does not
exhibit the close coupling that is present between the RC Array elements. This may have an adverse effect on
performance for high-throughput applications that involve many data exchanges.
As evident, all the above systems vary greatly. However, the MorphoSys architecture puts together, in a
cohesive structure, the main prominent features of previous reconfigurable systems (coarse-grain granularity,
SIMD organization, depth of programmability, multi-level configurable interconnection network, and
dynamic reconfiguration). This architecture then adds some innovative features (control processor with
modified ISA, streaming buffer that allows overlap of computation with data transfer, row/column broadcast,
selective context enabling) while avoiding many of the pitfalls (single contexts, i/o bottlenecks, static
reconfiguration, remote interface) of previous systems. In this sense, MorphoSys is a unique implementation.
In summary, the important features of the MorphoSys architecture are:
Integrated model: Except for main memory, MorphoSys is a complete system-on-chip.
Innovative memory interface: In contrast to other prototype reconfigurable systems, MorphoSys employs
an efficient scheme for high data throughput using a two-set data buffer.
Multiple contexts on-chip: Having multiple contexts enables fast single-cycle reconfiguration.
On-chip general-purpose processor: This processor, which also serves as the system controller, allows
efficient execution of complex applications that include both serial and parallel tasks.
To the best of our knowledge, the MorphoSys architecture, as described earlier, is unique with respect to
other published reconfigurable systems.
18
6. Programming and Simulation Environment
6.1 Behavioral VHDL Model
The MorphoSys reconfigurable system has been specified in behavioral VHDL. The system components
namely, the 8x8 Reconfigurable Array, the 32-bit TinyRISC host processor, the Context Memory, Frame
Buffer and the DMA controller have been modeled for complete functionality. The unified model has been
subjected to simulation for various applications using the QuickVHDL simulation environment. These
simulations utilize several test-benches, real world input data sets, a simple assembler-like parser for
generating the context/configuration instructions and assembly code for TinyRISC. Figure 7 depicts the
simulation environment with its different components.
6.2 GUI for MorphoSys: mView
A graphical user interface, mView, takes user input for each application (specification of operations and
data sources/destinations for each RC) and generates assembly code for the MorphoSys RC Array. It is also
used for studying MorphoSys simulation behavior. This GUI, based on Tcl/Tk [13] displays graphical
information about the functions being executed at each RC, the active interconnections, the sources and
destination of operands, usage of data buses and the express lanes, and values of RC outputs. It has several
built-in features that allow visualization of RC execution, interconnect usage patterns for different
applications, and single-step simulation runs with backward, forward and continuous execution. It operates in
one of two modes: programming mode or simulation mode.
Figure 7: Simulation Environment for MorphoSys, with mView display
19
In the programming mode, the user sets functions and interconnections for each row/column of the
RC Array corresponding to each context (row/column broadcasting) for the application. mView then
generates a context file that represents the user-specified application.
In the simulation mode, mView takes a context file, or a simulation output file as input. For either of
these, it provides a graphical display of the state of each RC as it executes the application represented by the
context/simulation file. mView is a valuable aid to the designer in mapping algorithms to the RC Array. Not
only does mView significantly reduce the programming time, but it also provides low-level information about
the actual execution of applications in the RC Array. This feature, coupled with its graphical nature, makes it
a convenient tool for verifying and debugging simulation runs.
6.3 Context Generation
For system simulation, each application has to be coded into context words and TinyRISC instructions.
For the former, an assembler-parser, mLoad, generates contexts from programs written in the RC instruction
set by the user or generated through mView. The next step is to determine the sequence of TinyRISC
instructions for appropriate operation of RC Array, timely data input and output, and to provide sample data.
Once a sequence is determined, and data procured, test-benches are used to simulate the system.
6.4 Code Generation for MorphoSys
An important aspect of our research is an ongoing effort to develop a programming environment for
automatic mapping and code generation for MorphoSys. A prototype compiler that compiles hybrid code for
MorphoSys M1 (from C source code, serial as well as parallel) has been developed using the SUIF compiler
environment [14]. The compilation is done after partitioning the code between the TinyRISC processor and
the RC Array. Currently, this partitioning is accomplished manually by inserting a particular prefix to
functions that are to be mapped to the RC Array. The compiler generates the instructions for TinyRISC
(including instructions to control the RC Array). Another issue under focus is the generation of assembled
context programs from applications (coded in C) for MorphoSys. At an advanced development stage,
MorphoSys would perform online profiling of applications and dynamically adjust the reconfiguration
profile for enhanced efficiency.
20
7. Mapping Applications to MorphoSys
In this section, we discuss the mapping of video compression, an important target recognition application
(Automatic Target Recognition) and data encryption/decryption algorithms to the MorphoSys architecture.
Video compression has a high degree of data-parallelism and tight real-time constraints. Automatic Target
Recognition (ATR) is one of the most computation-intensive applications. The International Data Encryption
Algorithm (IDEA) [30] for data encryption is typical of data-intensive applications. We also provide
performance estimates for these applications based on VHDL simulations. Pending the development of an
automatic mapping tool, all these applications were mapped to MorphoSys either by using the GUI, mView
or manually.
7.1 Video Compression (MPEG)
Video compression is an integral part of many multimedia applications. In this context, MPEG
standards [15] for video compression are important for realization of digital video services, such as video
conferencing, video-on-demand, HDTV and digital TV.
As depicted in Figure 8, the functions required of a typical MPEG encoder are:
Preprocessing: for example, color conversion to YCbCr, prefiltering and subsampling.
Motion Estimation and Compensation: After preprocessing, motion estimation of image pixels is done to
remove temporal redundancies in successive frames (predictive coding) of P type and B type. Algorithms
such as Full Search Block Matching (FSBM) may be used for motion estimation.
Transformation and Quantization: Each macroblock (typically consisting of 6 blocks of size 8x8 pixels)
is then transformed using the Discrete Cosine Transform (DCT). The resulting DCT coefficients are
quantized to enable compression.
Zigzag scan and VLC: The quantized coefficients are rearranged in a zigzag manner (in order of low to
high spatial frequency) and compressed using variable length encoding.
Inverse Quantization and Inverse Transformation: The quantized blocks of I and P type frames are
inverse quantized and transformed back by the Inverse Discrete Cosine Transform (IDCT). This
operation yields a copy of the picture, which is used for future predictive coding.
21
Q uantizationDC T
Inv. Q uant.
ID C T
Fram e
M em ory
Pre-
Processing
Fram e
M em ory
M otion
Com pensation
M otion
Estim ation
V LC
Encoder
O utput
Buffer
+
+
-
Regulator
O utput
Motion Vectors
Input
Predictive frame
Zig-zag
scan
Figure 8: Block diagram of an MPEG Encoder
Next, we discuss two major functions (motion estimation using FSBM and transformation using
DCT) of the MPEG video encoder, as mapped to MorphoSys. Finally, we discuss the overall performance of
MorphoSys for the entire compression encoder sequence (Note: VLC operations are not mapped to
MorphoSys, but Section 7.1.3 shows that adequate time is available to execute VLC after finishing the other
computations involved in MPEG encoding).
7.1.1 Video Compression: Motion Estimation for MPEG
Motion estimation is widely adopted in video compression to identify redundancy between frames.
The most popular technique for motion estimation is the block-matching algorithm because of its simple
hardware implementation [17]. Some standards also recommend this algorithm. Among the different block-
matching methods, Full Search Block Matching (FSBM) involves the maximum computations. However,
FSBM gives an optimal solution with low control overhead.
Typically, FSBM is formulated using the mean absolute difference (MAD) criterion as follows:
MAD(m, n) =
( ) ( )
==
++
N
j
N
i
njmiSjiR
11
,, given
q
n
m
p
,
where p and q are the maximum displacements, R(i, j) is the reference block of size N x N pixels at
coordinates (i, j), and S(i+m, j+n) is the candidate block within a search area of size (N+p+q)
2
pixels in the
previous frame. The displacement vector is represented by (m, n), and the motion vector is determined by
the least MAD(m, n) among all the (p+q+1)
2
possible displacements within the search area.
22
Figure 9 shows the configuration of RC Array for FSBM computation. Initially, one reference block
and the search area associated with it are loaded into one set of the Frame Buffer. The RC Array starts the
matching process for the reference block resident in the Frame Buffer. During this computation, another
reference block and the search area associated with it are loaded into the other set of Frame Buffer. In this
manner, data loading and computation time are overlapped.
For each reference block, three consecutive candidate blocks are matched concurrently in the RC
Array. As depicted in Figure 9, each RC in first, fourth, and seventh row performs the computation
(
)
(
)
++=
161
,,
i
j
njmiSjiR
P
,
where P
j
is the partial sum. Data from a row of the reference block is sent to the first row of the RC Array
and passed to the fourth row and seventh row through delay elements. The eight partial sums (P
j
) generated
in these rows are then passed to the second, third, and eighth row respectively to perform
(
)
=
161
,
i
j
P
nmMAD .
Subsequently, three MAD values corresponding to three candidate blocks are sent to TinyRISC for
comparison, and the RC Array starts block matching for the next three candidate blocks.
| R S | & accumulate
Partial sums accumulate
Partial sums accumulate
| R S| & accumulate
Register (delay element)
Register (delay element)
| R S | & accumulate
Partial sums accumulate
R C o p e r a t i o n s
S(i+m, j+n)
R(i, j)
Results to Tiny RISC
Figure 9: Configuration of RC Array for Full Search Block Matching
Computation cost: Based on the computation model shown above, and using N=16, for a reference
block size of 16x16, it takes 36 clock cycles to finish the matching of three candidate blocks. 16 cycles are
23
needed for comparing MAD results after every three block comparisons and updating motion vectors for the
best match. There are 289 candidate blocks (102 iterations) in each search area, and VHDL simulation results
show that a total of 5304 cycles are required to match the search area. For an image frame size of 352x288
pixels at 30 frames per second (MPEG-2 main profile, low level), the processing time is 2.1 x 10
6
cycles.
The computation time for MorphoSys (@100 MHz) is 21 ms. This is smaller than the frame period
of 33.33 ms. The context loading time is only 71 cycles, and since there are a large number of computation
cycles before the configuration is changed, the overhead is negligible.
Performance Analysis: MorphoSys performance is compared with three ASIC architectures
implemented in [17], [18], and [19] for matching one 8x8 reference block against its search area of 8 pixels
displacement. The result is shown in Figure 10. The ASIC architectures have the same processing power (in
terms of processing elements) as MorphoSys, though they employ customized hardware units such as parallel
adders to enhance performance. The number of processing cycles for MorphoSys is comparable to the cycles
required by the ASIC designs. Since MorphoSys is not an ASIC, its performance with regard to these ASICs
is significant. In a Section 7.1.3, it is shown that this performance level enables implementation of MPEG-2
encoder on MorphoSys.
Using the same parameters above, Pentium MMX [20] takes almost 29000 cycles for the same task.
When scaled for clock speed and same technology (fastest Pentium MMX fabricated with 0.35 micron
technology operates at 233 MHz), this amounts to more than 10X difference in performance.
581
631
1159
1020
0
300
600
900
1200
ASIC [17] ASIC [18] ASIC [19] MorphoSys M1
(64 RCs)
Cycles
Figure 10: Performance Comparison for Motion Estimation
24
7.1.2 Video Compression: Discrete Cosine Transform (DCT) for MPEG
The forward and inverse DCT are used in MPEG encoders and decoders. In the following analysis, we
consider an algorithm for a fast eight-point 1-D DCT [21]. This algorithm involves 16 multiplies and 26
adds, resulting in 256 multiplies and 416 adds for a 2-D implementation. The 1-D algorithm is first applied to
the rows (columns) of an input 8x8 image block, and then to the columns (rows). Eight row (column) DCTs
may be computed in parallel.
Mapping to RC Array: The standard block size for DCT in most image and video compression
standards is 8x8 pixels. Since the RC Array has the same size, each pixel of the image block may be directly
mapped to each RC. Each pixel of the input block is stored in one RC.
Sequence of steps:
Load input data: An 8x8 block is loaded from the Frame Buffer to RC Array. The data bus between
Frame Buffer and RC Array allows concurrent loading of eight pixels. An entire block is loaded in 8 cycles.
The same number of cycles is required to write out the processed data to the Frame Buffer.
Row-column approach: Using the separability property, 1-D DCT along rows is computed. For row
(column) mode of operation, the configuration context is broadcast along columns (rows). Different RCs
within a row (column) communicate using three-level interconnection network to compute the 1-D DCT. The
coefficients needed for computation are provided as constants in context words. When 1-D DCT along rows
(columns) is completed, the 1-D DCT along columns (rows) are computed in a similar manner (Figure 11).
1-D D C T alo ng row s
( 10 cycles )
1-D D C T alo ng
colum ns (10 cy cles)
R C
R C
R C
R C
R C
R C
R C
R C
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
R C
R C
R C
R C
R C
R C
R C
R C
R C R C R C R C R C R C R C R C
. . . . . . . .
R C R C R C R C R C R C R C R C
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
R O W 0
R O W 1
R O W 2
R O W 4
R O W 5
R O W 6
R O W 7
R O W 3
C O L 0 . . . . . . . . . . .C O L 7
Figure 11: Computation of 2-D DCT across rows/columns (without transposing)
25
Each sequence of 1-D DCT [21] involves:
i. Butterfly computation: It requires three cycles to perform using express lanes (inter-quad connectivity).
ii. Computation and re-arrangement: This takes six cycles, with an extra cycle for re-arrangement of
computed results.
Computation cost: The cost for computing 2-D DCT on an 8x8 block of the image is as follows: 6 cycles
for butterfly operation, 12 cycles for both 1-D DCT computations, 3 cycles for re-arrangement/scaling of
data and 16 cycles for data I/O (total of 37 cycles). This estimate is verified by VHDL simulation. Assuming
the data blocks to be present in the Frame Buffer (through overlapping of data load/store with computation
cycles), it takes 0.8 ms for MorphoSys @ 100 MHz to compute the DCT for the 2376 (396x6) blocks of 8x8
pixels in one frame of a 352x288 image. The cost of computing the 2-D IDCT is the same because the
operations involved are similar. Context loading time is quite significant at 270 cycles. However,
transformation of a large number of blocks (typically 2376) before a different configuration is loaded,
minimizes this effect.
Performance analysis: MorphoSys requires 37 cycles to complete 2-D DCT (or IDCT) on an 8x8 block
of pixel data. This is in contrast to 240 cycles required by Pentium MMX
TM
[20]. Scaling for Pentium clock
speed (233 MHz for same technology of 0.35 micron as MorphoSys) would result in 103 effective cycles for
Pentium). A dedicated superscalar multi-media processor, the V830R/AV [22] requires 201 clocks for IDCT
and REMARC [7] takes 54 cycles. A DSP video processor, TMS320C80 MVP [16] needs 320 cycles. The
relative performance figures for MorphoSys and other implementations are given in Figure 12.
37
54
103
201
320
0
100
200
300
400
MorphoSys
M1
REMARC
Pentium
MMX
(scaled)
V830R/AV
TMS320C80
MVP
Cycles
Figure 12: DCT/IDCT Performance Comparison (cycles)
26
Notably, MorphoSys performance scales linearly with the array size. For a 256 element RC Array,
number of operations possible per second would increase fourfold, with corresponding effect on throughput
for 2-D DCT and other algorithms. The performance figures (in GOPS) are summed up in Figure 13 and
these are more than 50% of the peak values. The figures are scaled for future generations of MorphoSys M1,
conservatively assuming a constant clock of 100 MHz.
3.36
6.72
13.44
6.4
12.8
25.6
0
6
12
18
24
30
MorphoSys
M1 (64
RCs)
MorphoSys
M1 (128
RCs)
MorphoSys
M1 (256
RCs)
DCT GOPS
Peak GOPS
Figure 13: Performance for DCT/IDCT - Giga Operations per Second (GOPS)
Some other points are worth noting: first, all rows (columns) perform the same computations, hence
they can be configured by a common context (thus enabling broadcast of context word), which leads to
saving in Context Memory space. Second, the RC Array provides the option of broadcasting context either
across rows or across columns. This allows computation of the second round of 1-D DCTs without
transposing the data. Elimination of the transpose operation saves a considerable amount of cycles, and is
important for high performance. The transpose operation generally consumes valuable cycle time. For
example, even hand-optimized version of IDCT code for Pentium MMX (that uses 64-bit registers) needs at
least 25 register-memory instructions for completing the transpose [20]. Processors, such as the TMS320
series [16], also expend some cycle time on transposing data.
Precision analysis for IDCT: Experiments were conducted for measuring the precision of
MorphoSys IDCT output values as specified in the IEEE Standard [23]. Considering that MorphoSys is not
an ASIC, and performs fixed-point operations, the results were impressive. The worst case pixel error was
27
satisfied and the Overall Mean Square Error (OMSE) was within 15% of the reference value. The majority of
pixel locations also satisfied the worst case reference values for mean error and mean square error.
Zigzag Scan: The zigzag scan function has also been implemented, even though MorphoSys is not
designed for applications that comprise of irregular accesses. However, the selective context enabling feature
of the RC Array was used to generate an efficient mapping. The fact that an application quite diverse from
the targeted applications could still be mapped to the MorphoSys architecture provides evidence of the
flexibility of the MorphoSys model.
7.1.3 Mapping MPEG-2 Video Encoder to MorphoSys
It is remarkable that because of the computation intensive nature of motion estimation, only dedicated
processors or ASICs have been used to implement MPEG video encoders. Most reconfigurable systems, DSP
processors or multimedia processors (e.g. [16]) consider only MPEG decoding or a sub-task (e.g. IDCT). Our
mapping of the complete MPEG encoder to MorphoSys is perhaps the first time that a reconfigurable system
has been able to meet the high throughput requirements of the MPEG video encoder.
We mapped all the functions for MPEG-2 video encoder, except VLC encoding, to MorphoSys. We
assume that the Main profile (low level) is being used. The maximum resolution at this level is 352x288
pixels per frame at 30 frames per second. The group of pictures consists of a sequence of four frames in the
order IBBP (a typical choice for broadcasting applications). The number of cycles required for each task of
the MPEG encoder, for each macroblock type is listed in Table 3. Besides actual computation, the number of
cycles required for loading configuration and data from memory is also included in the calculations.
Table 3: Performance Figures of MorphoSys M1 (64 RCs) for I, P and B Macro-blocks
Motion Estimation Motion Comp., DCT and Quant. ( / for
Inv. Quant., IDCT, inv MC)
Macroblock type /
MPEG functions
(in clock cycles)
Context Mem Ld Compute Context Mem Ld Compute
I type macroblock 0 0 0 270/270 234/234 264/264
P type macroblock 71 334 5304 270/270 351/351 264/264
B type macroblock 71 597 10608 270 / 0 468 / 0 306/ 0
28
All macro-blocks in each P frame and B frame are first subjected to motion estimation. Then motion
compensation, DCT and quantization are performed on each macroblock in a frame. The processed
macroblocks are sent to frame storage in main memory. Finally, we perform inverse quantization, inverse
DCT and reverse motion prediction for each macroblock of I frames and P frames. Each frame has 396
macroblocks, and clock cycles required for encoding each frame type are depicted in Figure 14. It may be
noted that motion estimation takes up almost 90% of the computation time for P and B type frames.
0 1250000 2500000 3750000 5000000
I frame
P frame
B frame
Cycles
Motion Est. MC, DCT and Q Inv Q, IDCT, Inv MC
Figure 14: MorphoSys performance for I, P and B frames (MPEG Video Encoder)
From the data in Figure 14, and assuming IBBP frame sequence, total encoding time is 117.3 ms.
This is 88% of available time (134 ms). From empirical data values in [22], 12% (remaining time) of
available time is sufficient to compute VLC. Table 4 shows that figures for MorphoSys MPEG encoder
(without VLC) are two orders of magnitude less than the corresponding figures for REMARC [7]. The
algorithm (FSBM) for motion estimation, which is the major computation, is the same for REMARC and
MorphoSys.
Table 4: Comparison of MorphoSys MPEG Encoder with REMARC [7] MPEG Encoder
Frame Type / # of
cycles
Total clock cycles for
MorphoSys M1 (64 RCs)
Clock cycles for REMARC [7]
(64 nano-processors)
I frame 209,628 52.9 x 10
6
P frame 2,378,987 69.6 x 10
6
B frame 4,572,035 81.5 x 10
6
29
7.2 Automatic Target Recognition (ATR)
Automatic Target Recognition (ATR) is the machine function of automatically detecting, classifying,
recognizing, and identifying an object. The ACS Surveillance challenge has been quantified as the ability to
search 40,000 square nautical miles per day with one-meter resolution [24]. The computation levels when
targets are partially obscured reaches the hundreds-of-teraops range. There are many algorithmic choices
available to implement an ATR system.
C hip
128x12 8x8bits
B it
Slice
Bitplane 0
Bitplane 7
.
.
.
B it
C orrelato r
C 0
C 7
.
.
+
Sh apesum
Thresholding
B it
C orrelator
Bright
Surroun d
Peak
D etection
B right
Tem plate
8x8x 1b it
Surround
Tem plate
8x8x1b it
Figure 15: ATR Processing Model
The ATR processing model developed at Sandia National Laboratory is shown in Fig. 15 [25, 26].
This model was designed to detect partially obscured targets in Synthetic Aperture Radar (SAR) images
generated by the radar imager in real time. SAR images (8-bit pixels) are input to a focus-of-attention
processor to identify regions of interest (called chips). These chips are thresholded to generate binary images
and the binary images are then matched against binary target templates. Target templates appear in pairs of a
bright and a surround template. The bright template identifies locations where a strong radar return is
expected, while the surround template identifies locations where strong radar absorption is expected.
Based on [25], the sequence of steps involved are: first, the 128 x 128 x 8 bit chip is sliced into eight
bitplanes to compute the shapesum, which is a weighted sum of the eight results obtained by correlating each
bitplane with the bright template. Once the shapesum is generated, the next step is correlating the actual
target templates with the chip. The correlation is performed on eight different binary images that are
generated by applying eight threshold values to the chip. The binary images are correlated with both the
30
bright and surround template pairs to generate eight pairs of correlation results, and the shapesum is used to
select one of the eight results. The selected pair of results is subsequently forwarded to the peak detector.
Both shapesum computation and target template matching, which are the most computation intensive
steps in the ATR processing model, require bit correlation. Figure 16 illustrates the operation of the bit
correlator implemented in MorphoSys M1. Each row of the 8x8 target template is packed as an 8-bit number
and loaded in the RC Array. All the candidate blocks in the chip are correlated with the target template.
Each column of the RC Array performs correlation of one target template with one candidate block, hence
eight templates are correlated concurrently in the RC Array.
R esu lt
8 -b its te m p la te d ata 1 6 -b its b in ary im a g e d a ta
A N D
O n es C o u n te r
Figure 16: Matching Process in Each RC
In order to perform bit-level correlation, two bytes (16 bits) of image data are input to each RC. In
the first step, the eight most significant bits of the image data are ANDed with the template data and a special
adder tree (implemented as custom hardware in each RC) is used to count the number of one’s of the ANDed
output to generate the correlation result. Then, the image data is shifted left one bit and the process is
repeated again to perform matching of the second block. After the image data is shifted eight times, a new
16-bit data is loaded and the RC starts another correlation of eight consecutive candidate blocks. For this
processing model, it takes four clock cycles to correlate one 8x8 binary image with an 8x8 target template.
Performance analysis: For analysis, we choose system parameters implemented in [25]. The ATR
systems from [25] and [26] are used for comparison. Two Xilinx XC4013 FPGAs (one dynamic FPGA for
most of computation, and one static FPGA for control) are used in each system of [25] and the Splash 2
31
system (consisting of 16 Xilinx 4010 FPGAs) is used in [26]. The image size of the chip is 128x128 pixels,
and the template size is 8x8 bits. For 16 pairs of target templates, the processing time is approx. 30 ms for
MorphoSys (at 100 MHz) from the calculations below:
Computation time: {(121x121 offsets) x (4 cycles) x (8 bitplanes x 16 bright templates + 8
thresholds x 32 templates) / (8 templates concurrently)} x 10 ns + image/context load ~ 30 ms
This processing time is about an order of magnitude lower than the 210 ms required for the FPGA
system in [25], and 195 ms for the Splash 2 system [26]. Even though MorphoSys M1 is a coarse-grained
system, it achieves similar performance to FPGA based systems (after accounting for clock rate scaling) for
the bit-level ATR operations. FPGAs are however limited in applicability to mostly bit-level operations, and
are inefficient for coarse-grain operations. These results are shown in Figure 17.
30
195
210
0
50
100
150
200
250
MorphoSys M1
(64 RCs)
Splash 2 @ 19
MHz [26]
Xilinx FPGAs @
12.5 MHz [25]
Time (ms)
Figure 17: Performance Comparison of MorphoSys for ATR
ATR System Specification: A quantified measure of the ATR problem [25] states that 100 chips
have to be processed each second for a given target. Each target has a pair of bright and surround templates
for every five-degree rotation (72 pairs for full 360-degree rotation). Considering these requirements, nine
chips of MorphoSys M1 (64 RCs) would be needed to satisfy this specification, as compared to 90 sets of
system described in [25] and 84 sets of the Splash 2 system [26].
7.3 Data Encryption/Decryption (IDEA Algorithm)
Data security is a key application domain. The International Data Encryption Algorithm (IDEA) [30]
is a typical example of this application class. IDEA involves processing of plaintext data (data to be
encrypted) in 64-bit blocks with a 128-bit encryption/decryption key. The algorithm performs eight iterations
of a core function. After the eighth iteration, a final transformation step produces a 64-bit ciphertext
32
(encrypted data) block. IDEA employs three operations: bitwise exclusive-OR, addition modulo 2
16
and
multiplication modulo 2
16
+ 1. Encryption/decryption keys are generated externally and then loaded once into
the Frame Buffer of MorphoSys M1.
When mapping IDEA to MorphoSys, some operations of IDEA’s core function can be performed in
parallel, while others are performed sequentially. The maximum number of operations that are performed in
parallel is four. To exploit this parallelism, clusters of four cells in the RC Array columns are allocated to
operate on each plaintext block. Thus the whole RC Array can operate on sixteen plaintext blocks in parallel.
Performance analysis: As two-64-bit blocks are transferred simultaneously through the operand data
bus, it takes only eight cycles to load 16 plaintext blocks into the RC Array. Each of eight iterations of the
core function takes seven clock cycles to execute within a cell cluster. The final transformation step requires
one additional cycle. Once the ciphertext blocks have been produced, eight cycles are necessary to write back
to the Frame Buffer before loading the next plaintext. It takes 73 cycles to produce 16 ciphertext blocks.
Figure 18 depicts the relative performance of MorphoSys for the IDEA algorithm, as compared to a
software implementation on Pentium II processor (which requires 357 clock cycles to generate one ciphertext
block). Performance of the Pentium processor was profiled using the Intel Vtune profiling tool. HiPCrypto
[31] is an ASIC that provides dedicated hardware implementation of IDEA. A single HiPCrypto chip
produces 7 ciphertext data every 56 cycles. IDEA mapped on MorphoSys is much faster than the
implementation on Pentium II (which requires 153 effective clock cycles even after scaling for Pentium
clock speed of 233 MHz assuming 0.35 micron technology).
0.007
0.125
0.22
0
0.05
0.1
0.15
0.2
0.25
sIDEA (scaled) HiPCrypto MorphoSys
Performance
Cipher text blocks per cycle
Figure 18: Performance Comparison for IDEA mapping on MorphoSys
33
8. Conclusions and Future Work
This paper has presented a new reconfigurable architecture, MorphoSys. Its performance has been
evaluated for many of the target applications with impressive results that validate this architectural model.
Work on the physical implementation of MorphoSys on a custom-designed chip is in progress.
Extensions for MorphoSys model: It may be noted that the MorphoSys architecture is not limited to
using a simple RISC processor as the main processor. TinyRISC is used in the current implementation only
to evaluate the design model. There are many options for the main processor. One would be to use an
advanced general-purpose processor in conjunction with TinyRISC (which would then function as an I/O
processor for the RC Array). Also, an advanced processor (with multi-threading) may be used as the main
processor to enable concurrent processing of application programs by the RC Array and the main processor.
Another potential focus is the RC Array. For this implementation, the array has been designed for data-
parallel, computation intensive tasks. However, the design model allows other versions, too. For example, a
suitably designed RC Array may be used for a different application class, such as high-precision signal
processing, bit-level computations, control-intensive applications, or dynamic stream processing.
Based on this, we visualize that MorphoSys may be the precursor of a generation of systems that
integrate advanced general-purpose processors with a specialized reconfigurable component, in order to meet
the constraints of mainstream, high-throughput and computation-intensive applications.
9. Acknowledgments
This research is supported by Defense and Advanced Research Projects Agency (DARPA) of the
Department of Defense under contract F-33615-97-C-1126. We express thanks to Ms. Kerry Hill of Air
Force Research Laboratory for her constructive feedback, Prof. Tomas Lang and Prof. Walid Najjar for their
useful and incisive comments, and Robert Heaton (Obsidian Technology) for his contributions towards the
physical design of MorphoSys.
We acknowledge the contributions of Maneesha Bhate, Matthew Campbell, Sadik Can, Benjamin U-Tee
Cheah, Alexander Gascoigne, Nambao Van Le, Rafael Maestre, Robert Powell, Rei Shu, Lingling Sun, Cesar
34
Talledo, Eric Tan, Timothy Truong, and Tom Truong; all of whom have been associated with the
development of MorphoSys architecture and application mapping.
References:
1. S. Brown and J. Rose, “Architecture of FPGAs and CPLDs: A Tutorial,” IEEE Design and Test of
Computers, Vol. 13, No. 2, pp. 42-57, 1996
2. D. Chen and J. Rabaey, “Reconfigurable Multi-processor IC for Rapid Prototyping of Algorithmic-
Specific High-Speed Datapaths,” IEEE Journal of Solid-State Circuits, V. 27, No. 12, December 92
3. E. Tau, D. Chen, I. Eslick, J. Brown and A. DeHon, “A First Generation DPGA Implementation,
FPD’95, Canadian Workshop of Field-Programmable Devices, May 1995
4. E. Mirsky and A. DeHon, “MATRIX: A Reconfigurable Computing Architecture with Configurable
Instruction Distribution and Deployable Resources,” Proceedings of IEEE Symposium on FPGAs for
Custom Computing Machines, April 1996, pp.157-66
5. J. R. Hauser and J. Wawrzynek, “Garp: A MIPS Processor with a Reconfigurable Co-processor,” Proc.
of the IEEE Symposium on FPGAs for Custom Computing Machines, April 1997
6. C. Ebeling, D. Cronquist, and P. Franklin, “Configurable Computing: The Catalyst for High-
Performance Architectures,” Proceedings of IEEE International Conference on Application-specific
Systems, Architectures and Processors, July 1997, pp. 364-72
7. T. Miyamori and K. Olukotun, “A Quantitative Analysis of Reconfigurable Coprocessors for Multimedia
Applications,” Proc. of IEEE Sym. on Field-Programmable Custom Computing Machines, Apr 1998
8. J. Babb, M. Frank, V. Lee, E. Waingold, R. Barua, M. Taylor, J. Kim, S. Devabhaktuni, and A. Agrawal,
“The RAW Benchmark Suite: computation structures for general-purpose computing,” Proc. of IEEE
Symposium on Field-Programmable Custom Computing Machines, April 1997, pp. 134-43
9. M. Gokhale, W. Holmes, A. Kopser, S. Lucas, R. Minnich, D. Sweely, and D. Lopresti, “Building and
Using a Highly Parallel Programmable Logic Array,” IEEE Computer, pp. 81-89, January 1991
10. P. Bertin, D. Roncin, and J. Vuillemin, “Introduction to Programmable Active Memories,” in Systolic
Array Processors, Prentice Hall, 1989, pp. 300-309
11. A.K. Yeung and J.M. Rabaey, “A 2.4 GOPS Data-Driven Reconfigurable Multiprocessor IC for DSP,”
Proceedings of IEEE Solid-State Circuits Conference, February 1995, pp. 108-109, 346, 440
12. A. Abnous, C. Christensen, J. Gray, J. Lenell, A. Naylor and N. Bagherzadeh, “Design and
Implementation of TinyRISC microprocessor,” Microprocessors and Microsystems, Vol. 16, No. 4, 1992
13. Practical Programming in Tcl and Tk, 2
nd
edition, by Brent B. Welch, Prentice-Hall, 1997
14. SUIF Compiler system, The Stanford SUIF Compiler Group, http://suif.stanford.edu
15. ISO/IEC JTC1 CD 13818. Generic coding of moving pictures, 1994 (MPEG-2 standard)
35
16. F. Bonomini, F. De Marco-Zompit, G. A. Mian, A. Odorico, and D. Palumbo, “Implementing an MPEG2
Video Decoder Based on TMS320C80 MVP,” SPRA 332, Texas Instr., September 1996
17. C. Hsieh and T. Lin, “VLSI Architecture For Block-Matching Motion Estimation Algorithm,” IEEE
Trans. on Circuits and Systems for Video Tech., vol. 2, pp. 169-175, June 1992
18. S.H Nam, J.S. Baek, T.Y. Lee and M. K. Lee, “ A VLSI Design for Full Search Block Matching Motion
Estimation,” Proc. of IEEE ASIC Conference, Rochester, NY, September 1994, pp. 254-7
19. K-M Yang, M-T Sun and L. Wu, “ A Family of VLSI Designs for Motion Compensation Block
Matching Algorithm,” IEEE Trans. on Circuits and Systems, V. 36, No. 10, October 1989, pp. 1317-25
20. Intel Application Notes for Pentium MMX, http://developer.intel.com/drg/mmx/appnotes/
21. W-H Chen, C. H. Smith and S. C. Fralick, “A Fast Computational Algorithm for the Discrete Cosine
Transform,” IEEE Trans. on Communications, vol. COM-25, No. 9, September 1977
22. T. Arai, I. Kuroda, K. Nadehara and K. Suzuki, “V830R/AV: Embedded Multimedia Superscalar RISC
Processor,” IEEE MICRO, March/April 1998, pp. 36-47
23. “IEEE Standard Specifications for the Implementation of 8x8 Inverse Discrete Cosine Transform,” Std.
1180-1990, IEEE, December 1990
24. Challenges for Adaptive Computing Systems, Defense and Advanced Research Projects Agency
(DARPA), www.darpa.mil/ito/research/acs/challenges.html
25. J. Villasenor, B. Schoner, K. Chia, C. Zapata, H. J. Kim, C. Jones, S. Lansing, and B. Mangione-Smith, “
Configurable Computing Solutions for Automatic Target Recognition,” Proceedings of IEEE Workshop
on FPGAs for Custom Computing Machines, April 1996
26. M. Rencher and B.L. Hutchings, " Automated Target Recognition on SPLASH 2," Proceedings of IEEE
Symposium on FPGAs for Custom Computing Machines, April 1997
27. S. Hauck, T.W. Fry, M.M. Hosler, and J.P. Kao, “The Chimaera Reconfigurable Functional Unit,”
Proceedings of IEEE Symposium on Field-programmable Custom Computing Machines, April 1997
28. R.D. Wittig and P. Chow, “OneChip: An FPGA processor with Reconfigurable Logic,” Proceedings of
IEEE Symposium on FPGAs for Custom Computing Machines, April 1996
29. M.J. Wirthlin and B.L. Hutchings, “A Dynamic Instruction Set Computer,” Proceedings of IEEE
Symposium on Field-programmable Custom Computing Machines, April 1995
30. B. Schneier, Applied Cryptography, John Wiley, New York, NY, 1996.
31. S. Salomao, V. Alves, and E.C. Filho, “HiPCrypto: A High Performance VLSI Cryptographic Chip,”
Proceedings of IEEE ASIC conference, 1998, pp. 7-13
32. W.J. Bouknight, S.A. Denenberg, D.E. McIntyre, J.M. Randall, A.H. Sameh and D.L. Slotnick, The
Illiac IV system,” Proc. IEEE, vol. 60, no. 4, April 1972, pp. 369-388
33. K.E Batcher, “Design of Massively Parallel Processor,” IEEE Trans. On Computers, C-29, September
1980, pp. 836-840
... Hardware acceleration is a result of building application-specific logic, e.g., matrix multiplication and cryptographic operations, precluding OS or software interruptions or locking, and taking advantage of fine grained parallelism; whereas security leverages the hardware immutability properties to provide guarantees like tamper-resistant data processing, isolation and building security abstractions [4]- [10]. To continuously support new use-cases while reducing the burden of hardware fabrication of a monolithic system (costly and slow), these systems are often (1) assembled and deployed as Multi-Processor System-on-Chip (MPSoCs) that makes use of modular cheap Commercial-of-the-shelf (COTS) components [11], [12]; and (2) leverage the hardware programmablity features of general-purpose reconfigurable hardware like FPGAs and CGRAs [13]- [15]. Unfortunately, this raises new computing integrity challenges against intrusions and faults. ...
Preprint
Full-text available
Computational offload to hardware accelerators is gaining traction due to increasing computational demands and efficiency challenges. Programmable hardware, like FPGAs, offers a promising platform in rapidly evolving application areas, with the benefits of hardware acceleration and software programmability. Unfortunately, such systems composed of multiple hardware components must consider integrity in the case of malicious components. In this work, we propose Samsara, the first secure and resilient platform that derives, from Byzantine Fault Tolerant (BFT), protocols to enhance the computing resilience of programmable hardware. Samsara uses a novel lightweight hardware-based BFT protocol for Systems-on-Chip, called H-Quorum, that implements the theoretical-minimum latency between applications and replicated compute nodes. To withstand malicious behaviors, Samsara supports hardware rejuvenation, which is used to replace, relocate, or diversify faulty compute nodes. Samsara's architecture ensures the security of the entire workflow while keeping the latency overhead, of both computation and rejuvenation, close to the non-replicated counterpart.
... To satisfy some dynamic properties, there are some von Neumann PEs request configuration from the processor [37,42,55,58]. Dataflow PE means the reconfigurable PE determines the execution state according to the input data to select the configuration execution [9,49,56,57,60,61,66,67,71]. Its essence is the out-of-order execution of instructions. ...
Preprint
Spatial architecture is a high-performance architecture that uses control flow graphs and data flow graphs as the computational model and producer/consumer models as the execution models. However, existing spatial architectures suffer from control flow handling challenges. Upon categorizing their PE execution models, we find that they lack autonomous, peer-to-peer, and temporally loosely-coupled control flow handling capability. This leads to limited performance in intensive control programs. A spatial architecture, Marionette, is proposed, with an explicit-designed control flow plane. The Control Flow Plane enables autonomous, peer-to-peer and temporally loosely-coupled control flow handling. The Proactive PE Configuration ensures timely and computation-overlapped configuration to improve handling Branch Divergence. The Agile PE Assignment enhance the pipeline performance of Imperfect Loops. We develop full stack of Marionette (ISA, compiler, simulator, RTL) and demonstrate that in a variety of challenging intensive control programs, compared to state-of-the-art spatial architectures, Marionette outperforms Softbrain, TIA, REVEL, and RipTide by geomean 2.88x, 3.38x, 1.55x, and 2.66x.
... A hybrid architecture emerged in 2003 when Xilinx embedded its FPGA technology into IBM's Application-Specific Integrated Circuits (ASICs), enabling designers to add programmable logic into their designs without developing a separate board for the FPGA [28]. At the same time, FPGA-based SoCs targeting digital signal processing applications extended their computational power through reconfigurable solutions often involving a matrix of computational elements with programmable interconnections [15,18,70,91]. In practice, FPGA-based SoC are complex devices that employ much more than hard and soft embedded processors [10]. ...
Preprint
Full-text available
Building and maintaining a silicon foundry is a costly endeavor that requires substantial financial investment. From this scenario, the semiconductor business has largely shifted to a fabless model where the Integrated Circuit supply chain is globalized but potentially untrusted. In recent years, several hardware obfuscation techniques have emerged to thwart hardware security threats related to untrusted IC fabrication. Reconfigurable-based obfuscation schemes have shown great promise of security against state-of-the-art attacks -- these are techniques that rely on the transformation of static logic configurable elements such as Look Up Tables (LUTs). This survey provides a comprehensive analysis of reconfigurable-based obfuscation techniques, evaluating their overheads and enumerating their effectiveness against all known attacks. The techniques are also classified based on different factors, including the technology used, element type, and IP type. Additionally, we present a discussion on the advantages of reconfigurable-based obfuscation techniques when compared to Logic Locking techniques and the challenges associated with evaluating these techniques on hardware, primarily due to the lack of tapeouts. The survey's findings are essential for researchers interested in hardware obfuscation and future trends in this area.
Chapter
This chapter discusses a list of block ciphers in computer cryptography. The block ciphers discussed in the chapter include Lucifer, Madryga, NewDES, FEAL, REDOC II, LOKI, RC2, IDEA, modular multiplication-based block cipher (MMB), CA, and Skipjack. IDEA operates on 64-bit plaintext blocks. The key is 128 bits long. The same algorithm is used for both encryption and decryption. As with all the other block ciphers, IDEA uses both confusion and diffusion. The design philosophy behind the algorithm is one of mixing operations from different algebraic groups. Three algebraic groups are being mixed, and they are all easily implemented in both hardware and software. MMB is an iterative algorithm that mainly consists of linear steps (XOR and key applications) and the parallel applications of four large nonlinear invertible substitutions. These substitutions are determined by a multiplication modulo 232-1 with constant factors.
Article
Future computing workloads will emphasize an architecture's ability to perform relatively simple calculations on massive quantities of mixed-width data. This paper describes a novel reconfigurable fabric architecture, PipeRench, optimized to accelerate these types of computations. PipeRench enables fast, robust compilers, supports forward compatibility, and virtualizes configurations, thus removing the fixed size constraint present in other fabrics. For the first time we explore how the bit-width of processing elements affects performance and show how the PipeRench architecture has been optimized to balance the needs of the compiler against the realities of silicon. Finally, we demonstrate extreme performance speedup on certain computing kernels (up to 190x versus a modern RISC processor), and analyze how this acceleration translates to application speedup.
Article
Recent trends in the cost and performance of application-specific hardware relative to conventional processors discourage investing much time and energy in special-purpose architectures except for niche applications. These trends, however, may be reversed by the increasing complexity of computer architectures and the advent of configurable computing. Configurable computers have attracted considerable attention recently because they promise to deliver the performance of application-specific hardware along with the flexibility of general-purpose computers. In this paper, we discuss some of the forces driving configurable computing, and we argue that new configurable architectures are needed to realize the enormous potential of configurable computing. In particular, we believe that the commercial FPGAs currently used to construct configurable computers are too fine-grained to achieve good cost-performance on computationally-intensive applications that demand high-performance hardware. We then describe a new architecture called RaPiD (Reconfigurable Pipelined Datapaths), which is optimized for highly repetitive, computationally-intensive tasks. Very deep application-specific computation pipelines can be configured in RaPiD that deliver very high performance for a wide range of applications. RaPiD achieves this using a coarse-grained reconfigurable architecture that mixes the appropriate amount of static configuration with dynamic control.
Article
A Fast Discrete Cosine Transform algorithm has been developed which provides a factor of six improvement in computational complexity when compared to conventional Discrete Cosine Transform algorithms using the Fast Fourier Transform. The algorithm is derived in the form of matrices and illustrated by a signal-flow graph, which may be readily translated to hardware or software implementations.