Conference PaperPDF Available

BASIC MATRIX OPERATIONS ON A DSP ARRAY ARCHITECTURE

Authors:

Abstract and Figures

Many processing tasks in signal and image processing can be reduced to a common set of basic matrix primitives, e.g. matrix/vector multiplication, matrix transposition and inversion, and solving systems of equations. A highly parallel array architecture for such applications is presented and it is shown how some frequently used matrix operations are performed. The array, consisting of PEs interconnected as a 2D-grid, executes instructions according to the SIMD (Single Instruction Multiple Data) parallel computing style. It is scalable, both in terms of problem size and when porting it to future down-scaled CMOS processes.
Content may be subject to copyright.
Abstract.
Many processing tasks in signal and image process-
ing can be reduced to a common set of basic matrix prim-
itives, e.g. matrix/vector multiplication, matrix
transposition and inversion, and solving systems of equa-
tions. A highly parallel array architecture for such appli-
cations is presented and it is shown how some frequently
used matrix operations are performed. The array, consist-
ing of PEs interconnected as a 2D-grid, executes instruc-
tions according to the SIMD (Single Instruction Multiple
Data) parallel computing style. It is scalable, both in
terms of problem size and when porting it to future down-
scaled CMOS processes.
Keywords: Parallel DSP, Multi-channel signal
processing, Matrix computations, Architecture and im-
plementation.
1 Introduction
The major computational requirements for many real-
time processing tasks in signal and image processing can
be reduced to a common set of basic matrix primitives
[1]. This set includes matrix/vector multiplication, ma-
trix/matrix multiplication and addition, matrix inversion,
solution of linear systems, eigen systems solution, matrix
decomposition (LU, QR and singular value decomposi-
tion) and the generalized SVD algorithm.
This paper presents the way in which such matrix
computations are performed on a parallel DSP array ar-
chitecture. Apart from complex operations such as matrix
inversion, matrix/vector multiplication and solving a sys-
tem of linear equations, simpler operations, such as ma-
trix addition/subtraction and matrix transposition are
presented.
2 The “REMAP-γ” DSP array architecture
Figure 1 shows the array architecture. Based on the
SIMD parallel computing model, a central Control Unit
generates and issues instructions (control, address and
timing) to the processors (PEs) in the array. All PEs re-
ceive the same instruction and performs the same opera-
tion (a “minor” local control modification is possible). In
its classical definition, the model is fully synchronous
and the timing and synchronization is given by the CU
broadcasting a central global clock signal. A status signal
“array status” (typically the “OR” sum of one status bit
per PE) is read by the CU to monitor the state of the array.
The PEs, interconnected as 2D array with nearest
neighbour connections, are optimized to perform MAC
operations (fix point representation). Input data is re-
ceived as a vector, one vector item per array column, at
the array top border. After completed processing, an out-
put vector is produced at the eastern array border.
Figure 1 . The SIMD array architecture.
2.1 Processing Elements
The processing elements used in the architecture use
a bit-serial data path including a (bit-parallel) register
file. Each PE can address the registers in the file inde-
pendently of other PEs using an index register, X. The
global CU supplies the base-address (common to all
PEs), and each PE adds its own 8-bit offset (in the AC-
CUMULATOR) to this base, yielding the final address.
This facilitates “local address modification”, useful in
(e.g. non-linear function) table lookup operations.
Figure 2 shows the PE data path architecture. The
central path of this is a bit-serial ALU with two 1-bit in-
16 1616 16
parallel-to-serial
PE array
serial-to-parallel
to/from CU
16
16
16
16
16
16
External Input
External Output
11111
1
1
1
Control Unit
instructions array status
(CU)
BASIC MATRIX OPERATIONS ON A DSP ARRAY ARCHITECTURE
LARS BENGTSSON, STEFAN LUND
Halmstad University, Centre for Computer Systems Architecture,
Box 823, S-301 18 Halmstad, Sweden
lars.bengtsson@ide.hh.se, stefan.lund@ide.hh.se
put buses (A and B), a carry-in input and two 1-bit out-
puts (S and COUT). S is fed to an accumulator (shift
register), to a 1-bit register (R) and to the index register.
COUT (Carry out) is fed to a 1-bit register (C) for carry-
save. All data transport through the PE flows through this
ALU. The data may be modified by the ALU or simply
pass unaffected with the same time delay. A T-bit is used
for “tagged” (selective) memory writes.
Figure 2 . The PE data path architecture
Each PE has a dedicated 16 bit, two’s-complement,
serial/parallel multiplier that makes multiplication in es-
sence as fast (per bit) as addition/subtraction. It is fed by
16-bit parallel data (from Q-reg) and by serial input data
(from PS-reg), least significant bit first. The output data
is serially generated with least significant bit first.
Figure 3 shows the multiplier structure.
Figure 3 . The 16-bit two’s-complement serial/parallel
multiplier
Communication with the neighbouring PEs takes
place through four wires input to the ALU input bus B
(North,East,South,West) and output through a multiplex-
er where one of three sources is selected. The first source
is the ACCUMULATOR and PS registers. This is used
e.g. in the NORTH,SOUTH,EAST and WEST instruc-
tions, where data are shifted one PE step in the array. The
third source (the 1-bit register R) is the source in those in-
structions where data are passed through the PE on a bit-
by-bit basis. Examples of instructions that use this are
RBROADC (Row BROADCast) and RMAC (Row Mul-
tiply-and-ACcumulate). These instructions are described
later in the text.
Input to the array is read from the north side. On the
northern PE border, the N(orth) inputs are connected to
parallel/serial conversion registers. In this way, the
SOUTH instruction will push input data (16-bits) into the
array from the North side. The SOUTH instruction also
simultaneously outputs the rightmost PE column to the
output interfaces. In this way, array input and output are
performed concurrently.
Output from the array is also done by the EAST in-
struction, which shifts (“pops”) the PE accumulator con-
tents one PE step to the east. The rightmost PEs (on the
eastern border) have their outputs connected to serial/
parallel conversion registers, which will serially be filled
with 16-bit data. These registers are readable/writeable
by the global CU while the array is working.
2.2 The instruction set
This section describes the REMAP-γinstruction set.
Instructions from the global CU are issued at the word
data-level. Most instructions operate on 16-bit words.
However, some generate 32 bits (e.g. the RMAC and
CMAC instructions). The CU sends these instructions at
“low” speed (typ. 400/16 MHz), and they are interpreted
and executed by the local CUs at “high” speed (typ. 400
MHz) at the bit-level. Table 1 below shows the 34 in-
structions and their functions (the arithmetic instructions
use 2-complement form data values).
Register File
(256*16)
Read
data
Address
Write data
ALU C
X-Reg
8
16 N
E
W
S
2*16 bit ACCUMULATOR
PS-Register
Q_Register
16 bitMultiplier
16
16
R
T
AB
to neighbor
to local CU
S Cout
Wd
Wd
16
Serial/Parallel
11
1
1
1
S
from Global
Control Unit
N,E,W,S
from
neighbors
&
FA
C
S
&
FA
C
S
&
FA
C
S
&
FA
C
S
&
FA
C
S
Serial in
Serial out
Q(0) Q(1) Q(2) Q(14) Q(15)
Table 1 . The Instruction Set
Instruction Instruction
ADD (SUB)
Add (sub) register PS to
(from) the ACC (16 bits)
MAC
Multiply Q-register with PS-
register and accumulate in
ACC. Result in ACC (32 bits).
RBROADC (pe, REG)
PE in column # ‘pe’ broad-
casts its ‘REG’ (ACC high
word, or PS) in each Row
CBROADC (pe, REG)
PE in row # ‘pe’ broadcasts its
‘REG’ (ACC high word, or
PS) in each Column
RMAC
Multiply-and-ACcumulate
each row. Result in ACC (32
bits) of PEs in the rightmost
column.
CMAC
Multiply-and-ACcumulate
each column. Result in ACC
(32 bits) of PEs in the lowest
row.
RBROADC_DIAG (REG)
The PEs in the diagonal
broadcasts its ‘REG’ (ACC
high word, or PS) in each Row
CBROADC_DIAG (REG)
The PEs in the diagonal
broadcasts its ‘REG’ (ACC
high word or PS) in each Col-
umn
RMIN/RMAX
Find the min./max value (in
ACC) across each row. Result
in ACC of the rightmost col-
umn
CMIN/CMAX
Find the min./max value (in
ACC) across each column.
Result in ACC of the lowest
row
2.2.1 Broadcast instructions
The broadcast instructions broadcast (using the PE
nearest neighbour connections only) the Accumulators or
PS registers in column ‘col’ (RBROADC) and in row
‘row’ (CBROADC) on each row and column respective-
ly. The ‘pe’ argument indicates which PE column
(RBROADC) or PE row (CBROADC) is the source of
the broadcast. The RBROADC_DIAG and
CBROADC_DIAG instructions use the PE elements of
the main diagonal as the source. Each PE uses the ‘pe’ ar-
gument (supplied in the instruction field) to determine if
they are the source or one of the destinations. The source
PEs start broadcasting immediately, and the destination
PEs waits the appropriate number of clock cycles (deter-
mined by the difference between ‘pe’ and the respective
PE ROW or COL number) before starting to receive the
16 bit long bitstream.
The maximum time for the broadcast instructions de-
pends on the array size. If the array size is N*N, the time
is (N-2)+16 cycles. For a typical size of 16*16 PEs, the
time required is 30 cycles.
When the complete broadcasted word has been col-
lected in the respective PE, it raises its PE_READY sig-
nal, informing the global CU that the instruction has been
completed in this PE.
The broadcast instructions utilize a “bit-level pipe-
lined” approach to distribute data amongst rows and col-
umns. Using only short nearest neighbour connections,
scalability is maintained when porting the architecture to
future down-scaled CMOS deep-submicron IC-process-
es [2].
2.2.2 The Multiply-and-ACcumulate
instructions:
The Multiply and Accumulate instructions imple-
ment MAC operations either locally (MAC) or for each
row (RMAC) or each column (CMAC). The RMAC and
CMAC instructions use the same “bit-level pipelined”
approach as the broadcast instructions. The principle is
the same as described in Figure 5, with the addition that
the bits are produced by the multipliers and accumulated
along the way by the ALUs.
Figure 6 illustrates this scheme.
Figure 6 . Illustration of the RMAC instruction
The time required for these instructions depends on
the array size. If the array size is N*N, the time required
for a RMAC or CMAC operation is N-1+log(N)+32 cy-
cles. For an array size of 16*16 PEs, this yields 51 clock
cycles. For a 32*32 array, 68 cycles are required.
3 Basic matrix primitives
This section presents mappings and performance of
some selected basic matrix operations on the architec-
ture.
3.1 Multiplying a vector with the transpose
of a matrix
In some algorithms (e.g. artificial neural networks -
ANN, it is necessary to, after first doing the ordinary ma-
trix by vector multiplication (R=WX), also multiply the
result vector Rwith the transpose of the same matrix
(WT). (EQ 1) illustrates these calculations.
ADDX
Add ACC (bits 23-16) to regis-
ter X. Result in X.
NORTH (REG),SOUTH
(REG), EAST (REG), WEST
(REG)
Shift ‘REG’ (ACC32, ACC16
or PS) to next PE in the array
STORE
(ACC_HIGH,ACC_LOW).
Write the ACC (High Word or
Low Word) to local memory
(address in X)
STORE_T (ACC_HIGH,
ACC_LOW). Write the ACC
to local memory, if the T-bit is
set.
R_SELECTF
Searches the rows of PEs for
T-bits set.Clears all but the
first one starting from the
rightmost column.
C_SELECTF
Searches the columns of PEs
for T-bits set. Clears all but
the first one starting from the
bottom row.
CLEAR
Clear ACC, C, R, T and the
multiplier S- and C-bits
INIT
Initialize the ROW and COL
registers in each PE.
AND (OR)
Logical AND (OR) the PS and
ACC registers (16 bits).
LOAD_X
Load the registers with data
from the instr. parameter field
(8-bits)
LOAD_PS, LOAD_Q
Load the PS (Q) registers with
data from the local memory
SHIFTACC (n)
Shift the ACC register n steps
to the right.
SHIFTACC_LEFT (n)
Shift the ACC register n steps
to the left.
DIV_RESTORE
Used in Divide procedure.
GRT
Sets the T-bit if the PS register
is greater (or equal to) than
the ACC register (high word)
LESS
Sets the T-bit if the PS register
is less than the ACC register
(high word)
Table 1 . The Instruction Set
Instruction Instruction
RMAC instruction starts in this column
Col. no: 1 2 3 N
bit-serial data flow
RMAC instruction ends in this column
PE PE PE
PE
row R
(EQ 1)
These operations are easily executed by the array us-
ing the RMAC and CMAC instructions as shown in the
following example. The ordinary matrix vector multipli-
cation (R=WX) is first calculated (assuming the Xvector
resides in the top row) using the “RMAC” instruction af-
ter which the result vector Rresides in the eastern PE col-
umn #N-1. The Rvector is then broadcasted on the rows
and the CMAC instruction is used to create the result vec-
tor S.
CBROADC (0,PS)
RMAC/* perform R=WX */
/* The accumulator is moved
to the PS register */
RBROADC (N-1,PS)/* broadcast the Rvector
from column #N-1 */
CMAC/* MAC across the columns*/
The Svector now resides in the southern (highest
numbered) row and may be moved to the eastern border if
needed using only two further instructions (CBROADC
(N-1), RBROADC_DIAG). Normally, however, as is the
case in the ANN training algorithm “back-propagation”,
this operation is one part in a long sequence and the result
is not output but is used further in the calculations.
3.2 Vector transpose
Transposing a vector is straightforward and uses only
two instructions (assuming the vector resides in the top
row). First, the vector is broadcasted on the columns by
the CBROADC instruction. Second, the
RBROADC_DIAG instruction broadcasts the diagonal
elements on the rows. The transposed vector now resides
in the rightmost column.
The instruction sequence for vector transpose is:
SOUTH (PS) /* input the vector */
CBROADC (0,PS) /* broadcast it on the
columns*/
RBROADC_DIAG (PS)/* the diagonal PEs do a
row broadcast */
EAST (PS) /* output the vector */
3.3 Matrix addition and subtraction
Addition (or subtraction) is done in parallel using the
ADD (SUB) instruction. If two matrices, A and B, are to
be added, the instructions issued are (assuming that the A
matrix element resides in the ACCumulators):
LOAD_X (B) /* load X register with
address to B */
LOAD_PS /* load B elements into
PS registers */
ADD () (or SUB) /* add/sub A and B items */
STORE (ACC_HIGH) /* store result in B */
3.4 Matrix inversion
To invert a matrix A, the Gauss-Jordan elimination
method may be used. This method is normally used to
solve a linear system of equations. This section first de-
scribes how this solving can be done on the array. It is
then shown how this can be extended to perform matrix
inversion.
3.4.1 The Gauss-Jordan elimination method
This algorithm solves a system of linear equations
with a variation of Gauss-elimination. No back-substitu-
tion is necessary to complete the solving. All off-diagonal
elements are eliminated. The resulting matrix and vector
require only one division computation for each element to
produce the solution vector.
An example linear system of equations with three var-
iables is shown below.
(EQ 2)
The algorithm is as follows (showing the elimination
of element A21):
•Calculate E = -A21 / A11
•For J=1 to 3 Calculate A2J = A2J + A1J * E
•Calculate B2 = B2 + B1 * E
This proceeds, one column at a time, until all off-diag-
onal elements have been eliminated. Finally, to obtain the
solution vector:
•For i=1 to 3 Calculate Xi = Ai,i / Bi
The general algorithm (in its sequential form) can be
expressed using the following pseudo-code:
FOR c IN 1 TO NO_COLS LOOP /* for each
col*/ FOR r in 1 TO NO_ROWS LOOP /* for each
row IF r=c CONTINUE;/* skip to next row if r=c
Er = -Arc/Acc;
FOR j IN 1 TO NO_COLS LOOP/* for
rixjwij
j
=
sirjwT
j
=
A11 A12 A13
A21 A22 A23
A31 A32 A33
x1
x2
x3
×B1
B2
B3
=
each column in A */
Arj = Arj + Acj*Er;
ENDLOOP;
Br = Br + Bc*Er;
ENDLOOP;
ENDLOOP;
FOR r in 1 TO NO_ROWS LOOP/* for each A
row*/ Xr = Ar,r / Br
ENDLOOP
Parallelizing this sequential code yields the follow-
ing pseudo-code:
FOR i in 1 TO NO_COLS LOOP
Broadcast row #i on the columns;
Er = -Ari/Aii for each row r not equal to i. Ei = 0;
Broadcast Er on the rows;
Arc = Arc + Aic*Er;
Br = Br + Bi*Er;
ENDLOOP
No row pivoting (row swapping) is used here, and
thus cases in which a diagonal element is zero (Aii=0; or
Bi=0) will produce a wrong result. Searching for the
maximum element and row swapping may be added to
cope with this situation.
3.4.2 Executing Gauss-Jordan on the array:
This section shows the code segment needed to exe-
cute the algorithm. The coefficient matrix Aand result
vector Bare first loaded in the PE array by storing them
in the PE local memories. The Amatrix is stored in col-
umns 0 to #NO_COLS-1, and the Bvector is stored in
the rightmost column (#NO_COLS). Also, a help ma-
trix, H, is stored with an initial contents of:
(EQ 3)
This help matrix is used to inhibit adjustment of row
#i during the elimination process.
The elimination executed on the array is performed
column by column. It starts from the leftmost column
(column #0), eliminating all off-diagonal elements in
this column, and proceeds towards the rightmost col-
umn.
The instructions broadcast from the global CU are:
FOR i IN 0 TO NO_COLS-1 LOOP /* For each
col Load_acc (A_B)
CBROADC (ACC,i) /* TEMPc
<= Aic, or TEMPc <= Bi*/
Store_acc_hw (TEMP)
Divide (A_B,TEMP)
Store_acc_lw (E)
Multiply (H,E) /* clear all
E values on row i*/
RBROADC (ACC,i) /* broadcast
E values on the rows */
Store_acc_hw (E) /* Store Er
<= Arc/TEMPc * Hrc */
Multiply (TEMP,E)
Store_acc_hw (SLASK)
Load_acc (A_B)
LOAD_X (SLASK)
LOAD_PS
SUB
Store_acc_hw (A_B) /* Store Arc
=Arc - TEMPc*Er, or Br=B
r- TEMPc*Er
*/ Load_acc (H)
SOUTH (ACC16)/* shift the help matrix
down one PE step. All “ones” are
shifted in to the top row. */
Store_acc_hw (H)
ENDLOOP
Load_acc (A_B)
RBROADC_DIAG (ACC) /* get diago-
nal elements*/
Store_acc_hw (SLASK)
Divide (A_B, SLASK) /* output
vector is now in ACC (low word) in col. 3 */
The procedure (macro) “Multiply” above contains
the following instructions:
Multiply (K,L):
CLEAR
LOAD_X (K)
LOAD_Q
LOAD_X (L)
LOAD_PS
MAC
Macros “Load_acc ()”, “Divide ()”, “Store_acc_hw
()” and “Store_acc_lw ()” loads the ACC, divides Q by
M, stores the ACC high word bits 31-16 (“hw”) or ACC
H
00000
11111
11111
………… … …
111111
=
low word bits 15-0 (“lw”).
Solving a 31*31 equation system (using a 32*32 ar-
ray) requires 51040 cycles. At 100 MHz clock frequency
this corresponds to 510 µsec.
3.4.3 Matrix inversion using Gauss-Jordan
elimination
To show that the Gauss-Jordan elimination method
can perform matrix inversion, consider the following:
First, use the relationship:
(EQ 4)
where I is the unity matrix. Then, use the notation:
(EQ 5)
Identifying X and B in (EQ 4), yields that:
(EQ 6)
Thus, by using Gauss-Jordan elimination with matri-
ces Aand B, where Bis initialized as the unity matrix I,
we can calculate the inverse matrix A-1.
As an example, to find the inverse of the 3*3 matrix
A, we start with the following:
(EQ 7)
The Amatrix is loaded in the first three columns of
the array. The unity matrix is loaded in the last three col-
umns. The elimination is now performed as in section
3.4.2, with the exception that three columns in the Bma-
trix are handled instead of only one.
The number of clock cycles required to invert a
32*32 matrix using a 32*64 array is 53771 cycles. At 100
MHz clock frequency this corresponds to 538 µsec.
Mapping the algorithm this way means that the array
is not quadratic. If this is not desirable, a quadratic array
can be used if the B matrix is stored in the same PEs as
the A matrix (in different memory positions). Of course,
this slows down the algorithm execution, because the A
and B elements can in this case not be calculated (adjust-
ed) in parallel.
3.5 Performing Matrix - Vector multiplica-
tion
Many signal processing algorithms can basically be
formulated as a matrix by vector multiplication problem
[3]. REMAP-γcan perform matrix/vector multiplication
using three separate methods, each with different per-
formance characteristics regarding throughput, latency
and the amount of hardware needed. These are:
1. Using column broadcast and the RMAC instruction
(“RMAC”).
2. Using a systolic type of processing with the local
MAC instruction and column broadcast (“pseudo-
systolic 1”).
3. Using a systolic type of processing using local MAC
and nearest neighbour PE communications only
(“pseudo-systolic 2”).
3.5.1 Matrix/vector multiplication using
column broadcast and the RMAC
instruction
The matrix/vector multiplication is basically per-
formed by issuing two instructions, CBROADC(0) and
RMAC (assuming the matrix is already loaded in the Q
registers of the PEs). First, the vector is input to row #0.
Second, this vector is broadcasted on the columns using
the CBROADC (0) instruction. Third, the RMAC in-
struction performs the multiply-and-accumulate between
the matrix and the vector elements. The result vector is
found in the rightmost column. Figure 7 illustrates this
with matrix A, input vector X and output vector O.
Figure 7 . Matrix/Vector multiplication using column
broadcast and RMAC
In this method a new matrix by vector multiplication
is not started until the previous one has completed. The
main benefit is the very low latency (see Table 2) for a
specific vector, measured as the time delay between in-
put and output.
3.5.2 Matrix/vector multiplication using the
local MAC instruction and column
broadcast (“pseudo-systolic 1”)
This method uses a systolic-like processing where the
input vectors are broadcasted on the columns and the ac-
AA 1I=
AX B=
XA
1
=BI=
A11 A12 A13
A21 A22 A23
A31 A32 A33
x11 x12 x13
x21 x22 x23
x31 x32 x33
100
010
001
=
input vector
output vector
a11
a31
a41
a21
a12
a32
a42
a22
a13
a33
a43
a23
a14
a34
a44
a24
x1x2x3x4
o4
o3
o2
o1
CBROADC (0)
RMAC
cumulated sums are shifted east one PE step at a time
through the array. The local MAC instruction produces
the products and accumulates the sum. The result vector
appears at the eastern output after some latency.
The procedure is as follows: the input vector is shift-
ed down (SOUTH) one PE step (this also inputs the next
vector at the same time). The CBROADC_DIAG in-
struction is then used to broadcast the vectors on the re-
spective columns using the diagonal PE elements as
sources. Next, the MAC instruction produces the product
and accumulates the sums. Finally, the accumulated
sums are shifted one PE step to the east (and at the same
time output a result vector).
Figure 8 shows this principle with matrix A and the
input vector stream F, G, H, I,......
In the systolic methods where vectors are pipelined
through the array, a new result is produced in each loop
(after Ninitial loops). However, there is a “high” latency
(equal to N loops) for a given vector.
Figure 8 . “pseudo-systolic 1” Matrix/Vector
multiplication
3.5.3 Using systolic processing and skew/
deskew external registers (“pseudo-systolic
2”)
This method uses systolic processing where the input
vector is delayed (skewed) and the output vector is
deskewed according to Figure 10. The local MAC in-
struction is used to create the local product and add this
to the accumulated sum shifted in from the left PE neigh-
bour. The column broadcast is not necessary here as in
the first two methods.
This method yields the highest throughput but has a
high latency (although not as high as “pseudo-systolic
1”). It requires extra hardware to skew/deskew the input
and output vectors (shown in Figure 9). Each of these
registers has a size equal to the PS (or Q) register in the
PE datapath.
Figure 9 . Skew and deskew registers needed in the
“pseudo-systolic 2” matrix/vector multiplication case
Figure 10 shows how this matrix/vector multiplica-
tion is performed by the array. The matrix is stored in the
PEs’ Q registers, i.e. one matrix element in each PE. In
each step, the vectors are shifted one step south, and the
accumulated sums are shifted one step to the east. A re-
sult is produced in each step, and the resulting vector is
found by deskewing the resulting vectors from the east
edge PEs.
Figure 10 . “pseudo-systolic 2” matrix/vector
multiplication with skew/deskew of in- and outdata
Table 2 shows a comparison of the performance of
the three matrix-vector multiplication methods in terms
of sustained throughput and latency. The clock frequency
assumed is 100 MHz and the array size is 32*32.
a11
a11 h1
a21 h1
a31 h1
a11 g1
a12 g2
+
a21 g1
a22 g2
+
a31 g1
a32 g2
+
h1
g1
f1
a21
a31
a12
a32
a13
a23
a33
a22
i2i3
a21 f1+
a22 f2+
a23 f3
a11 f1+
a12 f2+
a13 f3
a31 f1+
a32 f2+
a33 f3
h2
g2
f2
h3
g3
f3
i1
Table 2. Matrix/vector performance on a
32*32 array (@ 100 MHz)
Method Sustained
throughput
(GOPS)
Latency
(clock cycles) Latency (µs)
“RMAC” 1.4 146 1.46
“pseudo-
systolic 1” 1.6 4032 40
D D
D
D D
D
Skew section Deskew section
input vectors
output vectors
to array
from array
a11 a11 d1
a12 d
2
a13 d
3
+
+
a21 c
1
a22 c
2
a23 c
3
a31 b
1
a32 b
2
a33 b
3
a11 e1
a12 e2
a13 e3
+
+
a21 d1
a22 d2
a23 d3
+
+
a31 c1
a32 c2
a33 c3
+
+
a11 f1
a12 f2
a13 f3
+
+
a21 e1
a22 e2
a23 e3
+
+
a31 d1
a32 d2
a33 d3
+
+
i1h2g3
j1i2h3
k1j2i3
a11 h1
a21 g1
a11 g1
a12 g2
+
a21 f1
a22 f2
+
a31 e1
a32 e2
+
h1g2f3
g1f2e3
f1e2d3
a11 a12 a13
a21 a22 a23
a31 a32 a33
d1
d2
d3
a11 d1a12 d2a13 d
3
++
a21 d1a22 d2a23 d
3
++
a31 d1a32 d2a33 d
3
++
=
Input Vector
a21
a31
a12
a32
a13
a23
a33
a22
a31 f1
As Table 2 reveals, the “pseudosystolic 2” method is
superior with respect to throughput (but requires extra
skew- and deskew registers), and the “RMAC” method is
superior with respect to latency.
4 VLSI Test implementation
Two test prototype chips, one 16 PEs (4*4) and one 64
PEs (8*8) have been designed using VHDL synthesis and
standard cells. The technology used was the ES2 0.7 mi-
cron N-well CMOS double layer metal process. The phys-
ical layout was created with Cadence place&route tools
which resulted in an array area of 225 mm2(8*8 PEs) and
a clock speed of 100 MHz.
The chip (4*4) block diagram is shown in Figure 11.
As can be seen, the parallel-to-serial and serial-to-parallel
conversion interfaces at the northern and eastern borders
are included on-chip. These interfaces can, through the
use of multiplexers, be bypassed on those PE chips that
are not placed at the array borders when a multiple-chip
array is constructed.
The input interface (µPI_IN) includes four parallel-in/
serial-out shift registers. The output interface (µPI_OUT)
includes four serial in/parallel out shift registers.
Figure 11 Test chip (4*4) block diagram.
The 8*8 test chip has the same type of block diagram
but has 64 PEs and eight input registers and eight output
registers.
Table 2 below summarizes the most important design
parameters of the 8*8 test design.
Table 3 . Summary of test chip (8*8) data
Using scaling rules for CMOS [4] and [5], it is esti-
mated that when using a state-of-the-art CMOS process
(0,18 µm), approximately four times higher clock speed
(400 MHz) and 16 (42) times smaller area should be ex-
pected. Thus, a 32*32 array would fit in one single chip.
5 Conclusions
This paper has shown the mapping and performance
of basic matrix operations on a novel parallel DSP array
architecture. These matrix operations, basic in many
signal processing algorithms, include matrix inversion,
matrix/vector multiplication, solving systems of linear
equations, matrix addition/subtraction and matrix
transposition. Performance figures were given and
compared in terms of throughput, latency and execution
times. Data for a VLSI test chip in 0,7 µm CMOS was
presented and estimations shown regarding performance
and chip size using a state-of -the-art process.
6 References
[1] J.H. Moreno, T. Lang, Matrix Computations on
Systolic-Type Arrays, Kluwer Acad.Publ., ISBN 0-
7923-9237-X, 1992.
[2] L. Bengtsson, “REMAP-γ: A Scalable SIMD
VLSI Architecture with Hierarchical Control”, PhD
thesis no. 320 , School of Electrical and Computer
Engineering, Chalmers University of Technology,
Gothenburg, Sweden, 1997.
[3] H.T. Kung, C.E. Leiserson, “Algorithms for VLSI
Processor Arrays”, In “Introduction to VLSI Sys-
tems”, Mead & Conway, Addition-Wesley, 1980.
[4] R.H. Dennard, F.H. Gaensslen, H.N. Yu, V.L.
Rideout, E. Bassous, A.R. LeBlanc, “Design of ion-
implanted MOSFET’s with very small physical
dimensions”, IEEE J. Solid-State Circuits, Vol. SC-9,
p.256, 1974.
[5] K.C.Saraswat, F. Mohammadi, “Effect of Scaling
of Interconnections on the Time Delay of VLSI Cir-
cuits”, IEEE J. of Solid-State Circ., vol. SC-17, no. 2,
pp. 275-280, April 1982.
“pseudo-
systolic 2” 2.56 2560 25.6
Table 2. Matrix/vector performance on a
32*32 array (@ 100 MHz)
Method Sustained
throughput
(GOPS)
Latency
(clock cycles) Latency (µs)
DATA_IN 3
µPI_IN(Parallel to Serial interface)
PE14 PE13 PE12 PE11
µPI_OUT(Serial to Parallel interface)
PE24
PE33
PE23
PE32 PE31
PE22 PE21
PE34
PE43 PE42 PE41PE44
0 10 1 0 1 0 1 DATA_OU
T
DATA_OU
T
DATA_OU
T
DATA_OU
T
DATA_IN 4
DATA_IN 2
DATA_IN 1
Parallel
IN
Parallel out
0
’=
bypass
Chip parameter Data
Technology 0,7 µm CMOS Double
Layer Metal
Clock frequency (MHz) 100
Array area (mm2) 225 (15 x 15)
Number of cells (excluding
register memory) 80450
Estimated power dissipa-
tion (W) < 12
ResearchGate has not been able to resolve any citations for this publication.
Article
Effect of scaling of dimensions, i.e., increase in chip size and decrease in minimum feature size, on the RC time delay associated with interconnections in VLSIC's has been investigated. Analytical expressions have been developed to relate this time delay to various elements of technology, i.e., interconnection material, minimum feature size, chip area, length of the interconnect, etc. Empirical expressions to predict the trends of the technological elements as a function of chronological time have been developed. Calculations of time delay for interconnections made of poly-Si, WSi 2 , W, and Al have been done and they indicate that as the chip area is increased and other device-related dimensions are decreased the interconnection time delay becomes significant compared to the device time delay and in extreme cases dominates the chip performance.
Article
This paper considers the design, fabrication, and characterization of very small Mosfet switching devices suitable for digital integrated circuits, using dimensions of the order of 1 μ. Scaling relationships are presented which show how a conventional MOSFET can be reduced in size. An improved small device structure is presented that uses ion implantation, to provide shallow source and drain regions and a nonuniform substrate doping profile. One-dimensional models are used to predict the substrate doping profile and the corresponding threshold voltage versus source voltage characteristic. A two-dimensional current transport model is used to predict the relative degree of short-channel effects for different device parameter combinations. Polysilicon-gate MOSFET's with channel lengths as short as 0.5 μ were fabricated, and the device characteristics measured and compared with predicted values. The performance improvement expected from using these very small devices in highly miniaturized integrated circuits is projected.
  • J H Moreno
  • T Lang
J.H. Moreno, T. Lang, Matrix Computations on Systolic-Type Arrays, Kluwer Acad.Publ., ISBN 0-7923-9237-X, 1992.
REMAP-γ: A Scalable SIMD VLSI Architecture with Hierarchical Control
  • L Bengtsson
L. Bengtsson, "REMAP-γ: A Scalable SIMD VLSI Architecture with Hierarchical Control", PhD thesis no. 320, School of Electrical and Computer Engineering, Chalmers University of Technology, Gothenburg, Sweden, 1997.