Conference PaperPDF Available

BASIC MATRIX OPERATIONS ON A DSP ARRAY ARCHITECTURE

September 2000

September 2000

Conference: International Conference on Signal Processing and Communications
At: Marbella, Spain

Authors:

Chalmers University of Technology

Many processing tasks in signal and image processing can be reduced to a common set of basic matrix primitives, e.g. matrix/vector multiplication, matrix transposition and inversion, and solving systems of equations. A highly parallel array architecture for such applications is presented and it is shown how some frequently used matrix operations are performed. The array, consisting of PEs interconnected as a 2D-grid, executes instructions according to the SIMD (Single Instruction Multiple Data) parallel computing style. It is scalable, both in terms of problem size and when porting it to future down-scaled CMOS processes.

. The PE data path architecture

…

. The 16-bit two's-complement serial/parallel multiplier

…

. Illustration of the RMAC instruction

…

below shows the 34 in- structions and their functions (the arithmetic instructions use 2-complement form data values).

…

Figures - uploaded by Lars Bengtsson

Content may be subject to copyright.

Content uploaded by Lars Bengtsson

Content may be subject to copyright.

Abstract.

Many processing tasks in signal and image process-

ing can be reduced to a common set of basic matrix prim-

itives, e.g. matrix/vector multiplication, matrix

transposition and inversion, and solving systems of equa-

tions. A highly parallel array architecture for such appli-

cations is presented and it is shown how some frequently

used matrix operations are performed. The array, consist-

ing of PEs interconnected as a 2D-grid, executes instruc-

tions according to the SIMD (Single Instruction Multiple

Data) parallel computing style. It is scalable, both in

terms of problem size and when porting it to future down-

scaled CMOS processes.

Keywords: Parallel DSP, Multi-channel signal

processing, Matrix computations, Architecture and im-

plementation.

1 Introduction

The major computational requirements for many real-

time processing tasks in signal and image processing can

be reduced to a common set of basic matrix primitives

[1]. This set includes matrix/vector multiplication, ma-

trix/matrix multiplication and addition, matrix inversion,

solution of linear systems, eigen systems solution, matrix

decomposition (LU, QR and singular value decomposi-

tion) and the generalized SVD algorithm.

This paper presents the way in which such matrix

computations are performed on a parallel DSP array ar-

chitecture. Apart from complex operations such as matrix

inversion, matrix/vector multiplication and solving a sys-

tem of linear equations, simpler operations, such as ma-

trix addition/subtraction and matrix transposition are

presented.

2 The “REMAP-γ” DSP array architecture

Figure 1 shows the array architecture. Based on the

SIMD parallel computing model, a central Control Unit

generates and issues instructions (control, address and

timing) to the processors (PEs) in the array. All PEs re-

ceive the same instruction and performs the same opera-

tion (a “minor” local control modification is possible). In

its classical definition, the model is fully synchronous

and the timing and synchronization is given by the CU

broadcasting a central global clock signal. A status signal

“array status” (typically the “OR” sum of one status bit

per PE) is read by the CU to monitor the state of the array.

The PEs, interconnected as 2D array with nearest

neighbour connections, are optimized to perform MAC

operations (fix point representation). Input data is re-

ceived as a vector, one vector item per array column, at

the array top border. After completed processing, an out-

put vector is produced at the eastern array border.

Figure 1 . The SIMD array architecture.

2.1 Processing Elements

The processing elements used in the architecture use

a bit-serial data path including a (bit-parallel) register

file. Each PE can address the registers in the file inde-

pendently of other PEs using an index register, X. The

global CU supplies the base-address (common to all

PEs), and each PE adds its own 8-bit offset (in the AC-

CUMULATOR) to this base, yielding the final address.

This facilitates “local address modification”, useful in

(e.g. non-linear function) table lookup operations.

Figure 2 shows the PE data path architecture. The

central path of this is a bit-serial ALU with two 1-bit in-

16 1616 16

parallel-to-serial

PE array

serial-to-parallel

to/from CU

External Input

External Output

11111

Control Unit

instructions array status

(CU)

BASIC MATRIX OPERATIONS ON A DSP ARRAY ARCHITECTURE

LARS BENGTSSON, STEFAN LUND

Halmstad University, Centre for Computer Systems Architecture,

Box 823, S-301 18 Halmstad, Sweden

lars.bengtsson@ide.hh.se, stefan.lund@ide.hh.se

put buses (A and B), a carry-in input and two 1-bit out-

puts (S and COUT). S is fed to an accumulator (shift

COUT (Carry out) is fed to a 1-bit register (C) for carry-

save. All data transport through the PE flows through this

ALU. The data may be modified by the ALU or simply

pass unaffected with the same time delay. A T-bit is used

for “tagged” (selective) memory writes.

Figure 2 . The PE data path architecture

Each PE has a dedicated 16 bit, two’s-complement,

serial/parallel multiplier that makes multiplication in es-

sence as fast (per bit) as addition/subtraction. It is fed by

16-bit parallel data (from Q-reg) and by serial input data

(from PS-reg), least significant bit first. The output data

is serially generated with least significant bit first.

Figure 3 shows the multiplier structure.

Figure 3 . The 16-bit two’s-complement serial/parallel

multiplier

Communication with the neighbouring PEs takes

place through four wires input to the ALU input bus B

(North,East,South,West) and output through a multiplex-

er where one of three sources is selected. The first source

is the ACCUMULATOR and PS registers. This is used

e.g. in the NORTH,SOUTH,EAST and WEST instruc-

tions, where data are shifted one PE step in the array. The

third source (the 1-bit register R) is the source in those in-

structions where data are passed through the PE on a bit-

by-bit basis. Examples of instructions that use this are

RBROADC (Row BROADCast) and RMAC (Row Mul-

tiply-and-ACcumulate). These instructions are described

later in the text.

Input to the array is read from the north side. On the

northern PE border, the N(orth) inputs are connected to

parallel/serial conversion registers. In this way, the

SOUTH instruction will push input data (16-bits) into the

array from the North side. The SOUTH instruction also

simultaneously outputs the rightmost PE column to the

output interfaces. In this way, array input and output are

performed concurrently.

Output from the array is also done by the EAST in-

struction, which shifts (“pops”) the PE accumulator con-

tents one PE step to the east. The rightmost PEs (on the

eastern border) have their outputs connected to serial/

parallel conversion registers, which will serially be filled

with 16-bit data. These registers are readable/writeable

by the global CU while the array is working.

2.2 The instruction set

This section describes the REMAP-γinstruction set.

Instructions from the global CU are issued at the word

data-level. Most instructions operate on 16-bit words.

However, some generate 32 bits (e.g. the RMAC and

CMAC instructions). The CU sends these instructions at

“low” speed (typ. 400/16 MHz), and they are interpreted

and executed by the local CUs at “high” speed (typ. 400

MHz) at the bit-level. Table 1 below shows the 34 in-

structions and their functions (the arithmetic instructions

use 2-complement form data values).

(256*16)

Read

data

Address

Write data

ALU C

X-Reg

16 N

2*16 bit ACCUMULATOR

PS-Register

Q_Register

16 bitMultiplier

to neighbor

to local CU

S Cout

Serial/Parallel

from Global

Control Unit

N,E,W,S

from

neighbors

Serial in

Serial out

Q(0) Q(1) Q(2) Q(14) Q(15)

Table 1 . The Instruction Set

Instruction Instruction

ADD (SUB)

Add (sub) register PS to

(from) the ACC (16 bits)

MAC

Multiply Q-register with PS-

ACC. Result in ACC (32 bits).

RBROADC (pe, REG)

PE in column # ‘pe’ broad-

casts its ‘REG’ (ACC high

word, or PS) in each Row

CBROADC (pe, REG)

PE in row # ‘pe’ broadcasts its

‘REG’ (ACC high word, or

PS) in each Column

RMAC

Multiply-and-ACcumulate

each row. Result in ACC (32

bits) of PEs in the rightmost

column.

CMAC

Multiply-and-ACcumulate

each column. Result in ACC

(32 bits) of PEs in the lowest

row.

RBROADC_DIAG (REG)

The PEs in the diagonal

broadcasts its ‘REG’ (ACC

high word, or PS) in each Row

CBROADC_DIAG (REG)

The PEs in the diagonal

broadcasts its ‘REG’ (ACC

high word or PS) in each Col-

umn

RMIN/RMAX

Find the min./max value (in

ACC) across each row. Result

in ACC of the rightmost col-

umn

CMIN/CMAX

Find the min./max value (in

ACC) across each column.

Result in ACC of the lowest

row

2.2.1 Broadcast instructions

The broadcast instructions broadcast (using the PE

nearest neighbour connections only) the Accumulators or

PS registers in column ‘col’ (RBROADC) and in row

‘row’ (CBROADC) on each row and column respective-

ly. The ‘pe’ argument indicates which PE column

(RBROADC) or PE row (CBROADC) is the source of

the broadcast. The RBROADC_DIAG and

CBROADC_DIAG instructions use the PE elements of

the main diagonal as the source. Each PE uses the ‘pe’ ar-

gument (supplied in the instruction field) to determine if

they are the source or one of the destinations. The source

PEs start broadcasting immediately, and the destination

PEs waits the appropriate number of clock cycles (deter-

mined by the difference between ‘pe’ and the respective

PE ROW or COL number) before starting to receive the

16 bit long bitstream.

The maximum time for the broadcast instructions de-

pends on the array size. If the array size is N*N, the time

is (N-2)+16 cycles. For a typical size of 16*16 PEs, the

time required is 30 cycles.

When the complete broadcasted word has been col-

lected in the respective PE, it raises its PE_READY sig-

nal, informing the global CU that the instruction has been

completed in this PE.

The broadcast instructions utilize a “bit-level pipe-

lined” approach to distribute data amongst rows and col-

umns. Using only short nearest neighbour connections,

scalability is maintained when porting the architecture to

future down-scaled CMOS deep-submicron IC-process-

es [2].

2.2.2 The Multiply-and-ACcumulate

instructions:

The Multiply and Accumulate instructions imple-

ment MAC operations either locally (MAC) or for each

row (RMAC) or each column (CMAC). The RMAC and

CMAC instructions use the same “bit-level pipelined”

approach as the broadcast instructions. The principle is

the same as described in Figure 5, with the addition that

the bits are produced by the multipliers and accumulated

along the way by the ALUs.

Figure 6 illustrates this scheme.

Figure 6 . Illustration of the RMAC instruction

The time required for these instructions depends on

the array size. If the array size is N*N, the time required

for a RMAC or CMAC operation is N-1+log(N)+32 cy-

cles. For an array size of 16*16 PEs, this yields 51 clock

cycles. For a 32*32 array, 68 cycles are required.

3 Basic matrix primitives

This section presents mappings and performance of

some selected basic matrix operations on the architec-

ture.

3.1 Multiplying a vector with the transpose

of a matrix

In some algorithms (e.g. artificial neural networks -

ANN, it is necessary to, after first doing the ordinary ma-

trix by vector multiplication (R=WX), also multiply the

result vector Rwith the transpose of the same matrix

(WT). (EQ 1) illustrates these calculations.

ADDX

Add ACC (bits 23-16) to regis-

ter X. Result in X.

NORTH (REG),SOUTH

(REG), EAST (REG), WEST

(REG)

Shift ‘REG’ (ACC32, ACC16

or PS) to next PE in the array

STORE

(ACC_HIGH,ACC_LOW).

Write the ACC (High Word or

Low Word) to local memory

(address in X)

STORE_T (ACC_HIGH,

ACC_LOW). Write the ACC

to local memory, if the T-bit is

set.

R_SELECTF

Searches the rows of PEs for

T-bits set.Clears all but the

ﬁrst one starting from the

rightmost column.

C_SELECTF

Searches the columns of PEs

for T-bits set. Clears all but

the ﬁrst one starting from the

bottom row.

CLEAR

Clear ACC, C, R, T and the

multiplier S- and C-bits

INIT

Initialize the ROW and COL

registers in each PE.

AND (OR)

Logical AND (OR) the PS and

ACC registers (16 bits).

LOAD_X

Load the registers with data

from the instr. parameter ﬁeld

(8-bits)

LOAD_PS, LOAD_Q

Load the PS (Q) registers with

data from the local memory

SHIFTACC (n)

Shift the ACC register n steps

to the right.

SHIFTACC_LEFT (n)

Shift the ACC register n steps

to the left.

DIV_RESTORE

Used in Divide procedure.

GRT

Sets the T-bit if the PS register

is greater (or equal to) than

the ACC register (high word)

LESS

Sets the T-bit if the PS register

is less than the ACC register

(high word)

Table 1 . The Instruction Set

Instruction Instruction

RMAC instruction starts in this column

Col. no: 1 2 3 N

bit-serial data ﬂow

RMAC instruction ends in this column

PE PE PE

row R

(EQ 1)

These operations are easily executed by the array us-

ing the RMAC and CMAC instructions as shown in the

following example. The ordinary matrix vector multipli-

cation (R=WX) is first calculated (assuming the Xvector

resides in the top row) using the “RMAC” instruction af-

ter which the result vector Rresides in the eastern PE col-

umn #N-1. The Rvector is then broadcasted on the rows

and the CMAC instruction is used to create the result vec-

tor S.

CBROADC (0,PS)

RMAC/* perform R=WX */

/* The accumulator is moved

to the PS register */

RBROADC (N-1,PS)/* broadcast the Rvector

from column #N-1 */

CMAC/* MAC across the columns*/

The Svector now resides in the southern (highest

numbered) row and may be moved to the eastern border if

needed using only two further instructions (CBROADC

(N-1), RBROADC_DIAG). Normally, however, as is the

case in the ANN training algorithm “back-propagation”,

this operation is one part in a long sequence and the result

is not output but is used further in the calculations.

3.2 Vector transpose

Transposing a vector is straightforward and uses only

two instructions (assuming the vector resides in the top

row). First, the vector is broadcasted on the columns by

the CBROADC instruction. Second, the

RBROADC_DIAG instruction broadcasts the diagonal

elements on the rows. The transposed vector now resides

in the rightmost column.

The instruction sequence for vector transpose is:

SOUTH (PS) /* input the vector */

CBROADC (0,PS) /* broadcast it on the

columns*/

RBROADC_DIAG (PS)/* the diagonal PEs do a

row broadcast */

EAST (PS) /* output the vector */

3.3 Matrix addition and subtraction

Addition (or subtraction) is done in parallel using the

ADD (SUB) instruction. If two matrices, A and B, are to

be added, the instructions issued are (assuming that the A

matrix element resides in the ACCumulators):

LOAD_X (B) /* load X register with

address to B */

LOAD_PS /* load B elements into

PS registers */

ADD () (or SUB) /* add/sub A and B items */

STORE (ACC_HIGH) /* store result in B */

3.4 Matrix inversion

To invert a matrix A, the Gauss-Jordan elimination

method may be used. This method is normally used to

solve a linear system of equations. This section first de-

scribes how this solving can be done on the array. It is

then shown how this can be extended to perform matrix

inversion.

3.4.1 The Gauss-Jordan elimination method

This algorithm solves a system of linear equations

with a variation of Gauss-elimination. No back-substitu-

tion is necessary to complete the solving. All off-diagonal

elements are eliminated. The resulting matrix and vector

require only one division computation for each element to

produce the solution vector.

An example linear system of equations with three var-

iables is shown below.

(EQ 2)

The algorithm is as follows (showing the elimination

of element A21):

•Calculate E = -A21 / A11

•For J=1 to 3 Calculate A2J = A2J + A1J * E

•Calculate B2 = B2 + B1 * E

This proceeds, one column at a time, until all off-diag-

onal elements have been eliminated. Finally, to obtain the

solution vector:

•For i=1 to 3 Calculate Xi = Ai,i / Bi

The general algorithm (in its sequential form) can be

expressed using the following pseudo-code:

FOR c IN 1 TO NO_COLS LOOP /* for each

col*/ FOR r in 1 TO NO_ROWS LOOP /* for each

row IF r=c CONTINUE;/* skip to next row if r=c

Er = -Arc/Acc;

FOR j IN 1 TO NO_COLS LOOP/* for

rixjwij

∑

sirjwT

∑

A11 A12 A13

A21 A22 A23

A31 A32 A33

×B1

each column in A */

Arj = Arj + Acj*Er;

ENDLOOP;

Br = Br + Bc*Er;

ENDLOOP;

FOR r in 1 TO NO_ROWS LOOP/* for each A

row*/ Xr = Ar,r / Br

ENDLOOP

Parallelizing this sequential code yields the follow-

ing pseudo-code:

FOR i in 1 TO NO_COLS LOOP

Broadcast row #i on the columns;

Er = -Ari/Aii for each row r not equal to i. Ei = 0;

Broadcast Er on the rows;

Arc = Arc + Aic*Er;

Br = Br + Bi*Er;

ENDLOOP

No row pivoting (row swapping) is used here, and

thus cases in which a diagonal element is zero (Aii=0; or

Bi=0) will produce a wrong result. Searching for the

maximum element and row swapping may be added to

cope with this situation.

3.4.2 Executing Gauss-Jordan on the array:

This section shows the code segment needed to exe-

cute the algorithm. The coefficient matrix Aand result

vector Bare first loaded in the PE array by storing them

in the PE local memories. The Amatrix is stored in col-

umns 0 to #NO_COLS-1, and the Bvector is stored in

the rightmost column (#NO_COLS). Also, a help ma-

trix, H, is stored with an initial contents of:

(EQ 3)

This help matrix is used to inhibit adjustment of row

#i during the elimination process.

The elimination executed on the array is performed

column by column. It starts from the leftmost column

(column #0), eliminating all off-diagonal elements in

this column, and proceeds towards the rightmost col-

umn.

The instructions broadcast from the global CU are:

FOR i IN 0 TO NO_COLS-1 LOOP /* For each

col Load_acc (A_B)

CBROADC (ACC,i) /* TEMPc

<= Aic, or TEMPc <= Bi*/

Store_acc_hw (TEMP)

Divide (A_B,TEMP)

Store_acc_lw (E)

Multiply (H,E) /* clear all

E values on row i*/

RBROADC (ACC,i) /* broadcast

E values on the rows */

Store_acc_hw (E) /* Store Er

<= Arc/TEMPc * Hrc */

Multiply (TEMP,E)

Store_acc_hw (SLASK)

Load_acc (A_B)

LOAD_X (SLASK)

LOAD_PS

SUB

Store_acc_hw (A_B) /* Store Arc

=Arc - TEMPc*Er, or Br=B

r- TEMPc*Er

*/ Load_acc (H)

SOUTH (ACC16)/* shift the help matrix

down one PE step. All “ones” are

shifted in to the top row. */

Store_acc_hw (H)

ENDLOOP

Load_acc (A_B)

RBROADC_DIAG (ACC) /* get diago-

nal elements*/

Store_acc_hw (SLASK)

Divide (A_B, SLASK) /* output

vector is now in ACC (low word) in col. 3 */

The procedure (macro) “Multiply” above contains

the following instructions:

Multiply (K,L):

CLEAR

LOAD_X (K)

LOAD_Q

LOAD_X (L)

LOAD_PS

MAC

Macros “Load_acc ()”, “Divide ()”, “Store_acc_hw

()” and “Store_acc_lw ()” loads the ACC, divides Q by

M, stores the ACC high word bits 31-16 (“hw”) or ACC

0000…0

1111…1

………… … …

11111…1

low word bits 15-0 (“lw”).

Solving a 31*31 equation system (using a 32*32 ar-

ray) requires 51040 cycles. At 100 MHz clock frequency

this corresponds to 510 µsec.

3.4.3 Matrix inversion using Gauss-Jordan

elimination

To show that the Gauss-Jordan elimination method

can perform matrix inversion, consider the following:

First, use the relationship:

(EQ 4)

where I is the unity matrix. Then, use the notation:

(EQ 5)

Identifying X and B in (EQ 4), yields that:

(EQ 6)

Thus, by using Gauss-Jordan elimination with matri-

ces Aand B, where Bis initialized as the unity matrix I,

we can calculate the inverse matrix A-1.

As an example, to find the inverse of the 3*3 matrix

A, we start with the following:

(EQ 7)

The Amatrix is loaded in the first three columns of

the array. The unity matrix is loaded in the last three col-

umns. The elimination is now performed as in section

3.4.2, with the exception that three columns in the Bma-

trix are handled instead of only one.

The number of clock cycles required to invert a

32*32 matrix using a 32*64 array is 53771 cycles. At 100

MHz clock frequency this corresponds to 538 µsec.

Mapping the algorithm this way means that the array

is not quadratic. If this is not desirable, a quadratic array

can be used if the B matrix is stored in the same PEs as

the A matrix (in different memory positions). Of course,

this slows down the algorithm execution, because the A

and B elements can in this case not be calculated (adjust-

ed) in parallel.

3.5 Performing Matrix - Vector multiplica-

tion

Many signal processing algorithms can basically be

formulated as a matrix by vector multiplication problem

[3]. REMAP-γcan perform matrix/vector multiplication

using three separate methods, each with different per-

formance characteristics regarding throughput, latency

and the amount of hardware needed. These are:

1. Using column broadcast and the RMAC instruction

(“RMAC”).

2. Using a systolic type of processing with the local

MAC instruction and column broadcast (“pseudo-

systolic 1”).

3. Using a systolic type of processing using local MAC

and nearest neighbour PE communications only

(“pseudo-systolic 2”).

3.5.1 Matrix/vector multiplication using

column broadcast and the RMAC

instruction

The matrix/vector multiplication is basically per-

formed by issuing two instructions, CBROADC(0) and

RMAC (assuming the matrix is already loaded in the Q

registers of the PEs). First, the vector is input to row #0.

Second, this vector is broadcasted on the columns using

the CBROADC (0) instruction. Third, the RMAC in-

struction performs the multiply-and-accumulate between

the matrix and the vector elements. The result vector is

found in the rightmost column. Figure 7 illustrates this

with matrix A, input vector X and output vector O.

Figure 7 . Matrix/Vector multiplication using column

broadcast and RMAC

In this method a new matrix by vector multiplication

is not started until the previous one has completed. The

main benefit is the very low latency (see Table 2) for a

specific vector, measured as the time delay between in-

put and output.

3.5.2 Matrix/vector multiplication using the

local MAC instruction and column

broadcast (“pseudo-systolic 1”)

This method uses a systolic-like processing where the

input vectors are broadcasted on the columns and the ac-

AA 1–I=

AX B=

1–

=BI=

A11 A12 A13

A21 A22 A23

A31 A32 A33

x11 x12 x13

x21 x22 x23

x31 x32 x33

100

010

001

input vector

output vector

a11

a31

a41

a21

a12

a32

a42

a22

a13

a33

a43

a23

a14

a34

a44

a24

x1x2x3x4

CBROADC (0)

RMAC

cumulated sums are shifted east one PE step at a time

through the array. The local MAC instruction produces

the products and accumulates the sum. The result vector

appears at the eastern output after some latency.

The procedure is as follows: the input vector is shift-

ed down (SOUTH) one PE step (this also inputs the next

vector at the same time). The CBROADC_DIAG in-

struction is then used to broadcast the vectors on the re-

spective columns using the diagonal PE elements as

sources. Next, the MAC instruction produces the product

and accumulates the sums. Finally, the accumulated

sums are shifted one PE step to the east (and at the same

time output a result vector).

Figure 8 shows this principle with matrix A and the

input vector stream F, G, H, I,......

In the systolic methods where vectors are pipelined

through the array, a new result is produced in each loop

(after Ninitial loops). However, there is a “high” latency

(equal to N loops) for a given vector.

Figure 8 . “pseudo-systolic 1” Matrix/Vector

multiplication

3.5.3 Using systolic processing and skew/

deskew external registers (“pseudo-systolic

2”)

This method uses systolic processing where the input

vector is delayed (skewed) and the output vector is

deskewed according to Figure 10. The local MAC in-

struction is used to create the local product and add this

to the accumulated sum shifted in from the left PE neigh-

bour. The column broadcast is not necessary here as in

the first two methods.

This method yields the highest throughput but has a

high latency (although not as high as “pseudo-systolic

1”). It requires extra hardware to skew/deskew the input

and output vectors (shown in Figure 9). Each of these

registers has a size equal to the PS (or Q) register in the

PE datapath.

Figure 9 . Skew and deskew registers needed in the

“pseudo-systolic 2” matrix/vector multiplication case

Figure 10 shows how this matrix/vector multiplica-

tion is performed by the array. The matrix is stored in the

PEs’ Q registers, i.e. one matrix element in each PE. In

each step, the vectors are shifted one step south, and the

accumulated sums are shifted one step to the east. A re-

sult is produced in each step, and the resulting vector is

found by deskewing the resulting vectors from the east

edge PEs.

Figure 10 . “pseudo-systolic 2” matrix/vector

multiplication with skew/deskew of in- and outdata

Table 2 shows a comparison of the performance of

the three matrix-vector multiplication methods in terms

of sustained throughput and latency. The clock frequency

assumed is 100 MHz and the array size is 32*32.

a11

a11 h1

⋅

a21 h1

⋅

a31 h1

⋅

a11 g1

⋅

a12 g2

⋅+

a21 g1

⋅

a22 g2

⋅+

a31 g1

⋅

a32 g2

⋅+

a21

a31

a12

a32

a13

a23

a33

a22

i2i3

a21 f1+⋅

a22 f2+⋅

a23 f3

⋅

a11 f1+⋅

a12 f2+⋅

a13 f3

⋅

a31 f1+⋅

a32 f2+⋅

a33 f3

⋅

Table 2. Matrix/vector performance on a

32*32 array (@ 100 MHz)

Method Sustained

throughput

(GOPS)

Latency

(clock cycles) Latency (µs)

“RMAC” 1.4 146 1.46

“pseudo-

systolic 1” 1.6 4032 40

D D

Skew section Deskew section

input vectors

output vectors

to array

from array

a11 a11 d1

⋅

a12 d

⋅

a13 d

⋅

a21 c

⋅

a22 c

⋅

a23 c

⋅

a31 b

⋅

a32 b

⋅

a33 b

⋅

a11 e1

⋅

a12 e2

⋅

a13 e3

⋅

a21 d1

⋅

a22 d2

⋅

a23 d3

⋅

a31 c1

⋅

a32 c2

⋅

a33 c3

⋅

a11 f1

⋅

a12 f2

⋅

a13 f3

⋅

a21 e1

⋅

a22 e2

⋅

a23 e3

⋅

a31 d1

⋅

a32 d2

⋅

a33 d3

⋅

i1h2g3

j1i2h3

k1j2i3

a11 h1

⋅

a21 g1

⋅

a11 g1

⋅

a12 g2

⋅

a21 f1

⋅

a22 f2

⋅

a31 e1

⋅

a32 e2

⋅

h1g2f3

g1f2e3

f1e2d3

a11 a12 a13

a21 a22 a23

a31 a32 a33

⋅a11 d1a12 d2a13 d

⋅+⋅+⋅

a21 d1a22 d2a23 d

⋅+⋅+⋅

a31 d1a32 d2a33 d

⋅+⋅+⋅

Input Vector

a21

a31

a12

a32

a13

a23

a33

a22

a31 f1

⋅

As Table 2 reveals, the “pseudosystolic 2” method is

superior with respect to throughput (but requires extra

skew- and deskew registers), and the “RMAC” method is

superior with respect to latency.

4 VLSI Test implementation

Two test prototype chips, one 16 PEs (4*4) and one 64

PEs (8*8) have been designed using VHDL synthesis and

standard cells. The technology used was the ES2 0.7 mi-

cron N-well CMOS double layer metal process. The phys-

ical layout was created with Cadence place&route tools

which resulted in an array area of 225 mm2(8*8 PEs) and

a clock speed of 100 MHz.

The chip (4*4) block diagram is shown in Figure 11.

As can be seen, the parallel-to-serial and serial-to-parallel

conversion interfaces at the northern and eastern borders

are included on-chip. These interfaces can, through the

use of multiplexers, be bypassed on those PE chips that

are not placed at the array borders when a multiple-chip

array is constructed.

The input interface (µPI_IN) includes four parallel-in/

serial-out shift registers. The output interface (µPI_OUT)

includes four serial in/parallel out shift registers.

Figure 11 Test chip (4*4) block diagram.

The 8*8 test chip has the same type of block diagram

but has 64 PEs and eight input registers and eight output

registers.

Table 2 below summarizes the most important design

parameters of the 8*8 test design.

Table 3 . Summary of test chip (8*8) data

Using scaling rules for CMOS [4] and [5], it is esti-

mated that when using a state-of-the-art CMOS process

(0,18 µm), approximately four times higher clock speed

(400 MHz) and 16 (42) times smaller area should be ex-

pected. Thus, a 32*32 array would fit in one single chip.

5 Conclusions

This paper has shown the mapping and performance

of basic matrix operations on a novel parallel DSP array

architecture. These matrix operations, basic in many

signal processing algorithms, include matrix inversion,

matrix/vector multiplication, solving systems of linear

equations, matrix addition/subtraction and matrix

transposition. Performance figures were given and

compared in terms of throughput, latency and execution

times. Data for a VLSI test chip in 0,7 µm CMOS was

presented and estimations shown regarding performance

and chip size using a state-of -the-art process.

6 References

[1] J.H. Moreno, T. Lang, Matrix Computations on

Systolic-Type Arrays, Kluwer Acad.Publ., ISBN 0-

7923-9237-X, 1992.

[2] L. Bengtsson, “REMAP-γ: A Scalable SIMD

VLSI Architecture with Hierarchical Control”, PhD

thesis no. 320 , School of Electrical and Computer

Engineering, Chalmers University of Technology,

Gothenburg, Sweden, 1997.

[3] H.T. Kung, C.E. Leiserson, “Algorithms for VLSI

Processor Arrays”, In “Introduction to VLSI Sys-

tems”, Mead & Conway, Addition-Wesley, 1980.

[4] R.H. Dennard, F.H. Gaensslen, H.N. Yu, V.L.

Rideout, E. Bassous, A.R. LeBlanc, “Design of ion-

implanted MOSFET’s with very small physical

dimensions”, IEEE J. Solid-State Circuits, Vol. SC-9,

p.256, 1974.

[5] K.C.Saraswat, F. Mohammadi, “Effect of Scaling

of Interconnections on the Time Delay of VLSI Cir-

cuits”, IEEE J. of Solid-State Circ., vol. SC-17, no. 2,

pp. 275-280, April 1982.

“pseudo-

systolic 2” 2.56 2560 25.6

Table 2. Matrix/vector performance on a

32*32 array (@ 100 MHz)

Method Sustained

throughput

(GOPS)

Latency

(clock cycles) Latency (µs)

DATA_IN 3

µPI_IN(Parallel to Serial interface)

PE14 PE13 PE12 PE11

µPI_OUT(Serial to Parallel interface)

PE24

PE33

PE23

PE32 PE31

PE22 PE21

PE34

PE43 PE42 PE41PE44

0 10 1 0 1 0 1 DATA_OU

DATA_OU

DATA_IN 4

DATA_IN 2

DATA_IN 1

Parallel

Parallel out

‘

’=

bypass

Chip parameter Data

Technology 0,7 µm CMOS Double

Layer Metal

Clock frequency (MHz) 100

Array area (mm2) 225 (15 x 15)

Number of cells (excluding

Estimated power dissipa-

tion (W) < 12

ResearchGate has not been able to resolve any citations for this publication.

Algorithms for VLSI processor arrays

Article

Jan 1980

Effect of Scaling of Interconnections on the Time Delay of VLSI Circuits

Article

May 1982

Effect of scaling of dimensions, i.e., increase in chip size and decrease in minimum feature size, on the RC time delay associated with interconnections in VLSIC's has been investigated. Analytical expressions have been developed to relate this time delay to various elements of technology, i.e., interconnection material, minimum feature size, chip area, length of the interconnect, etc. Empirical expressions to predict the trends of the technological elements as a function of chronological time have been developed. Calculations of time delay for interconnections made of poly-Si, WSi 2 , W, and Al have been done and they indicate that as the chip area is increased and other device-related dimensions are decreased the interconnection time delay becomes significant compared to the device time delay and in extreme cases dominates the chip performance.

Design of Ion-Implanted MOSFET'S with very small physical dimensions

Article

Nov 1974

This paper considers the design, fabrication, and characterization of very small Mosfet switching devices suitable for digital integrated circuits, using dimensions of the order of 1 μ. Scaling relationships are presented which show how a conventional MOSFET can be reduced in size. An improved small device structure is presented that uses ion implantation, to provide shallow source and drain regions and a nonuniform substrate doping profile. One-dimensional models are used to predict the substrate doping profile and the corresponding threshold voltage versus source voltage characteristic. A two-dimensional current transport model is used to predict the relative degree of short-channel effects for different device parameter combinations. Polysilicon-gate MOSFET's with channel lengths as short as 0.5 μ were fabricated, and the device characteristics measured and compared with predicted values. The performance improvement expected from using these very small devices in highly miniaturized integrated circuits is projected.

Jan 1992

J H Moreno
T Lang

J.H. Moreno, T. Lang, Matrix Computations on Systolic-Type Arrays, Kluwer Acad.Publ., ISBN 0-7923-9237-X, 1992.

REMAP-γ: A Scalable SIMD VLSI Architecture with Hierarchical Control

Jan 1997

L Bengtsson

L. Bengtsson, "REMAP-γ: A Scalable SIMD VLSI Architecture with Hierarchical Control", PhD thesis no. 320, School of Electrical and Computer Engineering, Chalmers University of Technology, Gothenburg, Sweden, 1997.

BASIC MATRIX OPERATIONS ON A DSP ARRAY ARCHITECTURE

Abstract and Figures

Recommended publications

Parallel constructs for three-dimensional registration on a SIMD (single-instruction stream/multiple...

Synchronizing a high-speed SIMD processor array

A VLSI Array Architecture for Artificial Neural Networks.

ELSA-a wafer scale image processing system

A modular system for solving linear equations exactly. I: Architecture and numerical algorithms