Conference PaperPDF Available

Scheduling collective communications on wormhole fat cubes

November 2005

November 2005

DOI:10.1109/CAHPC.2005.39

Source
IEEE Xplore

Conference: Computer Architecture and High Performance Computing, 2005. SBAC-PAD 2005. 17th International Symposium on

Authors:

Václav Dvorák

Brno University of Technology

A recent renewed interest in hypercube interconnection network has been concentrated to the more scalable and mostly cheaper version known as a fat cube. This paper generalizes the known results on time complexity of collective communications on a hypercube for the wormhole fat cube. Examples of particular communication algorithms on the 2D-fat cube topology with 8 processors are summarized and given in detail. The performed study shows that a large variety of fat cubes can provide lower cost, better scalability and manufacturability without compromising communication performance.

. Eight steps of the AAS schedule

…

Figures - uploaded by Václav Dvorák

Content may be subject to copyright.

Content uploaded by Václav Dvorák

Content may be subject to copyright.

Scheduling Collective Communications on Wormhole Fat Cubes

Vaclav Dvorak

Brno University of Technology

dvorak@fit.vutbr.cz

Abstract

A recent renewed interest in hypercube inter-

connection network has been concentrated to the more

scalable and mostly cheaper version known as a fat

cube. This paper generalizes the known results on time

complexity of collective communications on a

hypercube for the wormhole fat cube. Examples of

particular communication algorithms on the 2D-fat

cube topology with 8 processors are summarized and

given in detail. The performed study shows that a large

variety of fat cubes can provide lower cost, better

scalability and manufacturability without compro-

mising communication performance.

1. Introduction

One of the greatest challenges faced by designers of

digital systems at present is optimizing the

communication and interconnection between system

components. As more and more processor cores and

other large reusable components have been integrated

on the single silicon die (MPSoCs, Multiprocessor

Systems-on-Chips, [1]), many of traditional multi-

processing techniques are modified or developed

anew. The interconnection network, a fundamental

component of every parallel system, and

communication algorithms are no exceptions. Buses

are being replaced by crossbars or by direct

interconnection networks. Basically direct networks

converge on the use of pipelined (wormhole) message

transmission and source-based routing algorithms and

the major difference among them are in topology.

The well-known binary hypercube (HC) topology is

characterized by P= 2d nodes, naturally organized in d

dimensions, where d is also the node degree. The

worst-case distance between two nodes, (network

diameter D) is logarithmic, D = d = log P. The HC

topology is node and edge symmetric, what simplifies

the design of parallel algorithms tremendously.

Computation can start in any node and the source code

remains the same. Also the communication can start in

any dimension. Optimal algorithms for collective

communication operations exist in almost all

communication models. This is why the HC topology

is commonly considered the best topology there is

from the algorithmic and communication point of

view. The HC topology can simulate efficiently almost

any other topology, too. The only drawback is its non-

constant (logarithmic) degree d=logP and

consequently a high number of communication

channels and only partial scalability, as the number of

nodes P is restricted to powers of 2.

Topologies derived from the binary HC, such as

cube-connected cycles and wrap-around or ordinary

butterflies [2] eliminate the drawback of non-constant

node degree. They are constructed by expanding the

HC vertices into cycles or linear arrays and have a

small constant degree and the logarithmic diameter as

before. The bisection width 2d = P/d is slightly worse

than the value P for hypercube and so is the scalability,

since the number of processors is P = d2d, i.e. only 8,

24, 64, etc.

Another useful alternative is much better scalable

topology called a “fat cube” (FC). The vertices of the

HC are again expanded, but now into sets of

processors connected by the crossbar switch inside the

router. Scalability is improved since the node can

contain any number of processors, P=m2d,m = 1, 2,

3, etc. The node degree grows more slowly than in the

HC, d = log (P/m) and the bisection width can be

adjusted by multiple links between nodes. Due to these

favorable features has the FC topology been recently

used e.g. in commercial DSM NUMA machine Origin

3000 (SGI). Also fat nodes with 4 Opteron processors

have been used in 3D-FC connection [3] and nodes

with 8 CPUs are connected into K-ring network in

Swiss-T1 cluster [4]. The FC topology is also expected

to appear in future networking systems for MPSoCs,

because mapping FC into 2-D space is easier than in

the case of the “thin” HC.

In the rest of the paper we look at the router

architecture in Section 2 and present the details of

Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05)

hardware cost calculation in Section 3. The complexity

of collective communications is analyzed in Section 4.

Next Section 5 is a case study involving important

collective communications on the 8-processor FC.

Finally, Section 6 concludes this paper.

2. HD | FD | S – half-duplex, full-duplex, simplex links

3. NC | C – non-combining/ combining model capable

(or not) of combining or extracting partial messages

with negligible overhead

4. one-port (1) | d-port (d) – router model.

2. Fat cube and router architecture The router model for fat nodes deserves some

explanation, because it is a certain generalization of the

router model used in connection with thin nodes. In the

simplest case, processors are connected to the router

by a single link as in Fig. 1. This so called one-port

model (“1”) allows each of m or less processors to

send a message either outside to a remote processor or

to the local processor inside the same node, Fig. 2. In

d-port model (“d”) each processor can send up to d

distinct messages simultaneously, either outside or

locally. In fact, both the models are special cases of

the “k-port” model, where k =1 or d. In the context of

the traditional hypercube (m = 1 and f = 1) these

models are known as one-port and all-port models.

Let us recall notation introduced above and

establish some new notions related to the FC topology.

Drawings of two instances of this topology are shown

in Figure 1. We use the following parameters:

d– dimensionality of the FC/HC

D – network diameter

m – number of processors per fat node, an integer

greater than 1

P– processor count P = m2d (the FC), P = 2d (the HC)

d´– dimensionality of the HC with the same number of

processors as FC, d´=log P=d + ªlog mº (binary

log is the default)

CPU

1 d

f– multiplicity of external links

L – the number of external links in a FC network

L = fd2d1. Each link consists of two channels in

opposite directions.

a) b)

= CPU

= router

Figure 2. Router models for fat nodes.

(m = 2, d = 2, P = 8, f = 1)

1) one-port model d) d-port model

3. Cost of a fat cube network

Figure 1. Examples of fat cube networks.

a) P = 16, m = 4, d = 2, f = 2

b) P = 16, m = 2, d = 3, f = 1

The cost of the interconnection network has two

components: the external links cost CLand the router

cost CR. If we disregard manufacturability, the external

link cost CLcan be taken simply as the number of these

links CL = L = fd2d–1. The router cost, given mainly by

the cost of a×b crossbar switch with a input ports and

b output ports, is commonly taken as ab.

The design of communication algorithms depends

strongly on the model used to describe the parameters

of the underlying communication hardware. These

models have to address key characteristics of

interconnection networks, such as switching technique,

channel type, message combining capability and a

router model. The possible options in communication

architecture are:

Let us compare the fat cube cost C and hypercube

cost Ccand let us find under which condition is the fat

cube network cheaper. If both the networks have the

same number of processors P = P’, then

mddorPmP dd log22 '

1. SF | WH | CS | VCT – store-and-forward, wormhole,

circuit, and virtual-cut-through switching techniques

The lower link cost of the fat cube

LL CC c

11''1 222  c

d dd

LmddCfdC (1)

Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05)

implies fd md´, what holds true because

dimensionality of the fat cube with Pnodes is always

lower than that of a hypercube and because mostly

mf d

The cost CR of all routers together depends on the

type of the port model. Table 1 compares the total

router cost CR and CR´, the product of input and output

port counts, pin upout. Of course, we are interested

especially in fat cubes with some cost advantage, i.e.

when CRCR´. By making use of relation (1) we can

transform the condition of a lower cost into

inequalities involving parameters m,f and d:

1) (m + df)2dm(1 + d + log m)2 (2)

d)d2(m + f)2d 4m(d + log m)2(3)

Table 2 shows some numerically obtained solutions of

inequalities (2) – (3) for f = 1 and 2.

For example, both 1-port fat cube networks at Fig.

1 are cheaper then 1-port hypercubes with the same

number of processors P. Now the question is what

will be the impact of this lower hardware cost, if any,

on communication performance. We will therefore

investigate the performance of collective

communications on a fat cube in the next section.

Table 1. Total router cost in fat cube (CR)

and hypercube (CR´) topology

4. Complexity of collective communications

on the WH fat cube

Collective communications (CCs) are frequently

used in all parallel algorithms. If their overhead is

excessive, performance degrades rapidly with the

processor count. When we refer to „collective

communications”, we will assume communications

involving all processors. Seven types of such collective

communications are:

OAB (One-to-All Broadcast), OAS (One-to-All

Scatter), AOG (All-to-One Gather), AOR (All-to-one

reduce), AAB (All-to-All Broadcast), AAR (All-to-all

Reduce) and AAS (All-to-All Scatter), [5]. Since

complexities of some communications are similar

(AOG ~ OAS, AOR ~ OAB, AAR ~ AAB), we will

focus only on 4 basic types (OAB, OAS, AAB, AAS).

Each communication may be investigated with all

possible model options, what gives too many distinct

cases to explore. Therefore only the most important of

them will be analyzed.

Table 2. Conditions ensuring that a fat cube

be cheaper than the hypercube

f = 1 1 d

m = 2 dd 16

m = 4 dd 8

m = 8 dd 5

f = 2 1 d

m = 2 d=1 dd 2

m = 4 ddd 6

m = 8 ddd 3

d

In the rest of the paper we assume that the

communication in WH networks proceeds in

synchronized steps. In one step of CC, a set of

simultaneous packet transfers takes place along

complete disjoint paths between source-destination

node pairs. Complexity of collective communication

will be determined in terms of the number of these

communication steps ĲCC(G) for the lower bound and

ĲCC(G) for the upper bound; if network graph G is clear

from the context, we will omit its symbol G (HC or

FC). This figure of merit does not take into account

the message length (non-uniform in combining

models) or its variations from one step to another.

Before analyzing communications on a fat cube, let us

review the lower bounds on number of steps ĲCC in a

hypercube network, Table 3. Lower bounds for all CCs

on the WH hypercube, except OAB, are reachable by

known optimal algorithms. The double-tree algorithm

for OAB, Fig.3, is optimal only for d 6. Other known

algorithms are nearly optimal (e.g. the algorithm by

Ho-Kao, [5] ).

FC HC (m=f=1)

Cost pin,pout CRpin,

pout

1-port (m+df)2

d(m+df)2(1+d´)2

d´(1+d´)2

d-port d(m+f)2

dd2(m+f)22d´ 2d´ 4d´ 2In the following subsections we want to generalize

the above results for the fat cube topology with

restriction to non-combining WH models with FD

links. Our approach will use the known algorithms for

CC among nodes of the WH hypercube and this inter-

node communication will be followed or overlapped

by the local CC within the nodes on the router crossbar

(intra-node communication).

Table 3. CCs on a hypercube, lower bounds

CC on time complexity

CC 1-port all-port

OAB log P (= d)ªlog d+1 P º = ªd/log ( d+1)º

AAB P – 1 ª(P – 1)/dº

OAS P– 1 ª(P – 1)/dº

AAS P– 1 P/2

Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05)

1st step

step

4.2. All-to-all broadcast (AAB) on a WH fat

cube

Optimal AAB algorithms for a hypercube matching the

lower bounds in Tab.3 are based on a Hamiltonian

cycle (1-port model) and on so called time-arc-disjoint

spanning trees – TADTs (all-port model). All

processors can use such broadcast trees synchronously

with no conflicts. The following Theorem 2

establishes complexity of AAB on a fat cube, namely

upper bounds ĲAAB in case that we do first AAB among

nodes using TADTs and then AAB inside the nodes.

As we will see later, due to a possible partial overlap

of both the inter- and intra-node communications in d-

port model, lower bound ĲAAB can be reached under

certain conditions.

Figure 3. The “double tree” algorithm on 4D-

hypercube. S, S´ are the 1st and 2nd roots.

4.1. One-to-all broadcast (OAB) on a WH fat

cube

This CC is not influenced by the type of the links

(HD/FD) or message (non)combining. Since just one

message propagates in the network, multiple links

cannot help.

Theorem 2. Complexity of AAB on the k-port WH

fat cube measured by the number of communication

steps is

1) ĲAAB = ĲAAB(HC) = P -1

k)ĲAAB =

ªºªº

mPkmmkfdmP //)1(),min(/)( 

Theorem 1. Complexity of OAB on the k-port WH

fat cube measured by the number of communication

steps is Proof.

ĲOAB = ªlogk+1(P/m)º + ªlogk+1mº . 1) We can use cyclic rotation of messages along the

ring formed by the Hamiltonian cycle, m processors in

every node are incorporated into that cycle. In the first

step all Pprocessors are just sending their message

along the cycle and in following P-2 cycles they keep

receiving and re-sending other messages. Multiple

links cannot make it faster, because processors are

connected to the router with a simple link.

This upper bound can be reached for all m, k = 1 or d

and P/md 6.

Proof.

1) k=1. OAB implemented by recursive doubling in the

spanning binomial broadcast tree [5] increases the

number of informed nodes twice in each of

k) Using a generic TADT rooted in every node we

will perform AAB among nodes. Each node, if not a

leaf, broadcasts ”super-messages” consisting of m

distinct messages to other nodes. In each such “super-

step”, m messages stored in m node processors are

transferred between adjacent nodes. There are fd

incoming links to a node from all dimensions and mk

input links to node processors. Therefore m(2d1) =

Pm messages destined for one node will be received

in not less than ª(Pm) / min (fd,mk)º steps. At the

end will each processor have P/m distinct messages

(including its own original message) to share with

other local processors. As the local AAB among m

nodes can be done on the router crossbar as m-1

permutations, kpermutations at a time, the result is

D = d = ªlogk+1(P/m)º = ªlog 2dº

steps. The recursive doubling continues inside the

nodes with the use of a crossbar. This intra-node

communication may be overlapped with inter-node

communication except the last node, so that additional

ªlog mº steps are needed, q.e.d.

d) k = d. By making use of the double tree algorithm,

that performs two partial OABs based on partial

spanning binomial trees rooted in node S and S´, Fig.3,

the required number of steps is ªd/2º. However,

ªd/2º = ªlogd+12dº for dd 6.

The intra-node communication is done using all d ports

in ªlogd+1mº steps, q.e.d.

ĲAAB = ª(Pm) /min (fd,mk)º + ª(ºP/m,

km /)1

Let us note that

q.e.d.

ªlogk+1(P/m)º + ªlogk+1mºd ªlogk+1Pº +1,

Provided that fd < mk, then mkfd ports are free

during inter-node communication and can be used for

broadcasting messages within the node. As there are

so that the FC is in OAB never worse than the HC by

more than 1 step.

Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05)

Proof.

(Pm)/(fd) steps of inter-node communication,

(Pm)(mkfd)/(fd) out of total (P/m)m(m1) internal

pair-wise communications can be hidden. Remaining

1) We can use the Hamiltonian cycle and send

messages in any order to P1 remote processors. We

cannot use more than f = 1 external link, because each

processor has only one internal link and both the

external and internal links are connected in the

Hamiltonian cycle. Therefore in P – 1 steps all

processors will get their messages, q.e.d.

P(m1)  (Pm)(mkfd )/(fd)

pair-wise communications can be done, mk of them at

a time, on mk ports. With the previous inter-node

communication (and with careless handling the ceiling

function) it will require

k) By making use of modified TADT for global

OAS among nodes, super-messages from source node

to destination nodes will consist of m messages. There

are fd outgoing links from a node in all dimensions,

mk output links from processors and P – m messages

are to be sent to other nodes. This will therefore take

not less than

ªº

),min(/)( mkfdmP  steps. The local

OAS in the source node requires ª(m1)/kº steps,

because the source processor can emit k messages at a

time. Altogether

AAB

ª

ª¸

§





ª

)1(

steps. Therefore a clever overlapping of global and

local communications could make an AAB algorithm

as efficient as the optimal hypercube algorithm.

Contrary to OAB, combining is relevant to the

complexity of AAB. There is a straightforward

approach (Gather – Scatter) to combining AAB on the

fat cube: one representative processor in each node

gathers messages from all local peers and then AAB

takes place among these representative processors with

combined messages. At the end the representatives

extract and distribute individual messages to local

peers. We will not analyze complexity in detail, but

interestingly, combining AAB can sometimes be faster

on the fat cube than on the hypercube, [6].

ĲOAS =

ªºªº

kmmkfdmP /)1(),min(/)( 



q.e.d. For the d-port fat cube with simple links (f = 1)

this bound comes to ĲOAS = ª(P1)/dº = ĲOAS.

4.4. AAS on a WH fat cube

4.3. One-to-all scatter (OAS) on a WH fat cube

Let us recall that the optimal AAS algorithm for the

1-port hypercube matching the lower bound ĲAAS =

P1 (see Table 3) is very simple. AAS is decomposed

into P1 permutations, processors with the relative

address i are directly exchanging messages in step i,i

=1, 2, …, P1. However, the elegance of hypercube

topology shows in the all-port model in which P1

steps are compacted into P/2 steps in such a way, that

all links are used in both directions in all steps! The

smallest example is shown in Fig.4. Theorem 4

establishes complexity of AAS, namely upper bounds

ĲAAS in case that we do AAS among nodes first and

then inside the nodes. In some cases can these bounds

be further improved by overlapping inter- and intra-

node communications.

This CC has similar complexity as AAB in many

models. Optimal OAS algorithms for a hypercube

matching the lower bounds are based on a Hamiltonian

cycle (1-port model) and again on time-arc-disjoint

spanning trees TADTs (d-port model). An optimal

hypercube algorithm requires a broadcast tree with

sub-trees of approximately equal size (r1 node).

TADTs do not fulfil this requirement and must be

slightly modified. The construction of such trees is

known and will not be repeated here. The generic

TADT tree can be rooted in any source processor and

messages are sent into its sub-trees in any order. Link

type (HD or FD) does not influence ĲOAS, rather the

number of distinct messages that can be sent by the

source processor in one step is important. In the fat

cube topology we perform OAS among nodes first,

then OAS inside nodes. Theorem 3 gives related upper

bounds ĲOAS; for m=f =1 and k=dwe get the lower

bounds for the all-port hypercube as a special case.

Theorem 3. Complexity of OAS on the k-port WH

fat cube measured by the number of communication

steps is Fig. 4. AAS in 2 steps on WH 2D-HC

1) ĲOAS = ĲOAS (HC) = 1

k)ĲOAS =

ªºª

kmmkfdmP /)1(),min(/)( 

Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05)

5. Examples of collective communication

on the 8-processor, 2D-fat cube

Theorem 4. Complexity of AAS on the k-port WH

fat cube measured by the number of communication

steps is

1) ĲAAS = (2d1)ªm2/fº

k)ĲAAS =.

ªºª

kmfdmkPmd /)1()],min(2/[ 

In this section we have chosen to demonstrate

communication algorithms on the small d-port fat cube

with the following parameters: d= m = 2, P = 8, f= 1,

non-combining nodes, full duplex links and wormhole

switching.

Proof.

1) The direct exchange HC algorithm applied to the

global AAS leads to 2d1 super-steps, with exchanges

of m2 messages between each of 2d1pair of nodes, f

messages at a time. One exchange super-step will thus

take ªm2/fº steps, fdm, and the whole AAS will

require (2d1)ªm2/fº steps, q.e.d.

5.1. One-to-all broadcast

Whereas 3 OAB steps are always needed in 8-

processor hypercube using the spanning binomial tree

(1+1+2+4 /1+3+3+1 processors informed in 3 steps in

1-port /d-port model), 2 steps will do in the d-port fat

cube topology, see Fig. 5. The intra-node OAB is fully

overlapped with 2 steps of the inter-node OAB.

k) We can visualize AAS as a superposition of m-

to-P scatter communications by all nodes, in which

each processor in the node sends P-m distinct

messages outside and m-1 messages inside the node.

The block of m2 messages (a super-message) from the

source node (msource CPUs in one node, each of them

sending m messages to a destination node) goes

through intermediate nodes to the destination node and

utilizes a number of links on the way. We can count

the number of channels required to connect one source

node to destination nodes at all levels of the broadcast

tree as

)!1()!(

)!1(



§









¦¦

iiid

did

(4)

Figure 5. OAB in 2 steps on the WH fat cube

5.2. All-to-all broadcast

Theorem 3 states that we are able to complete AAB

in 3+4 steps of inter- and intra-node communication,

but we can do much better with their overlapping. The

optimal algorithm with a full overlap of the global and

local AAB is shown at Figure 6, reaching the lower

bound of Theorem 2 (f = 1, k = d):

The so called communication work CW(AAS) for all

2d nodes is thus

CW(AAS) = 2dx m2 = d22d m2/2 (5)

as each link will be used m2-times. On the other hand,

the total number of channels available at one time is a

lower value of the total count of external channels 2L

=fd2d and the total number of output ports 2d(mk), i.e.

2dmin(fd,mk). Since all the external links are utilized

in direct exchange algorithm in both directions in all

steps, it has to hold

ĲAAB = ª(P1)/dº = 4 steps = max (3, 4).

a) b)

ĲAAS = ªCW(AAS)/[2dmin(fd,mk)]º =

ªPmd/[2min(mk,fd)] º.

The intra-node AAS among m processors can be

implemented on the router crossbar as (m–1)

permutations at a rate k permutations in one step, i.e. in

steps. Together we get the desired result,

q.e.d.

km /)1( 

Figure 6. AAB in 4 steps on the WH fat cube

a) steps 1 and 2 b) steps 3 and 4

Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05)

The path of every message from source to

destination processors, divided into 4 steps, is

described in Table 4.

Table 4. Four steps of the AAB

communication schedule

destination processors

message step 1 step 2 step 3 step 4

1o2, 5 o6, 8 o7, 4 o3

2o1, 3 o4, 7 o8, 6 o5

3o, 2 o, 6 o5, 7 o8

4o3, 8 o7, 5 o6, 1 o2

5o6, 1 o2, 4 o3, 8 o7

6o5, 7 o8, 3 o, 2 o1

7o8, 6 o5, 2 o1, 3 o4

8o7, 4 o3, 1 o2, 5 o6

5.3. One-to-all scatter

In our running example (f=1, k=d) the upper bound

given by Theorem 3 matches the ideal lower bound

ªºªºªº

4/)1(/)1(/)( d OAS

OAS dPdmdmP

steps, see Fig. 7. The source keeps sending messages

into two sub-trees, three times 2 messages in any order

and then the local OAS inside the source node is done

in 1 more step.

3 3

Figure 7: OAS in 4 steps on the WH fat cube

5.4. All-to-all scatter

According to Theorem 4, we should be able to

complete AAS on our example fat cube in 9 steps.

AAS among nodes is scheduled in 2 super-steps

according to Fig.4. Considering now m2= 4 messages

in a super-message, there will be 4 steps in each super-

step. AAS within nodes, in our FC only exchange of

messages between two processors, can be combined

with any of the previous 8 steps because only one

processor port is busy during inter-node communi-

cation. Pairs of processors exchanging messages in

steps 1 to 8 are listed in Tab.5, local AAS communi-

cations are shown in bold.

Table 5. Eight steps of the AAS schedule

1 03, 16, 25, 47

2 02, 17, 24, 46

3 06, 12, 20, 42

4 07, 13, 21, 43

5 04, 15

6 62, 73

7 05, 14,

863, 72, 01, 23, 45, 67

The performance of AAS is limited not by number

of ports, but rather by the bisection width of the fat

cube: AAS on the d-port FC with double links would

complete in 4 steps only.

6. Results and conclusions

Summary of CC complexities for various models of

our sample fat cube and hypercube networks is in

Table 6. The table gives the optimized number of steps

with possible overlap of global and local CCs. The

communication performance of the FC is the same or

better in OAB and almost the same in OAS and AAB.

The AAS performance depends on multiplicity of

links.

Table 6. Complexity of CCs on the

8-processor hypercube and fat cubes

m,f, d OAB AAB OAS AAS P = 8

1, 1, 3 3 7 7 7 1-port HC

1, 1, 3 3 3 3 4 all-port HC

2, 1, 2 3 7 7 12 1-port FC

2, 1, 2 2 4 4 8 d-port FC

2, 2, 2 2 4 3 4 d-port FC

Another larger example concerns the 3D-FC with 4

CPUs per node, double links and with P = 32

processors. Table 7 gives the complexity values

obtained either from Tab.3 or from Theorems 1 to 4.

Anyway, the above results concern only two

particular fat cube networks, but theorems derived

earlier are suitable for comparison of other

configurations as well. Generally we can make the

following conclusions:

Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05)

1. Performance of 1-port HC and 1-port FC with the

same processor count Pare the same in all CCs but

AAS.

2. The AAS performance is in 1-port FC proportional

to 1/f,fdm.

3. Partitioning OAB into the global and local part does

not reduce the performance, but improves it by

overlapping both parts.

4. Performance in OAS and AAB is poorer than in HC

topology, but similar if optimization through

overlapping is used or even better if multiple links

are provided.

5. Poorer performance in AAS on d-port FC is given

by a lower bisection width, the same performance as

in the hypercube can be obtained when multiple

links are used.

6. If the hardware cost is a limiting factor, then a

suitable fat cube can be found which is cheaper than

the equivalent hypercube with the same number of

processors and with not much (if any) performance

degradation.

7. The number of processors Pin the fat cube

configuration is not limited to powers of 2, but a

power of 2 can be multiplied by an integer m. This

may be more straightforward scaling than a partial

hypercube.

Table 7. Complexity of CCs on the

32-processor hypercube and fat cubes

m,f, d OAB AAB OAS AAS P =32

1, 1, 5 5 31 31 31 1-port HC

1, 1, 5 5 7 7 16 all-port HC

4, 1, 3 5 31 31 112 1-port FC

4, 2, 3 4 21 7 34 2-port FC

4, 1, 3 3 18 11 65 d-port FC

4, 2, 3 3 13 6 33 d-port FC

4, 4, 3 3 11 4 17 d-port FC

The future research should address other network

topologies with fat nodes and links. Also other

communication patterns should be studied, such as

multicast and a-to-b broadcast or scatter. Also

combining node models are of interest; partial results

for SF switching have been presented in [6]. The role

of combining models for WH switching should still be

clarified. The research in the above directions could

help optimize communication architectures for

application-specific multiprocessor systems on chip,

[7].

7. References

[1].Jerraya, A.A., Wolf, W., Microprocessor Systems-on-

Chips, Elsevier Inc., 2005, ISBN 0-12385-251-X.

[2] W. Dally, B. Towles, Principles and Practices of

Interconnection Networks, The Morgan Kaufmann Series in

Computer Architecture and Design, Morgan Kaufman

Publishers, 2004, ISBN: 0-12200-751-4.

[4] E. Gabrielyan, R.D. Hersch, “Efficient Liquid Schedule

Search Strategies for Collective Communications”, Proc. of

ICON 2004 - 12th IEEE International Conference on

Networks, Singapore, Vol. 2, November 16-19, 2004, pp

760-766.

[3] C.N. Keltcher, et al., “The AMD Opteron Processor for

Multiprocessor Servers”, IEEE Micro, March/April 2003,

pp.66 – 76.

[5] J. Duato, S. Yalamanchili, L. Ni, Interconnection

Networks – An Engineering Approach, Morgan Kaufman

Publishers, 2003, ISBN 1-55860-852-4.

[6] Kutalek, V., Performance modeling and optimization of

application-specific multi-processor systems, Ph.D. thesis,

Faculty of information technology, Brno University of

Technology, 2005.

[7] Dvorak,V., Communication Architectures for

Application-Specific Multiprocessor Systems (on a Chip).

Proc. of the 11th International Conference on Software,

Telecommunications and Computer Networks SoftCOM

2003, Split, HR, FESB, 2003, p. 778-782.

Acknowledgement

This research has been carried out under the financial

support of the research grant “Network Architectures

of Embedded Systems Networks”, GA102/05/0467,

Grant Agency of Czech Republic, 2005-2007.

Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05)

Performance of Collective Communications on Interconnection Networks with Fat nodes and Edges

Conference Paper

Full-text available

May 2006

The paper deals with scheduling collective communications in the minimum number of communication steps; it shows how to generalize the known results regarding time complexity of collective communications on common direct networks for the same networks with fat nodes and edges. Models of node architecture composed of several processor cores that share a router are discussed. Examples of communication algorithms on fat K-ring networks with 8 to 32 processors are summarized and given in detail. The results show that fat networks, depending on their configuration, can provide a range of communication performance at a lower cost.

3D implementation of heterogeneous topologies on MPSoC

Conference Paper

Jan 2017

Liquid Schedule Construction Algorithm: an Efficient Method for Coloring a Congestion Graph

Article

Emin Gabrielyan

The upper limit of a network's capacity is its liquid throughput. The liquid throughput corresponds to the flow of a liquid in an equivalent network of pipes. In coarse-grained networks, the aggregate throughput of an arbitrarily scheduled collective communication may be several times lower than the maximal potential throughput of the network. In wormhole and wavelength division optical networks, there is a significant loss of performance due to congestions between simultaneous transfers sharing a common communication resource. We propose to schedule the transfers of a traffic according to a schedule yielding the liquid throughput. Such a schedule, called liquid schedule, relies on the knowledge of the underlying network topology and ensures an optimal utilization of all bottleneck links. To build a liquid schedule, we partition the traffic into time frames comprising mutually non-congesting transfers keeping all bottleneck links busy during all time frames. The search for mutually non-congesting transfers utilizing all bottleneck links is of exponential complexity. We present an efficient algorithm which non-redundantly traverses the search space. We efficiently reduce the search space without affecting the solution space. The liquid schedules for small problems (up to hundred nodes) can be found in a fraction of seconds.

Collective Communication AAB for Regular and Irregular Topology Based on Prediction of Conflicts

Conference Paper

Jan 2006

Collective communications involving all processors are frequently used in the solution of demanding parallel problems and their time complexity has a dramatic impact on the performance. This paper deals with scheduling of collective communications in multiprocessor networks using the store-and-forward switching technique resulting in minimum number of communication steps. We designed novel technique of communication conflict prediction, which significantly increases the success rate of optimal communication schedule

Evolutionary Design of OAB and AAB Communication Schedules for Interconnection Networks

Conference Paper

Full-text available

Apr 2006
Lect Notes Comput Sci

Since chip multiprocessors are quickly penetrating new application areas in network and media processing, their interconnection architectures become a subject of sophisticated optimization. One-to-All Broadcast (OAB) and All-to-All Broadcast (AAB) [2] group communications are frequently used in many parallel algorithms and if their overhead cost is excessive, performance degrades rapidly with a processor count. This paper deals with the design of a new application-specific standard genetic algorithm (SGA) and the use of Hybrid parallel Genetic Simulated Annealing (HGSA) to design optimal communication algorithms for an arbitrary topology of the interconnection network. Each of these algorithms is targeted for a different switching technique. The OAB and AAB communication schedules were designed mainly for an asymmetrical AMP [15] network and for the benchmark hypercube network [16] using Store-and-Forward (SF) and Wormhole (WH) switching.

Three topics in parallel communications

Article

Jan 2006

Emin Gabrielyan

The main objectives pursued by parallelism in communications are network capacity enhancement and fault-tolerance. Efficiently enhancing the capacity of a network by parallel communications is a non-trivial task. Some applications may also allow one to split the sources and destinations into multiple sources and destinations. An example is parallel Input/Output (I/O). Parallel I/O requires scalability, high throughput and good load balance. Low granularity enables good load balance but tends to reduce throughput. In this thesis we combine fine granularity with scalable high throughput. The network overhead can be reduced and the network throughput can be increased by aggregation of data into large messages. Parallel transmissions from multiple sources to multiple destinations traverse the network through many different paths which have numerous intersections in the network. In low latency high performance networks, serious congestions occur due to large indivisible messages competing for shared resources. We propose to optimally schedule parallel communications by taking into account the network topology. The developed liquid scheduling method optimally uses the potential transmission capacity of a network. Fault-tolerance is typically achieved by maintaining backup communication resources, which are kept idle as long as the primary resource is operational. A challenging idea, inspired by nature, is to simultaneously use all parallel resources. This idea is applied to fine-grained packetized communications. It also relies on erasure resilient codes for combating network failures.

Principles and Practices of Interconnection Network

Article

Full-text available

Jan 2004

A deadlock-free routing algorithm can be generated for arbitrary interconnection networks using the concept of virtual channels. A necessary and sufficient condition for deadlock-free routing is the absence of cycles in a channel dependency graph. Given an arbitrary network and a routing function, the cycles of the channel dependency graph can be removed by splitting physical channels into groups of virtual channels. This method is used to develop deadlock-free routing algorithms for k-ary n-cubes, for cube-connected cycles, and for shuffle-exchange networks.

Multiprocessor Systems-on-Chips

Book

Jan 2005

Modern system-on-chip (SoC) design shows a clear trend toward integration of multiple processor cores on a single chip. Designing a multiprocessor system-on-chip (MPSOC) requires an understanding of the various design styles and techniques used in the multiprocessor. Understanding the application area of the MPSOC is also critical to making proper tradeoffs and design decisions. Multiprocessor Systems-on-Chips covers both design techniques and applications for MPSOCs. Design topics include multiprocessor architectures, processors, operating systems, compilers, methodologies, and synthesis algorithms, and application areas covered include telecommunications and multimedia. The majority of the chapters were collected from presentations made at the International Workshop on Application-Specific Multi-Processor SoC held over the past two years. The workshop assembled internationally recognized speakers on the range of topics relevant to MPSOCs. After having refined their material at the workshop, the speakers are now writing chapters and the editors are fashioning them into a unified book by making connections between chapters and developing common terminology.

Interconnection Networks: an engineering Approach. Revised Printing

Article

Jan 2002

Interconnection networks: An engineering approach

Book

Jan 1997

Efficient Liquid Schedule Search Strategies for Collective Communications

Article

Jan 2004

The upper limit of a network's capacity is its liquid throughput. The liquid throughput corresponds to the flow of a liquid in an equivalent network of pipes. However, the aggregate throughput of a collective communication pattern (traffic) scheduled according to network topology unaware techniques may be several times lower than the maximal potential throughput of the network. In most of the cut-through, wormhole and wavelength division optical networks, there is a loss of performance due to congestions between simultaneous transfers sharing a common communication resource. We propose to schedule the transfers of a traffic according to a schedule yielding the liquid throughput. Such a schedule, called liquid schedule, relies on the knowledge of the underlying network topology and ensures an optimal utilization of all bottleneck links. To build a liquid schedule, we partition the traffic into time frames comprising mutually non-congesting transfers keeping all bottleneck links busy during all time frames. The search for mutually non-congesting transfers utilizing all bottleneck links is of exponential complexity. We present an efficient algorithm which non-redundantly traverses the search space and limits the search to only those sets of transfers, which are non-congesting and use all bottleneck links. © 2004 IEEE.

The AMD Opteron processor for multiprocessor servers

Article

Apr 2003

Representing AMD's entry into 64-bit computing, Opteron combines the backwards compatibility of the X86-64 architecture with a DDR memory controller and hypertransport links to deliver server-class performance. These features also make Opteron a flexible, modular, and easily connectable component for various multiprocessor configurations.

Principles and Practices of Interconnection Networks The Morgan Kaufmann Series in Computer Architecture and Design ISBN: 0-12200-751-4Efficient Liquid Schedule Search Strategies for Collective Communications

760-766

W Dally
B Towles
E Gabrielyan

W. Dally, B. Towles, Principles and Practices of Interconnection Networks, The Morgan Kaufmann Series in Computer Architecture and Design, Morgan Kaufman Publishers, 2004, ISBN: 0-12200-751-4. [4] E. Gabrielyan, R.D. Hersch, “Efficient Liquid Schedule Search Strategies for Collective Communications”, Proc. of ICON 2004 - 12th IEEE International Conference on Networks, Singapore, Vol. 2, November 16-19, 2004, pp 760-766.

Principles and Practices of Interconnection Networks, The Morgan Kaufmann Series in Computer Architecture and Design

Jan 2004

W Dally
B Towles

W. Dally, B. Towles, Principles and Practices of Interconnection Networks, The Morgan Kaufmann Series in Computer Architecture and Design, Morgan Kaufman Publishers, 2004, ISBN: 0-12200-751-4.

Performance modeling and optimization of application-specific multi-processor systems

Jan 2005

V Kutalek

Kutalek, V., Performance modeling and optimization of application-specific multi-processor systems, Ph.D. thesis, Faculty of information technology, Brno University of Technology, 2005.

Performance modeling and optimization of application-specific multi-processor systems, Ph.D. thesis, Faculty of information technology Communication Architectures for Application-Specific Multiprocessor Systems

Jan 2005

V Kutalek
V Dvorak

Kutalek, V., Performance modeling and optimization of application-specific multi-processor systems, Ph.D. thesis, Faculty of information technology, Brno University of Technology, 2005. [7] Dvorak,V., Communication Architectures for Application-Specific Multiprocessor Systems (on a Chip).

Scheduling collective communications on wormhole fat cubes

Abstract and Figures

Recommended publications

Analytical Performance Comparison of Deterministic, Partially- and Fully-Adaptive Routing Algorithms...

Broadcasting in hypercubes in the circuit switched model

An efficient implementation of Hamiltonian path based multicast routing for 3D interconnection netwo...

Performance modeling of ServerNet topologies