Conference PaperPDF Available

Scheduling collective communications on wormhole fat cubes

Authors:

Abstract and Figures

A recent renewed interest in hypercube interconnection network has been concentrated to the more scalable and mostly cheaper version known as a fat cube. This paper generalizes the known results on time complexity of collective communications on a hypercube for the wormhole fat cube. Examples of particular communication algorithms on the 2D-fat cube topology with 8 processors are summarized and given in detail. The performed study shows that a large variety of fat cubes can provide lower cost, better scalability and manufacturability without compromising communication performance.
Content may be subject to copyright.
Scheduling Collective Communications on Wormhole Fat Cubes
Vaclav Dvorak
Brno University of Technology
dvorak@fit.vutbr.cz
Abstract
A recent renewed interest in hypercube inter-
connection network has been concentrated to the more
scalable and mostly cheaper version known as a fat
cube. This paper generalizes the known results on time
complexity of collective communications on a
hypercube for the wormhole fat cube. Examples of
particular communication algorithms on the 2D-fat
cube topology with 8 processors are summarized and
given in detail. The performed study shows that a large
variety of fat cubes can provide lower cost, better
scalability and manufacturability without compro-
mising communication performance.
1. Introduction
One of the greatest challenges faced by designers of
digital systems at present is optimizing the
communication and interconnection between system
components. As more and more processor cores and
other large reusable components have been integrated
on the single silicon die (MPSoCs, Multiprocessor
Systems-on-Chips, [1]), many of traditional multi-
processing techniques are modified or developed
anew. The interconnection network, a fundamental
component of every parallel system, and
communication algorithms are no exceptions. Buses
are being replaced by crossbars or by direct
interconnection networks. Basically direct networks
converge on the use of pipelined (wormhole) message
transmission and source-based routing algorithms and
the major difference among them are in topology.
The well-known binary hypercube (HC) topology is
characterized by P= 2d nodes, naturally organized in d
dimensions, where d is also the node degree. The
worst-case distance between two nodes, (network
diameter D) is logarithmic, D = d = log P. The HC
topology is node and edge symmetric, what simplifies
the design of parallel algorithms tremendously.
Computation can start in any node and the source code
remains the same. Also the communication can start in
any dimension. Optimal algorithms for collective
communication operations exist in almost all
communication models. This is why the HC topology
is commonly considered the best topology there is
from the algorithmic and communication point of
view. The HC topology can simulate efficiently almost
any other topology, too. The only drawback is its non-
constant (logarithmic) degree d=logP and
consequently a high number of communication
channels and only partial scalability, as the number of
nodes P is restricted to powers of 2.
Topologies derived from the binary HC, such as
cube-connected cycles and wrap-around or ordinary
butterflies [2] eliminate the drawback of non-constant
node degree. They are constructed by expanding the
HC vertices into cycles or linear arrays and have a
small constant degree and the logarithmic diameter as
before. The bisection width 2d = P/d is slightly worse
than the value P for hypercube and so is the scalability,
since the number of processors is P = d2d, i.e. only 8,
24, 64, etc.
Another useful alternative is much better scalable
topology called a “fat cube” (FC). The vertices of the
HC are again expanded, but now into sets of
processors connected by the crossbar switch inside the
router. Scalability is improved since the node can
contain any number of processors, P=m2d,m = 1, 2,
3, etc. The node degree grows more slowly than in the
HC, d = log (P/m) and the bisection width can be
adjusted by multiple links between nodes. Due to these
favorable features has the FC topology been recently
used e.g. in commercial DSM NUMA machine Origin
3000 (SGI). Also fat nodes with 4 Opteron processors
have been used in 3D-FC connection [3] and nodes
with 8 CPUs are connected into K-ring network in
Swiss-T1 cluster [4]. The FC topology is also expected
to appear in future networking systems for MPSoCs,
because mapping FC into 2-D space is easier than in
the case of the “thin” HC.
In the rest of the paper we look at the router
architecture in Section 2 and present the details of
Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05)
1550-6533/05 $20.00 © 2005 IEEE
hardware cost calculation in Section 3. The complexity
of collective communications is analyzed in Section 4.
Next Section 5 is a case study involving important
collective communications on the 8-processor FC.
Finally, Section 6 concludes this paper.
2. HD | FD | S – half-duplex, full-duplex, simplex links
3. NC | C – non-combining/ combining model capable
(or not) of combining or extracting partial messages
with negligible overhead
4. one-port (1) | d-port (d) – router model.
2. Fat cube and router architecture The router model for fat nodes deserves some
explanation, because it is a certain generalization of the
router model used in connection with thin nodes. In the
simplest case, processors are connected to the router
by a single link as in Fig. 1. This so called one-port
model (“1”) allows each of m or less processors to
send a message either outside to a remote processor or
to the local processor inside the same node, Fig. 2. In
d-port model (“d”) each processor can send up to d
distinct messages simultaneously, either outside or
locally. In fact, both the models are special cases of
the “k-port” model, where k =1 or d. In the context of
the traditional hypercube (m = 1 and f = 1) these
models are known as one-port and all-port models.
Let us recall notation introduced above and
establish some new notions related to the FC topology.
Drawings of two instances of this topology are shown
in Figure 1. We use the following parameters:
d– dimensionality of the FC/HC
D – network diameter
m – number of processors per fat node, an integer
greater than 1
P– processor count P = m2d (the FC), P = 2d (the HC)
d´– dimensionality of the HC with the same number of
processors as FC, d´=log P=d + ªlog mº (binary
log is the default)
CPU
CPU
CPU
CPU
1 d
f– multiplicity of external links
L – the number of external links in a FC network
L = fd2d1. Each link consists of two channels in
opposite directions.
a) b)
= CPU
= router
Figure 2. Router models for fat nodes.
(m = 2, d = 2, P = 8, f = 1)
1) one-port model d) d-port model
3. Cost of a fat cube network
Figure 1. Examples of fat cube networks.
a) P = 16, m = 4, d = 2, f = 2
b) P = 16, m = 2, d = 3, f = 1
The cost of the interconnection network has two
components: the external links cost CLand the router
cost CR. If we disregard manufacturability, the external
link cost CLcan be taken simply as the number of these
links CL = L = fd2d–1. The router cost, given mainly by
the cost of a×b crossbar switch with a input ports and
b output ports, is commonly taken as ab.
The design of communication algorithms depends
strongly on the model used to describe the parameters
of the underlying communication hardware. These
models have to address key characteristics of
interconnection networks, such as switching technique,
channel type, message combining capability and a
router model. The possible options in communication
architecture are:
Let us compare the fat cube cost C and hypercube
cost Ccand let us find under which condition is the fat
cube network cheaper. If both the networks have the
same number of processors P = P’, then
mddorPmP dd log22 '
c
c
.
1. SF | WH | CS | VCT – store-and-forward, wormhole,
circuit, and virtual-cut-through switching techniques
The lower link cost of the fat cube
LL CC c
d
11''1 222 c
c
d dd
L
d
LmddCfdC (1)
Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05)
1550-6533/05 $20.00 © 2005 IEEE
implies fd md´, what holds true because
dimensionality of the fat cube with Pnodes is always
lower than that of a hypercube and because mostly
.
mf d
The cost CR of all routers together depends on the
type of the port model. Table 1 compares the total
router cost CR and CR´, the product of input and output
port counts, pin upout. Of course, we are interested
especially in fat cubes with some cost advantage, i.e.
when CRCR´. By making use of relation (1) we can
transform the condition of a lower cost into
inequalities involving parameters m,f and d:
1) (m + df)2dm(1 + d + log m)2 (2)
d)d2(m + f)2d 4m(d + log m)2(3)
Table 2 shows some numerically obtained solutions of
inequalities (2) – (3) for f = 1 and 2.
For example, both 1-port fat cube networks at Fig.
1 are cheaper then 1-port hypercubes with the same
number of processors P. Now the question is what
will be the impact of this lower hardware cost, if any,
on communication performance. We will therefore
investigate the performance of collective
communications on a fat cube in the next section.
Table 1. Total router cost in fat cube (CR)
and hypercube (CR´) topology
4. Complexity of collective communications
on the WH fat cube
Collective communications (CCs) are frequently
used in all parallel algorithms. If their overhead is
excessive, performance degrades rapidly with the
processor count. When we refer to „collective
communications”, we will assume communications
involving all processors. Seven types of such collective
communications are:
OAB (One-to-All Broadcast), OAS (One-to-All
Scatter), AOG (All-to-One Gather), AOR (All-to-one
reduce), AAB (All-to-All Broadcast), AAR (All-to-all
Reduce) and AAS (All-to-All Scatter), [5]. Since
complexities of some communications are similar
(AOG ~ OAS, AOR ~ OAB, AAR ~ AAB), we will
focus only on 4 basic types (OAB, OAS, AAB, AAS).
Each communication may be investigated with all
possible model options, what gives too many distinct
cases to explore. Therefore only the most important of
them will be analyzed.
Table 2. Conditions ensuring that a fat cube
be cheaper than the hypercube
f = 1 1 d
m = 2 dd 16
m = 4 dd 8
m = 8 dd 5
f = 2 1 d
m = 2 d=1 dd 2
m = 4 ddd 6
m = 8 ddd 3
d
d
d
In the rest of the paper we assume that the
communication in WH networks proceeds in
synchronized steps. In one step of CC, a set of
simultaneous packet transfers takes place along
complete disjoint paths between source-destination
node pairs. Complexity of collective communication
will be determined in terms of the number of these
communication steps IJCC(G) for the lower bound and
IJCC(G) for the upper bound; if network graph G is clear
from the context, we will omit its symbol G (HC or
FC). This figure of merit does not take into account
the message length (non-uniform in combining
models) or its variations from one step to another.
Before analyzing communications on a fat cube, let us
review the lower bounds on number of steps IJCC in a
hypercube network, Table 3. Lower bounds for all CCs
on the WH hypercube, except OAB, are reachable by
known optimal algorithms. The double-tree algorithm
for OAB, Fig.3, is optimal only for d 6. Other known
algorithms are nearly optimal (e.g. the algorithm by
Ho-Kao, [5] ).
FC HC (m=f=1)
Cost pin,pout CRpin,
pout
CR
´
1-port (m+df)2
d(m+df)2(1+)2
(1+)2
d-port d(m+f)2
dd2(m+f)22 2 4 2In the following subsections we want to generalize
the above results for the fat cube topology with
restriction to non-combining WH models with FD
links. Our approach will use the known algorithms for
CC among nodes of the WH hypercube and this inter-
node communication will be followed or overlapped
by the local CC within the nodes on the router crossbar
(intra-node communication).
Table 3. CCs on a hypercube, lower bounds
W
CC on time complexity
CC 1-port all-port
OAB log P (= d)ªlog d+1 P º = ªd/log ( d+1)º
AAB P – 1 ª(P – 1)/dº
OAS P– 1 ª(P – 1)/dº
AAS P– 1 P/2
Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05)
1550-6533/05 $20.00 © 2005 IEEE
1st step
2n
d
step
S
S'
S
S'
4.2. All-to-all broadcast (AAB) on a WH fat
cube
Optimal AAB algorithms for a hypercube matching the
lower bounds in Tab.3 are based on a Hamiltonian
cycle (1-port model) and on so called time-arc-disjoint
spanning trees – TADTs (all-port model). All
processors can use such broadcast trees synchronously
with no conflicts. The following Theorem 2
establishes complexity of AAB on a fat cube, namely
upper bounds IJAAB in case that we do first AAB among
nodes using TADTs and then AAB inside the nodes.
As we will see later, due to a possible partial overlap
of both the inter- and intra-node communications in d-
port model, lower bound IJAAB can be reached under
certain conditions.
Figure 3. The “double tree” algorithm on 4D-
hypercube. S, S´ are the 1st and 2nd roots.
4.1. One-to-all broadcast (OAB) on a WH fat
cube
This CC is not influenced by the type of the links
(HD/FD) or message (non)combining. Since just one
message propagates in the network, multiple links
cannot help.
Theorem 2. Complexity of AAB on the k-port WH
fat cube measured by the number of communication
steps is
1) IJAAB = IJAAB(HC) = P -1
k)IJAAB =
ªºªº
mPkmmkfdmP //)1(),min(/)(
Theorem 1. Complexity of OAB on the k-port WH
fat cube measured by the number of communication
steps is Proof.
IJOAB = ªlogk+1(P/m)º + ªlogk+1mº . 1) We can use cyclic rotation of messages along the
ring formed by the Hamiltonian cycle, m processors in
every node are incorporated into that cycle. In the first
step all Pprocessors are just sending their message
along the cycle and in following P-2 cycles they keep
receiving and re-sending other messages. Multiple
links cannot make it faster, because processors are
connected to the router with a simple link.
This upper bound can be reached for all m, k = 1 or d
and P/md 6.
Proof.
1) k=1. OAB implemented by recursive doubling in the
spanning binomial broadcast tree [5] increases the
number of informed nodes twice in each of
k) Using a generic TADT rooted in every node we
will perform AAB among nodes. Each node, if not a
leaf, broadcasts ”super-messages” consisting of m
distinct messages to other nodes. In each such “super-
step”, m messages stored in m node processors are
transferred between adjacent nodes. There are fd
incoming links to a node from all dimensions and mk
input links to node processors. Therefore m(2d1) =
Pm messages destined for one node will be received
in not less than ª(Pm) / min (fd,mk)º steps. At the
end will each processor have P/m distinct messages
(including its own original message) to share with
other local processors. As the local AAB among m
nodes can be done on the router crossbar as m-1
permutations, kpermutations at a time, the result is
D = d = ªlogk+1(P/m)º = ªlog 2dº
steps. The recursive doubling continues inside the
nodes with the use of a crossbar. This intra-node
communication may be overlapped with inter-node
communication except the last node, so that additional
ªlog mº steps are needed, q.e.d.
d) k = d. By making use of the double tree algorithm,
that performs two partial OABs based on partial
spanning binomial trees rooted in node S and S´, Fig.3,
the required number of steps is ªd/2º. However,
ªd/2º = ªlogd+12dº for dd 6.
The intra-node communication is done using all d ports
in ªlogd+1mº steps, q.e.d.
IJAAB = ª(Pm) /min (fd,mk)º + ª(ºP/m,
km /)1
Let us note that
q.e.d.
ªlogk+1(P/m)º + ªlogk+1mºd ªlogk+1Pº +1,
Provided that fd < mk, then mkfd ports are free
during inter-node communication and can be used for
broadcasting messages within the node. As there are
so that the FC is in OAB never worse than the HC by
more than 1 step.
Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05)
1550-6533/05 $20.00 © 2005 IEEE
Proof.
(Pm)/(fd) steps of inter-node communication,
(Pm)(mkfd)/(fd) out of total (P/m)m(m1) internal
pair-wise communications can be hidden. Remaining
1) We can use the Hamiltonian cycle and send
messages in any order to P1 remote processors. We
cannot use more than f = 1 external link, because each
processor has only one internal link and both the
external and internal links are connected in the
Hamiltonian cycle. Therefore in P – 1 steps all
processors will get their messages, q.e.d.
P(m1) (Pm)(mkfd )/(fd)
pair-wise communications can be done, mk of them at
a time, on mk ports. With the previous inter-node
communication (and with careless handling the ceiling
function) it will require
k) By making use of modified TADT for global
OAS among nodes, super-messages from source node
to destination nodes will consist of m messages. There
are fd outgoing links from a node in all dimensions,
mk output links from processors and Pm messages
are to be sent to other nodes. This will therefore take
not less than
ªº
),min(/)( mkfdmP steps. The local
OAS in the source node requires ª(m1)/kº steps,
because the source processor can emit k messages at a
time. Altogether
AAB
AAB
k
P
mk
fd
fd
mP
mk
mP
fd
mP
WW
»
»
º
«
«
ª
»
»
º
«
«
ª¸
¹
·
¨
©
§
»
»
º
«
«
ª
1
1
)1(
steps. Therefore a clever overlapping of global and
local communications could make an AAB algorithm
as efficient as the optimal hypercube algorithm.
Contrary to OAB, combining is relevant to the
complexity of AAB. There is a straightforward
approach (Gather Scatter) to combining AAB on the
fat cube: one representative processor in each node
gathers messages from all local peers and then AAB
takes place among these representative processors with
combined messages. At the end the representatives
extract and distribute individual messages to local
peers. We will not analyze complexity in detail, but
interestingly, combining AAB can sometimes be faster
on the fat cube than on the hypercube, [6].
IJOAS =
ªºªº
kmmkfdmP /)1(),min(/)(
,
q.e.d. For the d-port fat cube with simple links (f = 1)
this bound comes to IJOAS = ª(P1)/dº = IJOAS.
.
4.4. AAS on a WH fat cube
4.3. One-to-all scatter (OAS) on a WH fat cube
Let us recall that the optimal AAS algorithm for the
1-port hypercube matching the lower bound IJAAS =
P1 (see Table 3) is very simple. AAS is decomposed
into P1 permutations, processors with the relative
address i are directly exchanging messages in step i,i
=1, 2, …, P1. However, the elegance of hypercube
topology shows in the all-port model in which P1
steps are compacted into P/2 steps in such a way, that
all links are used in both directions in all steps! The
smallest example is shown in Fig.4. Theorem 4
establishes complexity of AAS, namely upper bounds
IJAAS in case that we do AAS among nodes first and
then inside the nodes. In some cases can these bounds
be further improved by overlapping inter- and intra-
node communications.
This CC has similar complexity as AAB in many
models. Optimal OAS algorithms for a hypercube
matching the lower bounds are based on a Hamiltonian
cycle (1-port model) and again on time-arc-disjoint
spanning trees TADTs (d-port model). An optimal
hypercube algorithm requires a broadcast tree with
sub-trees of approximately equal size (r1 node).
TADTs do not fulfil this requirement and must be
slightly modified. The construction of such trees is
known and will not be repeated here. The generic
TADT tree can be rooted in any source processor and
messages are sent into its sub-trees in any order. Link
type (HD or FD) does not influence IJOAS, rather the
number of distinct messages that can be sent by the
source processor in one step is important. In the fat
cube topology we perform OAS among nodes first,
then OAS inside nodes. Theorem 3 gives related upper
bounds IJOAS; for m=f =1 and k=dwe get the lower
bounds for the all-port hypercube as a special case.
Theorem 3. Complexity of OAS on the k-port WH
fat cube measured by the number of communication
steps is Fig. 4. AAS in 2 steps on WH 2D-HC
1) IJOAS = IJOAS (HC) = 1
P
k)IJOAS =
ªºª
kmmkfdmP /)1(),min(/)(
º
Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05)
1550-6533/05 $20.00 © 2005 IEEE
5. Examples of collective communication
on the 8-processor, 2D-fat cube
Theorem 4. Complexity of AAS on the k-port WH
fat cube measured by the number of communication
steps is
1) IJAAS = (2d1)ªm2/fº
k)IJAAS =.
ªºª
kmfdmkPmd /)1()],min(2/[
º
In this section we have chosen to demonstrate
communication algorithms on the small d-port fat cube
with the following parameters: d= m = 2, P = 8, f= 1,
non-combining nodes, full duplex links and wormhole
switching.
Proof.
1) The direct exchange HC algorithm applied to the
global AAS leads to 2d1 super-steps, with exchanges
of m2 messages between each of 2d1pair of nodes, f
messages at a time. One exchange super-step will thus
take ªm2/fº steps, fdm, and the whole AAS will
require (2d1)ªm2/fº steps, q.e.d.
5.1. One-to-all broadcast
Whereas 3 OAB steps are always needed in 8-
processor hypercube using the spanning binomial tree
(1+1+2+4 /1+3+3+1 processors informed in 3 steps in
1-port /d-port model), 2 steps will do in the d-port fat
cube topology, see Fig. 5. The intra-node OAB is fully
overlapped with 2 steps of the inter-node OAB.
k) We can visualize AAS as a superposition of m-
to-P scatter communications by all nodes, in which
each processor in the node sends P-m distinct
messages outside and m-1 messages inside the node.
The block of m2 messages (a super-message) from the
source node (msource CPUs in one node, each of them
sending m messages to a destination node) goes
through intermediate nodes to the destination node and
utilizes a number of links on the way. We can count
the number of channels required to connect one source
node to destination nodes at all levels of the broadcast
tree as
11
2
2
.2
1
1
1
)!1()!(
)!1(
0
0
1
1
01
10
¸
¸
¹
·
¨
¨
©
§
¸
¸
¹
·
¨
¨
©
§
¸
¸
¹
·
¨
¨
©
§
¸
¸
¹
·
¨
¨
©
§
¦¦
¦¦
d
d
j
d
i
d
i
d
i
d
j
d
d
i
d
d
iiid
did
d
i
d
ix
(4)
Figure 5. OAB in 2 steps on the WH fat cube
5.2. All-to-all broadcast
Theorem 3 states that we are able to complete AAB
in 3+4 steps of inter- and intra-node communication,
but we can do much better with their overlapping. The
optimal algorithm with a full overlap of the global and
local AAB is shown at Figure 6, reaching the lower
bound of Theorem 2 (f = 1, k = d):
The so called communication work CW(AAS) for all
2d nodes is thus
CW(AAS) = 2dx m2 = d22d m2/2 (5)
as each link will be used m2-times. On the other hand,
the total number of channels available at one time is a
lower value of the total count of external channels 2L
=fd2d and the total number of output ports 2d(mk), i.e.
2dmin(fd,mk). Since all the external links are utilized
in direct exchange algorithm in both directions in all
steps, it has to hold
IJAAB = ª(P1)/dº = 4 steps = max (3, 4).
1
2
3
4
23
14
5
67
8
23
14
5
67
8
a) b)
IJAAS = ªCW(AAS)/[2dmin(fd,mk)]º =
ªPmd/[2min(mk,fd)] º.
The intra-node AAS among m processors can be
implemented on the router crossbar as (m–1)
permutations at a rate k permutations in one step, i.e. in
steps. Together we get the desired result,
q.e.d.
ª
km /)1(
º
Figure 6. AAB in 4 steps on the WH fat cube
a) steps 1 and 2 b) steps 3 and 4
Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05)
1550-6533/05 $20.00 © 2005 IEEE
The path of every message from source to
destination processors, divided into 4 steps, is
described in Table 4.
Table 4. Four steps of the AAB
communication schedule
destination processors
message step 1 step 2 step 3 step 4
1o2, 5 o6, 8 o7, 4 o3
2o1, 3 o4, 7 o8, 6 o5
3o, 2 o, 6 o5, 7 o8
4o3, 8 o7, 5 o6, 1 o2
5o6, 1 o2, 4 o3, 8 o7
6o5, 7 o8, 3 o, 2 o1
7o8, 6 o5, 2 o1, 3 o4
8o7, 4 o3, 1 o2, 5 o6
5.3. One-to-all scatter
In our running example (f=1, k=d) the upper bound
given by Theorem 3 matches the ideal lower bound
ªºªºªº
4/)1(/)1(/)( d OAS
OAS dPdmdmP
WW
steps, see Fig. 7. The source keeps sending messages
into two sub-trees, three times 2 messages in any order
and then the local OAS inside the source node is done
in 1 more step.
1
23
4
7
8
5
6
1
1
3 3
2
2
4
Figure 7: OAS in 4 steps on the WH fat cube
5.4. All-to-all scatter
According to Theorem 4, we should be able to
complete AAS on our example fat cube in 9 steps.
AAS among nodes is scheduled in 2 super-steps
according to Fig.4. Considering now m2= 4 messages
in a super-message, there will be 4 steps in each super-
step. AAS within nodes, in our FC only exchange of
messages between two processors, can be combined
with any of the previous 8 steps because only one
processor port is busy during inter-node communi-
cation. Pairs of processors exchanging messages in
steps 1 to 8 are listed in Tab.5, local AAS communi-
cations are shown in bold.
Table 5. Eight steps of the AAS schedule
1 03, 16, 25, 47
2 02, 17, 24, 46
3 06, 12, 20, 42
4 07, 13, 21, 43
5 04, 15
6 62, 73
7 05, 14,
863, 72, 01, 23, 45, 67
7
65
4
2
3
0
1
The performance of AAS is limited not by number
of ports, but rather by the bisection width of the fat
cube: AAS on the d-port FC with double links would
complete in 4 steps only.
6. Results and conclusions
Summary of CC complexities for various models of
our sample fat cube and hypercube networks is in
Table 6. The table gives the optimized number of steps
with possible overlap of global and local CCs. The
communication performance of the FC is the same or
better in OAB and almost the same in OAS and AAB.
The AAS performance depends on multiplicity of
links.
Table 6. Complexity of CCs on the
8-processor hypercube and fat cubes
m,f, d OAB AAB OAS AAS P = 8
1, 1, 3 3 7 7 7 1-port HC
1, 1, 3 3 3 3 4 all-port HC
2, 1, 2 3 7 7 12 1-port FC
2, 1, 2 2 4 4 8 d-port FC
2, 2, 2 2 4 3 4 d-port FC
Another larger example concerns the 3D-FC with 4
CPUs per node, double links and with P = 32
processors. Table 7 gives the complexity values
obtained either from Tab.3 or from Theorems 1 to 4.
Anyway, the above results concern only two
particular fat cube networks, but theorems derived
earlier are suitable for comparison of other
configurations as well. Generally we can make the
following conclusions:
Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05)
1550-6533/05 $20.00 © 2005 IEEE
1. Performance of 1-port HC and 1-port FC with the
same processor count Pare the same in all CCs but
AAS.
2. The AAS performance is in 1-port FC proportional
to 1/f,fdm.
3. Partitioning OAB into the global and local part does
not reduce the performance, but improves it by
overlapping both parts.
4. Performance in OAS and AAB is poorer than in HC
topology, but similar if optimization through
overlapping is used or even better if multiple links
are provided.
5. Poorer performance in AAS on d-port FC is given
by a lower bisection width, the same performance as
in the hypercube can be obtained when multiple
links are used.
6. If the hardware cost is a limiting factor, then a
suitable fat cube can be found which is cheaper than
the equivalent hypercube with the same number of
processors and with not much (if any) performance
degradation.
7. The number of processors Pin the fat cube
configuration is not limited to powers of 2, but a
power of 2 can be multiplied by an integer m. This
may be more straightforward scaling than a partial
hypercube.
Table 7. Complexity of CCs on the
32-processor hypercube and fat cubes
m,f, d OAB AAB OAS AAS P =32
1, 1, 5 5 31 31 31 1-port HC
1, 1, 5 5 7 7 16 all-port HC
4, 1, 3 5 31 31 112 1-port FC
4, 2, 3 4 21 7 34 2-port FC
4, 1, 3 3 18 11 65 d-port FC
4, 2, 3 3 13 6 33 d-port FC
4, 4, 3 3 11 4 17 d-port FC
The future research should address other network
topologies with fat nodes and links. Also other
communication patterns should be studied, such as
multicast and a-to-b broadcast or scatter. Also
combining node models are of interest; partial results
for SF switching have been presented in [6]. The role
of combining models for WH switching should still be
clarified. The research in the above directions could
help optimize communication architectures for
application-specific multiprocessor systems on chip,
[7].
7. References
[1].Jerraya, A.A., Wolf, W., Microprocessor Systems-on-
Chips, Elsevier Inc., 2005, ISBN 0-12385-251-X.
[2] W. Dally, B. Towles, Principles and Practices of
Interconnection Networks, The Morgan Kaufmann Series in
Computer Architecture and Design, Morgan Kaufman
Publishers, 2004, ISBN: 0-12200-751-4.
[4] E. Gabrielyan, R.D. Hersch, “Efficient Liquid Schedule
Search Strategies for Collective Communications”, Proc. of
ICON 2004 - 12th IEEE International Conference on
Networks, Singapore, Vol. 2, November 16-19, 2004, pp
760-766.
[3] C.N. Keltcher, et al., “The AMD Opteron Processor for
Multiprocessor Servers”, IEEE Micro, March/April 2003,
pp.66 – 76.
[5] J. Duato, S. Yalamanchili, L. Ni, Interconnection
Networks – An Engineering Approach, Morgan Kaufman
Publishers, 2003, ISBN 1-55860-852-4.
[6] Kutalek, V., Performance modeling and optimization of
application-specific multi-processor systems, Ph.D. thesis,
Faculty of information technology, Brno University of
Technology, 2005.
[7] Dvorak,V., Communication Architectures for
Application-Specific Multiprocessor Systems (on a Chip).
Proc. of the 11th International Conference on Software,
Telecommunications and Computer Networks SoftCOM
2003, Split, HR, FESB, 2003, p. 778-782.
Acknowledgement
This research has been carried out under the financial
support of the research grant “Network Architectures
of Embedded Systems Networks”, GA102/05/0467,
Grant Agency of Czech Republic, 2005-2007.
Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05)
1550-6533/05 $20.00 © 2005 IEEE
... In the rest of the paper, we want to analyze the complexity of collective communications if fat nodes with more than one processor and/or fat (multiple) edges are introduced into the network. The results concerning the special case of fat cubes are available [2], but here we will approach the problem in a general scope. Section 2 deals with collective communications on several base network topologies, which serve as candidates for getting fatter nodes. ...
Conference Paper
Full-text available
The paper deals with scheduling collective communications in the minimum number of communication steps; it shows how to generalize the known results regarding time complexity of collective communications on common direct networks for the same networks with fat nodes and edges. Models of node architecture composed of several processor cores that share a router are discussed. Examples of communication algorithms on fat K-ring networks with 8 to 32 processors are summarized and given in detail. The results show that fat networks, depending on their configuration, can provide a range of communication performance at a lower cost.
Article
The upper limit of a network's capacity is its liquid throughput. The liquid throughput corresponds to the flow of a liquid in an equivalent network of pipes. In coarse-grained networks, the aggregate throughput of an arbitrarily scheduled collective communication may be several times lower than the maximal potential throughput of the network. In wormhole and wavelength division optical networks, there is a significant loss of performance due to congestions between simultaneous transfers sharing a common communication resource. We propose to schedule the transfers of a traffic according to a schedule yielding the liquid throughput. Such a schedule, called liquid schedule, relies on the knowledge of the underlying network topology and ensures an optimal utilization of all bottleneck links. To build a liquid schedule, we partition the traffic into time frames comprising mutually non-congesting transfers keeping all bottleneck links busy during all time frames. The search for mutually non-congesting transfers utilizing all bottleneck links is of exponential complexity. We present an efficient algorithm which non-redundantly traverses the search space. We efficiently reduce the search space without affecting the solution space. The liquid schedules for small problems (up to hundred nodes) can be found in a fraction of seconds.
Conference Paper
Collective communications involving all processors are frequently used in the solution of demanding parallel problems and their time complexity has a dramatic impact on the performance. This paper deals with scheduling of collective communications in multiprocessor networks using the store-and-forward switching technique resulting in minimum number of communication steps. We designed novel technique of communication conflict prediction, which significantly increases the success rate of optimal communication schedule
Conference Paper
Full-text available
Since chip multiprocessors are quickly penetrating new application areas in network and media processing, their interconnection architectures become a subject of sophisticated optimization. One-to-All Broadcast (OAB) and All-to-All Broadcast (AAB) [2] group communications are frequently used in many parallel algorithms and if their overhead cost is excessive, performance degrades rapidly with a processor count. This paper deals with the design of a new application-specific standard genetic algorithm (SGA) and the use of Hybrid parallel Genetic Simulated Annealing (HGSA) to design optimal communication algorithms for an arbitrary topology of the interconnection network. Each of these algorithms is targeted for a different switching technique. The OAB and AAB communication schedules were designed mainly for an asymmetrical AMP [15] network and for the benchmark hypercube network [16] using Store-and-Forward (SF) and Wormhole (WH) switching.
Article
The main objectives pursued by parallelism in communications are network capacity enhancement and fault-tolerance. Efficiently enhancing the capacity of a network by parallel communications is a non-trivial task. Some applications may also allow one to split the sources and destinations into multiple sources and destinations. An example is parallel Input/Output (I/O). Parallel I/O requires scalability, high throughput and good load balance. Low granularity enables good load balance but tends to reduce throughput. In this thesis we combine fine granularity with scalable high throughput. The network overhead can be reduced and the network throughput can be increased by aggregation of data into large messages. Parallel transmissions from multiple sources to multiple destinations traverse the network through many different paths which have numerous intersections in the network. In low latency high performance networks, serious congestions occur due to large indivisible messages competing for shared resources. We propose to optimally schedule parallel communications by taking into account the network topology. The developed liquid scheduling method optimally uses the potential transmission capacity of a network. Fault-tolerance is typically achieved by maintaining backup communication resources, which are kept idle as long as the primary resource is operational. A challenging idea, inspired by nature, is to simultaneously use all parallel resources. This idea is applied to fine-grained packetized communications. It also relies on erasure resilient codes for combating network failures.
Article
Full-text available
A deadlock-free routing algorithm can be generated for arbitrary interconnection networks using the concept of virtual channels. A necessary and sufficient condition for deadlock-free routing is the absence of cycles in a channel dependency graph. Given an arbitrary network and a routing function, the cycles of the channel dependency graph can be removed by splitting physical channels into groups of virtual channels. This method is used to develop deadlock-free routing algorithms for k-ary n-cubes, for cube-connected cycles, and for shuffle-exchange networks.
Book
Modern system-on-chip (SoC) design shows a clear trend toward integration of multiple processor cores on a single chip. Designing a multiprocessor system-on-chip (MPSOC) requires an understanding of the various design styles and techniques used in the multiprocessor. Understanding the application area of the MPSOC is also critical to making proper tradeoffs and design decisions. Multiprocessor Systems-on-Chips covers both design techniques and applications for MPSOCs. Design topics include multiprocessor architectures, processors, operating systems, compilers, methodologies, and synthesis algorithms, and application areas covered include telecommunications and multimedia. The majority of the chapters were collected from presentations made at the International Workshop on Application-Specific Multi-Processor SoC held over the past two years. The workshop assembled internationally recognized speakers on the range of topics relevant to MPSOCs. After having refined their material at the workshop, the speakers are now writing chapters and the editors are fashioning them into a unified book by making connections between chapters and developing common terminology.
Article
The upper limit of a network's capacity is its liquid throughput. The liquid throughput corresponds to the flow of a liquid in an equivalent network of pipes. However, the aggregate throughput of a collective communication pattern (traffic) scheduled according to network topology unaware techniques may be several times lower than the maximal potential throughput of the network. In most of the cut-through, wormhole and wavelength division optical networks, there is a loss of performance due to congestions between simultaneous transfers sharing a common communication resource. We propose to schedule the transfers of a traffic according to a schedule yielding the liquid throughput. Such a schedule, called liquid schedule, relies on the knowledge of the underlying network topology and ensures an optimal utilization of all bottleneck links. To build a liquid schedule, we partition the traffic into time frames comprising mutually non-congesting transfers keeping all bottleneck links busy during all time frames. The search for mutually non-congesting transfers utilizing all bottleneck links is of exponential complexity. We present an efficient algorithm which non-redundantly traverses the search space and limits the search to only those sets of transfers, which are non-congesting and use all bottleneck links. © 2004 IEEE.
Article
Representing AMD's entry into 64-bit computing, Opteron combines the backwards compatibility of the X86-64 architecture with a DDR memory controller and hypertransport links to deliver server-class performance. These features also make Opteron a flexible, modular, and easily connectable component for various multiprocessor configurations.
Principles and Practices of Interconnection Networks The Morgan Kaufmann Series in Computer Architecture and Design ISBN: 0-12200-751-4Efficient Liquid Schedule Search Strategies for Collective Communications
  • W Dally
  • B Towles
  • E Gabrielyan
W. Dally, B. Towles, Principles and Practices of Interconnection Networks, The Morgan Kaufmann Series in Computer Architecture and Design, Morgan Kaufman Publishers, 2004, ISBN: 0-12200-751-4. [4] E. Gabrielyan, R.D. Hersch, “Efficient Liquid Schedule Search Strategies for Collective Communications”, Proc. of ICON 2004 - 12th IEEE International Conference on Networks, Singapore, Vol. 2, November 16-19, 2004, pp 760-766.
Principles and Practices of Interconnection Networks, The Morgan Kaufmann Series in Computer Architecture and Design
  • W Dally
  • B Towles
W. Dally, B. Towles, Principles and Practices of Interconnection Networks, The Morgan Kaufmann Series in Computer Architecture and Design, Morgan Kaufman Publishers, 2004, ISBN: 0-12200-751-4.
Performance modeling and optimization of application-specific multi-processor systems
  • V Kutalek
Kutalek, V., Performance modeling and optimization of application-specific multi-processor systems, Ph.D. thesis, Faculty of information technology, Brno University of Technology, 2005.
Performance modeling and optimization of application-specific multi-processor systems, Ph.D. thesis, Faculty of information technology Communication Architectures for Application-Specific Multiprocessor Systems
  • V Kutalek
  • V Dvorak
Kutalek, V., Performance modeling and optimization of application-specific multi-processor systems, Ph.D. thesis, Faculty of information technology, Brno University of Technology, 2005. [7] Dvorak,V., Communication Architectures for Application-Specific Multiprocessor Systems (on a Chip).