Conference PaperPDF Available

Enhanced Overloaded CDMA Interconnect (OCI) Bus Architecture for on-Chip Communication

Authors:

Abstract and Figures

On-chip interconnect is a major building block and a main performance bottleneck in modern complex System-on-Chips (SoCs). The bus topology and its derivatives are the most deployed communication architectures in contemporary SoCs. Space switching exemplified by cross bars and multiplexers, and time sharing are the key enablers of various bus architectures. The cross bar has quadratic complexity while resource sharing significantly degrades the overall system's performance. In this work we motivate using Code Division Multiple Access (CDMA) as a bus sharing strategy which offers many advantages over other topologies. Our work seeks to complement the conventional CDMA bus features by applying overloaded CDMA practices to increase the bus utilization efficiency. We propose the Difference-Overloaded CDMA Interconnect (D-OCI) bus that leverages the balancing property of the Walsh codes to increase the number of interconnected elements by 50%. Two implementations of the D-OCI bus optimized for both speed and resource utilization are presented. The bus operation is validated on a Xilinx Artix-7 AC701 FPGA kit and the bus performance is evaluated and compared to other existing bus topologies. We also present the synthesis results for the UMC-0.13 μm design kit to give an idea of the maximum achievable bus frequency on ASIC platforms. Moreover, we advance a proof-of-concept HLS implementation of the D-OCI bus on a Xilinx Zynq-7000 SoC and compare its performance, latency, and resource utilization to the ARM AXI bus. The performance evaluation demonstrates the superiority of the D-OCI bus.
Content may be subject to copyright.
Enhanced Overloaded CDMA Interconnect (OCI)
Bus Architecture for on-Chip Communication
Khaled E. Ahmed, Mohammed M. Farag
Electrical Engineering Department, Faculty of Engineering, Alexandria University, Alexandria, Egypt
Email: k.e.elsayed@ieee.org, mmorsy@alexu.edu.eg
Abstract—On-chip interconnect is a major building block and
a main performance bottleneck in modern complex System-on-
Chips (SoCs). The bus topology and its derivatives are the most
deployed communication architectures in contemporary SoCs.
Space switching exemplified by cross bars and multiplexers, and
time sharing are the key enablers of various bus architectures.
The cross bar has quadratic complexity while resource sharing
significantly degrades the overall system’s performance. In this
work we motivate using Code Division Multiple Access (CDMA)
as a bus sharing strategy which offers many advantages over
other topologies. Our work seeks to complement the conventional
CDMA bus features by applying overloaded CDMA practices to
increase the bus utilization efficiency.
We propose the Difference-Overloaded CDMA Interconnect
(D-OCI) bus that leverages the balancing property of the Walsh
codes to increase the number of interconnected elements by
50%. Two implementations of the D-OCI bus optimized for both
speed and resource utilization are presented. The bus operation
is validated on a Xilinx Artix-7 AC701 FPGA kit and the bus
performance is evaluated and compared to other existing bus
topologies. We also present the synthesis results for the UMC-
0.13 μm design kit to give an idea of the maximum achievable bus
frequency on ASIC platforms. Moreover, we advance a proof-of-
concept HLS implementation of the D-OCI bus on a Xilinx Zynq-
7000 SoC and compare its performance, latency, and resource
utilization to the ARM AXI bus. The performance evaluation
demonstrates the superiority of the D-OCI bus.
KeywordsSoC, CDMA, Bus Architecture, On-Chip Intercon-
nect, CDMA Bus, Multiple Access Interference, Overloaded CDMA.
I. INTRODUCTION
System-on-Chips (SoCs) are getting more and more com-
plex as the feature size of the building transistors scales down.
More IP cores can fit on the same die which causes an
exponential increase in the interconnection complexity [1]. The
performance of individual IP cores used in SoCs is typically
optimized by the vendor leaving the task of implementing the
on-chip interconnection architecture to the system designer.
The task of implementing on-chip interconnects is not trivial
since the wiring density directly impacts the system’s perfor-
mance, resources, and power consumption. In some applica-
tions, on-chip interconnects can be the system’s performance
bottleneck which necessitates optimizing the interconnect log-
ical topology. Buses and Networks-on-Chips (NoCs) are the
most deployed topologies for on-chip interconnect in SoCs [2].
The straightforward approach to realize on-chip commu-
nication is space switching exemplified by crossbar switches
where every IP core is physically connected by wires to every
other element by a dedicated link providing the better achieved
connectivity. The interconnect complexity of the crossbar
scales quadratically with the number of on-chip cores [3]
rendering it a feasible solution only for a small number of
cores. Another common approach to realize on-chip com-
munication is the bus topology which prevails contemporary
SoC designs. In the bus topology, Time Division Multiple
Access (TDMA) is adopted, where all cores are interconnected
to the same bus and bus access is time shared between
interconnected elements according to the bus arbitration rules.
As the number of on-chip components increases, the efficiency
of the TDMA bus decreases due to the bus contention and
increased sharing overheads on the bus [4]. Many SoC designs
attempt to overcome this problem by employing hierarchical
bus topologies at the expense of increasing the interconnect
complexity, overhead, and power consumption [5].
The Code Division Multiple Access (CDMA) bus architec-
ture has been proposed as an alternative to the TDMA-based
bus topology to overcome the bus contention problem [6].
Direct sequence CDMA (DS-CDMA) is a well-known ap-
proach for medium sharing in wireless communication systems
where the channel is shared by assigning orthogonal spreading
codes called signatures to all transmit-receive pairs sharing the
communication channel. Code orthogonality enables channel
sharing and is measured in terms of the cross-correlation
between spreading codes which equals zero for orthogonal
spreading codes. In a CDMA bus, data from each transmit
element is spread by XORing data with a unique spreading
code or signature. Data spread by different elements are
summed together and sent over the bus. All receiver elements
simultaneously access the bus and receive the spread data sum.
Despreading is achieved by applying correlation operations
to the received sum, where each receiver can extract its data
by correlating it with the unique signature assigned for each
transmit-receive pair. Other advantages of using CDMA for on-
chip interconnect include reduced power consumption, fixed
communication latency, and reduced system complexity [7].
Table I shows a brief comparison between the basic cross-
bar, time-shared, and CDMA buses in terms of the wiring
complexity, bus throughput, and arbitration overheads [8] [9]
for M×Minterconnected elements. The CDMA bus has less
wiring complexity than the crossbar and less arbitration over-
head than the TDMA bus, thus provides a good compromise
of both. Furthermore, the CDMA bus has the advantage of
the possibility of increasing the bus capacity by increasing the
number of usable spreading codes, as this work suggests, thus
increasing the bus throughput compared to the time-shared bus.
The set of spreading codes used in a CDMA system must
be orthogonal to each other and any extra codes added to
this set induce Multiple Access Interference (MAI) which
2015 IEEE 23rd Annual Symposium on High-Performance Interconnects
978-1-4673-9160-3/15 $31.00 © 2015 IEEE
DOI 10.1109/HOTI.2015.12
78
2015 IEEE 23rd Annual Symposium on High-Performance Interconnects
978-1-4673-9160-3/15 $31.00 © 2015 IEEE
DOI 10.1109/HOTI.2015.12
78
2015 IEEE 23rd Annual Symposium on High-Performance Interconnects
978-1-4673-9160-3/15 $31.00 © 2015 IEEE
DOI 10.1109/HOTI.2015.12
78
2015 IEEE 23rd Annual Symposium on High-Performance Interconnects
978-1-4673-9160-3/15 $31.00 © 2015 IEEE
DOI 10.1109/HOTI.2015.12
78
2015 IEEE 23rd Annual Symposium on High-Performance Interconnects
978-1-4673-9160-3/15 $31.00 © 2015 IEEE
DOI 10.1109/HOTI.2015.12
78
TABLE I. CROSSBAR,TIME-SHARED,AND CDMA BUS COMPARISON
FOR M×MINTERCONNECTED ELEMENTS
Topology Wiring complexity Throughput Arbitration overhead
per M bits interconnection
Crossbar M2M1
TDMA 1 1 M
CDMA log2(M+1) 11
arises due to the non-zero cross-correlation between non-
orthogonal spreading codes. MAI can also appear due to the
auto-correlation between asynchronous orthogonal spreading
codes. The MAI problem sets a limit on the maximum number
of users in a CDMA communication system. Consequently, the
maximum number of IP cores that can simultaneously share
the CDMA bus interconnect in SoC is limited by MAI. In most
spreading code families, the maximum number of synchronous
orthogonal codes of length Nequals the code length itself.
State-of-the-art techniques in wireless communications
consider deploying non-orthogonal codes for data spreading
that can still be separated and identified on the receiver side
to increase the CDMA system capacity. These techniques are
known as overloaded CDMA and are currently employed in
synchronous CDMA wireless communication systems [10]. It
was proved that the number of spreading codes can be in-
creased by about 300% in noise-free communication channels
at the expense of employing more complex decoders like the
Maximum Likelihood (ML) decoder [10], [11]. Therefore, in
this work, we attempt to apply the overloaded CDMA practices
developed in wireless communications to on-chip interconnects
to significantly increase the bus capacity without incurring
additional overheads limiting the bus performance.
In our previous work, we presented the MAI-Overloaded
CDMA Interconnect (M-OCI) bus topology and presented
a systematic approach to generate the non-orthogonal MAI-
enabled spreading codes. The number of MAI-enabled codes
equals 25% of the orthogonal code set size, thus increasing the
bus capacity by 25% [12]. In this work, we propose a different
code family that increases the bus capacity by 50%. We present
the Difference-Overloaded CDMA Interconnect (D-OCI) bus
architecture and compare it to the M-OCI and ordinary CDMA
bus topologies. We also provide the implementation results of
the reference and pipelined architectures of the D-OCI bus
optimized for both resource utilization and speed, respectively.
The remaining of this paper is organized as follows:
Related work and a brief background about the conventional
CDMA bus architecture are presented in Section II. The solu-
tion fundamentals and D-OCI bus architecture are described in
Section III. Performance evaluation in terms of resources, max-
imum bus frequency, power consumption, and bus throughput
is presented in Section IV. A high-level synthesis (HLS)
implementation of the D-OCI bus and its comparison to the
AXI bus on a Zynq SoC is advanced in Section V Conclusions
and future work are portrayed in Section VI.
II. BACKGRO UND
The classical CDMA bus topology relies on orthogonal
Walsh codes to enable bus sharing. Nikolic et. al. propose a
full CDMA-based bus system in [13] to decrease the number
of parallel transfer lines of TDMA buses. Multilevel 2-bit
CDMA in [4] was used as an I/O reconfiguration scheme
which also demonstrated a reduction in the bus contention
over TDMA. CDMA and TDMA are combined in the CT-
Bus where data is communicated over both the time and code
domains [7]. CDMA also has been utilized to enable intra-
chip communication in NoC topologies. In [14] a CDMA
based NoC is compared to a PTP bidirectional ring based
NoC. The simulation results show that the CDMA NoC’s
fixed data transfer latency is equal to the best case latency of
the PTP of the same channel width. The fixed data transfer
latency of the CDMA NoC is attributed to the concurrent
sharing of the communication channel by network nodes. A
hierarchical CDMA star NoC topology is presented in [15],
it is compared to a pure mesh and a Fat tree topology, the
CDMA star NoC has fewer resources and routing complexity
than its rivals. In [16], a wireless CDMA NoC architecture was
demonstrated to have significantly lower energy dissipation
and higher bandwidth than a TDMA NoC.
Most related work addressing CDMA on-chip intercon-
nect investigate architectural and topological enhancements
and performance evaluation for the conventional DS-CDMA
communication scheme. In this work, we address a different
aspect of the CDMA technology for on-chip interconnects. We
investigate increasing the bus capacity by applying overloaded
CDMA to the existing on-chip CDMA bus topology. We make
use of the ordinary CDMA bus architecture presented by
Nikolic et. al. in [17] with some modifications to develop the
overloaded CDMA bus. Therefore, we present a brief overview
of the ordinary CDMA bus topology in this section.
Figure 1 shows the block diagram of the conventional
CDMA bus. The system is composed of a number of XOR
encoders and accumulator-based decoders. In the encoder, an
N-chip length binary orthogonal code, generated from the
Walsh spreading code family, is XORed with the data bit and
sent out serially, indicating that a single bit is spread in a
duration of Nclock cycles. The number of transmit-receive
IP core pairs sharing the bus equals to Mwhere MN.For
the ordinary CDMA bus topology using Walsh spreading codes
M=N. Serial streams from all transmitting cores sharing the
bus are added together and the sum is represented in binary
and sent to a decoding circuit feeding the receiving IP cores.
The decoder is implemented as a wrapper that cross
correlates the serialized channel sum with the signature code
assigned for the transmit-receive pair. As the spreading codes
are generated from the bipolar Walsh code family, decor-
relation (despreading) mainly involves two operations: sum
multiplication by ±1and accumulation. The bus data is passed
to the zero accumulator when the current chip value equals to
“0” and to the one accumulator when the chip value equals
to “1”. The one and zero accumulator circuits accumulate
their inputs during the decoding cycle and are reset to zero








 

  
  
 
 

Fig. 1. SoC CDMA XOR encoder and accumulator decoder
7979797979
at the beginning of each decoding cycle. Consequently, each
accumulator adds N/2different inputs during the decoding
cycle because the spreading signature codes are balanced. At
the end of the decoding cycle, if the zero accumulator content
is greater than the one accumulator content, the original data
bit is “1”; otherwise, the original data bit is “0”. The choice
of Walsh spreading codes is of a particular importance for the
design of the overloaded CDMA codes presented in this paper
due to two properties: the balancing property which causes a
constant difference between the two accumulators at the end
of the decoding cycles; and the property of the even difference
between the decoding pair to be discussed in the next section.
In our previous work [12], we established a non-orthogonal
spreading code family with an AND gate encoder that exploit
the steady difference of ±N/2between the two accumulators
to encode extra data and increase the bus capacity. The codes
were built such that their effects are mutually exclusive, thus
enables errorless detection of the spreading codes. The MAI
code family mimics MAI in wireless communications with the
main difference is that MAI is controllable, measurable, and
encoding data. Unfortunately, the MAI-code design limited the
number of the overloaded non-orthogonal codes to only 25%
of the spreading code length N.
III. DIFFERENCE-OVERLOADED CDMA INTERCONNECT
(D-OCI) CODE DESIGN AND BUS ARCHITECTURE
Our main objective is increasing the number of IP cores
sharing the ordinary CDMA bus while keeping the system
complexity unchanged by using simple encoding and decoding
circuitry. To achieve this goal, we propose slight modifications
to the ordinary CDMA bus. Figure 4 shows the overloaded
CDMA bus architecture for a single bit interconnect. The same
architecture is replicated for multi-bit CDMA bus. Mtransmit-
receive IP core pairs share the CDMA bus, spread data from
transmit IP cores are summed together using an arithmetic
adder circuit having Mbit binary inputs and an output of m-
bit width, where m=log2(M). Each transmit and receive IP
core is interfaced to an encoder or a decoder wrapper for data
spreading and despreading. The CDMA bus is only used by
the data signals while control signals are not interfaced by the
CDMA architecture. The destination address of data sent by
any transmit IP core is embedded in the signature code which
can eliminate the need for an address bus. The bus controller
is responsible for assigning spreading and despreading codes
and handshaking with the transmit and receive IP cores.
There is an interesting property of the Walsh code family
used in the conventional CDMA bus system. The difference be-
tween any two consecutive channel sums on the bus produced
by any combination of data spread is always even. For the
used accumulator decoder, this property forces the difference
between the zero accumulator input and the consecutive one
accumulator input to be always even. If all data sent is “0”, the
bus data at any cycle is either “0” or N
2, so if a code is flipped
(its encoded data = 1), then the bus data can be either “1” or
N
21. If the flipped code is used as a despreading code, then
the difference between the bus values when the despreading
code is “0” and “1” is even (±(1 (N
21)) = ±(2 N
2)).
Flipping any other code will not affect the even difference
since the codes are orthogonal, any other flipped code will
add either “1” or “0” to both accumulator inputs so the Pair
Difference (PD) will remain even. In Figure 2, only three codes
are sufficient to illustrate how flipping an orthogonal code does
not affect the even difference in a decoding pair.
One can exploit this unique property to design a set of extra
non-orthogonal spreading codes and, consequently, increase
the bus capacity. We develop a set of non-orthogonal spreading
codes that alters the channel sum to produce the odd difference
between consecutive bus values at specific time slots. The
two cycles where the bus difference is computed are called
the decoding pair, and the proposed non-orthogonal codes are
called the Pair Difference Spreading (PDS) codes. The new
codes cause MAI to appear on the bus, but it does not invalidate
the decoding operation as long as the added MAI does not
deviate the accumulator’s difference by more than N/2.
Unlike orthogonal spreading codes which are XORed with
the binary data bit, we utilize an AND gate to encode the PDS
codes with the binary data bit. The AND gate encoder works
as follows: if sent data is “0” it sends a stream of zeros that
does not deviate the even bus difference, and if sent data is
“1” it sends one of the PDScodes. Therefore, the additional
PDScode will either make the bus difference between two
cycles in a decoding pair even or odd. The XOR encoder of
the ordinary CDMA bus cannot be used to encode the PDS
codes because it only complements the spreading code chips,
so an XOR gate will cause the difference to be odd whether
the data is “0” or “1”. A hybrid encoder is developed for both
orthogonal and non-orthogonal spreading with an XOR gate,
an AND gate, and a multiplexer unit as shown by Figure 4.
A. Pair Difference Spreading and Despreading Code Design
Before proceeding to the bus architecture, we will discuss
how to design the PDScodes for an arbitrary balanced orthog-
onal code family of length N. Figure 3 shows the encoding
and decoding of four PDScodes overloaded to the set of 3
codes shown in Figure 2. Let us consider a non-orthogonal
PDScode composed of a first single chip of “1” and all the
remaining chips are “0” in the Nclock cycles—data encoding
and decoding cycle. Assume this code is assigned to an extra IP
core sharing the bus. When this core accesses the bus and sends
“1”, this code is sent and the single chip of “1” is the input to
either the one or zero accumulators in the orthogonal decoders
based on the despreading code. This code contributes an MAI
value corresponding to only one chip, and the difference D
between the accumulators at an orthogonal despreading code
accumulator decoder is:
D=±N
2+1 (1)
The difference between the bus values in the decoding pair is:
PD(k)=PD(k)+1 (2)
where kis the number of decoding pairs, PD(k)is the original
even pair difference and PD(k)is the pair difference after
adding the non-orthogonal PDScode. If PD(k)is even, then
the sent bit is “0”, if PD(k)is odd then the sent bit is “1”.
Thus, the decoded bit at the PD decoder kis the modulo 2 of
PD(k), which can be implemented by XORing the LSBs of
the two bus values in the decoding pair. Since the orthogonal
codes are balanced, then the number of ones and zeroes in the
despreading code is equal and equals to N/2. Therefore, the
8080808080



 


!"


!"











 


 
   










 
 
!"#$
% % % %&
!
%

!


!

"
%


"

"


#

$
%








%

&














% % % %& % % % %&
Fig. 2. The balancing property of Walsh codes: flipping any of the orthogonal codes does not affect the even difference in a decoding pair.
number of decoding pairs is N/2which is also the maximum
number of non-orthogonal PDScodes that can be added to
the bus because the accumulator difference Dsign might
be changed if the number of the added chips exceeds N/2
invalidating detection of orthogonal spreading codes. Since k
is the number of decoding pairs, then it ranges from 1 to N/2.
A shift register is needed to hold the first value of the bus pair
till the second value arrives in order to XOR the two values.

%''"


%''"


%''"&

 


!$$
%''"
&
(
)
*


%''$
!$"
&
+&,
Fig. 3. Encoding and decoding of four PDScodes overloaded to three
orthogonal codes.
To simplify designing the decoder circuit, we can se-
lect the Pair Difference Despreading (PDD) code to be
{0,1,0,1,0,1, ...}. Thus the first decoding pair is Bus(1) and
Bus(2), the second is Bus(2) and Bus(3) and so on. This
results in a simple shift register structure because the required
difference is between two successive decoder inputs. The first
PD decoder requires an N-bit shift register since the two values
to be subtracted arrive first on the bus and should be held till
the Nth decoding cycle. The second requires an (N2)-bit
shift register, and so forth. The last PD decoder requires only a
2-bit shift register. Hence the total number of the needed 1-bit
shift registers is N2
4+N
2. Dividing this number by the total
number of PD decoders of N/2yields N
2+1 registers per PD
decoder. The PDDcode can be any one of the codes in the
orthogonal code set since the despreading code must be both
orthogonal and balanced in order to yield the even difference
in a decoding pair. To minimize the width of registers per PD
decoder, the chips in every decoding pair must be adjacent to
eliminate the need to store bits between the two chips.
For the orthogonal signature decoders, the difference be-
tween the two accumulators is no longer ±N/2because of
MAI caused by the non-orthogonal PDScodes. However, a
comparator circuit can still detect data encoded by orthogonal
spreading codes by comparing the one and zero accumulator
contents, as long as the total MAI value contributed by non-
orthogonal codes is less than N/2to preserve the sign of
the difference and consequently facilitate orthogonal code
despreading. To clarify this we present this example for a code
length N=8, the selected PDDcode is:
PD
D={0,1,0,1,0,1,0,1}(3)
which is the concatenation of four consecutive decoding pairs.
We can generate the PDScodes using the designed despreading
code. Generally, PD
S(k)=2
lwhere lis the location of the
next “0” chip in the despreading code counted from the LSB
upwards. Therefore, for k={1,2,3,4},l={7,5,3,1}, and
the PDScodes are:
PD
S[1] = 27={1,0,0,0,0,0,0,0}
PD
S[2] = 25={0,0,1,0,0,0,0,0}
PD
S[3]=2
3={0,0,0,0,1,0,0,0}
PD
S[4]=2
1={0,0,0,0,0,0,1,0}
(4)
Thus, each PDScode either adds only a single chip to a
decoding pair or does not, according to the data to be sent.
Generally speaking, we can say that there are N/2cycles to
encode PDSbits and N/2free cycles.
B. Basic and Optimized Difference-Overloaded CDMA Inter-
connect (D-OCI) Decoder Architectures
The non-orthogonal PD decoder is only required to find
the difference between the bus values at two different bus
cycles inside a decoding pair. As illustrated by Figure 4, the
transmit IP cores are interfaced to the encoder wrappers, and
the receiving memory/peripheral units (MPUs) are connected
to the decoder wrappers. We apply a static code allocation
scheme where each transmit-receive pair has a fixed signature
code, the added N/2decoders are connected to N/2MPUs.
There are 1.5Nencoders and 1.5Ndecoders, the decoders are
decomposed into Northogonal decoders and N/2PD decoders
that decode data for the N/2PUs as explained previously.
Encoders are configured by applying specific spreading codes,
according to the intended communication link. If the intended
link uses an orthogonal spreading code, the XOR encoder is
selected; otherwise, the AND encoder is selected.
We implemented two variants of the bus, a reference archi-
tecture, and a pipelined architecture. The reference architecture
is a direct implementation of the spreading and despreading
circuitry without adding any non-functional registers. The
pipelined architecture is implemented to increase the bus
operating frequency and, consequently, throughput by adding
non-functional registers to reduce the bus critical path. Two
pipelining registers are inserted around the bus adder circuit
as shown in Figure 4. The encoded data register holds data
encoded by the orthogonal and PD encoders while the sum
register holds the adder output to be passed to the decoding
circuitry. Thus, the critical path inside the CDMA bus circuitry
is reduced to include the longest path in the three parts, which
is usually the adder circuit. This architecture can be pipelined
further by breaking the critical path in the adder circuit, but
at the expense of adding more pipelining registers. The bus
register is m-bit wide where (m=log21.5N) for the
orthogonal decoders, but only 1-bit wide for PD decoders since
only the LSB is required for the PD decoding.
8181818181
Bit-Slice A-1
Mux
1-bit
data Orthogonal
MAI
Spreading
Code
Generator 1
Hybrid Encoder 1
Encoder 2
Encoder M
Zero
Accumulator
One
Accumulator
0
1
comp
Despreadi ng
code
generator 1
Orthogonal Decoder 2
1xN Shift
Register
MAI Decod er
Orthogonal Decoder 1
m-bit width
Binary Signaling
Arithmetic Adder
Reg[0]-
Reg[1]
1xN Shift
Register
Reg[N-1]-
Reg[N]
Bus Register
1
d
Mux
1-bit
data
Orthogonal
MAI
Spreading
Code
Generator 1
Encoder 2
Encoder M
One
Accumulator
0
1
comp
Orthogonal Decoder 2
MAI Decod er
Orthogonal Decoder 1
Binary Signaling
1xN Shift
Register
Encoded Data Register
Bit-Slice 0
Decoder wrapper for IP cores using
PDS Codes
Decoder wrapper for an IP core
using an orthogonal code
Encoder wrapper for IP
core 1
Mux
1-bit
data
Orthogonal
PDS
Spreading
Code Gen
Hybrid Encoder 1
Encoder 2
Encoder M
IP
Core 2
Zero
Accumulator
One
Accumulator
0
1
Comp
Despreadi ng
Code Gen
Orthogonal Decode r 2
1x2 Shift
Register
Memory/
Peripheral
1
Memory/
Peripheral
N+1
data
despreadi ng
code
Configure
Memory/
Peripheral
2
Orthogonal Decode r 1
Bus Adder and Pipelining
Registers
m-bit width
Binary
Signaling
Arithmetic Adder
1xN Shift
Register
Memory/
Peripheral
1.5 N
Sum Register
Encoded Data Register
IP
Core 1
data
Spreading
code
Configure
A-bit
width
IP
Core M
start idle validack
Bus Controller
start
idle
valid
acknowledge
Counter
To All Code Generators
Reg[0]
Reg[1]
Reg[N-1]
Reg[N]
Fig. 4. Pipelined Difference Overloaded CDMA bus system containing the hybrid encoder, and both the orthogonal and the PD overloaded codes decoders.
IV. PERFORMANCE EVALUATION
A. Overloaded CDMA Interconnect (OCI) Bus Evaluation
In this section, we present the evaluation results of the
overloaded CDMA bus. A system containing a number of IP
cores and peripheral devices was built with full capacity, i.e.
the number of IP cores is the maximum number offered by the
bus. All CDMA bus variants are implemented and validated
on an Artix-7 AC701 evaluation kit. Specifically, we compare
between the conventional CDMA bus, M-OCI bus, basic and
pipelined D-OCI bus variants for different spreading code
lengths (number of chips) N={8,16,32,64}. To establish a
fair comparison between different bus architectures connecting
a number of elements, all performance metrics are normalized
to the number of interconnected elements, i.e. all performance
metrics for a bus interconnecting MIP cores are divided by
Mto evaluate bus performance per IP core. Evaluation results,
including resource utilization expressed in LUTs and Flip-
Flops per IP core, maximum bus frequency, dynamic power
consumption per IP core, and the bus bandwidth are shown in
Figure 5. To give an idea about ASIC implementation of the
CDMA bus, initial synthesis results of the bus using UMC-
0.13 μm ASIC cell library are illustrated in Figure 5.
As depicted by Figure 5(a), 5(b), for a fixed spreading
code length N, resource utilization per IP core of the M-
OCI and D-OCI buses is less than the ordinary CDMA bus
by 25% and 50% for M-OCI and D-OCI, respectively. This
resource reduction per IP core is due to the significant increase
in bus capacity compared to the marginal overhead added by
the OCI circuitry. Also, for a fixed spreading code length N,
the D-OCI has further fewer resources per IP core compared to
the M-OCI due to the increase in the overloading percentage.
Increasing the spreading code length Nincreases the resource
utilization per IP core due to the increase in the bus complexity.
Specifically, with increasing N, the size of the bus adder and
accumulator decoder circuitry increases. Another note worth
mentioning in Figure 5(a), 5(b) is that the resource utilization
of the pipelined D-OCI bus is always larger than the basic D-
OCI bus due to the added non-architectural pipelining registers.
For all CDMA bus variants, the operating frequency is
limited by the critical path length, including the spreading
circuit, channel adder, and accumulator decoder components.
For various CDMA buses of the same spreading code length
N, orthogonal spreading and despreading circuits are identi-
cal, non-orthogonal data encoders and decoders are running
parallel to the orthogonal spreading circuitry with a shorter
critical path length, and the input size of the adder circuit
is equal to the number of transmit IP cores Mwhich varies
with the CDMA bus type. Figure 5(c) illustrates that for
a fixed spreading code length N, the bus frequency of the
overloaded CDMA buses is less than the basic CDMA bus
frequency due to the increase in the adder circuit size. The
pipelined design isolates the critical path at the CDMA bus
adder tree which improves the maximum bus frequency at
the expense of the extra non-architectural registers and output
latency. The bus frequency decreases with increasing Nfor
both overloaded and ordinary CDMA buses due to increasing
the computational complexity of the adders as shown by
Figure 5(c). The operating frequency of the UMC-0.13 μm
implementation of the CDMA bus is about 10xgreater than
the FPGA implementation counterparts.
With increasing N, the drop in frequency is compensated
by the increase in the bus bandwidth due to the capacity
enhancement offered by the overloaded buses as shown by
Figure 5(d). The bus bandwidth is plotted for only a single bit
per IP interconnected via the CDMA bus. For fixed N,wecan
clearly see the enhancement of the bus bandwidth for the D-
OCI bus over the M-OCI and conventional CDMA buses, and
the enhancement of the pipelined D-OCI bus bandwidth over
the basic D-OCI bus. Generally, the CDMA bus bandwidth
BW is given by the following equation:
BW =Nbits fbM
N(5)
8282828282



&
)
-
-
)
)
&
(a) Resources as combinational (hashed) and non-combinational
(solid) in μm2/IP vs spreading code length N
(

(
-))&
(b) Resources in LUTs (hashed) and FFs (solid) /IP vs spreading
code length N

&
)
-


-))&
(

(
- ) )&
(c) Maximum bus frequency in MHz vs spreading code length N
(

(

-))&
(

(

(
- ) )&
(d) Bus bandwidth in Mbps vs. spreading code length N
&
(
)
-) )&
&
-))&
(e) Power in mW/IP vs spreading code length N
4 9 9 9%

Fig. 5. Synthesis and implementation results of the overload CDMA bus for code length N={8,16,32,64}.
where Nbits is the number of interconnected bits per IP core
(data bus width), fbis the bus frequency, Mis the number
of transmit-receive core pairs sharing the bus, and Nis the
spreading code length. The M-OCI and D-OCI bandwidth has
significant improvement over the ordinary CDMA bus as they
have an overloading ratio of M
N=1.25,1.5, respectively,
compared to the basic CDMA bus ratio of M
N=1.
As illustrated by Figure 5(e), for a fixed spreading code
length N, power dissipation per IP core is decreased for
the M-OCI and D-OCI buses due to the offered capacity
enhancement. For increasing N, power dissipation per IP core
increases for all CDMA buses due to the increased size and
complexity of the bus components. The aforementioned con-
clusions apply for both the ASIC and FPGA implementations
of the bus. However, the routing overhead in the D-OCI
increases the dynamic power consumption over the M-OCI
in the FPGA platform. The ASIC synthesis (pre-place and
route) results do not include routing information, so the D-
OCI appears to have less power consumption than the M-OCI.
B. OCI Bus Comparison to Other Interconnect Topologies
In order to evaluate the CDMA interconnect performance
relative to TDMA and SDMA, we implemented the basic
architecture for both TDMA and SDMA buses illustrated in
Figures 6 and 7, respectively. The TDMA bus is basically
composed of multiplexer and demultiplexer circuits back-to-
back connected as shown in Figure 6. An arbiter module is
responsible for selecting modules to be connected in specific
time slots according to specific access priorities and arbitration
rules. Access time is divided between the elements sharing
the bus and the bus utilization cannot be increased beyond
1 because only Mtransmit-receive pairs can access the bus
in Mtime slots. Though the arbitration overhead in TDMA
buses is significantly large, we only implement the switching
elements without the arbiter in order to assess the basic concept
without details. The SDMA bus depicted in Figure 7 is mainly
composed of Mmultiplexers each has Minputs to facilitate
connecting M×Melements without blocking communication
for any element. The SDMA bus dedicates a physical link
between every pair of interconnected elements which provides
uninterrupted communication, but at the expense of increasing
resource utilization. A new multiplexer is needed for every
additional transmit-receive pair and the complexity of existing
multiplexers increases due to the additional input/output pair.
The SDMA and TDMA buses of Figures 6 and 7 and the
basic CDMA bus of Figure 1 are implemented on the Xilinx
Artix-7 AC701 kit and the synthesis results are illustrated in
Figure 8. The resource utilization is expressed as the number of
LUTs and FFs. As depicted in Figure 8(a), the resource utiliza-
8383838383
$
6
.
:6$ .
9%
9%
9% 1#% ! 
1#% !
1#% !
Fig. 6. Basic TDMA bus topology
$
6
.
9%
.
9%
9%
1#
%!
1#
%!
Fig. 7. Basic SDMA bus topology
tion in the case of the TDMA bus is constant M/M 1.For
the SDMA bus, the resource utilization M2/M =Mwhich
follows the linear trend shown in Figure 8(a). The CDMA
bus resource utilization Mlog2(M)/M =log
2(M)which
results in a logarithmic utilization trend. The bandwidth of the
SDMA bus, on the other hand, is M-folds the constant band-
width of the SDMA and CDMA buses which is depicted by
the log scaled bandwidth comparison of Figure 8(c). Increasing
wiring complexity reduces the bus frequency, thus the TDMA
bus can achieve a higher bus frequency than the CDMA bus,
which in turns has a higher bus frequency than the SDMA
bus as shown by Figure 8(b). Dynamic power consumption of
the CDMA bus shown in Figure 8(d) is significantly higher
than that of the S/TDMA buses due to the larger number of
deployed registers. Nevertheless, the power consumption of the
CDMA bus approaches the SDMA bus as Mincreases due the
quadratic increase in wiring complexity of the SDMA bus.
The above analysis illustrates that the conventional CDMA
bus has a higher area setback when compared to the TDMA
bus but offers an equal bandwidth. Also, the conventional
CDMA bus has lower bandwidth setback against the SDMA
bus, but a much smaller area. Therefore, the OCI bus ar-
chitecture helps in enhancing the two setbacks by increasing
the bus bandwidth and reducing the resource utilization per
IP core. In other words, the OCI buses can increase the
bus bandwidth by overloading the channel, which cannot be
achieved with TDMA buses, at the expense of increasing the
bus complexity to be in order of log2M, which is much less
than the SDMA bus. Thus, the OCI bus is a good compromise
for SoCs requiring higher bandwidth than that achieved using
TDMA bus topologies but significantly less area than SDMA
bus architectures. The implementation results reinforce the
theoretical comparison provided in Table I.
The CDMA interconnect has a larger bandwidth per area
ratio of 1/log2(N+1) compared to the SDMA interconnect
ratio of 1/N . This ratio is significantly increased by M-
OCI and D-OCI to 1.25/log2(N+1) and 1.5/log2(N+1),
respectively. The OCI buses have less bandwidth to resources
ratio than the TDMA bus because the added complexity is
significantly larger than the increased bandwidth. An exclusive
advantage of the OCI bus over T/SDMA buses is the capability
of over utilizing the communication channel. In the TDMA
bus, each time slot can carry a maximum of 1-bit while, in
the SDMA bus, each PTP link can also carry no more than
1-bit. Conventional CDMA buses can also carry a maximum
of Nbits per Ntime slots or one bit per time slot. Fortunately,
the OCI bus has the ability to increase the channel utilization
beyond one via channel overloading.
Table II provides a comparison between the OCI bus
and other interconnect architectures presented in the literature
in terms of resources, capacity, frequency, bandwidth and
bandwidth to resources ratio. It should be noted that in all
the compared interconnects, a full system was implemented
including bus arbiters. As shown by Table II, the D-OCI bus
has the highest bandwidth to resources ratio compared to other
interconnect topologies. We should indicate that OCI is still
in its initial development phase where various architectural
and functional optimizations can be still applied to compete
with the state-of-the-art HOT interconnects. For instance, in
this work, we only presented the OCI bus architecture to
illustrate the bus overloading idea. However, we can also use
the same concept to build the OCI NoC architecture which can
significantly increase the interconnect bandwidth.
V. HLS AND AUTOMATION OF THE D-OCI BUS IP CORE
In the last section, we only presented a simple implemen-
tation of the D-OCI bus interconnecting a set of elements
generating test data and we compared the D-OCI bus to
other CDMA bus variants. However, such an evaluation is not
sufficient to prove the bus competency with the deployed long-
established SoC buses. We are currently automating the gener-
ation of the D-OCI bus IP core to facilitate bus deployment in
modern SoCs. In this section, we present a direct comparison
between an initial prototype of the D-OCI core and the ARM
AXI bus deployed in Xilinx Zynq-7000 SoC. The D-OCI IP
core is implemented using the Xilinx Vivado HLS design flow
while the AXI bus IP core is provided by Xilinx. The HLS
C to RTL flow allows automated quick implementation and
verification of the core, Vivado HLS tool also offers compiler
directives for RTL optimization. The AXI bus is chosen for
this comparison due its widespread deployment in modern
SoCs, the availability of a number of bus variants for different
performance needs, and its extensive support by different
vendors and CAD tools. Moreover, the AXI bus does not
require a fixed number of power of 2 of connected elements,
which facilitates its comparison to the D-OCI bus.
A SoC bus testbed shown in Figure 9 is built on a Zynq-
7000 SoC to evaluate the D-OCI bus. The testbed comprises
the Bus Under Test (BUT), M master and M slave IP cores,
an ARM processor, and a controller described as follows:
a) The BUT: The D-OCI bus is compared against the AXI
crossbar switch, namely the Shared Address Multiple Data
(SAMD) mode, and against the AXI TDMA switch, namely
the Shared Address Shared Data (SASD) mode.
b) Master IP cores: The M master IPs in this testbed are
capable generating one data beat to be written on one slave.
8484848484

&
)
-
) )& -
(a) Resources as combinational (hashed) and non-combinational
(solid) in LUT-FF

&
)
-


&
) )& -
(b) Maximum bus frequency in MHz
&
(
) )& -
(c) Log scaled bus bandwidth in Mbps
(

(
) )& -
(d) Dynamic power dissipation in mW
:  
Fig. 8. Synthesis and implementation results of TDMA, SDMA and CDMA bus topologies of M×Msize for M={16,32,64,128}.
TABLE II. AREA,FREQUENCY,BANDWIDTH AND BANDWIDTH TO RESOURCES RATIO OF THE D-OCI BUS VERSUS OTHER INTERCONNECTS
Bus Topology Implementation Bus Capacity Resources Frequency Bandwidth Bandwidth to resources
Technology Masters ×Slaves (Gate Count) (MHz) (Mbps) ratio (Mbps/Gate Count)
D-OCI ASIC 0.13 μm 11 ×11 1,268 1000 1,375 1.08
CDMA router [15] ASIC 0.18 μm 7×7 47,754 94.2 21,100 0.44
CDMA NoC [14] ASIC 0.18 μm 6×6 272,806 Asynch 7,410 0.027
PTP NoC [14] ASIC 0.18 μm 6×6 177,007 Asynch 6,756 0.038
D-OCI Artix7 11 ×11 503 146 187 0.37
CDMA wrapper [18] Virtex5 4 ×4 2,064 - 587.6 0.28
PTP MPEG [19] Virtex2 7 ×7 43,248 - 4714 0.108
TDMA MPEG [19] Virtex2 7 ×7 40,048 - 3,669 0.09
NoC MPEG [19] Virtex2 7 ×7 41,768 - 4,622 0.11
$ 
9%
8
8$$
;#
<1
$
:
8
!
$ 
9%
8
8$$
;#
<1
8
!
4
9%
4
9%
8
!
8
8$$
;#
<1
8
!
8
8$$
;#
<1
9=
:$

9%
< %



:$

9%

:$
$$

9%$$
Fig. 9. SoC testbed in for the D-OCI and AXI bus architectures
The data beat and the address of the slave is specified at
the compile time. When the BUT is the D-OCI, the master
IPs are connected to the D-OCI via a data bus and address
bus with valid/ready handshake signals. When the AXI is
used as the BUT, the master IPs are connected to the AXI
by an AXI bus wrapper provided by Xilinx.
c) Slave IP cores: the S slave IPs polls the bus for the data
beat and asserts a “transaction done” signal to validate that
data is received correctly. The correct data that each slave
should receive is known at the compile time. The salve IPs
are connected to the D-OCI by a data bus and valid/ready
handshake signals. The AXI bus wrapper is also used to
8585858585
connect the slaves to the AXI bus.
d) An Integrated Logic Analyser (ILA) and a counter: The
counter is used to as a measure of the clock latency of BUT.
The counter also has a one hot “start” output that starts the
entire system at a fixed count. The ILA is used to probe the
“transaction done” signals indicating the correctness of the
received data at each slave, the ILA also probes a ”done”
signal for each slave indicating that the slave is no longer
probing for data. The bus clock latency can be measured
by the number of counts of the counter between issuing the
“start” signal when all slaves assert the ”done” signal.
e) An ARM processor: the processor is used to validate the
bus operation, trigger the ILA probing, and start/stop or
clear the counter to monitor the testbed performance.
A comparison between the D-OCI and AXI buses in terms
of resource utilization and write latency in clock cycles for
different bus sizes is illustrated in Table III. The testbed runs
at a 100 MHZ clock frequency, the D-OCI IP core has only the
write channel implemented while the AXI bus has the read,
write, and write response channels implemented. Two different
implementations of the D-OCI bus with N=8and 16 are
tested, which results in 11 and 23 ×23 bus size, respectively.
The crossbar and TDMA implementations of the 11 ×11 and
16 ×16 AXI bus are tested, while the 23 ×23 AXI bus is
not implemented due to limitations of the Xilinx CAD tools
which can only generate up to 16 ×16 AXI bus. The timing
diagram of the six tested buses shown in Figure 10 is obtained
by the ILA probing of the “start”, “done” and “transaction
done” signals of the six tested buses. The implementation and
simulation results can be analyzed as follows:
a) The 11 ×11 D-OCI bus utilizes 97% and 90% fewer
resources than the 11×11 AXI crossbar and TDMA switch,
respectively, while the 23×23 D-OCI utilizes 94% and 80%
fewer resources than the AXI crossbar and TDMA switch,
respectively. The resource utilization ratio is calculated
between the combined number of LUTs and FFs for the
compared buses. The huge increase in the resource utiliza-
tion of the AXI crossbar can be attributed to three factors.
Firstly, crossbars are space switching elements using a
dedicated physical link for each interconnect. Secondly, the
AXI bus has 3 working channels while the D-OCI bus has
only one. Finally, the master and slave IP cores contain
AXI bus wrappers, which increases the utilization of the
masters and slaves and congests the design and causes the
placement and routing tool to duplicate AXI resources in
order to meet timing constraints. The increase in the AXI
TDMA switch resource utilization is mainly due to the
decoding, arbitration, and control overheads.
b) The 11 ×11 D-OCI bus latency is 13 clock cycles while the
11 ×11 AXI TDMA bus latency is 122 clock cycles (89%
reduction). The AXI TDMA bus latency is significantly
larger than the D-OCI bus because it serves only one write
request at a time without pipelinig the requests [20].
c) The 11 ×11 D-OCI bus latency of 13 cycles is also
better than the 11 ×11 AXI crossbar bus latency of 42
clock cycles, about 70% reduction. The 23 ×23 D-OCI
bus latency of 22 cycles is less than the 16 ×16 AXI
crossbar bus latency of 61 cycles, about 64% reduction.
This improvement in latency is because addressing in
the D-OCI bus is performed once at the beginning of
the transaction by assigning different spreading codes to
different masters, while in the AXI bus addresses are sent
in a TDMA fashion due to the shared address channel [20].
d) Finally, the implementation results show that the achievable
bandwidth of the D-OCI CDMA bus is significantly greater
than that of both AXI crossbar and TDMA bus configu-
rations. This can be attributed to that the arbitration and
control units of the D-OCI bus are not fully implemented
yet. It should be indicated that in case of a burst transfer
mode, which is not implemented yet in the D-OCI, the
AXI crossbar should be about Ntimes faster the D-OCI
bus, where Nis the spreading code length. This speedup is
due to that in burst transfers the addressing phase is only
performed once, then in every Nclock cycles the AXI
crossbar can send Ndata beats while the D-OCI bus can
send only 1 data beat from each IP core. This implies that
the D-OCI IP can be better utilized for application requiring
single data beat transfers like those with a substantial
number of random reads and write requests.
VI. CONCLUSIONS
In this work, we presented the enhanced OCI bus architec-
ture, namely D-OCI, which improves the conventional CDMA
bus capacity by 50%. The D-OCI topology can replace the
TDMA topology to implement on-chip interconnects in either
a bus or a NoC router. We exploited the balancing property of
the spreading code family employed in the classical CDMA
bus to increase the number of IP cores sharing the bus without
altering the simple accumulator decoder of the conventional
CDMA bus. A systematic generation procedure of the non-
orthogonal spreading and despreading codes is presented along
with two optimized, reference and pipelined, implementations
of the bus architecture. The D-OCI bus topology was imple-
mented and validated on an Artix-7 AC701 FPGA development
kit and the UMC 0.13μm ASIC technology.
We compared the D-OCI bus performance with the con-
ventional CDMA bus and M-OCI bus presented in our pre-
vious work. The reference D-OCI bus achieves 21% higher
bandwidth over the conventional bus on the FPGA platform,
while the pipelined D-OCI bus achieves 34% more bandwidth.
The dynamic power consumption on the FPGA is reduced
by 29% and 48% for the reference and pipelined D-OCI,
respectively. Initial ASIC synthesis results show that the D-
OCI bus utilizes less cell area, consumes less power, and can
achieve a bus frequency of up to 1 GHz, which promotes the
deployment of the D-OCI bus in ASIC platforms. We also
compared between the basic CDMA, SDMA, and TDMA bus
implementations and proved that the CDMA bus is the only
topology that can increase the bus utilization efficiency beyond
one. We presented a proof-of-concept prototype of the D-OCI
bus and compared it to the ARM AXI bus in its two modes
of operation, the crossbar and TDMA. The resource utilization
and clock latency comparisons between the D-OCI and AXI
buses demonstrate the significant improvement of the CDMA
bus architecture over the TDMA and SDMA bus architectures.
Many directions for future work are inspired by this
research. We aim to develop a fully-functional prototype of the
OCI-bus IP core. Functional and architectural optimizations
will be investigated to improve the OCI bus performance.
Also, we will investigate increasing the OCI bus capacity by
8686868686
TABLE III. UTILIZATION AND WRITE LATENCY OF THE D-OCI IP VS AXI BUS
Bus Bus Capacity LUTs FFs Latency Frequency Bandwidth
Topology M×Mclock cycles MHz Gbps
D-OCIN=8 11 ×11 177 222 13 109 2.951
D-OCIN=16 23 ×23 487 567 22 113 3.78
AXI SAMD-Crossbar 11 ×11 8,229 5,651 42 104 0.871
AXI SAMD-Crossbar 16 ×16 11,299 7,833 61 93 0.78
AXI SASD-TDMA 11 ×11 2,123 1,761 122 107 0.309
AXI SASD-TDMA 16 ×16 2,919 2,532 177 105 0.304
>9
9:
9$$6
?1$'$$$
$ $$1
?11
:$$$01$!!
?11&
6$$$01$! $$!
Fig. 10. Write latency of D-OCI bus vs AXI crossbar and AXI TDMA, The clock latency is measured from the instance when the bus starts till all slave IPs
receive the data beats.
expanding the spreading code set and using low-complexity
orthogonal decoders other than the accumulator decoder such
as the ML decoder presented in [21]. We will investigate
applying the OCI concepts to the NoC CDMA architecture to
enhance the interconnect bandwidth and power consumption.
REFERENCES
[1] R. Ho, K.W. Mai, and M.A. Horowitz. The future of wires. Proceedings
of the IEEE, 89(4):490–504, Apr 2001.
[2] Ling Wang, Jianye Hao, and Feixuan Wang. Bus-based and NoC
infrastructure performance emulation and comparison. In Information
Technology: New Generations, 2009. ITNG ’09. Sixth International
Conference on, pages 855–858, April 2009.
[3] G. Passas, M. Katevenis, and D. Pnevmatikatos. The combined input-
output queued (CIOQ) crossbar architecture for high-radix on-chip
switches. Micro, IEEE, PP(99):1–1, 2014.
[4] Jongsun Kim, I. Verbauwhede, and M.-C.F. Chang. Design of an
interconnect architecture and signaling technology for parallelism in
communication. Very Large Scale Integration (VLSI) Systems, IEEE
Transactions on, 15(8):881–894, Aug 2007.
[5] M Mitic, M Stojcev, and Z Stamenkovic. An overview of SoC buses.
In Vojin G Oklobdzija, editor, Digital Systems and Applications. CRC
Press, 2007.
[6] Jr. Bell, R.H., Chang Yong Kang, L. John, and E.E. Swartzlander.
CDMA as a multiprocessor interconnect strategy. In Signals, Systems
and Computers, 2001. Conference Record of the Thirty-Fifth Asilomar
Conference on, volume 2, pages 1246–1250 vol.2, Nov 2001.
[7] B.-C.C. Lai, P. Schaumont, and I Verbauwhede. CT-bus: a heteroge-
neous CDMA/TDMA bus for future SoC. In Signals, Systems and
Computers, 2004. Conference Record of the Thirty-Eighth Asilomar
Conference on, volume 2, pages 1868–1872 Vol.2, Nov 2004.
[8] Sudeep Pasricha and Nikil Dutt. Chapter 2 - basic concepts of bus-
based communication architectures. In Sudeep Pasricha and Nikil Dutt,
editors, On-Chip Communication Architectures, Systems on Silicon,
pages 17 – 41. Morgan Kaufmann, Burlington, 2008.
[9] Sudeep Pasricha and Nikil Dutt. Chapter 3 - on-chip communication
architecture standards. In Sudeep Pasricha and Nikil Dutt, editors, On-
Chip Communication Architectures, Systems on Silicon, pages 43 –
100. Morgan Kaufmann, Burlington, 2008.
[10] Seyed Amirhossein Hosseini, Omid Javidbakht, Pedram Pad, and Far-
rokh Marvasti. A review on synchronous CDMA systems: optimum
overloaded codes, channel capacity, and power control. EURASIP
Journal on Wireless Communications and Networking, (1):1–22, 2011.
[11] Kasra Alishahi, Shayan Dashmiz, Pedram Pad, Farrokh Marvasti, M. H.
Shafinia, and M. S. Mansouri. The enigma of CDMA revisited. CoRR,
abs/1005.0677, 2010.
[12] K.E. Ahmed and M.M. Farag. Overloaded CDMA bus topology
for MPSoC interconnect. In ReConFigurable Computing and FPGAs
(ReConFig), 2014 International Conference on, pages 1–7, Dec 2014.
[13] Tatjana Nikolic, Mile Stojcev, and Goran Djordjevic. CDMA bus-
based on-chip interconnect infrastructure. Microelectronics Reliability,
49(4):448 – 459, 2009.
[14] Xin Wang, T. Ahonen, and J. Nurmi. Applying CDMA technique to
network-on-chip. Very Large Scale Integration (VLSI) Systems, IEEE
Transactions on, 15(10):1091–1100, Oct 2007.
[15] Daewook Kim, Manho Kim, and Gerald E Sobelman. Design of a high-
performance scalable CDMA router for on-chip switched networks.
Memory, 8:01100110, 2005.
[16] A. Vidapalapati, V. Vijayakumaran, A. Ganguly, and A. Kwasinski. NoC
architectures with adaptive code division multiple access based wireless
links. In Circuits and Systems (ISCAS), 2012 IEEE International
Symposium on, pages 636–639, May 2012.
[17] T. Nikolic, G. Djordjevic, and M. Stojcev. Simultaneous data transfers
over peripheral bus using CDMA technique. In Microelectronics, 2008.
MIEL 2008. 26th International Conference on, pages 437–440, 2008.
[18] T. Nikolic, M. Stojcev, and Z. Stamenkovic. Wrapper design for a
CDMA bus in SOC. In Design and Diagnostics of Electronic Circuits
and Systems (DDECS), 2010 IEEE 13th International Symposium on,
pages 243–248, April 2010.
[19] Hyung Gyu Lee, Naehyuck Chang, Umit Y. Ogras, and Radu Mar-
culescu. On-chip communication architecture exploration: A quantita-
tive evaluation of point-to-point, bus, and network-on-chip approaches.
ACM Trans. Des. Autom. Electron. Syst., 12(3):23:1–23:20, May 2008.
[20] Xilinx. UG761-AXI Reference Guide.2012.
[21] M. Li. Fast code design for overloaded code-division multiplexing
systems. Vehicular Technology, IEEE Transactions on, PP(99):1–1,
2015.
8787878787
... This coding technique is called new parallel (NPC-CODEC). Various Overloaded CDMA Interconnect (OCI) crossbar architectures are presented in [1] and [7]- [9]. In OCI, the Walsh spreading code set is extended with a set of non-orthogonal codes of the same size to double the conventional CDMA crossbar capacity and throughput, while reducing the area overhead. ...
... Recently, many NoC architectures have been inspired by our previous works of the overloaded and aggregated CDMA crossbars presented in [1], [10], and [9]. An experimental study on the effect of noise on CDMA crossbars is presented in [16]. ...
Article
Full-text available
Network-on-chips (NoCs) are the dominant interconnection technique in modern system-on-chips (SoCs). The medium access technique used in the physical layer of NoC routers profoundly impacts the performance and footprint of the router. Code division multiple access (CDMA) is a medium access technique widely deployed in various wireless communication systems, and it has recently been proposed as a prominent switching method for NoC routers. In a CDMA crossbar, several processing elements (PEs) can communicate simultaneously over a single communication channel by applying the direct-sequence spread spectrum technique to digital interconnects. However, existing CDMA switches are bit-wise architectures in which a binary data bit is spread and serially communicated on an exclusive digital channel while replicating this configuration to communicate multiple data bits, which increases the crossbar area and wiring density. In this work, we propose aggregated CDMA (ACDMA) to improve the area, throughput, and power consumption of existing CDMA NoC routers. ACDMA exploits the static nature and relative noise immunity of on-chip interconnects to aggregate transmission of multiple data bits into M-ary symbols on a single digital communication channel, which significantly reduces the wiring density and area overhead of the crossbar. Serial and parallel variants of the ACDMA crossbar offering a wide range of area-speed trade-offs are implemented in the ASIC $65~nm$ standard cell technology. The implementation results show that the throughput-per-area (TPA) of the serial and parallel ACDMA crossbars is improved by 96.3%, 18.2%, and 118.6%; and 400%, 255.3%, and 184.2% compared to the serial and parallel counterparts of the Walsh Basis (WB), Standard Basis (SB), and Overloaded CDMA Interconnect (OCI) crossbars, respectively. A 65-node ACDMA NoC router is fully realized and compared to the state-of-the-art CDMA and CONNECT NoC routers under multiple synthetic workloads. Communication reliability of the ACDMA NoC router subject to noise is investigated and a hybrid ARQ approach is proposed to improve the interconnect reliability.
... This strategy is used in Ref. [15] to double the coding capacity. They modified a four-bit CDMA encoding where a maximum of three transmitters and receivers could be present, which resulted in six transmitters and receivers. ...
... The three of these transmitters use the four-bit Walsh code and the other three use the TDMA method. The mathematical foundations of Ref. [15] is presented in Ref. [16]. In Ref. [17], the authors used a compression technique to increase the number of CDMA network ports. ...
Article
Full-text available
The Code‐division multiple access (CDMA) method is commonly used as the network infrastructure in multi‐core chips. One of its advantages is the simultaneous connection of all network components. Another advantage is the constant delay of this method. On the other hand, one drawback is that the number of transmitters is limited to the number of encoding bits. In this study, the authors used the combination of Walsh codes and their inverses, as well as the simultaneous application of the time‐division multiple access (TDMA) method, to increase the transmission capacity of this protocol more than four times the standard mode. In the proposed design, although the circuit area does not increase significantly, a fourfold increase in the throughput of the CDMA network is seen. Using the method proposed in this study, it will be possible to increase the capacity further.
... A multilevel 2 bit Code Division Multiple Access proposed in [8] was mainly used as an input and output redesign scheme and also reduces the bus contention over Time Division Multiple Access. In [2]a comparison of based Network on Chip and a Point To Point duplex ring based Network on Chip is done. The outcome after simulation shows that the Code Division Multiple Access Network on Chips irreversible data transfer latency is equal to the best case latency of the Point To Point of the identical channel width. ...
... The data from the channel (bus) is passed to zero or one accumulator on the basis of the chip value. If the chip value is zero, then it is passed to the zero accumulator and if the value is one , then it is passed to one accumulator [2]. At the beginning of each decoding cycle the accumulators is reset to zero. ...
... In particular, when the despreading chip is '1', the adder adds Si to the contents of the register but subtracts Si from the contents of the register when the despreading chip is '-1'. At the end of the decoding cycle, the accumulator register holds N dk according to (5), and because N = 2n and n is an integer, data dk is decoded by shifting the accumulator content by log2(N) bits. ...
Article
Full-text available
Code Division Multiple Access (CDMA) is proposed as the physical layer enabler of Network-On-Chip (NoC) interconnects for its prominent features such as fixed latency, guaranteed service, and reduced system complexity. CDMA interconnects have been adopted by the NoC community as it originates in wireless communications where each bit in a CDMA encoded data word is transmitted on a separate channel to avoid interference. However, the wireless interference problem can be efficiently mitigated in on-chip interconnects eliminating the need for replicating the CDMA channel. Moreover, wireless channels are sequential by nature which is not the case in on-chip interconnects where parallel buses are the default communication means. After CDMA was adopted by the NoC community, the same wireless CDMA scheme has been maintained where each data bit is encoded in a separate CDMA channel and the encoding/decoding logic is replicated for data packets. In this work, we present a novel CDMA encoding/decoding scheme called Aggregated CDMA (ACDMA) for NoC interconnects in which all packet bits are encoded in a single CDMA channel, consequently, eliminating the area and energy overheads resulted from replicating the channel encoding/decoding logic. The overhead of channel replication is mitigated which results in up to 60.5% area and 55% power savings with 89% improvement in throughput per area compared to the conventional CDMA crossbar. As a future work, we plan to build and evaluate a full ACDMA-based NoC under different workloads and routing protocols.
... Another advantage of LDS-CDMA is its ability to achieve overloading, that is when the number of users in a system exceeds the processing gain. While the number of users in an overloaded system requires reduced sparsity of spreading codewords, it was proven in [6] that the number of spreading codes can be increased by up to 300% in a noiseless environment. Overloaded spreading codes are generated in accordance to the Welch Bound Equality [7] in order to reduce ISI. ...
Conference Paper
Full-text available
Today's wireless communication networks transmit their signals based on the Orthogonal Multiple Access (OMA) principle. As the number of users increases, OMA based approaches may fail to meet the stringent requirements emerging in the 5th Generation of wireless communications for very high spectral efficiency and massive connectivity. Non-Orthogonal Multiple Access (NOMA) emerges as a solution to improve upon spectral efficiency and user capacity without sacrificing system performance. This paper aims to demonstrate the validity of NOMA as an optimal choice for 5G by comparing it with OMA. Three Code-Domain NOMA (CD-NOMA) schemes are examined and compared with an established OMA technique, Orthogonal Frequency Division Multiplexing (OFDM). The chosen schemes for CD-NOMA are: Low Density Spreading CDMA (LDS-CDMA), Low Density Spreading OFDM (LDS-OFDM), and Sparse Coding Multiple Access (SCMA). The performance of each scheme is evaluated by computing its Bit error rate (BER) and Outage Probability (OP) and simulating them against different values of Signal-to-Noise-Ratio (SNR) over an AWGN channel. It is observed in this paper that, while having varying performance levels, every NOMA scheme outperforms OFDM, thereby proving NOMA to be a prime candidate for implementation in future 5G communication technologies.
Article
CDMA based on-chip interconnects have recently drawn attention due to lower variance of data transfer latency compared with the shared-bus. In this paper, we present a novel design of CDMA bus arbitration unit, which allocates codewords to system nodes and resolves destination conflicts, that is, ensures that at any time all parallel transmissions are destined to different nodes. The arbitration unit is characterized by a distributed architecture in the form of a ring composed of multiple simple ring elements, each of which is attached to one system node. In comparison with conventional centralized arbiter designs, the proposed solution improves the scalability of the CDMA bus-based multi-core SoCs, enables CDMA bus to operate at higher frequencies, and allows the implementation of more elaborated arbitration algorithms, adapted to various CDMA bus variants, with low hardware complexity. Simulation results show that the proposed distributed arbitration unit represents an efficient solution for achieving high throughput and low latency data transmission in CDMA bus-based SoCs.
Book
Proceedings of the 2019 Emerging Technology (EMiT) Conference, 9-11 April, University of Huddersfield, U.K.
Chapter
One of the most important factor which affects the performance of system on chips are on chip interconnects. The most suitable interconnection method which is capable of addressing several high performance applications are Network on Chip. The widely used method to implement on-chip crossbars are Code Division Multiple Access (CDMA). Overloaded CDMA is used to intensify the capacity of the CDMA based Network on Chip and to overcome the problems due to Multiple Access Interference (MAI). In this paper, to reduce the excess area needed to store the message data and to provide security, a parallel compare and compress technique is used to the T – OCI crossbar. The Walsh spreading codes has a property that enables adding more number of non orthogonal spreading codes which increases the capacity of CDMA bus. Codec is the key component of the PaCC architecture, which effectively balances the area and performance. The Parallel Run Length Encoding (PRLE) scheme observes q bit in parallel. The results show that the PRLE based T – OCI attains greater bus capacity, less power consumption and efficient in area when compared with the conventional CDMA bus.
Article
On-chip interconnects are the performance bottleneck in modern system-on-chips. Code-division multiple access (CDMA) has been proposed to implement on-chip crossbars due to its fixed latency, reduced arbitration overhead, and higher bandwidth. In CDMA, medium sharing is enabled in the code space by assigning a limited number of N-chip length orthogonal spreading codes to the processing elements sharing the interconnect. In this paper, we advance overloaded CDMA interconnect (OCI) to enhance the capacity of CDMA network-on-chip (NoC) crossbars by increasing the number of usable spreading codes. Serial and parallel OCI architecture variants are presented to adhere to different area, delay, and power requirements. Compared with the conventional CDMA crossbar, on a Xilinx Artix-7 AC701 FPGA kit, the serial OCI crossbar achieves 100% higher bandwidth, 31% less resource utilization, and 45% power saving, while the parallel OCI crossbar achieves N times higher bandwidth compared with the serial OCI crossbar at the expense of increased area and power consumption. A 65-node OCI-based star NoC is implemented, evaluated, and compared with an equivalent space division multiple access based torus NoC for various synthetic traffic patterns. The evaluation results in terms of the resource utilization and throughput highlight the OCI as a promising technology to implement the physical layer of NoC routers.
Article
Full-text available
We consider the problem of designing binary antipodal code sequences (signatures) for overloaded code-division multiplexing (CDM) systems where the number of concurrent users/signals is greater than the code length. Our goal is to provide an overloaded code that can be constructed and decoded quickly and, more importantly, provide satisfactory recovery performance in conjunction with decoder design specifics. We first introduce a fast and practical method for constructing a code set by operating the Kronecker product with two smaller codes. Under such construction, a fast two-stage maximum-likelihood (ML) detection scheme can dramatically reduce the computational complexity of the ML decoder and make CDM systems practically implementable. To improve the performance in terms of bit error rate, we propose hierarchical criteria for code design, which aims at reducing the cross-correlation of code while maintaining a uniquely decodable (errorless) code property. Simulation studies illustrate that the proposed code design can provide satisfactory performance with low-complexity two-stage ML detection.
Conference Paper
Full-text available
Intra-chip communication is a major bottleneck in modern multiprocessor system-on-chip (MPSoC) designs. The bus topology is the most common on-chip interconnect technology and bus contention in one of the major issues in bus-based MPSoC designs. Code division multiple access (CDMA) has been proposed as a bus sharing strategy to overcome the bus contention problem. In CDMA, a limited number of orthogonal spreading codes can share the medium due to the Multiple Access Interference (MAI) problem. In wireless communications, overloaded CDMA has been considered to increase the system capacity by adding extra non-orthogonal spreading codes with specific characteristics. We propose a novel CDMA bus architecture leveraging the overloaded CDMA concepts to increase the maximum number of cores sharing the same CDMA bus in MPSoC by 25% at a marginal cost. The overloaded CDMA bus architecture is illustrated, resource-and speed-efficient decoding circuits are presented, and a prototype system is implemented and validated on a Virtex-7 FPGA VC707 evaluation kit. The overloaded and ordinary CDMA bus architectures are compared in terms of resource usage, power consumption, bus operating clock frequency and bandwidth. Evaluation results show an improvement in resource utilization and power consumption per unit (IP core) and the bus bandwidth by approximately %25 while preserving the access delay of the ordinary CDMA bus.
Article
Full-text available
This paper is a tutorial review on important issues related to code-division multiple-access (CDMA) systems such as channel capacity, power control, and optimum codes; specifically, we consider optimum overloaded codes that achieve errorless transmission in the absence of noise for the binary and nonbinary cases. A survey of lower and upper bounds for the sum channel capacity of such systems is given in the presence and absence of channel noise. The asymptotic results for the channel capacity are also investigated. The channel capacity, errorless transmission codes, and power estimation for near-far effects are also explored. The emphasis of this tutorial review is on the overloaded CDMA systems.
Chapter
Full-text available
Contemporary System-on-Chip (SoC) designs consist of numerous heterogeneous Intellectual Property (IP) components (cores) integrated onto a single die. Until recently, the design space exploration for SoCs has been mainly focused on the computational aspects of the problem. But increasing number of cores is leading to rapidly growing on-chip communication bandwidth requirements. As a result the on-chip communication architecture is becoming a serious bottleneck, i.e. a critical determinant of system-level metrics, and power consumption. The research conducted in this paper is aimed at developing a CDMA interconnect intended for designing efficient on-chip communication architecture for shared-bus organized SoC. The main benefits of using this technique relate to decreasing the number of wires on system-bus which varies from 25% up to 81%, while the main disadvantage deals with increasing the latency of Read and Write processor cycles. The structure of a wrapper as an interface logic between the shared bus and IP connecting to it is described. We implemented four different wrapper structures described in VHDL and confirmed their functionality into two different technologies: FPGA and ASIC. A pair of master-slave wrapper seems to occupy appropriate space, in average 2000 equivalent gates (in FPGA), considering CPU cost of about 30000 gates, what is less than 8% of hardware overhead per CPU. We also present experimental results which show that benefits of involving CDMA coding relates both to decreasing a number of bus lines, and accomplishing simultaneous multiple master-slave connections at relatively low power consumption and high communication bandwidth. The obtained results show that increased data transfer latencies involved by CDMA data transfer are compensated by simultaneous master-slave transfers.
Chapter
Full-text available
Complex VLSI IC design has been revolutionized by the widespread adoption of the SoC paradigm. The benefits of the SoC approaches are numerous, including improvements in system performance, cost, size, power dissipation, and design turnaround time. Many SoC designs consist of one or more IPs, designed for a single or narrow set of applications with a highly characterizable communication. As the level of a chip integration continues to advances at a fast pace, the desire for efficient interconnects rapidly increases. Currently on-chip interconnections networks are mostly implemented using traditional interconnects like buses. The wide variety of buses used in SoC designs presents the major problem for reusable design. A number of companies and standards committees have attempted to standardize buses and interface with mixed results. In this chapter we have given an overview of the most popular on-chip bus-based interconnection networks such as AMBA, Avalon, CoreConnect, STBus, Wishbone, etc. The main characteristics of the considered buses with respect to topology, arbitration method, bus-width, and types of data transfers are discussed. In addition, we have pointed to some of the issues that SoC designers are facing in determining the bus architecture to use to provide flexible and high bandwidth between IP cores.
Chapter
System-on-chip (SoC) designs typically have several different types of components such as processors, memories, custom hardware, peripherals, and external interface IP (intellectual property) blocks that need to communicate with each other. This chapter presents the prevailing standards for on-chip communication architectures. Standards are essential to promote IP reuse and reduce the design time of the increasingly complex SoC designs today. On-chip bus-based communication architecture standards define the interface signals for components, as well as bus logic components, such as arbiters, decoders, and bridges that are required to implement the features of the proposed standard. This chapter highlights some of the popular on-chip bus architecture standards, such as ARM's AMBA 2.0 and 3.0, IBM's Core Connect, ST Microelectronics' ST Bus, Sonics SMART Interconnect, Open Cores Wishbone, and Alteras Avalon. Finally, it covers popular off-chip buses and standards.
Chapter
Buses are one of the most widely used means of communicating between components in a system-on-a-chip (SoC) design. The simplicity and efficiency of transferring data on buses have ensured that they remain the preferred interconnection mechanism today. A bus connects possibly several components with a single shared channel. The shared channel can be physically implemented as a single wire, which makes up a parallel bus. This parallel bus is the typical implementation choice for buses in almost all widely used on-chip bus-based communication architectures. This chapter introduces the components and terminology used to describe the communication architectures and covers some of their major characteristics, such as bus signal types, physical structure, clocking, decoding, and arbitration. It also presents an overview of some of the basic data transfer modes that are used during data transfers and describes some of the more advanced transfer modes intended to improve bus utilization and throughput performance. Some commonly used bus topology structures are also discussed, as well as some of the issues in the physical implementation of bus wires.
Article
High-radix, single-chip routers have emerged as efficient building blocks for interconnection networks. a novel vlsi microarchitecture includes deep crossbar pipelining to cope with wire delay, a cross-scheduler architecture to reduce wiring complexity, and catalytic custom gate placement within standard electronic design automation flows. the authors use this architecture to promote combined i/o queuing (cioq) for high-radix on-chip switches, compare cioq with swizzle switch prototypes, and demonstrate high-radix crossbars' potential for system-on-chip interconnects.
Conference Paper
Multi-hop data transfer in conventional Networks-on-Chips (NoCs) results in lower rates of data transfer and higher energy dissipation. Long-range millimeter-wave wireless interconnects were envisioned to alleviate this problem. However, as the bandwidth of the wireless channels is limited an efficient media access control (MAC) scheme is required to enhance the utilization of the available bandwidth. In this paper we show that with multiple simultaneous access of the shared wireless medium using a traffic-adaptive Code Division Multiple Access (CDMA) scheme the peak performance can be improved significantly while lowering energy dissipation in data transfer compared to the conventional wireline counterparts.
Article
Performance results and synthesized area over-head for a code division multiple access (CDMA) router intended for network-on-chip (NoC) applications are pre-sented. Specific architectural block diagrams of the main components of the router are given and synthesis results are provided for 0.18 micron and 0.25 micron structured ASIC libraries. Post-synthesis VHDL simulations verify the func-tionality of the router and provide values for packet transmis-sion latency and throughput as functions of the payload size. The router can be used to construct star+star and star+mesh network architectures which can be scaled to meet the needs of high-performance applications.