Fig 1 - uploaded by Mohamed El-Hadedy
Content may be subject to copyright.
NXNXM-bit Transpose Memory Architecture  

NXNXM-bit Transpose Memory Architecture  

Source publication
Article
Full-text available
This paper presents a novel type of high-speed and area-efficient register-based transpose memory architecture enabled by reporting on both edges of the clock. The proposed new architecture, by using the double edge triggered registers, doubles the throughput and increases the maximum frequency by avoiding some of the combinational circuit used in...

Contexts in source publication

Context 1
... shown in Fig. 1, the architecture of the transpose memory which consists of three primary components: the register file, the cell mapper, and the control unit. The NXNXM register file, shown in Fig. 2, operates on M-bit-long inputs. Each cell in the register file has a clock and an asynchronous reset signal, which synchro- nizes operation and reset. ...
Context 2
... proposed TRM, and the other register-based de- signs in [16] and [17] are implemented on the same tech- nology node with the same flow, to make a fair compar- ison. Fig. 11 shows that the proposed design (8X8X12) achieves about 63.1% of area reduction compared to the design in [16], and 62.5% of area reduction compared to the design in [17]. The area improvement leads to the reduction of both parasitic capacitance and leakage power, which contributes more than 10% of the total power. (1) Fig . 12 shows ...
Context 3
... compar- ison. Fig. 11 shows that the proposed design (8X8X12) achieves about 63.1% of area reduction compared to the design in [16], and 62.5% of area reduction compared to the design in [17]. The area improvement leads to the reduction of both parasitic capacitance and leakage power, which contributes more than 10% of the total power. (1) Fig . 12 shows the power breakdown of the three designs (8X8X12) at the same frequency of 444MHz, which is the maximum clock frequency of [17]. The re- sult shows that the proposed design consumes about 62.2% less total power compared to the design in [16], and 60% less total power compared to the design in [17]. The leakage power reduction of ...
Context 4
... performance comparison is presented in Fig. 13, where the proposed design achieves about 3.95X and 2.1X throughput improvement over [16] and [17]. Due to the small area, the proposed design shows huge per- formance/area improvement (about 12X and 6X over other the two designs). This provides a big potential of implementing the proposed design in a speed criti- cal applications that ...
Context 5
... the resolution for different applications varies, it is important to see how the proposed design performs with large data size. In this section, detailed analysis based on the 8X8 design will be presented. Fig. 14 is the area comparison with dif- ferent resolutions ranging from 2 to 32 bits. Similar to what has been shown in the FPGA implementations, the total area is a function of resolution and increases almost linearly with the resolution. It is worth to men- tion that for ASIC implementations, the overall area increases slower than that in ...
Context 6
... has been already shown in Section 4 (Fig. 5) that the maximum frequency is almost unchanged as the resolution increases. Fig. 15 shows the maximum fre- quency trend for the ASIC implementations. The fre- quency scales down slowly, the reason is that with the increase of the resolution, the area increases, also the parasitic capacitance. In addition, the critical path be- comes longer as the resolution increases. These facts together will impact the performance ...
Context 7
... facts together will impact the performance and result in the slight reduction of achievable maximum frequency. While for FPGA fabrics, the number of LUTs each node drives doesn't change with the configurations, so the parasitic capacitance for each node is almost unchanged even with different resolutions. One observation that can be made from Fig. 15 is that the reduction rate of the maximum frequency is much smaller than the increase rate of the resolutions. So this still guarantees the good performance with larger resolutions. Fig. 16 presents the average power consumption with different resolutions. The power is obtained under their maximum frequency respectively. For both ...
Context 8
... with the configurations, so the parasitic capacitance for each node is almost unchanged even with different resolutions. One observation that can be made from Fig. 15 is that the reduction rate of the maximum frequency is much smaller than the increase rate of the resolutions. So this still guarantees the good performance with larger resolutions. Fig. 16 presents the average power consumption with different resolutions. The power is obtained under their maximum frequency respectively. For both leakage power and dynamic power, it shows a linear increase with the resolution bits. The power for the 8X8X32 design is al- most comparable with the 8X8X12 design implemented in [16] and [17]. ...
Context 9
... 4 bit and 12 bit. The matrix size scales up from 4X4 to 16X16. Here we de- fine the scaling factor (SF_X) as the quality improve- ment/loss that are normalized to the 4X4 design for metric X. For example, if SF_power = 2, this means the power consumption of the design is twice larger than the baseline design (e.g. 4X4) with the same resolution. Fig. 17 and Fig. 18 show the scaling factor of all metrics for both 4 bit and 12 bit. In both cases, it gives a similar trend. The area increases as expected, while the scaling factors for the total area (including the in- terconnect overhead) for 4X4 to 8X8 are less than 4 (2X2), and from 8X8 to 16X16 roughly 4, this indicates that the total ...
Context 10
... 12 bit. The matrix size scales up from 4X4 to 16X16. Here we de- fine the scaling factor (SF_X) as the quality improve- ment/loss that are normalized to the 4X4 design for metric X. For example, if SF_power = 2, this means the power consumption of the design is twice larger than the baseline design (e.g. 4X4) with the same resolution. Fig. 17 and Fig. 18 show the scaling factor of all metrics for both 4 bit and 12 bit. In both cases, it gives a similar trend. The area increases as expected, while the scaling factors for the total area (including the in- terconnect overhead) for 4X4 to 8X8 are less than 4 (2X2), and from 8X8 to 16X16 roughly 4, this indicates that the total area doesn't ...
Context 11
... is almost increase with N quadratically. The maximum frequency decreases due to the increased parasitics and logic depth. But the reductions are still within 30%. The throughput improves due to the bigger input size. Power consumption also increases as the area increases. SF_Leakage is almost the same as SF_Total Area for all the three cases. Fig. 19 and 20 compare the scaling factors for both 4 bit and 12 bit resolution. Although the interconnect area increases faster (SF_Interconnect is bigger) for 12 bit, the total area scaling factor (SF_Total Area) keeps almost the same. And this suggests that the total area increase caused by expanding the matrix is almost in- dependent of ...
Context 12
... by a register processing the data every cycle to decrease the effect of the critical path. In this paper, we used the same structure of the 1D-DCT in the prior work while modifying the end-stage register to perform every half cycle by applying the approach in Fig. 3b. The performance comparison between [17] and the proposed double-edge TRM in Fig. 21 shows that the double-edge TRM improves the performance of 2D-DCT, with 3.5X speedup and a 28% reduction in ...

Similar publications

Article
Full-text available
A hardware architecture for quadruple precision floating point division arithmetic with multi-precision support is presented. Division is an important yet far more complex arithmetic operation than addition and multiplication, which demands significant amount of hardware resources for a complete implementation. The proposed architecture also suppor...

Citations

... Once the row transposition is completed, the results stored in the register array are shifted out column-wise. Due to the large amount of shifting required for the transposition between the row and column operations (n clock cycles are consumed to obtain each transposed n/2 × n/2 memory block), this architecture is not suitable for high-speed staircase decoding [35]. ...
Article
Full-text available
We focus on a hardware implementation of the concatenated forward error-correction (FEC) decoder defined in 400ZR implementation agreement to provide a throughput of 400 Gbps over fiber-optical communication links. We propose a soft-input hard-output low-complexity decoding algorithm for the inner Hamming code. We demonstrate that the algorithm leads to an efficient hardware design with low silicon area and power dissipation. We then propose a hardware implementation architecture of the outer staircase decoder. It features a highly optimized low-power implementation of Bose-Chaudhuri-Hocquenghem (BCH) component decoders and the staircase decoder memory that can be efficiently accessed by either vertical or horizontal component decoders. Finally, we analyze the hardware implementation of the entire 400ZR decoder and investigate the trade-off in terms of power, area, and speed, that results from the inner/outer decoder concatenation and is dictated by the bit-error rate at the output of the inner decoder.
... Second, the access to the memory may be limited. For small data sizes, either registers [25] or small memories that can be read or written arbitrarily may be used [10]. On the contrary, large memories such as a synchronous dynamic random-access memories (SDRAM) [26] are not so easily addressable. ...
... Switching network for the permutations σ 1 and σ 3 in equation(25) for the particular case of p = 3. ...
Article
Full-text available
In this paper, we analyze how to calculate the matrix transposition in continuous flow by using a memory or group of memories. The proposed approach studies this problem for specific conditions such as square and non-square matrices, use of limited access memories and use of several memories in parallel. Contrary to previous approaches, which are based on specific cases or examples, the proposed approach derives the fundamental theory involved in the problem of matrix transposition in a continuous flow. This allows for obtaining the exact equations for the read and write addresses of the memories and other control signals in the circuits. Furthermore, the cases that involve non-square matrices, which have not been studied in detail in the literature, are analyzed in depth in this paper. Experimental results show that the proposed approach is capable of transposing matrices of 8192 x 8192 32-bit data received in series at a rate of 200 mega samples per second, which doubles the throughput of previous approaches.