Cheng C. Wang's research while affiliated with University of California, Los Angeles and other places

What is this page?


This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Publications (9)


Reconfigure your RTL with EFLX join the SoC revolution
  • Conference Paper

August 2016

·

10 Reads

Cheng C. Wang

·

Share

A Multi-Granularity FPGA With Hierarchical Interconnects for Efficient and Flexible Mobile Computing

February 2014

·

266 Reads

·

25 Citations

IEEE Journal of Solid-State Circuits

Following the rapid expansion of mobile computing in the past decade, mobile system-on-a-chip (SoC) designs have off-loaded most compute-intensive tasks to dedicated accelerators to improve energy efficiency. An increasing number of accelerators in power-limited SoCs results in large regions of “dark silicon.” Such accelerators lack flexibility, thus any design change requires a SoC re-spin, significantly impacting cost and timeline. To address the need for efficiency and flexibility, this work presents a multi-granularity FPGA suitable for mobile computing. Occupying 20.5mm2 in 40nm CMOS, the chip incorporates 2,760 fine-grained configurable logic blocks (CLBs) with 11,040 6-input look-up-tables (LUTs) for random logic, basic arithmetic, shift registers, and distributed memories, 42 medium-grained 48b DSP processors for MAC and SIMD operations, 16 32K×1b to 512×72b reconfigurable block RAMs, and 2 coarsegrained kernels: a 64-8192-point fast Fourier transform (FFT) processor and a 16-core universal DSP (UDSP) for software-defined radio (SDR). Using a mixradix hierarchical interconnect, the chip achieves a 4× interconnect area reduction over commercial FPGAs for comparable connectivity, reducing overall area and leakage by 2.5×, and delivering a 10-50% lower active power. With coarse-grained kernels, the chip's energy efficiency reaches within 4-5× of ASIC designs.


Wordlength Optimization

May 2012

·

20 Reads

This chapter discusses wordlength optimization. Emphasis is placed on automated floating-to-fixed point conversion. Reduction in the number of bits without significant degradation in algorithm performance is an important step in hardware implementation of DSP algorithms. Manual tuning of bits is often performed by designers. Such approach is time-consuming and results in sub-optimal results. This chapter discusses an automated optimization approach.


Fig. 1. SEM, diagram, and operating states of the MEM relay device.
Fig. 2. Schematic indicating the relevant components in the MEM relay Verilog-A model.
Fig. 3. Die photo with relay-based, propagate-generate-kill circuit shown in inset.
Fig. 4. MEM relay based inverter and measured VTC illustrating full-rail swing at the output and digital gain.
Fig. 5. Latch composed of MEM relay devices and waveforms showing its operation. Functionality of the latch illustrates that MEM relay logic stages are composable.

+8

Demonstration of Integrated Micro-Electro-Mechanical Relay Circuits for VLSI Applications
  • Article
  • Full-text available

February 2011

·

1,398 Reads

·

177 Citations

IEEE Journal of Solid-State Circuits

·

·

Cheng C. Wang

·

[...]

·

This work presents measured results from test chips containing circuits implemented with micro-electro-mechanical (MEM) relays. The relay circuits designed on these test chips illustrate a range of important functions necessary for the implementation of integrated VLSI systems and lend insight into circuit design techniques optimized for the physical properties of these devices. To explore these techniques a hybrid electro-mechanical model of the relays' electrical and mechanical characteristics has been developed, correlated to measurements, and then also applied to predict MEM relay performance if the technology were scaled to a 90 nm technology node. A theoretical, scaled, 32-bit MEM relay-based adder, with a single-bit functionality demonstrated by the measured circuits, is found to offer a factor of ten energy efficiency gain over an optimized CMOS adder for sub-20 MOPS throughputs at a moderate increase in area.

Download

Figure 2: Framework for the automated wordlength optimization tool.
Figure 3: Actual versus computed MSE for as SVD U-Sigma design.
An automated fixed-point optimization tool in MATLAB XSG/SynDSP environment

January 2011

·

368 Reads

·

14 Citations

ISRN Signal Processing

This paper presents an automated tool for floating-point to fixed-point conversion. The tool is based on previous work that was built in MATLAB/Simulink environment and Xilinx System Generator support. The tool is now extended to include Synplify DSP blocksets in a seamless way from the users' view point. In addition to FPGA area estimation, the tool now also includes ASIC area estimation for end-users who choose the ASIC flow. The tool minimizes hardware cost subject to mean-squared quantization error (MSE) constraints. To obtain more accurate ASIC area estimations with synthesized results, 3 performance levels are available to choose from, suitable for high-performance, typical, or low-power applications. The use of the tool is first illustrated on an FIR filter to achieve over 50% area savings for MSE specification of 10−6 as compared to all 16-bit realization. More complex optimization results for chip-level designs are also demonstrated.


A 1.1 GOPS/mW FPGA chip with hierarchical interconnect fabric

January 2011

·

245 Reads

·

4 Citations

A 2048 look-up-table FPGA with a radix-2 hierarchical interconnect network is realized in 3.94mm2 in 65-nm CMOS. It has an interconnect-to-logic area ratio of 1:1, which is a 3-4x reduction from modern FPGAs while allowing up to 100% resource utilization. As a proof of concept, it is designed with standard cells, achieving 16.4 GOPS/mm2 at 370MHz. Peak energy efficiency of 1.1 GOPS/mW is measured at 0.5V.


Ultralow-Power Design in Near-Threshold Region

March 2010

·

4,680 Reads

·

364 Citations

Proceedings of the IEEE

Operation in the subthreshold region most often is synonymous to minimum-energy operation. Yet, the penalty in performance is huge. In this paper, we explore how design in the moderate inversion region helps to recover some of that lost performance, while staying quite close to the minimum-energy point. An energy-delay modeling framework that extends over the weak, moderate, and strong inversion regions is developed. The impact of activity and design parameters such as supply voltage and transistor sizing on the energy and performance in this operational region is derived. The quantitative benefits of operating in near-threshold region are established using some simple examples. The paper shows that a 20% increase in energy from the minimum-energy point gives back ten times in performance. Based on these observations, a pass-transistor based logic family that excels in this operational region is introduced. The logic family operates most of its logic in the above-threshold mode (using low-threshold transistors), yet containing leakage to only those in subthreshold. Operation below minimum-energy point of CMOS is demonstrated. In leakage-dominated ultralow-power designs, time-multiplexing will be shown to yield not only area, but also energy reduction due to lower leakage. Finally, the paper demonstrates the use of ultralow-power design techniques in chip synthesis.


Delay Estimation and Sizing of CMOS Logic Using Logical Effort With Slope Correction

September 2009

·

54 Reads

·

17 Citations

IEEE Transactions on Circuits and Systems II: Express Briefs

This brief presents an improved logical-effort model to account for the slope mismatch between the input and output of a gate. The model has a simple formulation in which only one additional parameter is needed, making the analysis suitable for hand calculations. Using 65- and 90-nm complementary metal-oxide-semiconductor technologies, the model maintains less than 5% error in gate-delay estimations compared to Spectre simulations even under large variations between the input and output slopes. Using this model, a circuit optimization tool is written to optimize an adder synthesized with a 65-nm standard-cell library. The estimation error for the adder is also within the modeling accuracy of 5%, whereas the original logical-effort model and the synthesis timing libraries have errors of up to 40% and 20%, respectively.


Word-length Optimization for Synplify DSP Blockset with FPGA and ASIC Area-Estimation

January 2008

·

20 Reads

·

2 Citations

This project report presents a major update to the original word-length optimization tool for floating-point to fixed-point conversion. The updated tool now support designs in Synplify DSP blockset; all commonly-used blocks are supported. Changes from the current optimization flow is kept to minimum, so users who are familiar with the original Xilinx System Generator flow can utilize this updated tool without additional knowledge. In addition to FPGA area-estimation, the update also includes ASIC area-estimation for end-users who chooses the ASIC flow. To obtain more accurate area-estimations with the synthesized results, 3 performance levels are available to choose from, suitable for high-performance, typical, or low-power applications.

Citations (7)


... latency and throughput) enhancement and optimization. Especially, embedding the FPGA on a systemon-chip (SoC) [9], [10] is one of the most attractive candidates for IoT applications [11]. Recently, there are several research activities on a standard-cell based FPGA as flexible and portable option for embedded FPGA fabrics [12]- [17]. ...

Reference:

Nonvolatile Field-Programmable Gate Array Using a Standard-Cell-Based Design Flow
A Multi-Granularity FPGA With Hierarchical Interconnects for Efficient and Flexible Mobile Computing
  • Citing Conference Paper
  • February 2014

IEEE Journal of Solid-State Circuits

... Affine analysis of ranges [14] can cancel correlated terms (e.g., A − A has a range [0, 0]), leading to a more compact design at the expense of more complex analysis. These techniques are used in high-level synthesis flows [15], which unfortunately require users to express their designs using non-zero-cost abstractions. The benefits of these optimizations can be offset by the difficulties encountered when porting existing RTL to a new flow and/or application. ...

An automated fixed-point optimization tool in MATLAB XSG/SynDSP environment

ISRN Signal Processing

... M2000 (Abound Logic) proposed an MSSN with local crossbars based on a Clos network [16], while Leopard Logic proposed a butterfly-based hierarchical network [17]. In [10,18] an MSSN based on butterfly topology is discussed with depopulation of the upper stages of the network and an isomorphic transformation to solve the radix-boundary problem [19], a limiting factor of MSSN in the field of FPGAs. Nevertheless, area saving is balanced by the fact that this network is no more proven to be RNB, although authors indicates the availability of enough bandwidth based on Rent's rule. ...

A 1.1 GOPS/mW FPGA chip with hierarchical interconnect fabric

... They followed a binary search algorithm to quickly arrive at a coarse optimal point and start the fine optimization from the coarse optimal point, thus reducing complexity to vary linearly with N, the number of quantizers, even without signal grouping. In addition, this method is well suited to MATLAB-based design and is very handy for integration with automated tools like Accelchip [30] and SynplifyPro [31,32] for providing a complete automated flow from floating-point MATLAB design to FPGA hardware resources in minutes. The paper [29] identified that MATLAB-Simulink environment can be exploited only for fixed-point modeling and simulation of AMS circuits. ...

Word-length Optimization for Synplify DSP Blockset with FPGA and ASIC Area-Estimation
  • Citing Article
  • January 2008

... Finally, laterally actuated relays typically have a smaller footprint when compared to other relay structures. NEMS relays utilizing four and six-terminal configurations have also been proposed as logic gate building blocks to minimize the number of relays needed for any circuit path in logic circuits and thus the overall switching time for the desired logic function [11], [12], [13], [14]. The four terminal designs introduce an additional body bias terminal and two separate contact regions on the cantilever electrically isolated from the gate terminal which serves as the input, as can be seen in Figure 1 (c). ...

Demonstration of Integrated Micro-Electro-Mechanical Relay Circuits for VLSI Applications

IEEE Journal of Solid-State Circuits

... It can quickly perform the carry generation process which can be computed with the help of propagating and generating signals. The major issue behind this kind of adder is that the propagation delay of the adder will be increased if the bit width of the input operands increases [10]. ...

Ultralow-Power Design in Near-Threshold Region

Proceedings of the IEEE