FPGA prototype system block diagram.

Source publication

On-chip communication and synchronization mechanisms with cache-integrated network interfaces

Conference Paper

Full-text available

May 2010

Per-core local (scratchpad) memories allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces (NIs), appropriate for scalable multicores, that combine the best of two worlds the flex...

Context 1

... block diagram of the FPGA system is presented in figure 5. There are four Xilinx microblaze IP cores, each with 4KB L1 instruction and data caches and a 64KB L2 data cache where our network interface mechanisms are in- tegrated. ...

View in full-text

Context 2

... implemented two versions of locks and barriers on our hardware prototype. The mutex lock uses the hardware lock box of figure 5, whereas the second uses multiple reader queues (see Section 3). For barrier synchronization, we de- veloped a barrier using the mutex lock implemented with the hardware lock box and a barrier using counters. ...

View in full-text

Dynamic behavior of SRAM data retention and a novel transient voltage collapse technique for 0.6V 32nm LP SRAM

Article

Full-text available

Dec 2011

A novel transient voltage collapse (TVC) technique is presented to enable low-voltage operation in SRAM. By dynamically switching off the PMOS during write operations with a collapsed supply voltage below the data retention voltage, a minimum operating voltage (Vccmin) of 0.6V is demonstrated in a 32nm 12-Mb low-power (LP) SRAM. Data retention fail...

Analyzing the Impact of Sleep Transistor on SRAM

Article

Full-text available

Jul 2017

Fig. 1: Standard 6T SRAM Cell. a) 6T SRAM cell working In standard 6T...

Fig. 3: Proposed 8T SRAM with Read Assist. b) Proposed 8T SRAM using...

Fig. 4: Proposed 8T SRAM Using [2] Extra Pass Transistors.

Fig. 5: A Waveform of Standard 6T SRAM Cell.

Low voltage high speed 8T SRAM cell for ultra-low power applications

Article

Full-text available

Aug 2018

The usage of portable devices increasing rapidly in the modern life has led us to focus our attention to increase the performance of the SRAM circuits, especially for low power applications. Basically in six-Transistor (6T) SRAM cell either read or write operation can be performed at a time whereas, in 7T SRAM cell using single ended write operatio...

Double Node Upset Immune RHBD-14T SRAM Cell for Space and Satellite Applications

Article

Full-text available

Sep 2023

Design of High Performance SRAM Based on Single-Port Sense Amplifier

Conference Paper

Full-text available

Jan 2015

A Survey on FPGA-based Heterogeneous Clusters Architectures

Article

Full-text available

Jan 2023

In recent years, the most powerful supercomputers have already reached megawatt power consumption levels, an important issue that challenges sustainability and shows the impossibility of maintaining this trend. To this date, the prevalent approach to super-computing is dominated by CPUs and GPUs. Given their fixed architectures with generic instruction sets, they have been favored with lots of tools and mature workflows which led to mass adoption and further growth. However, reconfigurable hardware such as FPGAs has repeatedly proven that it offers substantial advantages over this super-computing approach concerning performance and power consumption. In this survey, we review the most relevant works that advanced the field of heterogeneous super-computing using FPGAs focusing on their architectural characteristics. Each work was divided into three main parts: network, hardware, and software tools. All implementations face challenges that involve all three parts. These dependencies result in compromises that designers must take into account. The advantages and limitations of each approach are discussed and compared in detail. The classification and study of the architectures illustrate the trade-offs of the solutions and help identify open problems and research lines.

Real-Time Big Data Stream Processing Using GPU with Spark Over Hadoop Ecosystem

Article

Full-text available

Jun 2018
INT J PARALLEL PROG

In this technological era, every person, authorities, entrepreneurs, businesses, and many things around us are connected to the internet, forming Internet of thing (IoT). This generates a massive amount of diverse data with very high-speed, termed as big data. However, this data is very useful that can be used as an asset for the businesses, organizations, and authorities to predict future in various aspects. However, efficiently processing Big Data while making real-time decisions is a quite challenging task. Some of the tools like Hadoop are used for Big Datasets processing. On the other hand, these tools could not perform well in the case of real-time high-speed stream processing. Therefore, in this paper, we proposed an efficient and real-time Big Data stream processing approach while mapping Hadoop MapReduce equivalent mechanism on graphics processing units (GPUs). We integrated a parallel and distributed environment of Hadoop ecosystem and a real-time streaming processing tool, i.e., Spark with GPU to make the system more powerful in order to handle the overwhelming amount of high-speed streaming. We designed a MapReduce equivalent algorithm for GPUs for a statistical parameter calculation by dividing overall Big Data files into fixed-size blocks. Finally, the system is evaluated while considering the efficiency aspect (processing time and throughput) using (1) large-size city traffic video data captured by static as well as moving vehicles’ cameras while identifying vehicles and (2) large text-based files, like twitter data files, structural data, etc. Results show that the proposed system working with Spark on top and GPUs under the parallel and distributed environment of Hadoop ecosystem is more efficient and real-time as compared to existing standalone CPU-based MapReduce implementation.

Workload Characterization Of Multithreaded Applications On Multicore Architectures

Conference Paper

May 2014

Multicore architectures are now available for a wide range of high performance applications, ranging from embedded systems to large scale servers deployed in cloud environments. Multicore architectures are usually subject to two conflicting goals: obtaining a full utilization of the cores while achieving given performance objectives, such as throughput, response time or reduced energy consumption. Moreover, there is a strong interdependence between the software characteristics of the applications, and the underlying CPU architecture. In this scenario, simulation and analytical techniques can provide solid tools to properly design the considered class of systems: however, properly characterize the workload on multithreaded application in multicore environment is not an easy task, and thus is an hot research topic. In this paper we present several models, of increasing complexity, that can characterize multithreaded applications running on multicore architectures. Proceedings 28th European Conference on Modelling and Simulation © ECMS Flaminio Squazzoni, Fabio Baronio, Claudia Archetti, Marco Castellani (Editors).

NP-SARC: Scalable network processing in the SARC multi-core FPGA platform

Article

Jan 2013
J SYST ARCHITECT

A multicore FPGA platform with cache-integrated network interfaces (NIs) has been developed, appropriate for scalable multicores, that combine the best of two worlds – the flexibility of caches (using implicit communication) and the efficiency of scratchpad memories (using explicit communication). Furthermore, the proposed scheme provides virtualized user-level RDAM capabilities and special hardware primitives (counter, queues) for the communication and synchronization of the cores.This paper presents how the proposed architecture can be utilized in the domain of network processing applications using the hardware synchronization mechanisms. Two representatives network processing benchmarks are used; one for header processing and one for payload processing. The Multiple Reader Queue (MRQ) scheme is utilized in the case of header processing, while in the case of payload processing where transfer of bulk data is required, the user-level RDMA scheme is utilized. These applications are mapped and evaluated to an FPGA platform with up to 24 processors. The performance evaluation in the domain of network processing shows that the proposed scheme can offer low latency communication and increased programming efficiency while it also offloads the processor from the communication and synchronization processes.

Arbitration of many thousand flows at 100G and beyond

Conference Paper

Full-text available

Jan 2013

Network devices supporting above-100G links are needed today in order to scale communication bandwidth along with the processing capabilities of computing nodes in data centers and warehouse computers. In this paper, we propose a light-weight, fair scheduler for such ultra high-speed links, and an arbitrarily large number of requestors. We show that, in practice, our first algorithm, as well its predecessor, DRR, may result in bursty service even in the common case, where flow weights are approximately equal, and we identify applications where this can damage performance. Our second contribution is an enhancement that improves short-term fairness to deliver very smooth service when flow weights are approximately equal, whilst allocating bandwidth in a weighted fair manner.

Formic: Cost-efficient and Scalable Prototyping of Manycore Architectures

Article

Full-text available

Apr 2012

Modeling emerging multicore architectures is challenging and imposes a tradeoff between simulation speed and accuracy. An effective practice that balances both targets well is to map the target architecture on FPGA platforms. We find that accurate prototyping of hundreds of cores on existing FPGA boards faces at least one of the following problems: (i) limited fast memory resources (SRAM) to model caches, (ii) insufficient inter-board connectivity for scaling the design or (iii) the board is too expensive. We address these shortcomings by designing a new FPGA board for multicore architecture prototyping, which explicitly targets scalability and cost-efficiency. Formic has a 35% bigger FPGA, three times more SRAM, four times more links and costs at most half as much when compared to the popular Xilinx XUPV5 prototyping platform. We build and test a 64-board system by developing a 512-core, MicroBlaze-based, non-coherent hardware prototype with DMA capabilities, with full network-on-chip in a 3D-mesh topology. We believe that Formic offers significant advantages over existing academic and commercial platforms that can facilitate hardware prototyping for future manycore architectures.

Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs

Article

Full-text available

Dec 2011
INT J PARALLEL PROG

Per-core scratchpad memories (or local stores) allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces, appropriate for scalable multicores, that combine the best of two worlds – the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized network interface (NI) functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multiword blocks through RDMA copy. Furthermore, we introduce event responses, as a technique that enables software configurable communication and synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, completion notifications for software selected sets of arbitrary size transfers, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and measure the logic overhead over a cache-only design for basic NI functionality to be less than 20%. We also evaluate the on-chip communication performance on the prototype, as well as the performance of synchronization functions with simulation of CMPs with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.

A Practical Approach for Performance Analysis of Shared-Memory Programs

Conference Paper

Full-text available

Jun 2011

Parallel programming has transcended from HPC into mainstream, enabled by a growing number of programming models, languages and methodologies, as well as the availability of multicore systems. However, performance analysis of parallel programs is still difficult, especially for large and complex programs, or applications developed using different programming models. This paper proposes a simple analytical model for studying the speedup of shared-memory programs on multicore systems. The proposed model derives the speedup and speedup loss from data dependency and memory overhead for various configurations of threads, cores and memory access policies in UMA and NUMA systems. The model is practical because it uses only generally available and non-intrusive inputs derived from the trace of the operating system run-queue and hardware events counters. Using six OpenMP HPC dwarfs from the NPB benchmark, our model differs from measurement results on average by 9% for UMA and 11% on NUMA. Our analysis shows that speedup loss is dominated by memory contention, especially for larger problem sizes. For the worst performing structured grid dwarf on UMA, memory contention accounts for up to 99% of the speedup loss. Based on this insight, we apply our model to determine the optimal number of cores that alleviates memory contention, maximizing speedup and reducing execution time.

Fine-grain OpenMP runtime support with explicit communication hardware primitives

Conference Paper

Full-text available

Mar 2011

We present a runtime system that uses the explicit on-chip communication mechanisms of the SARC multi-core architecture, to implement efficiently the OpenMP programming model and enable the exploitation of fine-grain parallelism in OpenMP programs. We explore the design space of implementa- tion of OpenMP directives and runtime intrinsics, using a family of hardware primitives; remote stores, remote DMAs, hardware counters and hardware event queues with automatic responses, to support static and dynamic scheduling and data transfers in local memories. Using an FPGA prototype with four cores, we achieve OpenMP task creation latencies of 30-35 processor clock cycles, initiation of parallel contexts in 50 cycles and synchronization primitives in 65-210 cycles.

Direct communication and synchronization mechanisms in chip multiprocessors

Article

Full-text available

Jan 2011

Σταμάτης. Καββαδίας

The physical constraints of transistor integration have made chip multiprocessors (CMPs) a necessity, and increasing the number of cores (CPUs) the best approach, yet, for the exploitation of more transistors. Already, the feasible number of cores per chip increases beyond our ability to utilize them for general purposes. Although many important application domains can easily benefit from the use of more cores, scaling, in general, single-application performance with multiprocessing presents a tough milestone for computer science. The use of per core on-chip memories, managed in software with RDMA, adopted in the IBM Cell processor, has challenged the mainstream approach of using coherent caches for the on-chip memory hierarchy of CMPs. The two architectures have largely different implications for software and disunite researchers for the most suitable approach to multicore exploitation. We demonstrate the combination of the two approaches, with cache-integration of a network interface (NI) for explicit interprocessor communication, and flexible dynamic allocation of on-chip memory to hardware-managed (cache) and software-managed parts. The network interface architecture combines messages and RDMA-based transfers, with remote load-store access to the software-managed memories, and allows multipath routing in the processor interconnection network. We propose the technique of event responses that efficiently exploits the normal cache access flow for network interface functions, and prototype our combined approach in an FPGA-based multicore system, which shows reasonable logic overhead (less than 20%) in cache datapaths and controllers, for the basic NI functionality. We also design and implement synchronization mechanisms in the network interface (counters and queues), that take advantage of event responses and exploit the cache tag and data arrays for synchronization state. We propose novel queues, that efficiently support multiple readers, providing hardware lock and job dispatching services, and counters, that enable selective fences for explicit transfers, and can be synthesized to implement barriers in the memory system. Evaluation of the cache-integrated NI on the hardware prototype, demonstrates the flexibility of exploiting both cacheable and explicitly-managed data, and potential advantages of NI transfer mechanism alternatives. Simulations of up to 128 core CMPs show that our synchronization primitives provide significant benefits for contended locks and barriers, and can improve task scheduling efficiency in the Cilk run-time system, especially for regular codes.

FPGA prototype system block diagram.

Contexts in source publication

Similar publications

Citations