Figure 5 - uploaded by Stamatis G. Kavvadias
Content may be subject to copyright.
FPGA prototype system block diagram.  

FPGA prototype system block diagram.  

Source publication
Conference Paper
Full-text available
Per-core local (scratchpad) memories allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces (NIs), appropriate for scalable multicores, that combine the best of two worlds the flex...

Contexts in source publication

Context 1
... block diagram of the FPGA system is presented in figure 5. There are four Xilinx microblaze IP cores, each with 4KB L1 instruction and data caches and a 64KB L2 data cache where our network interface mechanisms are in- tegrated. ...
Context 2
... implemented two versions of locks and barriers on our hardware prototype. The mutex lock uses the hardware lock box of figure 5, whereas the second uses multiple reader queues (see Section 3). For barrier synchronization, we de- veloped a barrier using the mutex lock implemented with the hardware lock box and a barrier using counters. ...

Similar publications

Article
Full-text available
A novel transient voltage collapse (TVC) technique is presented to enable low-voltage operation in SRAM. By dynamically switching off the PMOS during write operations with a collapsed supply voltage below the data retention voltage, a minimum operating voltage (Vccmin) of 0.6V is demonstrated in a 32nm 12-Mb low-power (LP) SRAM. Data retention fail...
Article
Full-text available
The usage of portable devices increasing rapidly in the modern life has led us to focus our attention to increase the performance of the SRAM circuits, especially for low power applications. Basically in six-Transistor (6T) SRAM cell either read or write operation can be performed at a time whereas, in 7T SRAM cell using single ended write operatio...

Citations

... Access to local and remote memories is done using the Remote Direct Memory Access (RDMA) protocol [40]. As the first application, a multicore system based on 8 custom MicroBlaze [41] processors per module forming a 512core cluster [42] was implemented. ...
Article
Full-text available
In recent years, the most powerful supercomputers have already reached megawatt power consumption levels, an important issue that challenges sustainability and shows the impossibility of maintaining this trend. To this date, the prevalent approach to super-computing is dominated by CPUs and GPUs. Given their fixed architectures with generic instruction sets, they have been favored with lots of tools and mature workflows which led to mass adoption and further growth. However, reconfigurable hardware such as FPGAs has repeatedly proven that it offers substantial advantages over this super-computing approach concerning performance and power consumption. In this survey, we review the most relevant works that advanced the field of heterogeneous super-computing using FPGAs focusing on their architectural characteristics. Each work was divided into three main parts: network, hardware, and software tools. All implementations face challenges that involve all three parts. These dependencies result in compromises that designers must take into account. The advantages and limitations of each approach are discussed and compared in detail. The classification and study of the architectures illustrate the trade-offs of the solutions and help identify open problems and research lines.
... May works have been done in the performance improvement of CPU and GPU architectures. Various ways of analyzing multicore CPUs systems are available in the literature [17][18][19][20]. For instance, multiple parametric performance models are mentioned by the authors [17], which aim to run multiple class applications. ...
... A low-power overhead scheme is presented based on a portion of shared cache among different application [18]. Moreover, the focus is also given to L2 cache sharing in which the performance of on-chip cache came up with a novel architecture for configuration of sharing of synchronous dynamic access memory between multiple CPU functions [19]. Similarly, to understand the partitioning effects of the system performance, a scheme [20] is presented that is based on the partitioning of cache and bandwidth partitioning interaction. ...
Article
Full-text available
In this technological era, every person, authorities, entrepreneurs, businesses, and many things around us are connected to the internet, forming Internet of thing (IoT). This generates a massive amount of diverse data with very high-speed, termed as big data. However, this data is very useful that can be used as an asset for the businesses, organizations, and authorities to predict future in various aspects. However, efficiently processing Big Data while making real-time decisions is a quite challenging task. Some of the tools like Hadoop are used for Big Datasets processing. On the other hand, these tools could not perform well in the case of real-time high-speed stream processing. Therefore, in this paper, we proposed an efficient and real-time Big Data stream processing approach while mapping Hadoop MapReduce equivalent mechanism on graphics processing units (GPUs). We integrated a parallel and distributed environment of Hadoop ecosystem and a real-time streaming processing tool, i.e., Spark with GPU to make the system more powerful in order to handle the overwhelming amount of high-speed streaming. We designed a MapReduce equivalent algorithm for GPUs for a statistical parameter calculation by dividing overall Big Data files into fixed-size blocks. Finally, the system is evaluated while considering the efficiency aspect (processing time and throughput) using (1) large-size city traffic video data captured by static as well as moving vehicles’ cameras while identifying vehicles and (2) large text-based files, like twitter data files, structural data, etc. Results show that the proposed system working with Spark on top and GPUs under the parallel and distributed environment of Hadoop ecosystem is more efficient and real-time as compared to existing standalone CPU-based MapReduce implementation.
... The problem represented by performance issues in systems based on multicore CPUs has been analyzed in literature by different points of view. In the most of the literature, effects of one single component of the CPUs (or the system) have been investigated, rather than complete characterizations: for example, [4], [5] and [6], consider the effects of L2 cache sharing; [7], considers the whole memory hierarchy; [8], focuses on multithreading support; [9], [10], that analyze the effects of internal scheduling; More abstract features, such as virtualization effects, are considered in [11], [12], [13] and [14]. As the general approach is founded onto the definition or the application of benchmarks that are run on real systems to tune analytical or simulative models, in this paper in vivo measures will be used to validate the proposed models, to obtain a reliable base on which more general performance consideration can be carry out (as, e. g., in [15]), and try and get some general indications about the influence of multithreading and multicore on the overall performances of a complex system architecture. ...
Conference Paper
Multicore architectures are now available for a wide range of high performance applications, ranging from embedded systems to large scale servers deployed in cloud environments. Multicore architectures are usually subject to two conflicting goals: obtaining a full utilization of the cores while achieving given performance objectives, such as throughput, response time or reduced energy consumption. Moreover, there is a strong interdependence between the software characteristics of the applications, and the underlying CPU architecture. In this scenario, simulation and analytical techniques can provide solid tools to properly design the considered class of systems: however, properly characterize the workload on multithreaded application in multicore environment is not an easy task, and thus is an hot research topic. In this paper we present several models, of increasing complexity, that can characterize multithreaded applications running on multicore architectures. Proceedings 28th European Conference on Modelling and Simulation © ECMS Flaminio Squazzoni, Fabio Baronio, Claudia Archetti, Marco Castellani (Editors).
... To reduce latency, these mechanisms and registers need to be brought close to the processor, at the level of cache memory, as opposed to the level of main memory or I/O bus. This section briefly explains how we achieve all of the above, describing the proposed communication (RDMA) and synchronization (counters, queues, notifications) mechanisms and some typical uses (the detailed scheme was presented in [20,21]). The proposed NI mechanisms are integrated into private -as opposed to shared -caches in order for processors to have parallel access to them. ...
Article
A multicore FPGA platform with cache-integrated network interfaces (NIs) has been developed, appropriate for scalable multicores, that combine the best of two worlds – the flexibility of caches (using implicit communication) and the efficiency of scratchpad memories (using explicit communication). Furthermore, the proposed scheme provides virtualized user-level RDAM capabilities and special hardware primitives (counter, queues) for the communication and synchronization of the cores.This paper presents how the proposed architecture can be utilized in the domain of network processing applications using the hardware synchronization mechanisms. Two representatives network processing benchmarks are used; one for header processing and one for payload processing. The Multiple Reader Queue (MRQ) scheme is utilized in the case of header processing, while in the case of payload processing where transfer of bulk data is required, the user-level RDMA scheme is utilized. These applications are mapped and evaluated to an FPGA platform with up to 24 processors. The performance evaluation in the domain of network processing shows that the proposed scheme can offer low latency communication and increased programming efficiency while it also offloads the processor from the communication and synchronization processes.
... At the same time, the DCN infrastructure changes in ways that can radically modify the landscape. Intelligent network interfaces (NIs) attached to (or coexisting with) processing cores, which can provide low-latency / high-bandwidth pathways to remote processes, is a long sought goal -see Fig. 1(a) for an illustration, and refer to [3] for an example Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ...
... Negative credits are also used in[11].3 We ignore here the trivial case where f is the only active flow and thus receives all service. ...
Conference Paper
Full-text available
Network devices supporting above-100G links are needed today in order to scale communication bandwidth along with the processing capabilities of computing nodes in data centers and warehouse computers. In this paper, we propose a light-weight, fair scheduler for such ultra high-speed links, and an arbitrarily large number of requestors. We show that, in practice, our first algorithm, as well its predecessor, DRR, may result in bursty service even in the common case, where flow weights are approximately equal, and we identify applications where this can damage performance. Our second contribution is an enhancement that improves short-term fairness to deliver very smooth service when flow weights are approximately equal, whilst allocating bandwidth in a weighted fair manner.
... The first hardware design project which uses the Formic board is a prototype of a non cache-coherent manycore architecture, based on ideas of the SARC project [3], which was fully implemented in software simulation and partially implemented on a XUPV5 hardware platform [4]. Each board fits in its FPGA eight CPUs, their private L1 and L2 caches, eight GTP links and a full network-on-chip centered around a 22-port crossbar. ...
Article
Full-text available
Modeling emerging multicore architectures is challenging and imposes a tradeoff between simulation speed and accuracy. An effective practice that balances both targets well is to map the target architecture on FPGA platforms. We find that accurate prototyping of hundreds of cores on existing FPGA boards faces at least one of the following problems: (i) limited fast memory resources (SRAM) to model caches, (ii) insufficient inter-board connectivity for scaling the design or (iii) the board is too expensive. We address these shortcomings by designing a new FPGA board for multicore architecture prototyping, which explicitly targets scalability and cost-efficiency. Formic has a 35% bigger FPGA, three times more SRAM, four times more links and costs at most half as much when compared to the popular Xilinx XUPV5 prototyping platform. We build and test a 64-board system by developing a 512-core, MicroBlaze-based, non-coherent hardware prototype with DMA capabilities, with full network-on-chip in a 3D-mesh topology. We believe that Formic offers significant advantages over existing academic and commercial platforms that can facilitate hardware prototyping for future manycore architectures.
... This paper extends on our previous work in [17] . Here, we elaborate on the architecture of cache-integrated network interfaces and the technique of event responses that enables their efficient implementation, and also measure the logic overhead of NI integration inside a cache. ...
Article
Full-text available
Per-core scratchpad memories (or local stores) allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces, appropriate for scalable multicores, that combine the best of two worlds – the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized network interface (NI) functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multiword blocks through RDMA copy. Furthermore, we introduce event responses, as a technique that enables software configurable communication and synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, completion notifications for software selected sets of arbitrary size transfers, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and measure the logic overhead over a cache-only design for basic NI functionality to be less than 20%. We also evaluate the on-chip communication performance on the prototype, as well as the performance of synchronization functions with simulation of CMPs with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.
... These models can be grouped into two categories: approaches to reduce the burden on the memory and approaches to increase the efficiency of existing memory systems. In the first category, cache integrated network interfaces [31] and utility-based cache partitioning [32] have been proposed. In the second category, Liu et. ...
Conference Paper
Full-text available
Parallel programming has transcended from HPC into mainstream, enabled by a growing number of programming models, languages and methodologies, as well as the availability of multicore systems. However, performance analysis of parallel programs is still difficult, especially for large and complex programs, or applications developed using different programming models. This paper proposes a simple analytical model for studying the speedup of shared-memory programs on multicore systems. The proposed model derives the speedup and speedup loss from data dependency and memory overhead for various configurations of threads, cores and memory access policies in UMA and NUMA systems. The model is practical because it uses only generally available and non-intrusive inputs derived from the trace of the operating system run-queue and hardware events counters. Using six OpenMP HPC dwarfs from the NPB benchmark, our model differs from measurement results on average by 9% for UMA and 11% on NUMA. Our analysis shows that speedup loss is dominated by memory contention, especially for larger problem sizes. For the worst performing structured grid dwarf on UMA, memory contention accounts for up to 99% of the speedup loss. Based on this insight, we apply our model to determine the optimal number of cores that alleviates memory contention, maximizing speedup and reducing execution time.
... On multi-core processors with coherent caches, communication is consumer-initiated, thus requiring roundtrip messages to transfer data between cores, while synchronization is implemented with atomic operations, which trigger sequences of invalidations and subsequent data transfers from remote caches or memory. Hardware support for explicit communication, such as hardware LIFO queues (Carbon [1]), asynchronous direct messages [2], RDMAs [3], and hardware queues with asynchronous event-responses [4] provide viable solutions to these problems. This paper explores the design and implementation of OpenMP on a multi-core system that offers explicit communication primitives for fast on-chip data transfers. ...
... The OpenMP primitives include: (i) scheduling of parallel loops and asynchronous tasks [5], using either work-stealing [6] or work-sharing [7]; (ii) user-level synchronization including locks, barriers, and reductions and (iii) data privatization in local memories [8]. We present the design and implementation of an OpenMP runtime system for an FPGA prototype of the SARC archi- tecture [9] which features explicitly managed on-chip local memories, explicit on-chip communication primitives including remote stores for producer-initiated short data transfers, RDMA operations for producer-initiated or consumer-initiated bulk data transfers, hardware event queues with automatically generated responses, and hardware counters [4]. Our OpenMP implementation on a four-core SARC multicore FPGA prototype, achieves parallel task initiation in 30-35 processor clock cycles. ...
... Additionally, the omp_lock_t datatype and the omp atomic directive are used to perform atomic updates on shared variables. We implement all these operations using two alternatives: (i) using the hardware mutex and (ii) using MRQs to implement a lock [4]. The MRQ initially contains one token that all cores try to read. ...
Conference Paper
Full-text available
We present a runtime system that uses the explicit on-chip communication mechanisms of the SARC multi-core architecture, to implement efficiently the OpenMP programming model and enable the exploitation of fine-grain parallelism in OpenMP programs. We explore the design space of implementa- tion of OpenMP directives and runtime intrinsics, using a family of hardware primitives; remote stores, remote DMAs, hardware counters and hardware event queues with automatic responses, to support static and dynamic scheduling and data transfers in local memories. Using an FPGA prototype with four cores, we achieve OpenMP task creation latencies of 30-35 processor clock cycles, initiation of parallel contexts in 50 cycles and synchronization primitives in 65-210 cycles.
... In multi-chip multiprocessors this type of NI was exploited for the sender-side of the AP1000 [54] to transfer cache lines, and there are a few recent CMP designs that also take this approach. These include the CellBE [13], from IBM, Sony, and Toshiba, implementing eight synergistic processing elements (SPE) with private scratchpad memories inside a global address space accessible via coherent RDMA; Intel's Single-chip Cloud Computing (SCC) experimental 48-core chip [5] which provides 8 KB/core on-chip message passing buffers (MPB), accessible by all cores via loads and stores, as a conceptual shared buffer inside the system address space; and the cache-integrated NI for the SARC European IP project [55] presented in this thesis, implemented in a multicore FPGA-based prototype [56,57], which supports all the communication mechanisms that will be discussed in subsection 2.3.4 and a set of synchronization mechanisms presented in section 2.4. ...
Article
Full-text available
The physical constraints of transistor integration have made chip multiprocessors (CMPs) a necessity, and increasing the number of cores (CPUs) the best approach, yet, for the exploitation of more transistors. Already, the feasible number of cores per chip increases beyond our ability to utilize them for general purposes. Although many important application domains can easily benefit from the use of more cores, scaling, in general, single-application performance with multiprocessing presents a tough milestone for computer science. The use of per core on-chip memories, managed in software with RDMA, adopted in the IBM Cell processor, has challenged the mainstream approach of using coherent caches for the on-chip memory hierarchy of CMPs. The two architectures have largely different implications for software and disunite researchers for the most suitable approach to multicore exploitation. We demonstrate the combination of the two approaches, with cache-integration of a network interface (NI) for explicit interprocessor communication, and flexible dynamic allocation of on-chip memory to hardware-managed (cache) and software-managed parts. The network interface architecture combines messages and RDMA-based transfers, with remote load-store access to the software-managed memories, and allows multipath routing in the processor interconnection network. We propose the technique of event responses that efficiently exploits the normal cache access flow for network interface functions, and prototype our combined approach in an FPGA-based multicore system, which shows reasonable logic overhead (less than 20%) in cache datapaths and controllers, for the basic NI functionality. We also design and implement synchronization mechanisms in the network interface (counters and queues), that take advantage of event responses and exploit the cache tag and data arrays for synchronization state. We propose novel queues, that efficiently support multiple readers, providing hardware lock and job dispatching services, and counters, that enable selective fences for explicit transfers, and can be synthesized to implement barriers in the memory system. Evaluation of the cache-integrated NI on the hardware prototype, demonstrates the flexibility of exploiting both cacheable and explicitly-managed data, and potential advantages of NI transfer mechanism alternatives. Simulations of up to 128 core CMPs show that our synchronization primitives provide significant benefits for contended locks and barriers, and can improve task scheduling efficiency in the Cilk run-time system, especially for regular codes.