High-level organization of a near-bank PIM architecture.

High-level organization of a near-bank PIM architecture.

Source publication
Preprint
Full-text available
Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures. Near-bank PIM architectures place simple cores close to DRAM banks and can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parall...

Contexts in source publication

Context 1
... a result, memory-centric near-bank PIM systems constitute a better fit for the widely-used SpMV kernel, because they provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency [45,53]. Figure 3 shows the baseline organization of a near-bank PIM system that we assume in this work. The PIM system (Figure 3) consists of a host CPU, standard DRAM memory modules, and PIM-enabled memory modules. ...
Context 2
... 3 shows the baseline organization of a near-bank PIM system that we assume in this work. The PIM system (Figure 3) consists of a host CPU, standard DRAM memory modules, and PIM-enabled memory modules. PIM-enabled modules are connected to the host CPU using one or more memory channels, and include multiple PIM chips. ...
Context 3
... modules are connected to the host CPU using one or more memory channels, and include multiple PIM chips. A PIM chip (Figure 3 right) tightly integrates a low-area PIM core with a DRAM memory bank. We assume that each PIM core can additionally include a small private instruction memory and a small data (scratchpad or cache) memory. ...
Context 4
... compare the coarse-grained locking (lb-cg) and the fine-grained locking (lb-fg) approaches in BCOO format. Figure 30 shows the performance achieved by BCOO format for all the data types when balancing the blocks or the non-zero elements across 16 tasklets of one DPU. We evaluate all small matrices of Table 3, i.e., delaunay_n13 (D), wing_nodal (W), raefsky4 (R) and pkustk08 (P) matrices. ...

Similar publications

Article
Full-text available
Background: Third-generation nanopore sequencers offer selective sequencing or "Read Until" that allows genomic reads to be analyzed in real time and abandoned halfway if not belonging to a genomic region of "interest." This selective sequencing opens the door to important applications such as rapid and low-cost genetic tests. The latency in analy...
Article
Full-text available
The major upgrade of the ALICE experiment for the LHC Run3 poses unique challenges and opportunities for new software development. In particular, the entirely new data taking and processing software of ALICE relies on process parallelism and large amounts of shared objects in memory. Thus from a single-core single thread workload in the past, the n...
Article
Full-text available
3D topography metrology of optical micro-structured surfaces is critical for controlled manufacturing and evaluation of optical properties. Coherence scanning interferometry technology has significant advantages for measuring optical micro-structured surfaces. However, the current research faces difficulties of designing high accuracy and efficient...
Conference Paper
Full-text available
in this paper a new system is developed for autonomous robots to detect and track multi-objects in uncontrolled environments and in real time for the purpose of decreasing the processing time needed and obtaining better error rates than current systems. To achieve this, a novel multi object tracking algorithm is introduced, implemented and enhanced...
Article
Full-text available
Applications running on out-of-order cores have benefited for decades of store-to-load forwarding which accelerates communication of store values to loads of the same thread. Despite threads running on a simultaneous multithreading (SMT) core could also access the load queues (LQ) and store queues (SQ) / store buffers (SB) of other threads to allow...