High-level organization of a near-bank PIM architecture.

Source publication

Fig. 1. (a) CSR-based representation of a sparse matrix. (b) CSR-based...

Fig. 2. (a) SpMV with a dense matrix representation, and (b) CSR, (c)...

Fig. 3. High-level organization of a near-bank PIM architecture.

Fig. 4. Execution of the SpMV kernel on a real PIM system.

Fig. 5. Data partitioning techniques of the SparseP package.

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

Preprint

Full-text available

Jan 2022

Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures. Near-bank PIM architectures place simple cores close to DRAM banks and can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parall...

Context 1

... a result, memory-centric near-bank PIM systems constitute a better fit for the widely-used SpMV kernel, because they provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency [45,53]. Figure 3 shows the baseline organization of a near-bank PIM system that we assume in this work. The PIM system (Figure 3) consists of a host CPU, standard DRAM memory modules, and PIM-enabled memory modules. ...

View in full-text

Context 2

... 3 shows the baseline organization of a near-bank PIM system that we assume in this work. The PIM system (Figure 3) consists of a host CPU, standard DRAM memory modules, and PIM-enabled memory modules. PIM-enabled modules are connected to the host CPU using one or more memory channels, and include multiple PIM chips. ...

View in full-text

Context 3

... modules are connected to the host CPU using one or more memory channels, and include multiple PIM chips. A PIM chip (Figure 3 right) tightly integrates a low-area PIM core with a DRAM memory bank. We assume that each PIM core can additionally include a small private instruction memory and a small data (scratchpad or cache) memory. ...

View in full-text

Context 4

... compare the coarse-grained locking (lb-cg) and the fine-grained locking (lb-fg) approaches in BCOO format. Figure 30 shows the performance achieved by BCOO format for all the data types when balancing the blocks or the non-zero elements across 16 tasklets of one DPU. We evaluate all small matrices of Table 3, i.e., delaunay_n13 (D), wing_nodal (W), raefsky4 (R) and pkustk08 (P) matrices. ...

View in full-text

Figure 9: Comparison between HARU and state-of-the-art methods.

Figure 11: Pipelined execution of Algorithm 2.

Figure 12: sDTW hardware accelerator design for HARU.

Efficient real-time selective genome sequencing on resource-constrained devices

Article

Full-text available

Dec 2022

Background: Third-generation nanopore sequencers offer selective sequencing or "Read Until" that allows genomic reads to be analyzed in real time and abandoned halfway if not belonging to a genomic region of "interest." This selective sequencing opens the door to important applications such as rapid and low-cost genetic tests. The latency in analy...

Figure 3: Per job CPU usage before isolating tasks Figure 4: Per job...

Adding multi-core support to the ALICE Grid Middleware

Article

Full-text available

Feb 2023

The major upgrade of the ALICE experiment for the LHC Run3 poses unique challenges and opportunities for new software development. In particular, the entirely new data taking and processing software of ALICE relies on process parallelism and large amounts of shared objects in memory. Thus from a single-core single thread workload in the past, the n...

Determination of the zero-order fringe and phase-shifting points....

Surface reconstruction by the GPU-CUDA architecture.

CUDA main program and the parallel kernel function of the unambiguous...

(a) Preimage of the T-mesh. (b) T-junction for the blending function...

Determining the preimage of the T-mesh. (a) Preimage of the T-mesh for...

Parallel unambiguous generalized phase-shifting and T-spline fitting algorithms for optical micro-structured surface 3D topography metrology

Article

Full-text available

Mar 2023

3D topography metrology of optical micro-structured surfaces is critical for controlled manufacturing and evaluation of optical properties. Coherence scanning interferometry technology has significant advantages for measuring optical micro-structured surfaces. However, the current research faces difficulties of designing high accuracy and efficient...

Real-Time Multi-object Detection and Tracking for Autonomous Robots in Uncontrolled Environments

Conference Paper

Full-text available

May 2024

in this paper a new system is developed for autonomous robots to detect and track multi-objects in uncontrolled environments and in real time for the purpose of decreasing the processing time needed and obtaining better error rates than current systems. To achieve this, a novel multi object tracking algorithm is introduced, implemented and enhanced...

Speculative Inter-Thread Store-to-Load Forwarding in SMT Architectures

Article

Full-text available

Nov 2022

Applications running on out-of-order cores have benefited for decades of store-to-load forwarding which accelerates communication of store values to loads of the same thread. Despite threads running on a simultaneous multithreading (SMT) core could also access the load queues (LQ) and store queues (SQ) / store buffers (SB) of other threads to allow...

High-level organization of a near-bank PIM architecture.

Contexts in source publication

Similar publications