Conference PaperPDF Available

CompStor: An In-storage Computation Platform for Scalable Distributed Processing

May 2018

May 2018

DOI:10.1109/IPDPSW.2018.00195

Conference: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Authors:

Mahdi Torabzadehkashi

Intel

Siavash Rezaei

University of California, Irvine

V. Alves

NGD Systems, Inc.

The explosion of data-centric and data dependent applications requires new storage devices, interfaces, and software stacks. Big data analytics solutions such as Hadoop, MapReduce and Spark have addressed the performance challenge by using a distributed architecture based on a new paradigm that relies on moving computation closer to data. In this paper, we describe a novel approach aimed at pushing the “move computation to data” paradigm to its ultimate limit by enabling highly efficient and flexible in-storage processing capability in solid state drives (SSDs). We have designed CompStor, an FPGA-based SSD that implements computational storage through a software stack (devices, protocol, interface, software, and systems) and a dedicated hardware for in-storage processing including a quad-core ARM processor subsystem. The dedicated hardware resources provide in-storage data analytics capability without degrading the performance of common storage device functions such as read, write and trim. Experimental results show up to 3X energy saving for some applications in comparison to the host CPU. To the best of our knowledge, the 24TB CompStor SSD is the first one capable of supporting in-storage computation running an operating system, enabling all types of applications and Linux shell commands to be executed in-place with no modification.

Bandwidth mismatch in high storage servers

…

Figures - uploaded by Mahdi Torabzadehkashi

Content may be subject to copyright.

Content uploaded by Mahdi Torabzadehkashi

Content may be subject to copyright.

CompStor: An In-Storage Computation Platform for

Scalable Distributed Processing

Mahdi Torabzadehkashi

EECS Department

University of California, Irvine

Irvine, USA

Torabzam@uci.edu

Siavash Rezaei

ICS Department

University of California, Irvine

Irvine, USA

siavashr@uci.edu

Vladimir Alves

NGD Systems, Inc

Irvine, USA

vladimir.alves@ngdsystems.com

Nader Bagherzadeh

EECS Department

University of California, Irvine

Irvine, USA

nader@uci.edu

Abstract— The explosion of data-centric and data dependent

applications requires new storage devices, interfaces, and software

stacks. Big data analytics solutions such as Hadoop, MapReduce

and Spark have addressed the performance challenge by using a

distributed architecture based on a new paradigm that relies on

moving computation closer to data. In this paper, we describe a

novel approach aimed at pushing the “move computation to data”

paradigm to its ultimate limit by enabling highly efficient and

flexible in-storage processing capability in solid state drives

(SSDs). We have designed CompStor, an FPGA-based SSD that

implements computational storage through a software stack

(devices, protocol, interface, software, and systems) and a

dedicated hardware for in-storage processing including a quad-

core ARM processor subsystem. The dedicated hardware

resources provide in-storage data analytics capability without

degrading the performance of common storage device functions

such as read, write and trim. Experimental results show up to 3X

energy saving for some applications in comparison to the host

CPU. To the best of our knowledge, the 24TB CompStor SSD is

the first one capable of supporting in-storage computation

running an operating system, enabling all types of applications

and Linux shell commands to be executed in-place with no

modification.

Keywords—near data processing, in-storage computing, in-situ

processing, distributed processing, non-volatile memory, SSD, direct-

attached storage

I. INTRODUCTION

By 2020, roughly 485 hyper-scale data centers will contain

47% of all servers in data centers worldwide with Big Data

applications driving the overall growth in stored data [1].

Moreover, the traffic within these storage units will quintuple

by the same time. For a hyperscale storage architect, finding a

way to avoid moving data fro m device to device even in the

same rack will beco me a paramount need, as will reducing the

power consumed to process and move this data. Webscale data

centers have been actively developing new storage server

architectures that favor a significant increase of capacity [2,3].

The advent of high-performance, high-capacity flash storage

has changed the dynamics of the storage-compute relationship.

Today, a handful of NVMe flash devices can easily saturate the

PCIe bus complex of most servers. To address the significant

bandwidth mismatch, a new paradigm is required that moves

computing capabilities closer to data. This concept, which that

authors refer to as in-situ Processing, provides storage

platforms with significant compute capabilities, reducing the

computing demands on servers. In-situ processing takes

advantage of distributed processing to:

● Significantly reduce data transfers between SSD and

host when the computation can take place in-situ

● Take advantage of the enormous aggregated bandwidth

at the media interface and reduce data ingestion time by

more than an order of magnitude

● Achieve very high performance for IO intensive

applications that can be parallelized

When storage systems were built around hard disk drives,

this balance favored the CPU, which o ften sat idle waiting

for the storage system to respond, even if there were

hundreds of disk drives all operating in parallel. With flash

PCIe SW SSD

Flash

16ch x 533MB/s = ~8.5 GB/s

2.0 GB/s per SSD

1024ch x 533MB/s

= ~545 GB/s

16 lanes of PCIe = 16 GB/s 8TB per SSD

Host

CPU

Host

Memory

SSD

…

SSD

Flash

…

Fig 1. Bandwidth mismatch in high storage servers

TABLE I. COMPA RISON OF IN-STO RAGE COMPU TATION RELATED WORK

storage media, the reverse is true [4]. Fig. 1 highlights the

bandwidth mismatch between the flash media and the host

CPU in storage servers being designed for webscale data

centers. For example, in the Open Compute storage server

proposed in [2,3], the mismatch can be as high as 80x which

means a dramatic slow-down factor for data access and

processing. In-situ processing can alleviate this problem

since it precludes data movement from media to the host CPU.

In addition, the energy consumption is a major challenge

in the current datacenter architecture. In fact, the energy

consumption and cost of moving data will only be accentuated

as capacities increase [1]. Our experimental results presented

in section IV show that significant energy saving is

achievable.

Moving applications closer to the data instead of moving

data to the applications is a viable solution to this problem [5-

8], and there have been significant efforts to make this idea

applicable in the storage systems.

In this paper, we p resent CompStor, an in-situ processing

SSD framework encompassing an in-storage processing SSD

hardware and a software stack for the communication between

a host and a PCIe/NVMe attached in-situ processing SSD.

However, the current version of the software stack also

supports single host-multiple SSDs communication, and this

feature is exhaustively tested in the experimental results

described later in this paper. The CompStor solves the

prementioned limitations and adds more critical features and

capabilities to the storage systems.

Table I provides a comparison of critical features of the

different state-of-the-art solutions including Co mpStor. This

table clearly shows the motivations which drive the

development of a the CompStor. The works mentioned in

Table 1 will be discussed in the next section.

We can summarize the main contributions of this paper as

follows:

● Pushing the “move computation to data” paradigm

to its ultimate limit by enabling highly efficient and

flexible in-situ processing capability in solid-state

drives thanks to several architectural innovations.

● Design of an SSD controller with dedicated hardware

and software resources providing in-situ data analytics

capability without degrading the performance of

common storage device functions such as read, write,

erase and trim.

● An SSD controller architecture that supports the use of

a Linu x OS dedicated to in-situ processing. This OS

provides the familiar execution environment for

existing (big data) applications and makes the porting

effortless. In contrast to the previous works reviewed

in the next section, CompStor can run executable code

as well as shell scripts. In addition, one of the most

useful features of CompStor is dynamic task loading

which is the ability to load tasks into a computational

SSD at runtime. These features cannot be available

concurrently without a running OS inside the SSD.

● A complete software stack to support scalability,

consistency and system-level parallelism.

● A comparative analysis of energy consumption and

performance using real-world use cases.

The rest of this paper is organized as follows. In section II,

we discuss some of the state-of-the-art related works. A detail

explanation of hardware and software architecture of

CompStor is presented in Section III. In Section IV, we

introduce our 24TB in-situ processing SSD prototype and

report experimental results regarding energy consumption and

performance improvements. Finally, in Section V a conclusion

of our solution is presented.

Prototype description

dynamic task loading

programming library

OS-level flexibility

Jun [13]

FPGA based SSD / FPGA accelerator



Abbani [23]

FPGA based SSD / Soft microprocessor



Kang [17]

OTS SATA SSD / 2 ARM (unknown)



✓



Kim [15]

Simulation model / ARM A9 (sim)



Tiwari [16]

Model / ARM A9 (model)



✓



Gu [19]

OTS NVMe SSD / ARM R7

✓



Gao [20]

Simulation model / ARM A7 (model)



CompStor

24TB NVMe SSD / Quad core ARM A53

✓

II. RELATED WORK

In-storage processing is not a new concept [9,10].

However, in recent years several attempts have been made at

developing technologies that translate into performance

improvement and power savings in the data center [11-20].

Seagate has introduced Kinetic hard disk drives as object-

oriented storages a few years ago [24]. This type of storages is

fundamentally different from traditional block-oriented

storages. Instead of using read and write co mmands to access

data blocks, Kinetic HDDs provide object-level data access via

a RESTful API. In the other words, the operating system does

not have to deal with low-level file system block addresses,

instead, it could simply read an object like a file or an image by

referring to its object identification. The in-situ processing is

orthogonal to the object-oriented storage systems which means

a storage could be either in-situ processing or object-oriented

or both at the same time. The in-situ processing idea proposes

in-storage data processing without considering if the data is

stored as an object or as a series of data blocks.

Jun et al. proposed a scalable architecture, called BlueDBM

[12], flash storage with an FPGA based in-storage processing.

Dealing with pure FPGA accelerators provides power

efficiency but lacks in flexibility for addressing different types

of applications. On the other hand, despite the existence of

high-level synthesis tools, there are still too many steps

required to generate RTL design and bitstream files. The extra

time it takes to generate RTL design makes it time-consuming

to reconfigure the FPGA frequently to accommodate different

applications. Often FPGA-based accelerators are specialized

for specific applications. An extended version of BlueDBM

[13] provides experimental results for more storage nodes and

running real applications.

In [23] a simple operating system is provided which is

composed of drive-resident utility programs running on top of

a MicroBlaze soft microprocessor in an FPGA. Using this

platform, they lose FPGA efficiency because of running in-

storage applications as instruction-based executables on a soft

microprocessor to avoid the difficulty of generating RTL code

for each application.

Some recent works [14-16,19] exploit embedded

processors in the SSD controllers originally designed to

execute SSD firmware for flash management purposes. But

they all suffer from limited processing power of the embedded

processors that are not dedicated to user data processing tasks.

Exp loiting the flash management firmware processor will

undoubtedly interfere with the critical flash controlling tasks

such as garbage collection and wear leveling, and impact user

read/write performance. Smart SSD [17] introduces a kernel

module on the host side to schedule and coordinate tasks

between host and SSDs. The module shares workloads

between host and device. In [14], Do et al. used an off-the-

shelf SSD for an extended version of the Microsoft SQL and

ran a selection and aggregation query. However, for an

arbitrary application, significant modifications are needed to

make them executable on the SSD’s embedded processor

because most applications are heavily linked to libraries that

exist in the Linux OS or could be added to it. In [15], a special

purpose architecture is proposed for the scan and join

operations in databases, resulting in a very narrow scope of

usage. In [16], authors targeted a special category of

application, scientific simulation applications. Usually, in such

applications simulation results require performing a sequence

of data analysis tasks. The data analysis part is executed in one

of the SSDs’ embedded proces sors. As one would expect, an

embedded processor targeted for SSD management does not

provide the necessary performance for many big data

applications as was alluded to earlier.

Biscuit [19] is the closest framework to our approach.

Biscuit uses ARM Cortex R7 embedded processors in the

storage device shared with SSD controller. This approach

results in a potential degradation impact on the performance of

the storage device. It also restricts designers to follow a defined

programming model to be able to use the in-storage processing.

In [20], Gao et al. proposed a heterogeneous architecture

for the near data processing unit, constructed from fine-grained

configurable logic blocks (CLBs similar to those in FPGAs)

and coarse-grained functional units (similar to those in

CGRAs). There are, however, significant difficulties in

converting software applications to run on this platform.

Overall, the limitations of previous work can be summarized as

follows:

• no dedicated HW resources for in-situ processing that

would guarantee the unchanged performance of read

and write commands

• no support for an operating system running in-storage.

• no software stack that truly supports scalability of

storage capacity and processing power consistency and

system-level parallelism.

• low processing power and supporting a limited domain

of applications

III. COMPSTOR ARCHITECTURE

In a conventional SSD, there are two processing

subsystems. The front-end subsystem talks to host and takes

care of PCIe/NVMe protocols, while on the other side, the

back-end subsystem handles flash management tasks such as

garbage collection. These two subsystems talk to each other to

handle host’s read and write co mmands. In our design, we have

implemented both subsystems in FPGA and tested the

developed SSD for different benchmarks . Expectedly, it works

similar to conventional SSDs. Later, we modified the

architecture to support in-storage processing without

interfering with the basic tasks of an SSD.

Using computational SSD, for the computation to take

place, only a command and a resulting data need to transfer

over the storage interface, and this greatly reduces the interface

traffic and significantly lowers the required power. In this

section, we will cover the hardware architecture and software

stack implemented in the CompStor platform which provides

in-situ processing without degrading the performance of the

storage data access.

A. CompStor Hardware Architecture

Fig. 2 shows a host attached to N CompStors via a PCIe

root complex and switch. The PCIe root complex together with

the PCIe switch provide means for several PCIe endpoints to

be connected to a single host. This figure also demonstrates a

high-level view of the block diagram of the CompStor

hardware architecture. The SSD controller of this platform

contains all the common subsystems necessary for the

implementation of a very h igh capacity enterprise-grade SSD,

such as embedded real-time processors, PCIe and NVMe

controllers, a fast-release host data buffer, advanced ECC

engine, programmable flash media interface, and encryption.

These modules are not shown in Fig. 2 for the sake of

simplicity.

In addition to these foundational elements, Co mpStor is

equipped with a dedicated in-situ processing subsystem (ISPS)

with full access to the flash media. Table II describes the

CompStor processor subsystem specifications.

The choice of an application processor subsystem dedicated

to in-storage computation and fully isolated control and data

paths are critical to ensuring:

• porting a Linux operating system

• concurrent data processing and storage functionality

without degradation of either one

ISPS has a direct connection to the flash management

processor that provides flash media read/write accesses. We

have modified the SSD controller hardware and software to

provide a high bandwidth, low latency data path between ISPS

and the flash media interface. In the other words, ISPS can

access the flash data more efficiently than the host CPU.

B. CompStor Software Stack

In CompStor, a host side client controls the in-situ

processing flow, so from a master-slave perspective, the client

is the master, while CompStor behaves as a slave. The client is

responsible to perform a defined sequence of steps: sending an

in-situ task to CompStor, waiting for the completion of the

task, and receiving back results of the execution. In this

section, the software stack that helps the user to go through

these steps is discussed.

There is a set of entities shared between all components in

the software stack. In fact, they are virtual entities traveling

through layers of the software stack to deliver information and

may get encapsulated into other entities. Each layer may either

process or redirect them to the next layer. Following these

entities and their uses are defined:

• Command: A data structure containing detailed

information about in-situ computation task including

the name of input and output files, the Linux shell

command/script or the application name, the

arguments needed to pass to the application, and access

permissions. Linux OS support enables dynamic task

• Response: A data structure containing the information

about the outcome of an in-storage computation task,

such as the final status of the command and time

consumed to execute it inside CompStor.

• Minion: A virtual entity that travels from a client to a

CompStor and delivers a command. It then waits until

the in-situ processing is done to deliver the response

TABLE II. ISPS CHARACTERISTICS

Processor

64-bit quad-core ARM Cortex A53 @ 1.5GHz

Memory

32KB I-cache & D-cache

1MB L2 cache

8GB DDR4 @ 2133MT/s

Client

CompStor

Minion to CompStor (only contains the command)

Command Response

Minion back to Client (contains Response too)

Command Response

Fig 3. Minion virtual entity

Fig 2. An in-situ processing system containing a host and several CompSt ors

back to the client. This entity is composed of a

command and a response. The fields of the command

are populated by the client while the fields of the

response are populated by CompStor. Fig. 3 depicts a

minion containing a command and a response

traveling between client and CompStor.

• Query: A virtual entity that travels from client to

CompStor for delivering an administrative message.

Similar to a minion, it travels back to the client after

delivering the message, but it could not trigger an in-

situ processing task. Instead, it can load an executable

on runtime (dynamic task loading) or get information

about the current status of CompStor such as ARM

cores utilization, or temperature of the cores. This

information could be used for load balancing.

The software stack spread over host and CompStor and

each layer accomplishes specific tasks and serves the other

layers. The commands, responses, minions, and queries are the

only entities traveling from one layer to another. Fig. 4 depicts

the software stack architecture and how the layers

communicate with each other. These layers are defined as

follows:

Off-loadable executable: It is a C/C++ application or a

Linux shell command/script or a combination of both. The

application user aims to run on CompStor embedded Linux

could be the same source code user runs on host OS but need

to be compiled by ARM compilers.

In-situ Library: A C/C++ library that provides high-level

APIs for the client and should be statically linked to it. In

contrast to some of the related works where an in-storage

lib rary is provided to rewrite the off-loadable application

[5,6], the CompStor in-situ library is only intended to be used

in the client, not in the off-loadable executable, which does

not need any modification to be executed in CompStor.

Client: As mentioned before, a C/C++ application that

controls the in-storage processing flow using the in-situ

library.

ISPS agent: A daemon running on CompStor which is

responsible for receiving minions from clients and spawning

in-storage processes based on the command inside the

received minions. The daemon populates the response fields

of the minion and sends it back to the client after task

completion.

Flash access device driver: A Linux device driver inside

the ISPS Linu x OS that communicates to the SSD controller

for flash read/write access. The flash access device driver

abstracts the flash read/write accesses, so off-loadable

executable sees the flash memory as it is running on the host

CPU.

SSD controller software: The software that is responsible

for the flash management, garbage collections, and table

keeping tasks. This software handles host read/write

commands and also provides an efficient flash read/write

access for the flash access device driver.

In-situ

Library

ISPS agent

Flash

access

device

driver

Client

SSD

controller

Software

Offloaded

executable #1

Host ISPS

Offloaded

executable #2

CompStor

Fig 4. Software Stack

TABLE III. A LIFETIME OF A MINION

Step #

Description

Host side client configures a minion and sends it

to ISPS agent using the in-situ library APIs.

The ISPS agent extracts command from the

received minion and spawns off-loadable

executable or the Linu x shell command/script

based on the command inside ISPS.

At runtime, the executable accesses the flash

storage through the device driver.

The device driver sends read/write commands to

flash controller.

At runtime, the ISPS agent keeps track of the

status of the in-situ processing.

In the end, the ISPS agent populates the fields

of response in the minion and sends the minion

back to the client.

Whenever client launches a minion, it triggers multiple

messages passing between different software layers. Table III

describes the lifetime of a minion, from the time it is

configured in client till it delivers the result to the client. The

steps number match with the labels in Fig. 4.

CompStor client is able to send several concurrent minions

to different CompStors. This gives the client the ability to

trigger multiple parallel in-storage processing requests into a

storage node. Considering a data center containing hundreds of

CompStor equipped storage nodes, there could be thousands of

concurrent minions, resulting in heavy parallelism at the

storage unit level.

IV. PROTOT YPE AND EXPERIMENTAL RESULTS

In this section, we introduce a fully functional 24TB

CompStor NVMe SSD prototype and run several experiments

to assess energy consumption and performance.

A. CompStor prototype

The prototype is an NVMe over PCIe SSD with an FPGA-

based enterprise-class SSD controller coupled with the in-situ

processing subsystem itself built around a quad-core 64-bit

ARM A53 application processor.

Fig. 5 shows the prototype developed for the experiments

described in this section. For the prototype, we built two

boards, one for the basic SSD controller modules together

with flash memories and another one for the ISPS. In fact, the

SSD controller is implemented using an ISPS co mponent

which is attached to the main board as a daughter board via a

high-speed FMC connector providing a seamless connection

with other components within SSD controller. On the main

board, we used Xilinx Vertex-7 2000T FPGA, while the ISPS

benefits from a Xilinx Zynq Ultrascale+ MPSoC. The latter is

a SoC containing an FPGA together with a quad-core 64-bit

ARM A53 processor and two ARM R5 cores. This SoC also

contains a graphics processing unit (GPU) and a set of ASIC

modules such as encryption and decryption units, however, we

have not used these facilities in the current version of the

CompStor. To the best of our knowledge, this prototype is the

first Linux-powered SSD which is equipped with a co mplete

software stack to support sending in-storage processing

commands to the CompStor and receiving the result from the

SSD.

We have considered the cost of the ISPS implementation

in comparison with the cost of the whole SSD manufacturing.

Our analysis shows that the cost of implementing the in-situ

computation is less than 8% of the whole SSD manufacturing

because the major costs are related to flash media modules and

SSD controller logic. In fact, in the CompStor, the ISPS which

is mainly ASIC ARM cores are relatively cheap in comparison

to the cost of the other hardware and software modules.

B. System setup

Two identical servers have been used for the experiments

described in this section (see specifications in table IV). In one

of the servers, we used an off-the-shelf enterprise SSD while

CompStor was used as a directed attached storage device in

the other. The arguments, input files, and test scripts are the

same for both servers.

Since processing huge plain text files is common in

datacenter applications, for the experiments described in this

section, we prepared a dataset which contains 348 compressed

big text files. In fact, these te xt files are books in different

fields which are transformed to plain text files. The total size

of the dataset is about 11.3GB. The books are individually

compressed using bzip2 and gzip algorithms.

The user applications selected for the experiment included

both IO intensive and compute-intensive applications.

Compression and decompression functions are co mmonly

used in big data analytics frameworks. We have used

gzip/gunzip and bzip2/bunzip2 as representative of a compute-

intensive class of applications. On the other hand, for IO

intensive experiments, two search applications were selected:

TABLE IV. SERVER SPECIFICAT ION

CPU Type

Intel Xeon E5-2620 v4

Memory

32 GB DDR4

Operating system

Ubuntu 16.04

Off-the-shelf SSD

256GB NVMe SSD

In-situ SSD

CompStor 24TB NVMe SSD

Fig 5. The CompStor prototype

grep and gawk [21]. Grep is a Linux shell command designed

to search on text inputs while gawk utility searches text and

makes changes based on user-specified patterns.

C. Experimental Results

Performance Experiments: In the first set of experiments,

we used a host side client to trigger in-situ processing and ran

the applications on CompStor. Obviously, the performance of

one CompStor with a quad-core ARM processor is lower than

a high-end Xeon processor. However, the performance of in-

storage computation systems scales linearly with the number of

storage devices as is the case for storage servers architected in

[2,3]. Fig. 6 depicts how performance scales linearly with

capacity i.e. the number of CompStor devices. For this

experiment, several CompStors are attached to a single host via

PCIe slots.

Even though highly parallel in-storage computation

performance can equal or surpass that of a server CPU, it

makes sense to consider that one augments the other and result

in a higher performance and more efficient system. Fig. 7

depicts the performance of the Xeon processor combined with

the performance of multiple CompStor devices when running

the bzip2 compression algorithm. In this experiment, we

distributed the whole set of the input files between the host and

several CompStors. Then the performance of the CompStors

and the host are measured separately.

ISPS

Flash Memory

SSD Controller

This result shows that the in-situ processing will add

comparable processing power to the whole system while the

manufacturing cost of adding this feature to the SSD is

reasonable. In addition, regarding the energy consumption,

CompStor can achieve a compelling result thanks to more

efficient data access link, and more power efficient processing

engine in comparison with the host.

Energy Consumption Experiment: In this experiment,

we have measured the energy efficiency of the server using

conventional storage devices compared to the server which

benefits from CompStor. In the latter case, for the

computation to take place, only the computational request and

the resulting data need transfer over the storage interface,

greatly reducing the interface traffic and the required energy.

880.9

177.6

1462

1717

1908

522

2621.4

4666

gzip

gunzip

bzip 2

bunzip 2

Energy Consumption per

Gigabyte Data (Joule/GB)

CompStor Xeon E5-2620

68.5

89.17

222.7

295.4

grep ga wk

Compression/Decompression

The reason we chose energy consumption over power

consumption is to make the result of this experiments

independent of the performance of the systems. We ran the

experiments on both servers as described in Section IV.B. and

measured the energy consumed when executing

compression/decompression and search applications. The

energy consumption is measured as power consumption

multiply by consumed time for executing the benchmarks. So,

we have measured the average power consumption over time

as well as the consumed time and calculated the energy

consumptions accordingly.

Results were normalized per gigabyte of data transferred,

i.e. Watts per GB/s or Joules/GB as shown in Fig. 8, so the

energy consumption results in this figure are regardless of the

number of the CompStors which execute the benchmarks. We

intended to make the result of the energy consumption to be

independent of the number of CompStors. In other words,

considering a fixed amount of input data per each CompStor,

as we increase the number of CompStors, the volume of the

input data together with the energy consumption will increase

linearly, so the energy consumption per input data size is

always independent of the number of CompStors.

V. CONCLUSION

We have presented a novel approach to move computation

to data paradigm to its ultimate limit by architecting and

designing highly efficient and flexible in-storage processing

capability in solid state drives. We have designed CompStor,

an in-storage computation platform with the appropriate

software stack (devices, protocol, interface, software, and

systems) and dedicated hardware for in-situ processing

including a powerful multi-core application processor

subsystem. The dedicated hardware resources provide in-situ

data analytics capability without degrading the performance of

common storage device functions such as read, write and trim.

By moving data analysis tasks closer to where the data resides,

these storage devices reduce dramatically the storage

bandwidth bottleneck, data movement cost, and improves the

overall energy efficiency. Experimental results show up to 3X

energy saving for some applications in comparison to the host

CPU. To the best of our knowledge, the 24TB CompStor SSD

is the first one capable of supporting in-storage computation

Fig 8. Energy consumption per data unit results

Fig. 5.

Fig. 6.

Fig 7. Aggregated system performance for compression using bzip2

Fig. 3.

Fig. 4.

Fig 6. Performance experimental results

Fig. 1.

Fig. 2.

running an operating system, enabling all types of applications

and Linux shell commands to be executed in-place with no

modification.

REFERENCES

[1] Cisco, “ Cisco Glo bal Cloud Index: Forecast and Methodology, 2015–

2020”, whit e pap er, 2016. [Online]. URL:

https://www.cisco.com/c/dam/en/us/solutions/collateral/service-

provider/global-cloud-index-gci/white-paper-c11-738085.pdf.

[2] Siamak Tavallaei, “Microsoft P roject Olympus Hyperscale GPU

Accelerator (HGX-1)” CSI, Azure Cloud, Tech blog, 2017. [Online].

URL:

https://azure.microsoft.com/mediahandler/files/resourcefiles/00c18868-

eba9-43d5-b8c6-e59f9fa219ee/HGX-1%20Blog_5_26_2017.pdfI.

[3] S. Jacobs and C. P. Bean, “Fine particles, thin film s and exchange

anisotropy,” in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New

York: Academic, 1963, pp. 271–350.

[4] Mark A. Shaw, “Project Olympus Flash Expansion FX -16”, T ech blog,

2017. [Online]. URL:

http://www.opencompute.org/blog/ocp-us-summit-2017-and-now-a-

word-from-our-sponsors/

[5] Q. Xu, H. Siyamwala, M. Ghosh, T. Suri, M. Awasthi, Z. Guz, A.

Shayest eh, and V. Balakrishnan, “Performance An alysis of NVMe SSDs

and their Implication on Real World Dat abases,” In Proceedings of the

8th ACM Int ernational Systems and Storage Conference. SYSTOR,

ACM, 2015.

[6] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-

in-memory accelerator for parallel graph processing,” In P roceedin gs of

International Symposium on Computer Architecture (ISCA), pp. 105-

117, 2015.

[7] S. Ch o, C. Park, H. Oh, S. Kim , Y. Yi, and G. R. Ganger, “Active disk

meets flash: A case for intelligent SSDs,” In Proceedings of the27t h

International ACM Conference on International Conference on

Supercomputing, ICS ’13, pp. 91–102, ACM, 2013.

[8] C. Li, Y. Hu, L. Liu, J. Gu, M. Song, X. Liang, J. Yuan, and T. Li,

“Towards sustainable in-sit u server systems in the big dat a era,” In

Proceedings of the 42nd Annual International Symposium on Computer

Archit ectur e, ISCA ’15, (New York, NY, USA), pp. 14–26, ACM, 2015.

[9] Y. Kang, Y. Kee, E. L. Miller, an d C. Park, “Enabling co st -effective

data processing with smart SSD,” In Proceedings of IEEE 29th

Symposium on Mass Sto rage Syst ems and Technologies, MSST’13, pp.

1–12, 2013.

[10] S. Y. W. Su and G. J. Lipovski, “ CASSM: A cellular sy st em fo r very

large dat a bases,” In Proceedings of International Conferen ce on Very

Large Data Bases, pages 456–472, Sept. 1975.

[11] E. Riedel, G. A. Gibson, an d C. Faloutso s, “ Active storage for large-

scale data minin g and mult imedia,” In Proceedings of the 24th

International Conferen ce on Very Large Dat a Bases, VLDB ’ 98, pp. 62–

73, Morgan Kaufmann, 1998.

[12] S.-W. Jun, M. Liu, K. E. Fleming, Arvind, “Scalable multi-access flash

store for big data analytics,” In Proceedings of the 2014 ACM/SIGDA

International Symposium on Field-programmable Gate Arrays, pp. 55-

64, 2014.

[13] S.-W. Jun, M. Liu, S. Lee, J. Hicks, J. Ankcorn, M. King, S. Xu, and

Arvind, “ BlueDBM: An appliance for big dat a analyt ics,” In

Proceedings of the 42nd Annual International Symposium on Computer

Archit ecture (ISCA), New York, NY, USA, pp. 1–13, ACM, 2015.

[14] J. Do, Y.S. Kee, J. M. Patel, C. Park, K. Park, and D. J. DeWitt, “Query

processing on smart SSDs: Op portun ities and challenges,” In

Proceedings of the ACM International Conference on Management of

Data (SIGMOID), pp. 1221–1230, ACM, 2013.

[15] S. Kim, H. Oh, C. Park, S. Cho, S. W. Lee, B. Moon, “In -storage

processing of dat abase scans and joins,” Informat ion Scien ces: An

International Journal, pp. 183-200, January 2016.

[16] D. Tiwari, S. Boboila, S. S. Vazhkudai, Y. Kim, X. Ma, P. J. Desnoyers,

and Y. Solih in, “Active Flash: Towards Energy -Efficient, In-Situ Data

Analytics on Extreme-Scale Machines,” In Proceedings of FAST’13, pp.

119-132, 2013.

[17] Y. Kang, Y. S. Kee, E. Miller, and C. Park. “Enabling co st-effective

data processing with smart ssd,” In Proceedings of Mass Storage

Systems and Technologies (MSST), 2013 IEEE 29th Symposium on,

2013.

[18] M. Gao and C. Kozyrakis, “HRL: efficient and flexible reconfigurable

logic for near-data processing,” In Proceedings of International

Symposium on High Performance Comput er Architecture (HPCA), pp.

126-137, 2016.

[19] B. Gu, A. Yoon, D. Bae, I. Jo, J. Lee, J. Yoon, J. Kang, M. Kwon, C.

Yoon, S. Cho, J. Jeo ng, and D. Chang. “Biscuit: a framework for near-

data processing of big data workloads,” In Proceedings of ISCA, 2016.

[20] M. Gao, G. Ayers, and and C. Ko zyrakis, “Pract ical Near-Data

Processing for In-memo ry Analyt ics Frameworks,” In Proceedings o f

PACT -24, pp. 113–124, Oct 2015.

[21] S. L. Xi, O. Babarinsa, M. Athanassoulis, and S. Idreo s, “ Beyond the

Wall: Near-Data Processin g for Databases,” In Proceedings of the 11th

International Workshop on Data Management on New Hardware, p.1-

10, 2015.

[22] “GNU awk project ”, 2 017. [Online]. URL:

http://savannah.gnu.org/projects/gawk/

[23] N. Abbani, et al. “A distribut ed reconfigurable active SSD plat form for

data intensive applications,” In Proceedings of IEEE 13th High

Performance Computing and Communications (HPCC), 2011.

[24] “The Seagat e Kinetic Open Storage Vision”, [online]. URL:

https://www.seagate.com/tech-insights/kinetic-vision-how-seagate-new-

developer-tools-meets-the-needs-of-cloud-storage-platforms-master-ti/

Arquitectura distribuida para la detección de fallos en equipos industriales con mejor puntuación de precisión e índice de robustez

Article

Full-text available

Dec 2023

Creating algorithms and systems that can process and store large amounts of data represents a great scientific, economic, and practical challenge. The application of machine learning (ML) to these problems is not trivial, and even less so if the processing of these algorithms needs to be distributed to handle the large computational load of data analysis and decision making. This paper presents a distributed and robust architecture to train, deploy, and execute distributed failure detection algorithm pipelines improving their Robustness and Precision. The solution is based on Smart Operational Realtime Bigdata Analytics (SORBA), a patented distributed architecture. The architecture combines the metrics of Robustness and Precision to automatically optimize the selection of industrial failure detection machine learning algorithm pipelines and their hyperparameters. A system of modules is developed for the acquisition, normalization, data conditioning, training, deployment, and online execution of machine learning algorithm pipelines. The solution was validated by comparing the Machine Learning (ML) results of two use cases: an industrial motor and a locomotive battery, with those achieved with Spark. The experiments showed an average improvement on the Precision Score of 28.76% and Robustness Index of 10.9%. The solution streamlines the implementation of successful applications and improves the performance of these indicators with respect to the solutions currently available in the Spark MLlib.

Designing, Modeling, and Optimizing Data-Intensive Computing Systems

Preprint

Aug 2022

Gagandeep Singh

The cost of moving data between the memory units and the compute units is a major contributor to the execution time and energy consumption of modern workloads in computing systems. At the same time, we are witnessing an enormous amount of data being generated across multiple application domains. These trends suggest a need for a paradigm shift towards a data-centric approach where computation is performed close to where the data resides. Further, a data-centric approach can enable a data-driven view where we take advantage of vast amounts of available data to improve architectural decisions. As a step towards modern architectures, this dissertation contributes to various aspects of the data-centric approach and proposes several data-driven mechanisms. First, we design NERO, a data-centric accelerator for a real-world weather prediction application. Second, we explore the applicability of different number formats, including fixed-point, floating-point, and posit, for different stencil kernels. Third, we propose NAPEL, an ML-based application performance and energy prediction framework for data-centric architectures. Fourth, we present LEAPER, the first use of few-shot learning to transfer FPGA-based computing models across different hardware platforms and applications. Fifth, we propose Sibyl, the first reinforcement learning-based mechanism for data placement in hybrid storage systems. Overall, this thesis provides two key conclusions: (1) hardware acceleration on an FPGA+HBM fabric is a promising solution to overcome the data movement bottleneck of our current computing systems; (2) data should drive system and design decisions by leveraging inherent data characteristics to make our computing systems more efficient.

Flash-Cosmos: In-Flash Bulk Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory

Preprint

Full-text available

Sep 2022

Bulk bitwise operations, i.e., bitwise operations on large bit vectors, are prevalent in a wide range of important application domains, including databases, graph processing, genome analysis, cryptography, and hyper-dimensional computing. In conventional systems, the performance and energy efficiency of bulk bitwise operations are bottlenecked by data movement between the compute units and the memory hierarchy. In-flash processing (i.e., processing data inside NAND flash chips) has a high potential to accelerate bulk bitwise operations by fundamentally reducing data movement through the entire memory hierarchy. We identify two key limitations of the state-of-the-art in-flash processing technique for bulk bitwise operations; (i) it falls short of maximally exploiting the bit-level parallelism of bulk bitwise operations; (ii) it is unreliable because it does not consider the highly error-prone nature of NAND flash memory. We propose Flash-Cosmos (Flash Computation with One-Shot Multi-Operand Sensing), a new in-flash processing technique that significantly increases the performance and energy efficiency of bulk bitwise operations while providing high reliability. Flash-Cosmos introduces two key mechanisms that can be easily supported in modern NAND flash chips: (i) Multi-Wordline Sensing (MWS), which enables bulk bitwise operations on a large number of operands with a single sensing operation, and (ii) Enhanced SLC-mode Programming (ESP), which enables reliable computation inside NAND flash memory. We demonstrate the feasibility of performing bulk bitwise operations with high reliability in Flash-Cosmos by testing 160 real 3D NAND flash chips. Our evaluation shows that Flash-Cosmos improves average performance and energy efficiency by 3.5x/32x and 3.3x/95x, respectively, over the state-of-the-art in-flash/outside-storage processing techniques across three real-world applications.

Virtual PIM: Resource-Aware Dynamic DPU Allocation and Workload Scheduling Framework for Multi-DPU PIM Architecture

Conference Paper

Oct 2023

PreCog: Near-Storage Accelerator for Heterogeneous CNN Inference

Conference Paper

Jul 2023

A review on computational storage devices and near memory computing for high performance applications

Article

Apr 2023

The von Neumann bottleneck is imposed due to the explosion of data transfers and emerging data-intensive applications in heterogeneous system architectures. The conventional computation approach of transferring data to CPU is no longer suitable especially with the cost it imposes. Given the increasing storage capacities, moving extensive data volumes between storage and computation cannot scale up. Hence, high-performance data processing mechanisms are needed, which may be achieved by bringing computation closer to data. Gathering insights where data is stored helps deal with energy efficiency, low latency, as well as security. Storage bus bandwidth is also saved when only computation results are delivered to the host memory. Various applications, including database acceleration, machine learning, Artificial Intelligence (AI), offloading (compression/encryption/encoding) and others can perform better and become more scalable if the “move process to data” paradigm is applied. Embedding processing engines inside Solid-State Drives (SSDs), transforming them to Computational Storage Devices (CSDs), provides the needed data processing solution. In this paper, we review the prior art on Near Data Processing (NDP) with focus on In-Storage Computing (ISC), identifying main challenges and potential gaps for future research directions.

Designing In-Storage Computing for Low Latency and High Throughput Homomorphic Encrypted Execution

Conference Paper

Mar 2023

Flash-Cosmos: In-Flash Bulk Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory

Conference Paper

Oct 2022

In-storage Processing of I/O Intensive Applications on Computational Storage Drives

Conference Paper

Apr 2022

Litmus: Towards a Practical Database Management System with Verifiable ACID Properties and Transaction Correctness

Conference Paper

Jun 2022

Towards sustainable in-situ server systems in the big data era

Article

Full-text available

Jun 2015
Comput Architect News

Recent years have seen an explosion of data volumes from a myriad of distributed sources such as ubiquitous cameras and various sensors. The challenges of analyzing these geographically dispersed datasets are increasing due to the significant data movement overhead, time-consuming data aggregation, and escalating energy needs. Rather than constantly move a tremendous amount of raw data to remote warehouse-scale computing systems for processing, it would be beneficial to leverage in-situ server systems (InS) to pre-process data, i.e., bringing computation to where the data is located. This paper takes the first step towards designing server clusters for data processing in the field. We investigate two representative in-situ computing applications, where data is normally generated from environmentally sensitive areas or remote places that lack established utility infrastructure. These very special operating environments of in-situ servers urge us to explore standalone (i.e., off-grid) systems that offer the opportunity to benefit from local, self-generated energy sources. In this work we implement a heavily instrumented proof-of-concept prototype called InSURE: in-situ server systems using renewable energy. We develop a novel energy buffering mechanism and a unique joint spatio-temporal power management strategy to coordinate standalone power supplies and in-situ servers. We present detailed deployment experiences to quantify how our design fits with in-situ processing in the real world. Overall, InSURE yields 20%~60% improvements over a state-of-the-art baseline. It maintains impressive control effectiveness in under-provisioned environment and can economically scale along with the data processing needs. The proposed design is well complementary to today's grid-connected cloud data centers and provides competitive cost-effectiveness.

BlueDBM: An Appliance for Big Data Analytics

Article

Full-text available

Jul 2015

Complex data queries, because of their need for random accesses, have proven to be slow unless all the data can be accommodated in DRAM. There are many domains, such as genomics, geological data and daily twitter feeds where the datasets of interest are 5TB to 20 TB. For such a dataset, one would need a cluster with 100 servers, each with 128GB to 256GBs of DRAM, to accommodate all the data in DRAM. On the other hand, such datasets could be stored easily in the flash memory of a rack-sized cluster. Flash storage has much better random access performance than hard disks, which makes it desirable for analytics workloads. In this paper we present BlueDBM, a new system architecture which has flash-based storage with in-store processing capability and a low-latency high-throughput inter-controller network. We show that BlueDBM outperforms a flash-based system without these features by a factor of 10 for some important applications. While the performance of a ram-cloud system falls sharply even if only 5%~10% of the references are to the secondary storage, this sharp performance degradation is not an issue in BlueDBM. BlueDBM presents an attractive point in the cost-performance trade-off for Big Data analytics.

Biscuit: A Framework for Near-Data Processing of Big Data Workloads

Article

Jun 2016
Comput Architect News

Data-intensive queries are common in business intelligence, data warehousing and analytics applications. Typically, processing a query involves full inspection of large in-storage data sets by CPUs. An intuitive way to speed up such queries is to reduce the volume of data transferred over the storage network to a host system. This can be achieved by filtering out extraneous data within the storage, motivating a form of near-data processing. This work presents Biscuit, a novel near-data processing framework designed for modern solid-state drives. It allows programmers to write a data-intensive application to run on the host system and the storage system in a distributed, yet seamless manner. In order to offer a high-level programming model, Biscuit builds on the concept of data flow. Data processing tasks communicate through typed and data-ordered ports. Biscuit does not distinguish tasks that run on the host system and the storage system. As the result, Biscuit has desirable traits like generality and expressiveness, while promoting code reuse and naturally exposing concurrency. We implement Biscuit on a host system that runs the Linux OS and a high-performance solid-state drive. We demonstrate the effectiveness of our approach and implementation with experimental results. When data filtering is done by hardware in the solid-state drive, the average speed-up obtained for the top five queries of TPC-H is over 15×.

Beyond the Wall

Conference Paper

May 2015

The continuous growth of main memory size allows modern data systems to process entire large scale datasets in memory. The increase in memory capacity, however, is not matched by proportional decrease in memory latency, causing a mismatch for in-memory processing. As a result, data movement through the memory hierarchy is now one of the main performance bottlenecks for main memory data systems. Database systems researchers have proposed several innovative solutions to minimize data movement and to make data access patterns hardware-aware. Nevertheless, all relevant rows and columns for a given query have to be moved through the memory hierarchy; hence, movement of large data sets is on the critical path. In this paper, we present JAFAR, a Near-Data Processing (NDP) accelerator for pushing selects down to memory in modern column-stores. JAFAR implements the select operator and allows only qualifying data to travel up the memory hierarchy. Through a detailed simulation of JAFAR hardware we show that it has the potential to provide 9x improvement for selects in column-stores. In addition, we discuss both hardware and software challenges for using NDP in database systems as well as opportunities for further NDP accelerators to boost additional relational operators.

Practical Near-Data Processing for In-Memory Analytics Frameworks

Conference Paper

Oct 2015

Performance analysis of NVMe SSDs and their implication on real world databases

Conference Paper

May 2015

The storage subsystem has undergone tremendous innovation in order to keep up with the ever-increasing demand for throughput. Non Volatile Memory Express (NVMe) based solid state devices are the latest development in this domain, delivering unprecedented performance in terms of latency and peak bandwidth. NVMe drives are expected to be particularly beneficial for I/O intensive applications, with databases being one of the prominent use-cases. This paper provides the first, in-depth performance analysis of NVMe drives. Combining driver instrumentation with system monitoring tools, we present a breakdown of access times for I/O requests throughout the entire system. Furthermore, we present a detailed, quantitative analysis of all the factors contributing to the low-latency, high-throughput characteristics of NVMe drives, including the system software stack. Lastly, we characterize the performance of multiple cloud databases (both relational and NoSQL) on state-of-the-art NVMe drives, and compare that to their performance on enterprise-class SATA-based SSDs. We show that NVMe-backed database applications deliver up to 8× superior client-side performance over enterprise-class, SATA-based SSDs.

HRL: Efficient and flexible reconfigurable logic for near-data processing

Conference Paper

Mar 2016

A scalable processing-in-memory accelerator for parallel graph processing

Article

Jun 2015
Comput Architect News

The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

In-storage processing of database scans and joins

Article

Jan 2016
INFORM SCIENCES

Flash memory-based SSD is becoming popular because of its outstanding performance compared to conventional magnetic disk drives. Today, SSDs are essentially a block device attached to a legacy host interface. As a result, the system I/O bus remains a bottleneck, and the abundant flash memory bandwidth and the computing capabilities of SSD are largely untapped. In this paper, we propose to accelerate key database operations, scan and join, for large-scale data analysis by moving data-intensive processing from the host CPU to inside flash SSDs ("in-storage processing"), close to the data source itself. To realize the idea of in-storage processing in a cost-effective manner, we deploy special-purpose compute modules using the System-on-Chip technology. While data from flash memory are transferred, a target database operation is applied to the data stream on the fly without any delay. This reduces the amount of data to transfer to the host drastically, and in turn, ensures all components along the data path in an SSD are utilized in a balanced way. Our experimental results show that in-storage processing outperforms conventional processing with a host CPU by over up to 7 ×, 5 × and 47 × for scan, join, and their combination, respectively. It also turns out that in-storage processing can be realized at only 1% of the total SSD cost, while offering sizable energy savings of up to 45 × compared to host processing.

BlueDBM: an appliance for big data analytics

Conference Paper

Jun 2015

Sang-Woo Jun

CompStor: An In-storage Computation Platform for Scalable Distributed Processing

Abstract and Figures

Recommended publications

Optimizing Hadoop Framework for Solid State Drives

In-Storage Computing for Hadoop MapReduce Framework: Challenges and Possibilities

ActiveSort: Efficient external sorting using active SSDs in the MapReduce framework

CompStor: An In-storage Computation Platform for Scalable Distributed Processing