Conference PaperPDF Available

CompStor: An In-storage Computation Platform for Scalable Distributed Processing

Authors:
  • NGD Systems, Inc.

Abstract and Figures

The explosion of data-centric and data dependent applications requires new storage devices, interfaces, and software stacks. Big data analytics solutions such as Hadoop, MapReduce and Spark have addressed the performance challenge by using a distributed architecture based on a new paradigm that relies on moving computation closer to data. In this paper, we describe a novel approach aimed at pushing the “move computation to data” paradigm to its ultimate limit by enabling highly efficient and flexible in-storage processing capability in solid state drives (SSDs). We have designed CompStor, an FPGA-based SSD that implements computational storage through a software stack (devices, protocol, interface, software, and systems) and a dedicated hardware for in-storage processing including a quad-core ARM processor subsystem. The dedicated hardware resources provide in-storage data analytics capability without degrading the performance of common storage device functions such as read, write and trim. Experimental results show up to 3X energy saving for some applications in comparison to the host CPU. To the best of our knowledge, the 24TB CompStor SSD is the first one capable of supporting in-storage computation running an operating system, enabling all types of applications and Linux shell commands to be executed in-place with no modification.
Content may be subject to copyright.
CompStor: An In-Storage Computation Platform for
Scalable Distributed Processing
Mahdi Torabzadehkashi
EECS Department
University of California, Irvine
Irvine, USA
Torabzam@uci.edu
Siavash Rezaei
ICS Department
University of California, Irvine
Irvine, USA
siavashr@uci.edu
Vladimir Alves
NGD Systems, Inc
Irvine, USA
vladimir.alves@ngdsystems.com
Nader Bagherzadeh
EECS Department
University of California, Irvine
Irvine, USA
nader@uci.edu
Abstract The explosion of data-centric and data dependent
applications requires new storage devices, interfaces, and software
stacks. Big data analytics solutions such as Hadoop, MapReduce
and Spark have addressed the performance challenge by using a
distributed architecture based on a new paradigm that relies on
moving computation closer to data. In this paper, we describe a
novel approach aimed at pushing the “move computation to data”
paradigm to its ultimate limit by enabling highly efficient and
flexible in-storage processing capability in solid state drives
(SSDs). We have designed CompStor, an FPGA-based SSD that
implements computational storage through a software stack
(devices, protocol, interface, software, and systems) and a
dedicated hardware for in-storage processing including a quad-
core ARM processor subsystem. The dedicated hardware
resources provide in-storage data analytics capability without
degrading the performance of common storage device functions
such as read, write and trim. Experimental results show up to 3X
energy saving for some applications in comparison to the host
CPU. To the best of our knowledge, the 24TB CompStor SSD is
the first one capable of supporting in-storage computation
running an operating system, enabling all types of applications
and Linux shell commands to be executed in-place with no
modification.
Keywordsnear data processing, in-storage computing, in-situ
processing, distributed processing, non-volatile memory, SSD, direct-
attached storage
I. INTRODUCTION
By 2020, roughly 485 hyper-scale data centers will contain
47% of all servers in data centers worldwide with Big Data
applications driving the overall growth in stored data [1].
Moreover, the traffic within these storage units will quintuple
by the same time. For a hyperscale storage architect, finding a
way to avoid moving data fro m device to device even in the
same rack will beco me a paramount need, as will reducing the
power consumed to process and move this data. Webscale data
centers have been actively developing new storage server
architectures that favor a significant increase of capacity [2,3].
The advent of high-performance, high-capacity flash storage
has changed the dynamics of the storage-compute relationship.
Today, a handful of NVMe flash devices can easily saturate the
PCIe bus complex of most servers. To address the significant
bandwidth mismatch, a new paradigm is required that moves
computing capabilities closer to data. This concept, which that
authors refer to as in-situ Processing, provides storage
platforms with significant compute capabilities, reducing the
computing demands on servers. In-situ processing takes
advantage of distributed processing to:
Significantly reduce data transfers between SSD and
host when the computation can take place in-situ
Take advantage of the enormous aggregated bandwidth
at the media interface and reduce data ingestion time by
more than an order of magnitude
Achieve very high performance for IO intensive
applications that can be parallelized
When storage systems were built around hard disk drives,
this balance favored the CPU, which o ften sat idle waiting
for the storage system to respond, even if there were
hundreds of disk drives all operating in parallel. With flash
PCIe SW SSD
Flash
16ch x 533MB/s = ~8.5 GB/s
2.0 GB/s per SSD
1024ch x 533MB/s
= ~545 GB/s
16 lanes of PCIe = 16 GB/s 8TB per SSD
Host
CPU
Host
Memory
SSD
SSD
Flash
Flash
Flash
1
16
1
16
1
4
23
61
62
63
64
Fig 1. Bandwidth mismatch in high storage servers
TABLE I. COMPA RISON OF IN-STO RAGE COMPU TATION RELATED WORK
storage media, the reverse is true [4]. Fig. 1 highlights the
bandwidth mismatch between the flash media and the host
CPU in storage servers being designed for webscale data
centers. For example, in the Open Compute storage server
proposed in [2,3], the mismatch can be as high as 80x which
means a dramatic slow-down factor for data access and
processing. In-situ processing can alleviate this problem
since it precludes data movement from media to the host CPU.
In addition, the energy consumption is a major challenge
in the current datacenter architecture. In fact, the energy
consumption and cost of moving data will only be accentuated
as capacities increase [1]. Our experimental results presented
in section IV show that significant energy saving is
achievable.
Moving applications closer to the data instead of moving
data to the applications is a viable solution to this problem [5-
8], and there have been significant efforts to make this idea
applicable in the storage systems.
In this paper, we p resent CompStor, an in-situ processing
SSD framework encompassing an in-storage processing SSD
hardware and a software stack for the communication between
a host and a PCIe/NVMe attached in-situ processing SSD.
However, the current version of the software stack also
supports single host-multiple SSDs communication, and this
feature is exhaustively tested in the experimental results
described later in this paper. The CompStor solves the
prementioned limitations and adds more critical features and
capabilities to the storage systems.
Table I provides a comparison of critical features of the
different state-of-the-art solutions including Co mpStor. This
table clearly shows the motivations which drive the
development of a the CompStor. The works mentioned in
Table 1 will be discussed in the next section.
We can summarize the main contributions of this paper as
follows:
Pushing the move computation to data paradigm
to its ultimate limit by enabling highly efficient and
flexible in-situ processing capability in solid-state
drives thanks to several architectural innovations.
Design of an SSD controller with dedicated hardware
and software resources providing in-situ data analytics
capability without degrading the performance of
common storage device functions such as read, write,
erase and trim.
An SSD controller architecture that supports the use of
a Linu x OS dedicated to in-situ processing. This OS
provides the familiar execution environment for
existing (big data) applications and makes the porting
effortless. In contrast to the previous works reviewed
in the next section, CompStor can run executable code
as well as shell scripts. In addition, one of the most
useful features of CompStor is dynamic task loading
which is the ability to load tasks into a computational
SSD at runtime. These features cannot be available
concurrently without a running OS inside the SSD.
A complete software stack to support scalability,
consistency and system-level parallelism.
A comparative analysis of energy consumption and
performance using real-world use cases.
The rest of this paper is organized as follows. In section II,
we discuss some of the state-of-the-art related works. A detail
explanation of hardware and software architecture of
CompStor is presented in Section III. In Section IV, we
introduce our 24TB in-situ processing SSD prototype and
report experimental results regarding energy consumption and
performance improvements. Finally, in Section V a conclusion
of our solution is presented.
Prototype description
dynamic task loading
programming library
OS-level flexibility
Jun [13]
FPGA based SSD / FPGA accelerator
Abbani [23]
FPGA based SSD / Soft microprocessor
Kang [17]
OTS SATA SSD / 2 ARM (unknown)
Kim [15]
Simulation model / ARM A9 (sim)
Tiwari [16]
Model / ARM A9 (model)
Gu [19]
OTS NVMe SSD / ARM R7
Gao [20]
Simulation model / ARM A7 (model)
CompStor
24TB NVMe SSD / Quad core ARM A53
II. RELATED WORK
In-storage processing is not a new concept [9,10].
However, in recent years several attempts have been made at
developing technologies that translate into performance
improvement and power savings in the data center [11-20].
Seagate has introduced Kinetic hard disk drives as object-
oriented storages a few years ago [24]. This type of storages is
fundamentally different from traditional block-oriented
storages. Instead of using read and write co mmands to access
data blocks, Kinetic HDDs provide object-level data access via
a RESTful API. In the other words, the operating system does
not have to deal with low-level file system block addresses,
instead, it could simply read an object like a file or an image by
referring to its object identification. The in-situ processing is
orthogonal to the object-oriented storage systems which means
a storage could be either in-situ processing or object-oriented
or both at the same time. The in-situ processing idea proposes
in-storage data processing without considering if the data is
stored as an object or as a series of data blocks.
Jun et al. proposed a scalable architecture, called BlueDBM
[12], flash storage with an FPGA based in-storage processing.
Dealing with pure FPGA accelerators provides power
efficiency but lacks in flexibility for addressing different types
of applications. On the other hand, despite the existence of
high-level synthesis tools, there are still too many steps
required to generate RTL design and bitstream files. The extra
time it takes to generate RTL design makes it time-consuming
to reconfigure the FPGA frequently to accommodate different
applications. Often FPGA-based accelerators are specialized
for specific applications. An extended version of BlueDBM
[13] provides experimental results for more storage nodes and
running real applications.
In [23] a simple operating system is provided which is
composed of drive-resident utility programs running on top of
a MicroBlaze soft microprocessor in an FPGA. Using this
platform, they lose FPGA efficiency because of running in-
storage applications as instruction-based executables on a soft
microprocessor to avoid the difficulty of generating RTL code
for each application.
Some recent works [14-16,19] exploit embedded
processors in the SSD controllers originally designed to
execute SSD firmware for flash management purposes. But
they all suffer from limited processing power of the embedded
processors that are not dedicated to user data processing tasks.
Exp loiting the flash management firmware processor will
undoubtedly interfere with the critical flash controlling tasks
such as garbage collection and wear leveling, and impact user
read/write performance. Smart SSD [17] introduces a kernel
module on the host side to schedule and coordinate tasks
between host and SSDs. The module shares workloads
between host and device. In [14], Do et al. used an off-the-
shelf SSD for an extended version of the Microsoft SQL and
ran a selection and aggregation query. However, for an
arbitrary application, significant modifications are needed to
make them executable on the SSD’s embedded processor
because most applications are heavily linked to libraries that
exist in the Linux OS or could be added to it. In [15], a special
purpose architecture is proposed for the scan and join
operations in databases, resulting in a very narrow scope of
usage. In [16], authors targeted a special category of
application, scientific simulation applications. Usually, in such
applications simulation results require performing a sequence
of data analysis tasks. The data analysis part is executed in one
of the SSDsembedded proces sors. As one would expect, an
embedded processor targeted for SSD management does not
provide the necessary performance for many big data
applications as was alluded to earlier.
Biscuit [19] is the closest framework to our approach.
Biscuit uses ARM Cortex R7 embedded processors in the
storage device shared with SSD controller. This approach
results in a potential degradation impact on the performance of
the storage device. It also restricts designers to follow a defined
programming model to be able to use the in-storage processing.
In [20], Gao et al. proposed a heterogeneous architecture
for the near data processing unit, constructed from fine-grained
configurable logic blocks (CLBs similar to those in FPGAs)
and coarse-grained functional units (similar to those in
CGRAs). There are, however, significant difficulties in
converting software applications to run on this platform.
Overall, the limitations of previous work can be summarized as
follows:
no dedicated HW resources for in-situ processing that
would guarantee the unchanged performance of read
and write commands
no support for an operating system running in-storage.
no software stack that truly supports scalability of
storage capacity and processing power consistency and
system-level parallelism.
low processing power and supporting a limited domain
of applications
III. COMPSTOR ARCHITECTURE
In a conventional SSD, there are two processing
subsystems. The front-end subsystem talks to host and takes
care of PCIe/NVMe protocols, while on the other side, the
back-end subsystem handles flash management tasks such as
garbage collection. These two subsystems talk to each other to
handle host’s read and write co mmands. In our design, we have
implemented both subsystems in FPGA and tested the
developed SSD for different benchmarks . Expectedly, it works
similar to conventional SSDs. Later, we modified the
architecture to support in-storage processing without
interfering with the basic tasks of an SSD.
Using computational SSD, for the computation to take
place, only a command and a resulting data need to transfer
over the storage interface, and this greatly reduces the interface
traffic and significantly lowers the required power. In this
section, we will cover the hardware architecture and software
stack implemented in the CompStor platform which provides
in-situ processing without degrading the performance of the
storage data access.
A. CompStor Hardware Architecture
Fig. 2 shows a host attached to N CompStors via a PCIe
root complex and switch. The PCIe root complex together with
the PCIe switch provide means for several PCIe endpoints to
be connected to a single host. This figure also demonstrates a
high-level view of the block diagram of the CompStor
hardware architecture. The SSD controller of this platform
contains all the common subsystems necessary for the
implementation of a very h igh capacity enterprise-grade SSD,
such as embedded real-time processors, PCIe and NVMe
controllers, a fast-release host data buffer, advanced ECC
engine, programmable flash media interface, and encryption.
These modules are not shown in Fig. 2 for the sake of
simplicity.
In addition to these foundational elements, Co mpStor is
equipped with a dedicated in-situ processing subsystem (ISPS)
with full access to the flash media. Table II describes the
CompStor processor subsystem specifications.
The choice of an application processor subsystem dedicated
to in-storage computation and fully isolated control and data
paths are critical to ensuring:
porting a Linux operating system
concurrent data processing and storage functionality
without degradation of either one
ISPS has a direct connection to the flash management
processor that provides flash media read/write accesses. We
have modified the SSD controller hardware and software to
provide a high bandwidth, low latency data path between ISPS
and the flash media interface. In the other words, ISPS can
access the flash data more efficiently than the host CPU.
B. CompStor Software Stack
In CompStor, a host side client controls the in-situ
processing flow, so from a master-slave perspective, the client
is the master, while CompStor behaves as a slave. The client is
responsible to perform a defined sequence of steps: sending an
in-situ task to CompStor, waiting for the completion of the
task, and receiving back results of the execution. In this
section, the software stack that helps the user to go through
these steps is discussed.
1
There is a set of entities shared between all components in
the software stack. In fact, they are virtual entities traveling
through layers of the software stack to deliver information and
may get encapsulated into other entities. Each layer may either
process or redirect them to the next layer. Following these
entities and their uses are defined:
Command: A data structure containing detailed
information about in-situ computation task including
the name of input and output files, the Linux shell
command/script or the application name, the
arguments needed to pass to the application, and access
permissions. Linux OS support enables dynamic task
loading.
Response: A data structure containing the information
about the outcome of an in-storage computation task,
such as the final status of the command and time
consumed to execute it inside CompStor.
Minion: A virtual entity that travels from a client to a
CompStor and delivers a command. It then waits until
the in-situ processing is done to deliver the response
TABLE II. ISPS CHARACTERISTICS
64-bit quad-core ARM Cortex A53 @ 1.5GHz
32KB I-cache & D-cache
1MB L2 cache
8GB DDR4 @ 2133MT/s
Client
CompStor
Minion to CompStor (only contains the command)
Command Response
Minion back to Client (contains Response too)
Command Response
Fig 3. Minion virtual entity
Fig 2. An in-situ processing system containing a host and several CompSt ors
back to the client. This entity is composed of a
command and a response. The fields of the command
are populated by the client while the fields of the
response are populated by CompStor. Fig. 3 depicts a
minion containing a command and a response
traveling between client and CompStor.
Query: A virtual entity that travels from client to
CompStor for delivering an administrative message.
Similar to a minion, it travels back to the client after
delivering the message, but it could not trigger an in-
situ processing task. Instead, it can load an executable
on runtime (dynamic task loading) or get information
about the current status of CompStor such as ARM
cores utilization, or temperature of the cores. This
information could be used for load balancing.
The software stack spread over host and CompStor and
each layer accomplishes specific tasks and serves the other
layers. The commands, responses, minions, and queries are the
only entities traveling from one layer to another. Fig. 4 depicts
the software stack architecture and how the layers
communicate with each other. These layers are defined as
follows:
Off-loadable executable: It is a C/C++ application or a
Linux shell command/script or a combination of both. The
application user aims to run on CompStor embedded Linux
could be the same source code user runs on host OS but need
to be compiled by ARM compilers.
In-situ Library: A C/C++ library that provides high-level
APIs for the client and should be statically linked to it. In
contrast to some of the related works where an in-storage
lib rary is provided to rewrite the off-loadable application
[5,6], the CompStor in-situ library is only intended to be used
in the client, not in the off-loadable executable, which does
not need any modification to be executed in CompStor.
Client: As mentioned before, a C/C++ application that
controls the in-storage processing flow using the in-situ
library.
ISPS agent: A daemon running on CompStor which is
responsible for receiving minions from clients and spawning
in-storage processes based on the command inside the
received minions. The daemon populates the response fields
of the minion and sends it back to the client after task
completion.
Flash access device driver: A Linux device driver inside
the ISPS Linu x OS that communicates to the SSD controller
for flash read/write access. The flash access device driver
abstracts the flash read/write accesses, so off-loadable
executable sees the flash memory as it is running on the host
CPU.
SSD controller software: The software that is responsible
for the flash management, garbage collections, and table
keeping tasks. This software handles host read/write
commands and also provides an efficient flash read/write
access for the flash access device driver.
In-situ
Library
ISPS agent
Flash
access
device
driver
Client
SSD
controller
Software
Offloaded
executable #1
Host ISPS
Offloaded
executable #2
1
34
5
62
CompStor
Fig 4. Software Stack
TABLE III. A LIFETIME OF A MINION
Step #
Description
1
Host side client configures a minion and sends it
to ISPS agent using the in-situ library APIs.
2
The ISPS agent extracts command from the
received minion and spawns off-loadable
executable or the Linu x shell command/script
based on the command inside ISPS.
3
At runtime, the executable accesses the flash
storage through the device driver.
4
The device driver sends read/write commands to
flash controller.
5
At runtime, the ISPS agent keeps track of the
status of the in-situ processing.
6
In the end, the ISPS agent populates the fields
of response in the minion and sends the minion
back to the client.
Whenever client launches a minion, it triggers multiple
messages passing between different software layers. Table III
describes the lifetime of a minion, from the time it is
configured in client till it delivers the result to the client. The
steps number match with the labels in Fig. 4.
CompStor client is able to send several concurrent minions
to different CompStors. This gives the client the ability to
trigger multiple parallel in-storage processing requests into a
storage node. Considering a data center containing hundreds of
CompStor equipped storage nodes, there could be thousands of
concurrent minions, resulting in heavy parallelism at the
storage unit level.
IV. PROTOT YPE AND EXPERIMENTAL RESULTS
In this section, we introduce a fully functional 24TB
CompStor NVMe SSD prototype and run several experiments
to assess energy consumption and performance.
A. CompStor prototype
The prototype is an NVMe over PCIe SSD with an FPGA-
based enterprise-class SSD controller coupled with the in-situ
processing subsystem itself built around a quad-core 64-bit
ARM A53 application processor.
Fig. 5 shows the prototype developed for the experiments
described in this section. For the prototype, we built two
boards, one for the basic SSD controller modules together
with flash memories and another one for the ISPS. In fact, the
SSD controller is implemented using an ISPS co mponent
which is attached to the main board as a daughter board via a
high-speed FMC connector providing a seamless connection
with other components within SSD controller. On the main
board, we used Xilinx Vertex-7 2000T FPGA, while the ISPS
benefits from a Xilinx Zynq Ultrascale+ MPSoC. The latter is
a SoC containing an FPGA together with a quad-core 64-bit
ARM A53 processor and two ARM R5 cores. This SoC also
contains a graphics processing unit (GPU) and a set of ASIC
modules such as encryption and decryption units, however, we
have not used these facilities in the current version of the
CompStor. To the best of our knowledge, this prototype is the
first Linux-powered SSD which is equipped with a co mplete
software stack to support sending in-storage processing
commands to the CompStor and receiving the result from the
SSD.
We have considered the cost of the ISPS implementation
in comparison with the cost of the whole SSD manufacturing.
Our analysis shows that the cost of implementing the in-situ
computation is less than 8% of the whole SSD manufacturing
because the major costs are related to flash media modules and
SSD controller logic. In fact, in the CompStor, the ISPS which
is mainly ASIC ARM cores are relatively cheap in comparison
to the cost of the other hardware and software modules.
B. System setup
Two identical servers have been used for the experiments
described in this section (see specifications in table IV). In one
of the servers, we used an off-the-shelf enterprise SSD while
CompStor was used as a directed attached storage device in
the other. The arguments, input files, and test scripts are the
same for both servers.
Since processing huge plain text files is common in
datacenter applications, for the experiments described in this
section, we prepared a dataset which contains 348 compressed
big text files. In fact, these te xt files are books in different
fields which are transformed to plain text files. The total size
of the dataset is about 11.3GB. The books are individually
compressed using bzip2 and gzip algorithms.
The user applications selected for the experiment included
both IO intensive and compute-intensive applications.
Compression and decompression functions are co mmonly
used in big data analytics frameworks. We have used
gzip/gunzip and bzip2/bunzip2 as representative of a compute-
intensive class of applications. On the other hand, for IO
intensive experiments, two search applications were selected:
TABLE IV. SERVER SPECIFICAT ION
CPU Type
Intel Xeon E5-2620 v4
Memory
32 GB DDR4
Operating system
Ubuntu 16.04
Off-the-shelf SSD
256GB NVMe SSD
In-situ SSD
CompStor 24TB NVMe SSD
Fig 5. The CompStor prototype
grep and gawk [21]. Grep is a Linux shell command designed
to search on text inputs while gawk utility searches text and
makes changes based on user-specified patterns.
C. Experimental Results
Performance Experiments: In the first set of experiments,
we used a host side client to trigger in-situ processing and ran
the applications on CompStor. Obviously, the performance of
one CompStor with a quad-core ARM processor is lower than
a high-end Xeon processor. However, the performance of in-
storage computation systems scales linearly with the number of
storage devices as is the case for storage servers architected in
[2,3]. Fig. 6 depicts how performance scales linearly with
capacity i.e. the number of CompStor devices. For this
experiment, several CompStors are attached to a single host via
PCIe slots.
Even though highly parallel in-storage computation
performance can equal or surpass that of a server CPU, it
makes sense to consider that one augments the other and result
in a higher performance and more efficient system. Fig. 7
depicts the performance of the Xeon processor combined with
the performance of multiple CompStor devices when running
the bzip2 compression algorithm. In this experiment, we
distributed the whole set of the input files between the host and
several CompStors. Then the performance of the CompStors
and the host are measured separately.
ISPS
Flash Memory
SSD Controller
This result shows that the in-situ processing will add
comparable processing power to the whole system while the
manufacturing cost of adding this feature to the SSD is
reasonable. In addition, regarding the energy consumption,
CompStor can achieve a compelling result thanks to more
efficient data access link, and more power efficient processing
engine in comparison with the host.
Energy Consumption Experiment: In this experiment,
we have measured the energy efficiency of the server using
conventional storage devices compared to the server which
benefits from CompStor. In the latter case, for the
computation to take place, only the computational request and
the resulting data need transfer over the storage interface,
greatly reducing the interface traffic and the required energy.
880.9
177.6
1462
1717
1908
522
2621.4
4666
gzip
gunzip
bzip 2
bunzip 2
Energy Consumption per
Gigabyte Data (Joule/GB)
CompStor Xeon E5-2620
68.5
89.17
222.7
295.4
grep ga wk
Compression/Decompression
Search
The reason we chose energy consumption over power
consumption is to make the result of this experiments
independent of the performance of the systems. We ran the
experiments on both servers as described in Section IV.B. and
measured the energy consumed when executing
compression/decompression and search applications. The
energy consumption is measured as power consumption
multiply by consumed time for executing the benchmarks. So,
we have measured the average power consumption over time
as well as the consumed time and calculated the energy
consumptions accordingly.
Results were normalized per gigabyte of data transferred,
i.e. Watts per GB/s or Joules/GB as shown in Fig. 8, so the
energy consumption results in this figure are regardless of the
number of the CompStors which execute the benchmarks. We
intended to make the result of the energy consumption to be
independent of the number of CompStors. In other words,
considering a fixed amount of input data per each CompStor,
as we increase the number of CompStors, the volume of the
input data together with the energy consumption will increase
linearly, so the energy consumption per input data size is
always independent of the number of CompStors.
V. CONCLUSION
We have presented a novel approach to move computation
to data paradigm to its ultimate limit by architecting and
designing highly efficient and flexible in-storage processing
capability in solid state drives. We have designed CompStor,
an in-storage computation platform with the appropriate
software stack (devices, protocol, interface, software, and
systems) and dedicated hardware for in-situ processing
including a powerful multi-core application processor
subsystem. The dedicated hardware resources provide in-situ
data analytics capability without degrading the performance of
common storage device functions such as read, write and trim.
By moving data analysis tasks closer to where the data resides,
these storage devices reduce dramatically the storage
bandwidth bottleneck, data movement cost, and improves the
overall energy efficiency. Experimental results show up to 3X
energy saving for some applications in comparison to the host
CPU. To the best of our knowledge, the 24TB CompStor SSD
is the first one capable of supporting in-storage computation
Fig 8. Energy consumption per data unit results
Fig. 5.
Fig. 6.
Fig 7. Aggregated system performance for compression using bzip2
Fig. 3.
Fig. 4.
Fig 6. Performance experimental results
Fig. 1.
Fig. 2.
running an operating system, enabling all types of applications
and Linux shell commands to be executed in-place with no
modification.
REFERENCES
[1] Cisco, “ Cisco Glo bal Cloud Index: Forecast and Methodology, 2015–
2020”, whit e pap er, 2016. [Online]. URL:
https://www.cisco.com/c/dam/en/us/solutions/collateral/service-
provider/global-cloud-index-gci/white-paper-c11-738085.pdf.
[2] Siamak Tavallaei, “Microsoft P roject Olympus Hyperscale GPU
Accelerator (HGX-1)” CSI, Azure Cloud, Tech blog, 2017. [Online].
URL:
https://azure.microsoft.com/mediahandler/files/resourcefiles/00c18868-
eba9-43d5-b8c6-e59f9fa219ee/HGX-1%20Blog_5_26_2017.pdfI.
[3] S. Jacobs and C. P. Bean, “Fine particles, thin film s and exchange
anisotropy, in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New
York: Academic, 1963, pp. 271350.
[4] Mark A. Shaw, “Project Olympus Flash Expansion FX -16”, T ech blog,
2017. [Online]. URL:
http://www.opencompute.org/blog/ocp-us-summit-2017-and-now-a-
word-from-our-sponsors/
[5] Q. Xu, H. Siyamwala, M. Ghosh, T. Suri, M. Awasthi, Z. Guz, A.
Shayest eh, and V. Balakrishnan, “Performance An alysis of NVMe SSDs
and their Implication on Real World Dat abases,” In Proceedings of the
8th ACM Int ernational Systems and Storage Conference. SYSTOR,
ACM, 2015.
[6] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, A scalable processing-
in-memory accelerator for parallel graph processing,” In P roceedin gs of
International Symposium on Computer Architecture (ISCA), pp. 105-
117, 2015.
[7] S. Ch o, C. Park, H. Oh, S. Kim , Y. Yi, and G. R. Ganger, “Active disk
meets flash: A case for intelligent SSDs,” In Proceedings of the27t h
International ACM Conference on International Conference on
Supercomputing, ICS ’13, pp. 91–102, ACM, 2013.
[8] C. Li, Y. Hu, L. Liu, J. Gu, M. Song, X. Liang, J. Yuan, and T. Li,
“Towards sustainable in-sit u server systems in the big dat a era, In
Proceedings of the 42nd Annual International Symposium on Computer
Archit ectur e, ISCA ’15, (New York, NY, USA), pp. 14–26, ACM, 2015.
[9] Y. Kang, Y. Kee, E. L. Miller, an d C. Park, “Enabling co st -effective
data processing with smart SSD, In Proceedings of IEEE 29th
Symposium on Mass Sto rage Syst ems and Technologies, MSST’13, pp.
112, 2013.
[10] S. Y. W. Su and G. J. Lipovski, CASSM: A cellular sy st em fo r very
large dat a bases,” In Proceedings of International Conferen ce on Very
Large Data Bases, pages 456472, Sept. 1975.
[11] E. Riedel, G. A. Gibson, an d C. Faloutso s, “ Active storage for large-
scale data minin g and mult imedia, In Proceedings of the 24th
International Conferen ce on Very Large Dat a Bases, VLDB 98, pp. 62
73, Morgan Kaufmann, 1998.
[12] S.-W. Jun, M. Liu, K. E. Fleming, Arvind, Scalable multi-access flash
store for big data analytics,” In Proceedings of the 2014 ACM/SIGDA
International Symposium on Field-programmable Gate Arrays, pp. 55-
64, 2014.
[13] S.-W. Jun, M. Liu, S. Lee, J. Hicks, J. Ankcorn, M. King, S. Xu, and
Arvind, BlueDBM: An appliance for big dat a analyt ics,” In
Proceedings of the 42nd Annual International Symposium on Computer
Archit ecture (ISCA), New York, NY, USA, pp. 113, ACM, 2015.
[14] J. Do, Y.S. Kee, J. M. Patel, C. Park, K. Park, and D. J. DeWitt, “Query
processing on smart SSDs: Op portun ities and challenges,” In
Proceedings of the ACM International Conference on Management of
Data (SIGMOID), pp. 12211230, ACM, 2013.
[15] S. Kim, H. Oh, C. Park, S. Cho, S. W. Lee, B. Moon, “In -storage
processing of dat abase scans and joins,” Informat ion Scien ces: An
International Journal, pp. 183-200, January 2016.
[16] D. Tiwari, S. Boboila, S. S. Vazhkudai, Y. Kim, X. Ma, P. J. Desnoyers,
and Y. Solih in, “Active Flash: Towards Energy -Efficient, In-Situ Data
Analytics on Extreme-Scale Machines,” In Proceedings of FAST’13, pp.
119-132, 2013.
[17] Y. Kang, Y. S. Kee, E. Miller, and C. Park. “Enabling co st-effective
data processing with smart ssd,” In Proceedings of Mass Storage
Systems and Technologies (MSST), 2013 IEEE 29th Symposium on,
2013.
[18] M. Gao and C. Kozyrakis, “HRL: efficient and flexible reconfigurable
logic for near-data processing,” In Proceedings of International
Symposium on High Performance Comput er Architecture (HPCA), pp.
126-137, 2016.
[19] B. Gu, A. Yoon, D. Bae, I. Jo, J. Lee, J. Yoon, J. Kang, M. Kwon, C.
Yoon, S. Cho, J. Jeo ng, and D. Chang. “Biscuit: a framework for near-
data processing of big data workloads,” In Proceedings of ISCA, 2016.
[20] M. Gao, G. Ayers, and and C. Ko zyrakis, Pract ical Near-Data
Processing for In-memo ry Analyt ics Frameworks, In Proceedings o f
PACT -24, pp. 113124, Oct 2015.
[21] S. L. Xi, O. Babarinsa, M. Athanassoulis, and S. Idreo s, “ Beyond the
Wall: Near-Data Processin g for Databases,” In Proceedings of the 11th
International Workshop on Data Management on New Hardware, p.1-
10, 2015.
[22] “GNU awk project ”, 2 017. [Online]. URL:
http://savannah.gnu.org/projects/gawk/
[23] N. Abbani, et al. A distribut ed reconfigurable active SSD plat form for
data intensive applications, In Proceedings of IEEE 13th High
Performance Computing and Communications (HPCC), 2011.
[24] “The Seagat e Kinetic Open Storage Vision”, [online]. URL:
https://www.seagate.com/tech-insights/kinetic-vision-how-seagate-new-
developer-tools-meets-the-needs-of-cloud-storage-platforms-master-ti/
... This is added to the challenges involved in obtaining accurate and robust results in the detection of failures [5][6][7]. Most of the architectures developed so far are only used in the training process, and not in the online execution of algorithms [8][9][10][11][12]. Another important aspect is that these architectures do not contain integrated modules that measure the robustness and performance of the algorithms out of box. ...
... The era of distributed processing has seen the emergence of several architectures for distributed processing of large amounts of data (Bigdata) [8][9][10][11][12]. These distributed architectures possess a collection of independent entities that cooperate to solve a problem that cannot be solved individually. ...
... The proposed architecture is based on SORBA and it builds upon it by using Precision Score and Robustness Index to optimize the self-selection, execution, and deployment of machine learning algorithms pipelines for failure detection in industrial equipment. Table 1 shows the benefits and shortcomings of Spark, SORBA and the proposed architecture for training, deployment, and execution of algorithms [9][10][11][12][13][14][15][16][17][18][19][20][21]. When compared to Spark, the added capabilities of the proposed architecture improve the performance of ML failure detection algorithms for industrial equipment, as it can be concluded from the three case studies presented in the paper. ...
Article
Full-text available
Creating algorithms and systems that can process and store large amounts of data represents a great scientific, economic, and practical challenge. The application of machine learning (ML) to these problems is not trivial, and even less so if the processing of these algorithms needs to be distributed to handle the large computational load of data analysis and decision making. This paper presents a distributed and robust architecture to train, deploy, and execute distributed failure detection algorithm pipelines improving their Robustness and Precision. The solution is based on Smart Operational Realtime Bigdata Analytics (SORBA), a patented distributed architecture. The architecture combines the metrics of Robustness and Precision to automatically optimize the selection of industrial failure detection machine learning algorithm pipelines and their hyperparameters. A system of modules is developed for the acquisition, normalization, data conditioning, training, deployment, and online execution of machine learning algorithm pipelines. The solution was validated by comparing the Machine Learning (ML) results of two use cases: an industrial motor and a locomotive battery, with those achieved with Spark. The experiments showed an average improvement on the Precision Score of 28.76% and Robustness Index of 10.9%. The solution streamlines the implementation of successful applications and improves the performance of these indicators with respect to the solutions currently available in the Spark MLlib.
... Second, the majority of research papers published over the years have proposed homogeneous processing units near the main memory. However, the logic considered varies in its compute capabilities, e.g., simple in-order cores [11,15,105,241,260,334,382,407,463], graphics processing units [88,182,365,527], field-programmable gate arrays [145,199,212], and application-specific accelerators [12,132,154,160,183,225,280,332,478,510]. ...
... Fifth, despite the promises made by existing proposals on NMC, the support for virtual memory, memory coherence, and compiler support is fairly limited. In most of the works [105,160,199,241,260,332,463,478], the programmers are expected to re-write their code using specialized APIs in order to reap the benefits of NMC. ...
... CompStor (2018) Torabzadehkashi et al. [463] propose an architecture that consists of NVMe over PCIe SSD and FPGA-based SSD controller coupled with in-storage processing subsystem (ISPS) based on the quad-core ARM A53 processor. They modify the SSD controller hardware and software to provide high bandwidth and low latency data path between ISPS and the flash media interface. ...
Preprint
The cost of moving data between the memory units and the compute units is a major contributor to the execution time and energy consumption of modern workloads in computing systems. At the same time, we are witnessing an enormous amount of data being generated across multiple application domains. These trends suggest a need for a paradigm shift towards a data-centric approach where computation is performed close to where the data resides. Further, a data-centric approach can enable a data-driven view where we take advantage of vast amounts of available data to improve architectural decisions. As a step towards modern architectures, this dissertation contributes to various aspects of the data-centric approach and proposes several data-driven mechanisms. First, we design NERO, a data-centric accelerator for a real-world weather prediction application. Second, we explore the applicability of different number formats, including fixed-point, floating-point, and posit, for different stencil kernels. Third, we propose NAPEL, an ML-based application performance and energy prediction framework for data-centric architectures. Fourth, we present LEAPER, the first use of few-shot learning to transfer FPGA-based computing models across different hardware platforms and applications. Fifth, we propose Sibyl, the first reinforcement learning-based mechanism for data placement in hybrid storage systems. Overall, this thesis provides two key conclusions: (1) hardware acceleration on an FPGA+HBM fabric is a promising solution to overcome the data movement bottleneck of our current computing systems; (2) data should drive system and design decisions by leveraging inherent data characteristics to make our computing systems more efficient.
... In contrast, our mechanism can be adopted in commodity SSDs with very low cost, as we demonstrated in Section 5. In-Storage Processing. Several prior works propose to leverage the internal processor (e.g., [51,52,55,56,[141][142][143][144][145][146][147][148][149][150][151][152][153][154]) or embed hardware accelerators (e.g., [15-17, 49, 50, 155-165]) within the storage device for computation. Due to their more general-purpose designs, these proposals can perform more diverse and complex operations (e.g., arithmetic operations). ...
Preprint
Full-text available
Bulk bitwise operations, i.e., bitwise operations on large bit vectors, are prevalent in a wide range of important application domains, including databases, graph processing, genome analysis, cryptography, and hyper-dimensional computing. In conventional systems, the performance and energy efficiency of bulk bitwise operations are bottlenecked by data movement between the compute units and the memory hierarchy. In-flash processing (i.e., processing data inside NAND flash chips) has a high potential to accelerate bulk bitwise operations by fundamentally reducing data movement through the entire memory hierarchy. We identify two key limitations of the state-of-the-art in-flash processing technique for bulk bitwise operations; (i) it falls short of maximally exploiting the bit-level parallelism of bulk bitwise operations; (ii) it is unreliable because it does not consider the highly error-prone nature of NAND flash memory. We propose Flash-Cosmos (Flash Computation with One-Shot Multi-Operand Sensing), a new in-flash processing technique that significantly increases the performance and energy efficiency of bulk bitwise operations while providing high reliability. Flash-Cosmos introduces two key mechanisms that can be easily supported in modern NAND flash chips: (i) Multi-Wordline Sensing (MWS), which enables bulk bitwise operations on a large number of operands with a single sensing operation, and (ii) Enhanced SLC-mode Programming (ESP), which enables reliable computation inside NAND flash memory. We demonstrate the feasibility of performing bulk bitwise operations with high reliability in Flash-Cosmos by testing 160 real 3D NAND flash chips. Our evaluation shows that Flash-Cosmos improves average performance and energy efficiency by 3.5x/32x and 3.3x/95x, respectively, over the state-of-the-art in-flash/outside-storage processing techniques across three real-world applications.
Article
The von Neumann bottleneck is imposed due to the explosion of data transfers and emerging data-intensive applications in heterogeneous system architectures. The conventional computation approach of transferring data to CPU is no longer suitable especially with the cost it imposes. Given the increasing storage capacities, moving extensive data volumes between storage and computation cannot scale up. Hence, high-performance data processing mechanisms are needed, which may be achieved by bringing computation closer to data. Gathering insights where data is stored helps deal with energy efficiency, low latency, as well as security. Storage bus bandwidth is also saved when only computation results are delivered to the host memory. Various applications, including database acceleration, machine learning, Artificial Intelligence (AI), offloading (compression/encryption/encoding) and others can perform better and become more scalable if the “move process to data” paradigm is applied. Embedding processing engines inside Solid-State Drives (SSDs), transforming them to Computational Storage Devices (CSDs), provides the needed data processing solution. In this paper, we review the prior art on Near Data Processing (NDP) with focus on In-Storage Computing (ISC), identifying main challenges and potential gaps for future research directions.
Article
Full-text available
Recent years have seen an explosion of data volumes from a myriad of distributed sources such as ubiquitous cameras and various sensors. The challenges of analyzing these geographically dispersed datasets are increasing due to the significant data movement overhead, time-consuming data aggregation, and escalating energy needs. Rather than constantly move a tremendous amount of raw data to remote warehouse-scale computing systems for processing, it would be beneficial to leverage in-situ server systems (InS) to pre-process data, i.e., bringing computation to where the data is located. This paper takes the first step towards designing server clusters for data processing in the field. We investigate two representative in-situ computing applications, where data is normally generated from environmentally sensitive areas or remote places that lack established utility infrastructure. These very special operating environments of in-situ servers urge us to explore standalone (i.e., off-grid) systems that offer the opportunity to benefit from local, self-generated energy sources. In this work we implement a heavily instrumented proof-of-concept prototype called InSURE: in-situ server systems using renewable energy. We develop a novel energy buffering mechanism and a unique joint spatio-temporal power management strategy to coordinate standalone power supplies and in-situ servers. We present detailed deployment experiences to quantify how our design fits with in-situ processing in the real world. Overall, InSURE yields 20%~60% improvements over a state-of-the-art baseline. It maintains impressive control effectiveness in under-provisioned environment and can economically scale along with the data processing needs. The proposed design is well complementary to today's grid-connected cloud data centers and provides competitive cost-effectiveness.
Article
Full-text available
Complex data queries, because of their need for random accesses, have proven to be slow unless all the data can be accommodated in DRAM. There are many domains, such as genomics, geological data and daily twitter feeds where the datasets of interest are 5TB to 20 TB. For such a dataset, one would need a cluster with 100 servers, each with 128GB to 256GBs of DRAM, to accommodate all the data in DRAM. On the other hand, such datasets could be stored easily in the flash memory of a rack-sized cluster. Flash storage has much better random access performance than hard disks, which makes it desirable for analytics workloads. In this paper we present BlueDBM, a new system architecture which has flash-based storage with in-store processing capability and a low-latency high-throughput inter-controller network. We show that BlueDBM outperforms a flash-based system without these features by a factor of 10 for some important applications. While the performance of a ram-cloud system falls sharply even if only 5%~10% of the references are to the secondary storage, this sharp performance degradation is not an issue in BlueDBM. BlueDBM presents an attractive point in the cost-performance trade-off for Big Data analytics.
Article
Data-intensive queries are common in business intelligence, data warehousing and analytics applications. Typically, processing a query involves full inspection of large in-storage data sets by CPUs. An intuitive way to speed up such queries is to reduce the volume of data transferred over the storage network to a host system. This can be achieved by filtering out extraneous data within the storage, motivating a form of near-data processing. This work presents Biscuit, a novel near-data processing framework designed for modern solid-state drives. It allows programmers to write a data-intensive application to run on the host system and the storage system in a distributed, yet seamless manner. In order to offer a high-level programming model, Biscuit builds on the concept of data flow. Data processing tasks communicate through typed and data-ordered ports. Biscuit does not distinguish tasks that run on the host system and the storage system. As the result, Biscuit has desirable traits like generality and expressiveness, while promoting code reuse and naturally exposing concurrency. We implement Biscuit on a host system that runs the Linux OS and a high-performance solid-state drive. We demonstrate the effectiveness of our approach and implementation with experimental results. When data filtering is done by hardware in the solid-state drive, the average speed-up obtained for the top five queries of TPC-H is over 15×.
Conference Paper
The continuous growth of main memory size allows modern data systems to process entire large scale datasets in memory. The increase in memory capacity, however, is not matched by proportional decrease in memory latency, causing a mismatch for in-memory processing. As a result, data movement through the memory hierarchy is now one of the main performance bottlenecks for main memory data systems. Database systems researchers have proposed several innovative solutions to minimize data movement and to make data access patterns hardware-aware. Nevertheless, all relevant rows and columns for a given query have to be moved through the memory hierarchy; hence, movement of large data sets is on the critical path. In this paper, we present JAFAR, a Near-Data Processing (NDP) accelerator for pushing selects down to memory in modern column-stores. JAFAR implements the select operator and allows only qualifying data to travel up the memory hierarchy. Through a detailed simulation of JAFAR hardware we show that it has the potential to provide 9x improvement for selects in column-stores. In addition, we discuss both hardware and software challenges for using NDP in database systems as well as opportunities for further NDP accelerators to boost additional relational operators.
Conference Paper
The storage subsystem has undergone tremendous innovation in order to keep up with the ever-increasing demand for throughput. Non Volatile Memory Express (NVMe) based solid state devices are the latest development in this domain, delivering unprecedented performance in terms of latency and peak bandwidth. NVMe drives are expected to be particularly beneficial for I/O intensive applications, with databases being one of the prominent use-cases. This paper provides the first, in-depth performance analysis of NVMe drives. Combining driver instrumentation with system monitoring tools, we present a breakdown of access times for I/O requests throughout the entire system. Furthermore, we present a detailed, quantitative analysis of all the factors contributing to the low-latency, high-throughput characteristics of NVMe drives, including the system software stack. Lastly, we characterize the performance of multiple cloud databases (both relational and NoSQL) on state-of-the-art NVMe drives, and compare that to their performance on enterprise-class SATA-based SSDs. We show that NVMe-backed database applications deliver up to 8× superior client-side performance over enterprise-class, SATA-based SSDs.
Article
The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.
Article
Flash memory-based SSD is becoming popular because of its outstanding performance compared to conventional magnetic disk drives. Today, SSDs are essentially a block device attached to a legacy host interface. As a result, the system I/O bus remains a bottleneck, and the abundant flash memory bandwidth and the computing capabilities of SSD are largely untapped. In this paper, we propose to accelerate key database operations, scan and join, for large-scale data analysis by moving data-intensive processing from the host CPU to inside flash SSDs ("in-storage processing"), close to the data source itself. To realize the idea of in-storage processing in a cost-effective manner, we deploy special-purpose compute modules using the System-on-Chip technology. While data from flash memory are transferred, a target database operation is applied to the data stream on the fly without any delay. This reduces the amount of data to transfer to the host drastically, and in turn, ensures all components along the data path in an SSD are utilized in a balanced way. Our experimental results show that in-storage processing outperforms conventional processing with a host CPU by over up to 7 ×, 5 × and 47 × for scan, join, and their combination, respectively. It also turns out that in-storage processing can be realized at only 1% of the total SSD cost, while offering sizable energy savings of up to 45 × compared to host processing.