Content uploaded by Mahdi Torabzadehkashi
Author content
All content in this area was uploaded by Mahdi Torabzadehkashi on Dec 31, 2018
Content may be subject to copyright.
CompStor: An In-Storage Computation Platform for
Scalable Distributed Processing
Mahdi Torabzadehkashi
EECS Department
University of California, Irvine
Irvine, USA
Torabzam@uci.edu
Siavash Rezaei
ICS Department
University of California, Irvine
Irvine, USA
siavashr@uci.edu
Vladimir Alves
NGD Systems, Inc
Irvine, USA
vladimir.alves@ngdsystems.com
Nader Bagherzadeh
EECS Department
University of California, Irvine
Irvine, USA
nader@uci.edu
Abstract— The explosion of data-centric and data dependent
applications requires new storage devices, interfaces, and software
stacks. Big data analytics solutions such as Hadoop, MapReduce
and Spark have addressed the performance challenge by using a
distributed architecture based on a new paradigm that relies on
moving computation closer to data. In this paper, we describe a
novel approach aimed at pushing the “move computation to data”
paradigm to its ultimate limit by enabling highly efficient and
flexible in-storage processing capability in solid state drives
(SSDs). We have designed CompStor, an FPGA-based SSD that
implements computational storage through a software stack
(devices, protocol, interface, software, and systems) and a
dedicated hardware for in-storage processing including a quad-
core ARM processor subsystem. The dedicated hardware
resources provide in-storage data analytics capability without
degrading the performance of common storage device functions
such as read, write and trim. Experimental results show up to 3X
energy saving for some applications in comparison to the host
CPU. To the best of our knowledge, the 24TB CompStor SSD is
the first one capable of supporting in-storage computation
running an operating system, enabling all types of applications
and Linux shell commands to be executed in-place with no
modification.
Keywords—near data processing, in-storage computing, in-situ
processing, distributed processing, non-volatile memory, SSD, direct-
attached storage
I. INTRODUCTION
By 2020, roughly 485 hyper-scale data centers will contain
47% of all servers in data centers worldwide with Big Data
applications driving the overall growth in stored data [1].
Moreover, the traffic within these storage units will quintuple
by the same time. For a hyperscale storage architect, finding a
way to avoid moving data fro m device to device even in the
same rack will beco me a paramount need, as will reducing the
power consumed to process and move this data. Webscale data
centers have been actively developing new storage server
architectures that favor a significant increase of capacity [2,3].
The advent of high-performance, high-capacity flash storage
has changed the dynamics of the storage-compute relationship.
Today, a handful of NVMe flash devices can easily saturate the
PCIe bus complex of most servers. To address the significant
bandwidth mismatch, a new paradigm is required that moves
computing capabilities closer to data. This concept, which that
authors refer to as in-situ Processing, provides storage
platforms with significant compute capabilities, reducing the
computing demands on servers. In-situ processing takes
advantage of distributed processing to:
● Significantly reduce data transfers between SSD and
host when the computation can take place in-situ
● Take advantage of the enormous aggregated bandwidth
at the media interface and reduce data ingestion time by
more than an order of magnitude
● Achieve very high performance for IO intensive
applications that can be parallelized
When storage systems were built around hard disk drives,
this balance favored the CPU, which o ften sat idle waiting
for the storage system to respond, even if there were
hundreds of disk drives all operating in parallel. With flash
PCIe SW SSD
Flash
16ch x 533MB/s = ~8.5 GB/s
2.0 GB/s per SSD
1024ch x 533MB/s
= ~545 GB/s
16 lanes of PCIe = 16 GB/s 8TB per SSD
Host
CPU
Host
Memory
SSD
…
SSD
Flash
Flash
Flash
…
1
16
1
16
1
4
23
61
62
63
64
Fig 1. Bandwidth mismatch in high storage servers
TABLE I. COMPA RISON OF IN-STO RAGE COMPU TATION RELATED WORK
storage media, the reverse is true [4]. Fig. 1 highlights the
bandwidth mismatch between the flash media and the host
CPU in storage servers being designed for webscale data
centers. For example, in the Open Compute storage server
proposed in [2,3], the mismatch can be as high as 80x which
means a dramatic slow-down factor for data access and
processing. In-situ processing can alleviate this problem
since it precludes data movement from media to the host CPU.
In addition, the energy consumption is a major challenge
in the current datacenter architecture. In fact, the energy
consumption and cost of moving data will only be accentuated
as capacities increase [1]. Our experimental results presented
in section IV show that significant energy saving is
achievable.
Moving applications closer to the data instead of moving
data to the applications is a viable solution to this problem [5-
8], and there have been significant efforts to make this idea
applicable in the storage systems.
In this paper, we p resent CompStor, an in-situ processing
SSD framework encompassing an in-storage processing SSD
hardware and a software stack for the communication between
a host and a PCIe/NVMe attached in-situ processing SSD.
However, the current version of the software stack also
supports single host-multiple SSDs communication, and this
feature is exhaustively tested in the experimental results
described later in this paper. The CompStor solves the
prementioned limitations and adds more critical features and
capabilities to the storage systems.
Table I provides a comparison of critical features of the
different state-of-the-art solutions including Co mpStor. This
table clearly shows the motivations which drive the
development of a the CompStor. The works mentioned in
Table 1 will be discussed in the next section.
We can summarize the main contributions of this paper as
follows:
● Pushing the “move computation to data” paradigm
to its ultimate limit by enabling highly efficient and
flexible in-situ processing capability in solid-state
drives thanks to several architectural innovations.
● Design of an SSD controller with dedicated hardware
and software resources providing in-situ data analytics
capability without degrading the performance of
common storage device functions such as read, write,
erase and trim.
● An SSD controller architecture that supports the use of
a Linu x OS dedicated to in-situ processing. This OS
provides the familiar execution environment for
existing (big data) applications and makes the porting
effortless. In contrast to the previous works reviewed
in the next section, CompStor can run executable code
as well as shell scripts. In addition, one of the most
useful features of CompStor is dynamic task loading
which is the ability to load tasks into a computational
SSD at runtime. These features cannot be available
concurrently without a running OS inside the SSD.
● A complete software stack to support scalability,
consistency and system-level parallelism.
● A comparative analysis of energy consumption and
performance using real-world use cases.
The rest of this paper is organized as follows. In section II,
we discuss some of the state-of-the-art related works. A detail
explanation of hardware and software architecture of
CompStor is presented in Section III. In Section IV, we
introduce our 24TB in-situ processing SSD prototype and
report experimental results regarding energy consumption and
performance improvements. Finally, in Section V a conclusion
of our solution is presented.
Prototype description
dynamic task loading
programming library
OS-level flexibility
Jun [13]
FPGA based SSD / FPGA accelerator
Abbani [23]
FPGA based SSD / Soft microprocessor
Kang [17]
OTS SATA SSD / 2 ARM (unknown)
✓
Kim [15]
Simulation model / ARM A9 (sim)
Tiwari [16]
Model / ARM A9 (model)
✓
Gu [19]
OTS NVMe SSD / ARM R7
✓
✓
Gao [20]
Simulation model / ARM A7 (model)
CompStor
24TB NVMe SSD / Quad core ARM A53
✓
✓
✓
II. RELATED WORK
In-storage processing is not a new concept [9,10].
However, in recent years several attempts have been made at
developing technologies that translate into performance
improvement and power savings in the data center [11-20].
Seagate has introduced Kinetic hard disk drives as object-
oriented storages a few years ago [24]. This type of storages is
fundamentally different from traditional block-oriented
storages. Instead of using read and write co mmands to access
data blocks, Kinetic HDDs provide object-level data access via
a RESTful API. In the other words, the operating system does
not have to deal with low-level file system block addresses,
instead, it could simply read an object like a file or an image by
referring to its object identification. The in-situ processing is
orthogonal to the object-oriented storage systems which means
a storage could be either in-situ processing or object-oriented
or both at the same time. The in-situ processing idea proposes
in-storage data processing without considering if the data is
stored as an object or as a series of data blocks.
Jun et al. proposed a scalable architecture, called BlueDBM
[12], flash storage with an FPGA based in-storage processing.
Dealing with pure FPGA accelerators provides power
efficiency but lacks in flexibility for addressing different types
of applications. On the other hand, despite the existence of
high-level synthesis tools, there are still too many steps
required to generate RTL design and bitstream files. The extra
time it takes to generate RTL design makes it time-consuming
to reconfigure the FPGA frequently to accommodate different
applications. Often FPGA-based accelerators are specialized
for specific applications. An extended version of BlueDBM
[13] provides experimental results for more storage nodes and
running real applications.
In [23] a simple operating system is provided which is
composed of drive-resident utility programs running on top of
a MicroBlaze soft microprocessor in an FPGA. Using this
platform, they lose FPGA efficiency because of running in-
storage applications as instruction-based executables on a soft
microprocessor to avoid the difficulty of generating RTL code
for each application.
Some recent works [14-16,19] exploit embedded
processors in the SSD controllers originally designed to
execute SSD firmware for flash management purposes. But
they all suffer from limited processing power of the embedded
processors that are not dedicated to user data processing tasks.
Exp loiting the flash management firmware processor will
undoubtedly interfere with the critical flash controlling tasks
such as garbage collection and wear leveling, and impact user
read/write performance. Smart SSD [17] introduces a kernel
module on the host side to schedule and coordinate tasks
between host and SSDs. The module shares workloads
between host and device. In [14], Do et al. used an off-the-
shelf SSD for an extended version of the Microsoft SQL and
ran a selection and aggregation query. However, for an
arbitrary application, significant modifications are needed to
make them executable on the SSD’s embedded processor
because most applications are heavily linked to libraries that
exist in the Linux OS or could be added to it. In [15], a special
purpose architecture is proposed for the scan and join
operations in databases, resulting in a very narrow scope of
usage. In [16], authors targeted a special category of
application, scientific simulation applications. Usually, in such
applications simulation results require performing a sequence
of data analysis tasks. The data analysis part is executed in one
of the SSDs’ embedded proces sors. As one would expect, an
embedded processor targeted for SSD management does not
provide the necessary performance for many big data
applications as was alluded to earlier.
Biscuit [19] is the closest framework to our approach.
Biscuit uses ARM Cortex R7 embedded processors in the
storage device shared with SSD controller. This approach
results in a potential degradation impact on the performance of
the storage device. It also restricts designers to follow a defined
programming model to be able to use the in-storage processing.
In [20], Gao et al. proposed a heterogeneous architecture
for the near data processing unit, constructed from fine-grained
configurable logic blocks (CLBs similar to those in FPGAs)
and coarse-grained functional units (similar to those in
CGRAs). There are, however, significant difficulties in
converting software applications to run on this platform.
Overall, the limitations of previous work can be summarized as
follows:
• no dedicated HW resources for in-situ processing that
would guarantee the unchanged performance of read
and write commands
• no support for an operating system running in-storage.
• no software stack that truly supports scalability of
storage capacity and processing power consistency and
system-level parallelism.
• low processing power and supporting a limited domain
of applications
III. COMPSTOR ARCHITECTURE
In a conventional SSD, there are two processing
subsystems. The front-end subsystem talks to host and takes
care of PCIe/NVMe protocols, while on the other side, the
back-end subsystem handles flash management tasks such as
garbage collection. These two subsystems talk to each other to
handle host’s read and write co mmands. In our design, we have
implemented both subsystems in FPGA and tested the
developed SSD for different benchmarks . Expectedly, it works
similar to conventional SSDs. Later, we modified the
architecture to support in-storage processing without
interfering with the basic tasks of an SSD.
Using computational SSD, for the computation to take
place, only a command and a resulting data need to transfer
over the storage interface, and this greatly reduces the interface
traffic and significantly lowers the required power. In this
section, we will cover the hardware architecture and software
stack implemented in the CompStor platform which provides
in-situ processing without degrading the performance of the
storage data access.
A. CompStor Hardware Architecture
Fig. 2 shows a host attached to N CompStors via a PCIe
root complex and switch. The PCIe root complex together with
the PCIe switch provide means for several PCIe endpoints to
be connected to a single host. This figure also demonstrates a
high-level view of the block diagram of the CompStor
hardware architecture. The SSD controller of this platform
contains all the common subsystems necessary for the
implementation of a very h igh capacity enterprise-grade SSD,
such as embedded real-time processors, PCIe and NVMe
controllers, a fast-release host data buffer, advanced ECC
engine, programmable flash media interface, and encryption.
These modules are not shown in Fig. 2 for the sake of
simplicity.
In addition to these foundational elements, Co mpStor is
equipped with a dedicated in-situ processing subsystem (ISPS)
with full access to the flash media. Table II describes the
CompStor processor subsystem specifications.
The choice of an application processor subsystem dedicated
to in-storage computation and fully isolated control and data
paths are critical to ensuring:
• porting a Linux operating system
• concurrent data processing and storage functionality
without degradation of either one
ISPS has a direct connection to the flash management
processor that provides flash media read/write accesses. We
have modified the SSD controller hardware and software to
provide a high bandwidth, low latency data path between ISPS
and the flash media interface. In the other words, ISPS can
access the flash data more efficiently than the host CPU.
B. CompStor Software Stack
In CompStor, a host side client controls the in-situ
processing flow, so from a master-slave perspective, the client
is the master, while CompStor behaves as a slave. The client is
responsible to perform a defined sequence of steps: sending an
in-situ task to CompStor, waiting for the completion of the
task, and receiving back results of the execution. In this
section, the software stack that helps the user to go through
these steps is discussed.
1
There is a set of entities shared between all components in
the software stack. In fact, they are virtual entities traveling
through layers of the software stack to deliver information and
may get encapsulated into other entities. Each layer may either
process or redirect them to the next layer. Following these
entities and their uses are defined:
• Command: A data structure containing detailed
information about in-situ computation task including
the name of input and output files, the Linux shell
command/script or the application name, the
arguments needed to pass to the application, and access
permissions. Linux OS support enables dynamic task
loading.
• Response: A data structure containing the information
about the outcome of an in-storage computation task,
such as the final status of the command and time
consumed to execute it inside CompStor.
• Minion: A virtual entity that travels from a client to a
CompStor and delivers a command. It then waits until
the in-situ processing is done to deliver the response
TABLE II. ISPS CHARACTERISTICS
Processor
64-bit quad-core ARM Cortex A53 @ 1.5GHz
Memory
32KB I-cache & D-cache
1MB L2 cache
8GB DDR4 @ 2133MT/s
Client
CompStor
Minion to CompStor (only contains the command)
Command Response
Minion back to Client (contains Response too)
Command Response
Fig 3. Minion virtual entity
Fig 2. An in-situ processing system containing a host and several CompSt ors
back to the client. This entity is composed of a
command and a response. The fields of the command
are populated by the client while the fields of the
response are populated by CompStor. Fig. 3 depicts a
minion containing a command and a response
traveling between client and CompStor.
• Query: A virtual entity that travels from client to
CompStor for delivering an administrative message.
Similar to a minion, it travels back to the client after
delivering the message, but it could not trigger an in-
situ processing task. Instead, it can load an executable
on runtime (dynamic task loading) or get information
about the current status of CompStor such as ARM
cores utilization, or temperature of the cores. This
information could be used for load balancing.
The software stack spread over host and CompStor and
each layer accomplishes specific tasks and serves the other
layers. The commands, responses, minions, and queries are the
only entities traveling from one layer to another. Fig. 4 depicts
the software stack architecture and how the layers
communicate with each other. These layers are defined as
follows:
Off-loadable executable: It is a C/C++ application or a
Linux shell command/script or a combination of both. The
application user aims to run on CompStor embedded Linux
could be the same source code user runs on host OS but need
to be compiled by ARM compilers.
In-situ Library: A C/C++ library that provides high-level
APIs for the client and should be statically linked to it. In
contrast to some of the related works where an in-storage
lib rary is provided to rewrite the off-loadable application
[5,6], the CompStor in-situ library is only intended to be used
in the client, not in the off-loadable executable, which does
not need any modification to be executed in CompStor.
Client: As mentioned before, a C/C++ application that
controls the in-storage processing flow using the in-situ
library.
ISPS agent: A daemon running on CompStor which is
responsible for receiving minions from clients and spawning
in-storage processes based on the command inside the
received minions. The daemon populates the response fields
of the minion and sends it back to the client after task
completion.
Flash access device driver: A Linux device driver inside
the ISPS Linu x OS that communicates to the SSD controller
for flash read/write access. The flash access device driver
abstracts the flash read/write accesses, so off-loadable
executable sees the flash memory as it is running on the host
CPU.
SSD controller software: The software that is responsible
for the flash management, garbage collections, and table
keeping tasks. This software handles host read/write
commands and also provides an efficient flash read/write
access for the flash access device driver.
In-situ
Library
ISPS agent
Flash
access
device
driver
Client
SSD
controller
Software
Offloaded
executable #1
Host ISPS
Offloaded
executable #2
1
34
5
62
CompStor
Fig 4. Software Stack
TABLE III. A LIFETIME OF A MINION
Step #
Description
1
Host side client configures a minion and sends it
to ISPS agent using the in-situ library APIs.
2
The ISPS agent extracts command from the
received minion and spawns off-loadable
executable or the Linu x shell command/script
based on the command inside ISPS.
3
At runtime, the executable accesses the flash
storage through the device driver.
4
The device driver sends read/write commands to
flash controller.
5
At runtime, the ISPS agent keeps track of the
status of the in-situ processing.
6
In the end, the ISPS agent populates the fields
of response in the minion and sends the minion
back to the client.
Whenever client launches a minion, it triggers multiple
messages passing between different software layers. Table III
describes the lifetime of a minion, from the time it is
configured in client till it delivers the result to the client. The
steps number match with the labels in Fig. 4.
CompStor client is able to send several concurrent minions
to different CompStors. This gives the client the ability to
trigger multiple parallel in-storage processing requests into a
storage node. Considering a data center containing hundreds of
CompStor equipped storage nodes, there could be thousands of
concurrent minions, resulting in heavy parallelism at the
storage unit level.
IV. PROTOT YPE AND EXPERIMENTAL RESULTS
In this section, we introduce a fully functional 24TB
CompStor NVMe SSD prototype and run several experiments
to assess energy consumption and performance.
A. CompStor prototype
The prototype is an NVMe over PCIe SSD with an FPGA-
based enterprise-class SSD controller coupled with the in-situ
processing subsystem itself built around a quad-core 64-bit
ARM A53 application processor.
Fig. 5 shows the prototype developed for the experiments
described in this section. For the prototype, we built two
boards, one for the basic SSD controller modules together
with flash memories and another one for the ISPS. In fact, the
SSD controller is implemented using an ISPS co mponent
which is attached to the main board as a daughter board via a
high-speed FMC connector providing a seamless connection
with other components within SSD controller. On the main
board, we used Xilinx Vertex-7 2000T FPGA, while the ISPS
benefits from a Xilinx Zynq Ultrascale+ MPSoC. The latter is
a SoC containing an FPGA together with a quad-core 64-bit
ARM A53 processor and two ARM R5 cores. This SoC also
contains a graphics processing unit (GPU) and a set of ASIC
modules such as encryption and decryption units, however, we
have not used these facilities in the current version of the
CompStor. To the best of our knowledge, this prototype is the
first Linux-powered SSD which is equipped with a co mplete
software stack to support sending in-storage processing
commands to the CompStor and receiving the result from the
SSD.
We have considered the cost of the ISPS implementation
in comparison with the cost of the whole SSD manufacturing.
Our analysis shows that the cost of implementing the in-situ
computation is less than 8% of the whole SSD manufacturing
because the major costs are related to flash media modules and
SSD controller logic. In fact, in the CompStor, the ISPS which
is mainly ASIC ARM cores are relatively cheap in comparison
to the cost of the other hardware and software modules.
B. System setup
Two identical servers have been used for the experiments
described in this section (see specifications in table IV). In one
of the servers, we used an off-the-shelf enterprise SSD while
CompStor was used as a directed attached storage device in
the other. The arguments, input files, and test scripts are the
same for both servers.
Since processing huge plain text files is common in
datacenter applications, for the experiments described in this
section, we prepared a dataset which contains 348 compressed
big text files. In fact, these te xt files are books in different
fields which are transformed to plain text files. The total size
of the dataset is about 11.3GB. The books are individually
compressed using bzip2 and gzip algorithms.
The user applications selected for the experiment included
both IO intensive and compute-intensive applications.
Compression and decompression functions are co mmonly
used in big data analytics frameworks. We have used
gzip/gunzip and bzip2/bunzip2 as representative of a compute-
intensive class of applications. On the other hand, for IO
intensive experiments, two search applications were selected:
TABLE IV. SERVER SPECIFICAT ION
CPU Type
Intel Xeon E5-2620 v4
Memory
32 GB DDR4
Operating system
Ubuntu 16.04
Off-the-shelf SSD
256GB NVMe SSD
In-situ SSD
CompStor 24TB NVMe SSD
Fig 5. The CompStor prototype
grep and gawk [21]. Grep is a Linux shell command designed
to search on text inputs while gawk utility searches text and
makes changes based on user-specified patterns.
C. Experimental Results
Performance Experiments: In the first set of experiments,
we used a host side client to trigger in-situ processing and ran
the applications on CompStor. Obviously, the performance of
one CompStor with a quad-core ARM processor is lower than
a high-end Xeon processor. However, the performance of in-
storage computation systems scales linearly with the number of
storage devices as is the case for storage servers architected in
[2,3]. Fig. 6 depicts how performance scales linearly with
capacity i.e. the number of CompStor devices. For this
experiment, several CompStors are attached to a single host via
PCIe slots.
Even though highly parallel in-storage computation
performance can equal or surpass that of a server CPU, it
makes sense to consider that one augments the other and result
in a higher performance and more efficient system. Fig. 7
depicts the performance of the Xeon processor combined with
the performance of multiple CompStor devices when running
the bzip2 compression algorithm. In this experiment, we
distributed the whole set of the input files between the host and
several CompStors. Then the performance of the CompStors
and the host are measured separately.
ISPS
Flash Memory
SSD Controller
This result shows that the in-situ processing will add
comparable processing power to the whole system while the
manufacturing cost of adding this feature to the SSD is
reasonable. In addition, regarding the energy consumption,
CompStor can achieve a compelling result thanks to more
efficient data access link, and more power efficient processing
engine in comparison with the host.
Energy Consumption Experiment: In this experiment,
we have measured the energy efficiency of the server using
conventional storage devices compared to the server which
benefits from CompStor. In the latter case, for the
computation to take place, only the computational request and
the resulting data need transfer over the storage interface,
greatly reducing the interface traffic and the required energy.
880.9
177.6
1462
1717
1908
522
2621.4
4666
gzip
gunzip
bzip 2
bunzip 2
Energy Consumption per
Gigabyte Data (Joule/GB)
CompStor Xeon E5-2620
68.5
89.17
222.7
295.4
grep ga wk
Compression/Decompression
Search
The reason we chose energy consumption over power
consumption is to make the result of this experiments
independent of the performance of the systems. We ran the
experiments on both servers as described in Section IV.B. and
measured the energy consumed when executing
compression/decompression and search applications. The
energy consumption is measured as power consumption
multiply by consumed time for executing the benchmarks. So,
we have measured the average power consumption over time
as well as the consumed time and calculated the energy
consumptions accordingly.
Results were normalized per gigabyte of data transferred,
i.e. Watts per GB/s or Joules/GB as shown in Fig. 8, so the
energy consumption results in this figure are regardless of the
number of the CompStors which execute the benchmarks. We
intended to make the result of the energy consumption to be
independent of the number of CompStors. In other words,
considering a fixed amount of input data per each CompStor,
as we increase the number of CompStors, the volume of the
input data together with the energy consumption will increase
linearly, so the energy consumption per input data size is
always independent of the number of CompStors.
V. CONCLUSION
We have presented a novel approach to move computation
to data paradigm to its ultimate limit by architecting and
designing highly efficient and flexible in-storage processing
capability in solid state drives. We have designed CompStor,
an in-storage computation platform with the appropriate
software stack (devices, protocol, interface, software, and
systems) and dedicated hardware for in-situ processing
including a powerful multi-core application processor
subsystem. The dedicated hardware resources provide in-situ
data analytics capability without degrading the performance of
common storage device functions such as read, write and trim.
By moving data analysis tasks closer to where the data resides,
these storage devices reduce dramatically the storage
bandwidth bottleneck, data movement cost, and improves the
overall energy efficiency. Experimental results show up to 3X
energy saving for some applications in comparison to the host
CPU. To the best of our knowledge, the 24TB CompStor SSD
is the first one capable of supporting in-storage computation
Fig 8. Energy consumption per data unit results
Fig. 5.
Fig. 6.
Fig 7. Aggregated system performance for compression using bzip2
Fig. 3.
Fig. 4.
Fig 6. Performance experimental results
Fig. 1.
Fig. 2.
running an operating system, enabling all types of applications
and Linux shell commands to be executed in-place with no
modification.
REFERENCES
[1] Cisco, “ Cisco Glo bal Cloud Index: Forecast and Methodology, 2015–
2020”, whit e pap er, 2016. [Online]. URL:
https://www.cisco.com/c/dam/en/us/solutions/collateral/service-
provider/global-cloud-index-gci/white-paper-c11-738085.pdf.
[2] Siamak Tavallaei, “Microsoft P roject Olympus Hyperscale GPU
Accelerator (HGX-1)” CSI, Azure Cloud, Tech blog, 2017. [Online].
URL:
https://azure.microsoft.com/mediahandler/files/resourcefiles/00c18868-
eba9-43d5-b8c6-e59f9fa219ee/HGX-1%20Blog_5_26_2017.pdfI.
[3] S. Jacobs and C. P. Bean, “Fine particles, thin film s and exchange
anisotropy,” in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New
York: Academic, 1963, pp. 271–350.
[4] Mark A. Shaw, “Project Olympus Flash Expansion FX -16”, T ech blog,
2017. [Online]. URL:
http://www.opencompute.org/blog/ocp-us-summit-2017-and-now-a-
word-from-our-sponsors/
[5] Q. Xu, H. Siyamwala, M. Ghosh, T. Suri, M. Awasthi, Z. Guz, A.
Shayest eh, and V. Balakrishnan, “Performance An alysis of NVMe SSDs
and their Implication on Real World Dat abases,” In Proceedings of the
8th ACM Int ernational Systems and Storage Conference. SYSTOR,
ACM, 2015.
[6] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-
in-memory accelerator for parallel graph processing,” In P roceedin gs of
International Symposium on Computer Architecture (ISCA), pp. 105-
117, 2015.
[7] S. Ch o, C. Park, H. Oh, S. Kim , Y. Yi, and G. R. Ganger, “Active disk
meets flash: A case for intelligent SSDs,” In Proceedings of the27t h
International ACM Conference on International Conference on
Supercomputing, ICS ’13, pp. 91–102, ACM, 2013.
[8] C. Li, Y. Hu, L. Liu, J. Gu, M. Song, X. Liang, J. Yuan, and T. Li,
“Towards sustainable in-sit u server systems in the big dat a era,” In
Proceedings of the 42nd Annual International Symposium on Computer
Archit ectur e, ISCA ’15, (New York, NY, USA), pp. 14–26, ACM, 2015.
[9] Y. Kang, Y. Kee, E. L. Miller, an d C. Park, “Enabling co st -effective
data processing with smart SSD,” In Proceedings of IEEE 29th
Symposium on Mass Sto rage Syst ems and Technologies, MSST’13, pp.
1–12, 2013.
[10] S. Y. W. Su and G. J. Lipovski, “ CASSM: A cellular sy st em fo r very
large dat a bases,” In Proceedings of International Conferen ce on Very
Large Data Bases, pages 456–472, Sept. 1975.
[11] E. Riedel, G. A. Gibson, an d C. Faloutso s, “ Active storage for large-
scale data minin g and mult imedia,” In Proceedings of the 24th
International Conferen ce on Very Large Dat a Bases, VLDB ’ 98, pp. 62–
73, Morgan Kaufmann, 1998.
[12] S.-W. Jun, M. Liu, K. E. Fleming, Arvind, “Scalable multi-access flash
store for big data analytics,” In Proceedings of the 2014 ACM/SIGDA
International Symposium on Field-programmable Gate Arrays, pp. 55-
64, 2014.
[13] S.-W. Jun, M. Liu, S. Lee, J. Hicks, J. Ankcorn, M. King, S. Xu, and
Arvind, “ BlueDBM: An appliance for big dat a analyt ics,” In
Proceedings of the 42nd Annual International Symposium on Computer
Archit ecture (ISCA), New York, NY, USA, pp. 1–13, ACM, 2015.
[14] J. Do, Y.S. Kee, J. M. Patel, C. Park, K. Park, and D. J. DeWitt, “Query
processing on smart SSDs: Op portun ities and challenges,” In
Proceedings of the ACM International Conference on Management of
Data (SIGMOID), pp. 1221–1230, ACM, 2013.
[15] S. Kim, H. Oh, C. Park, S. Cho, S. W. Lee, B. Moon, “In -storage
processing of dat abase scans and joins,” Informat ion Scien ces: An
International Journal, pp. 183-200, January 2016.
[16] D. Tiwari, S. Boboila, S. S. Vazhkudai, Y. Kim, X. Ma, P. J. Desnoyers,
and Y. Solih in, “Active Flash: Towards Energy -Efficient, In-Situ Data
Analytics on Extreme-Scale Machines,” In Proceedings of FAST’13, pp.
119-132, 2013.
[17] Y. Kang, Y. S. Kee, E. Miller, and C. Park. “Enabling co st-effective
data processing with smart ssd,” In Proceedings of Mass Storage
Systems and Technologies (MSST), 2013 IEEE 29th Symposium on,
2013.
[18] M. Gao and C. Kozyrakis, “HRL: efficient and flexible reconfigurable
logic for near-data processing,” In Proceedings of International
Symposium on High Performance Comput er Architecture (HPCA), pp.
126-137, 2016.
[19] B. Gu, A. Yoon, D. Bae, I. Jo, J. Lee, J. Yoon, J. Kang, M. Kwon, C.
Yoon, S. Cho, J. Jeo ng, and D. Chang. “Biscuit: a framework for near-
data processing of big data workloads,” In Proceedings of ISCA, 2016.
[20] M. Gao, G. Ayers, and and C. Ko zyrakis, “Pract ical Near-Data
Processing for In-memo ry Analyt ics Frameworks,” In Proceedings o f
PACT -24, pp. 113–124, Oct 2015.
[21] S. L. Xi, O. Babarinsa, M. Athanassoulis, and S. Idreo s, “ Beyond the
Wall: Near-Data Processin g for Databases,” In Proceedings of the 11th
International Workshop on Data Management on New Hardware, p.1-
10, 2015.
[22] “GNU awk project ”, 2 017. [Online]. URL:
http://savannah.gnu.org/projects/gawk/
[23] N. Abbani, et al. “A distribut ed reconfigurable active SSD plat form for
data intensive applications,” In Proceedings of IEEE 13th High
Performance Computing and Communications (HPCC), 2011.
[24] “The Seagat e Kinetic Open Storage Vision”, [online]. URL:
https://www.seagate.com/tech-insights/kinetic-vision-how-seagate-new-
developer-tools-meets-the-needs-of-cloud-storage-platforms-master-ti/