ArticlePDF Available

Skylight—A Window on Shingled Disk Operation

Authors:

Abstract and Figures

We introduce Skylight, a novel methodology that combines software and hardware techniques to reverse engineer key properties of drive-managed Shingled Magnetic Recording (SMR) drives. The software part of Skylight measures the latency of controlled I/O operations to infer important properties of drive-managed SMR, including type, structure, and size of the persistent cache; type of cleaning algorithm; type of block mapping; and size of bands. The hardware part of Skylight tracks drive head movements during these tests, using a high-speed camera through an observation window drilled through the cover of the drive. These observations not only confirm inferences from measurements, but resolve ambiguities that arise from the use of latency measurements alone. We show the generality and efficacy of our techniques by running them on top of three emulated and two real SMR drives, discovering valuable performance-relevant details of the behavior of the real SMR drives.
Content may be subject to copyright.
16
Skylight—A Window on Shingled Disk Operation
ABUTALIB AGHAYEV, MANSOUR SHAFAEI, and PETER DESNOYERS,
Northeastern University
We introduce Skylight, a novel methodology that combines software and hardware techniques to reverse
engineer key properties of drive-managed Shingled Magnetic Recording (SMR) drives. The software part of
Skylight measures the latency of controlled I/O operations to infer important properties of drive-managed
SMR, including type, structure, and size of the persistent cache; type of cleaning algorithm; type of block
mapping; and size of bands. The hardware part of Skylight tracks drive head movements during these tests,
using a high-speed camera through an observation window drilled through the cover of the drive. These
observations not only confirm inferences from measurements, but resolve ambiguities that arise from the
use of latency measurements alone. We show the generality and efficacy of our techniques by running them
on top of three emulated and two real SMR drives, discovering valuable performance-relevant details of the
behavior of the real SMR drives.
Categories and Subject Descriptors: C.4 [Performance of Systems]: Design studies;Measurement
techniques;Modeling techniques;Performance attributes; D.4.2 [Storage Management]: Allocation/
deallocation strategies;Garbage collection;Secondary storage
General Terms: Design, Algorithms, Experimentation, Measurement, Performance, Reliability
Additional Key Words and Phrases: Shingled magnetic recording, shingle translation layer, emulation,
microbenchmarks, disks
ACM Reference Format:
Abutalib Aghayev, Mansour Shafaei, and Peter Desnoyers. 2015. Skylight—A window on shingled disk
operation. ACM Trans. Storage 11, 4, Article 16 (October 2015), 28 pages.
DOI: http://dx.doi.org/10.1145/2821511
1. INTRODUCTION
In the nearly 60 years since the Hard Disk Drive (HDD) has been introduced, it has
become the mainstay of computer storage systems. In 2013 the hard drive industry
shipped over 400EB [Seagate 2013b] of storage, or almost 60GB for every person on
earth. Although facing strong competition from NAND flash-based Solid-State Drives
(SSDs), magnetic disks hold a 10×advantage over flash in both total bits shipped [Riley
2013] and per-bit cost [DRAMeXchange 2014], an advantage that will persist if density
improvements continue at current rates.
The most recent growth in disk capacity is the result of improvements to Perpendic-
ular Magnetic Recording (PMR) [Piramanayagam 2007], which has yielded terabyte
drives by enabling bits as short as 20nm in tracks 70nm wide [Seagate 2013c], but
This work was supported by the National Science Foundation, under grant CNS-1149232, and by NetApp
Faculty Fellowship.
Authors’ addresses: A. Aghayev, M. Shafaei, and P. Desnoyers, College of Computer and Information Science,
Northeastern University, 440 Huntington Ave., 202 West Village H, Boston, MA; emails: agayev@gmail.com,
shafaei@ece.neu.edu, pjd@.ccs.neu.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c
2015 ACM 1553-3077/2015/10-ART16 $15.00
DOI: http://dx.doi.org/10.1145/2821511
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
16:2 A. Aghayev et al.
further increases will require new technologies [Thompson and Best 2000]. Shingled
Magnetic Recording (SMR) [Wood et al. 2009] is the first such technology to reach
market: 5TB drives are available from Seagate [2013a] and shipments of 8 and 10TB
drives have been announced by Seagate [2014] and HGST [2014]. Other technolo-
gies (Heat-Assisted Magnetic Recording [Kryder et al. 2008] and Bit-Patterned Media
[Dobisz et al. 2008]) remain in the research stage, and may in fact use shingled record-
ing when they are released [Wang et al. 2013].
Shingled recording spaces tracks more closely, so they overlap like rows of shingles on
a roof, squeezing more tracks and bits onto each platter [Wood et al. 2009]. The increase
in density comes at a cost in complexity, as modifying a disk sector will corrupt other
data on the overlapped tracks, requiring copying to avoid data loss [Amer et al. 2010;
Feldman and Gibson 2013; Gibson and Ganger 2011; Gibson and Polte 2009]. Rather
than push this work onto the host file system [INCITS T10 Technical Committee 2014;
Le Moal et al. 2012], SMR drives shipped to date preserve compatibility with existing
drives by implementing a Shingle Translation Layer (STL) [Cassuto et al. 2010; Gibson
and Ganger 2011; Hall et al. 2012] that hides this complexity.
Like an SSD, an SMR drive combines out-of-place writes with dynamic mapping in
order to efficiently update data, resulting in a drive with performance much different
from that of a Conventional Magnetic Recording (CMR) drive due to seek overhead for
out-of-order operations. However, unlike SSDs, which have been extensively measured
and characterized [Bouganim et al. 2009; Chen et al. 2009], little is known about
the behavior and performance of SMR drives and their translation layers, or how to
optimize file systems, storage arrays, and applications to best use them.
We introduce a methodology for measuring and characterizing such drives, devel-
oping a specific series of microbenchmarks for this characterization process, much as
has been done in the past for conventional drives [Gim and Won 2010; Talagala et al.
1999; Worthington et al. 1995]. We augment these timing measurements with a novel
technique that tracks actual head movements via high-speed camera and image pro-
cessing and provides a source of reliable information in cases where timing results are
ambiguous.
We validate this methodology on three different emulated drives that use STLs
previously described in the literature [Cassuto et al. 2010; Coker and Hall 2013; Hall
et al. 2012], implemented as a Linux device mapper target [Device-Mapper 2001] over
a conventional drive, demonstrating accurate inference of properties. We then apply
this methodology to 5 and 8TB SMR drives provided by Seagate, inferring the STL
algorithm and its properties and providing the first public characterization of such
drives.
Using our approach we are able to discover important characteristics of the Seagate
SMR drives and their translation layer, including the following:
Cache type and size. The drives use a persistent disk cache of 20 and 25GiB on the
5 and 8TB drives, respectively, with high random write speed until the cache is full.
The effective cache size is a function of write size and queue depth.
Persistent cache structure. The persistent disk cache is written as journal entries
with quantized sizes—a phenomenon absent from the academic literature on SMRs.
Block mapping. Noncached data is statically mapped, using a fixed assignment of
Logical Block Addresses (LBAs) to Physical Block Addresses (PBAs), similar to that
used in CMR drives, with implications for performance and durability.
Band size. SMR drives organize data in bands—a set of contiguous tracks that are
rewritten as a unit; the examined drives have a small band size of 15–40MiB.
Cleaning mechanism. Aggressive cleaning during idle times moves data from the
persistent cache to bands; cleaning duration is 0.6–1.6s per modified band.
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
Skylight—A Window on Shingled Disk Operation 16:3
Fig. 1. Shingled disk tracks with head width k=2.
Our results show the details that may be discovered using Skylight, most of which
impact (negatively or positively) the performance of different workloads, as described
in Section 6. These results—and the toolset allowing similar measurements on new
drives—should thus be useful to users of SMR drives, both in determining what work-
loads are best suited for these drives and in modifying applications to better use them.
In addition, we hope that they will be of use to designers of SMR drives and their trans-
lation layers, by illustrating the effects of low-level design decisions on system-level
performance.
In the rest of the article we give an overview of SMR (Section 2) followed by the
description of emulated and real drives examined (Section 3). We then present our
characterization methodology and apply it to all of the drives (Section 4); finally, we
survey related work (Section 5) and present our conclusions (Section 6).
2. BACKGROUND
Shingled recording is a response to limitations on areal density with perpendicular
magnetic recording due to the superparamagnetic limit [Thompson and Best 2000].
In brief, for bits to become smaller, write heads must become narrower, resulting in
weaker magnetic fields. This requires lower coercivity (easily recordable) media, which
is more vulnerable to bit flips due to thermal noise, requiring larger bits for reliability.
As the head gets smaller this minimum bit size gets larger, until it reaches the width
of the head and further scaling is impossible.
Several technologies have been proposed to go beyond this limit, of which SMR is the
simplest [Wood et al. 2009]. To decrease the bit size further, SMR reduces the track
width while keeping the head size constant, resulting in a head that writes a path
several tracks wide. Tracks are then overlapped like rows of shingles on a roof, as seen
in Figure 1. Writing these overlapping tracks requires only incremental changes in
manufacturing, but much greater system changes, as it becomes impossible to rewrite
a single sector without destroying data on the overlapped sectors.
For maximum capacity an SMR drive could be written from beginning to end, utiliz-
ing all tracks. Modifying any of this data, however, would require reading and rewriting
the data that would be damaged by that write, and data to be damaged by the rewrite
and so on, until the end of the surface is reached. This cascade of copying may be halted
by inserting guard regions—tracks written at the full head width—so that the tracks
before the guard region may be rewritten without affecting any tracks following it, as
shown in Figure 2. These guard regions divide each disk surface into rewritable bands;
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
16:4 A. Aghayev et al.
Fig. 2. Surface of a platter in a hypothetical SMR drive. A persistent cache consisting of nine tracks is
located at the outer diameter. The guard region that separates the persistent cache from the first band is
simply a track that is written at a full head width of ktracks. Although the guard region occupies the width
of ktracks, it contains a single track’s worth of data and the remaining k-1 tracks are wasted. The bands
consist of four tracks, also separated with a guard region. Overwriting a sector in the last track of any band
will not affect the following band. Overwriting a sector in any of the tracks will require reading and rewriting
all of the tracks starting at the affected track and ending at the guard region within the band.
since the guards hold a single track’s worth of data, storage efficiency for a band size
of btracks is b
b+k1.
Given knowledge of these bands, a host file system can ensure they are only written
sequentially, for example, by implementing a log-structured file system [Le Moal et al.
2012; Rosenblum and Ousterhout 1991]. Standards are being developed to allow a drive
to identify these bands to the host [INCITS T10 Technical Committee 2014]: host-aware
drives report sequential-write-preferred bands (an internal STL handles nonsequen-
tial writes), and host-managed drives report sequential-write-required bands. These
standards are still in draft form, and to date no drives based on them are available on
the open market.
Alternately the drive-managed disks present a standard rewritable block interface
that is implemented by an internal STL, much as an SSD uses a Flash Translation
Layer (FTL). Although the two are logically similar, appropriate algorithms differ due
to differences in the constraints placed by the underlying media: (a) high seek times
for nonsequential access, (b) lack of high-speed reads, (c) use of large (tens to hundreds
of MB) cleaning units, and (d) lack of wear-out, eliminating the need for wear leveling.
These translation layers typically store all data in bands where it is mapped at a
coarse granularity, and devote a small fraction of the disk to a persistent cache,as
shown in Figure 2, which contains copies of recently written data. Data that should
be retrieved from the persistent cache may be identified by checking a persistent cache
map (or exception map) [Cassuto et al. 2010; Hall et al. 2012]. Data is moved back from
the persistent cache to bands by the process of cleaning, which performs Read-Modify-
Write (RMW) on every band whose data was overwritten. The cleaning process may be
lazy, running only when the free cache space is low, or aggressive, running during idle
times.
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
Skylight—A Window on Shingled Disk Operation 16:5
In one translation approach, a static mapping algorithmically assigns a native loca-
tion [Cassuto et al. 2010] (a PBA) to each LBA in the same way as is done in a CMR
drive. An alternate approach uses coarse-grained dynamic mapping for noncached
LBAs [Cassuto et al. 2010], in combination with a small number of free bands. During
cleaning, the drive writes an updated band to one of these free bands and then updates
the dynamic map, potentially eliminating the need for a temporary staging area for
cleaning updates and sequential writes.
In any of these cases drive operation may change based on the setting of the volatile
cache (enabled or disabled) [SATA-IO 2011]. When the volatile cache is disabled, writes
are required to be persistent before completion is reported to the host. When it is
enabled, persistence is only guaranteed after a FLUSH command or a write command
with the flush (FUA) flag set.
3. TEST DRIVES
We now describe the drives we study. First, we discuss how we emulate three SMR
drives using our implementation of two STLs described in the literature. Second, we
describe the real SMR drives we study in this article and the real CMR drive we use
for emulating SMR drives.
3.1. Emulated Drives
We implement Cassuto et al.’s set-associative STL [Cassuto et al. 2010] and a variant of
their S-blocks STL [Cassuto et al. 2010; Hall 2014], which we call fully associative STL,
as Linux device mapper targets. These are kernel modules that export a pseudo block
device to user space that internally behaves like a drive-managed SMR—the module
translates incoming requests using the translation algorithm and executes them on a
CMR drive.
The set-associative STL manages the disk as a set of Nisocapacity (same-sized) data
bands, with typical sizes of 20–40MiB, and uses a small (1%–10%) section of the disk
as the persistent cache. The persistent cache is also managed as a set of nisocapacity
cache bands where nN. When a block in data band ais to be written, a cache band
is chosen through (amod n); the next empty block in this cache band is written and
the persistent cache map is updated. Further accesses to the block are served from the
cache band until cleaning moves the block to its native location, which happens when
the cache band becomes full.
The fully associative STL, on the other hand, divides the disk into large (we used
40GiB) zones and manages each zone independently. A zone starts with 5% of its
capacity provisioned to free bands for handling updates. When a block in logical band
ais to be written to the corresponding physical band b, a free band cis chosen and
written to and the persistent cache map is updated. When the number of free bands
falls below a threshold, cleaning merges the bands band cand writes it to a new band
dand remaps the logical band ato the physical band d, freeing bands band cin the
process. This dynamic mapping of bands allows the fully associative STL to handle
streaming writes with zero overhead.
To evaluate the accuracy of our emulation strategy, we implemented a pass-through
device mapper target and found negligible overhead for our tests, confirming a previous
study [Pitchumani et al. 2012]. Although in theory, this emulation approach may seem
disadvantaged by the lack of access to exact sector layout, in practice this is not the
case—even in real SMR drives, the STL running inside the drive is implemented on
top of a layer that provides linear PBAs by hiding sector layout and defect manage-
ment [Feldman 2014b]. Therefore, we believe that the device mapper target running
on top of a CMR drive provides an accurate model for predicting the behavior of an
STL implemented by the controller of an SMR drive.
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
16:6 A. Aghayev et al.
Table I. Emulated SMR Drive Configurations
Persistent Cache Disk Cache Cleaning Band Mapping
Drive Name STL Type and Size Multiplicity Type Size Type Size
Emulated-SMR-1 Set associative Disk, 37.2GiB Single at ID Lazy 40MiB Static 3.9TB
Emulated-SMR-2 Set associative Flash, 9.98GiB N/A Lazy 25MiB Static 3.9TB
Emulated-SMR-3 Fully associative Disk, 37.2GiB Multiple Aggressive 20MiB Dynamic 3.9TB
Table I shows the three emulated SMR drive configurations we use in our tests. The
first two drives use the set-associative STL, and they differ in the type of persistent
cache and band size. The last drive uses the fully associative STL and disk for the
persistent cache. We do not have a drive configuration combining the fully associative
STL and flash for persistent cache, since the fully associative STL aims to reduce
long seeks during cleaning in disks, by using multiple caches evenly spread out on a
disk—flash does not suffer from long seek times.
To emulate an SMR drive with a flash cache (Emulate-SMR-2) we use the Emulate-
SMR-1 implementation, but use a device mapper linear target to redirect the under-
lying LBAs corresponding to the persistent cache, storing them on an SSD.
To check the correctness of the emulated SMR drives we ran repeated burn-in tests
using fio [Axboe 2015]. We also formatted emulated drives with ext4, compiled the
Linux kernel on top, and successfully booted the system with the compiled kernel. The
source code for the set-associative STL (1,200 lines of C) and a testing framework (250
lines of Go) are available at http://sssl.ccs.neu.edu/skylight.
3.2. Real Drives
Two real SMR drives were tested: Seagate ST5000AS0011, a 5,900RPM desktop drive
(rotation time 10ms) with four platters, eight heads, and 5TB capacity (termed
Seagate-SMR in the following), and Seagate ST8000AS0002, a similar drive with six
platters, 12 heads, and 8TB capacity. Emulated drives use a Seagate ST4000NC001
(Seagate-CMR), a real CMR drive identical in drive mechanics and specification (except
the 4TB capacity) to the ST5000AS0011. Results for the 8 and 5TB SMR drives were
similar; to save space, we only present results for the publicly available 5TB drive.
4. CHARACTERIZATION TESTS
To motivate our drive characterization methodology we first describe the goals of our
measurements. We then describe the mechanisms and methodology for the tests, and
finally present results for each tested drive. For emulated SMR drives, we show that
the tests produce accurate answers, based on implemented parameters; for real SMR
drives we discover their properties. The behavior of the real SMR drives under some of
the tests engenders further investigation, leading to the discovery of important details
about their operation.
4.1. Characterization Goals
The goal of our measurements is to determine key drive characteristics and parameters:
Drive type. In the absence of information from the vendor, is a drive an SMR or a
CMR?
Persistent cache type. Does the drive use flash or disk for the persistent cache? The
type of the persistent cache affects the performance of random writes and reliable
(volatile cache-disabled) sequential writes. If the drive uses disk for persistent cache,
is it a single cache, or is it distributed across the drive [Cassuto et al. 2010; Hall
2014]? The layout of the persistent disk cache affects the cleaning performance and
the performance of the sequential read of a sparsely overwritten linear region.
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
Skylight—A Window on Shingled Disk Operation 16:7
Fig. 3. SMR drive with the observation window encircled in red. Head assembly is visible parked at the
inner diameter.
Cleaning. Does the drive use aggressive cleaning, improving performance for low
duty-cycle applications, or lazy cleaning, which may be better for throughput-oriented
ones? Can we predict the performance impact of cleaning?
Persistent cache size. After some number of out-of-place writes the drive will need
to begin a cleaning process, moving data from the persistent cache to bands so that
it can accept new writes, negatively affecting performance. What is this limit, as a
function of total blocks written, number of write operations, and other factors?
Band size. Since a band is the smallest unit that may be rewritten efficiently, knowl-
edge of band size is important for optimizing SMR drive workloads [Cassuto et al.
2010; Coker and Hall 2013]. What are the band sizes for a drive, and are these sizes
constant over time and space [Feldman 2011]?
Block mapping. The mapping type affects performance of both cleaning and reliable
sequential writes. For LBAs that are not in the persistent cache, is there a static
mapping from LBAs to PBAs, or is this mapping dynamic?
Zone structure. Determining the zone structure of a drive is a common step in under-
standing block mapping and band size, although the structure itself has little effect
on external performance.
4.2. Test Mechanisms
The software part of Skylight uses fio to generate microbenchmarks that elicit the
drive characteristics. The hardware part of Skylight tracks the head movement during
these tests. It resolves ambiguities when interpreting the latency of the data obtained
from the microbenchmarks and leads to discoveries that are not possible with mi-
crobenchmarks alone. To track the head movements, we installed (under clean-room
conditions) a transparent window in the drive casing over the region traversed by
the head. Figure 3 shows the head assembly parked at the Inner Diameter (ID). We
recorded the head movements using a Casio EX-ZR500 camera at 1,000 frames per
second and processed the recordings with ffmpeg to generate head location value for
each video frame.
We ran the tests on a 64-bit Intel Core-i3 Haswell system with 16GiB RAM and
64-bit Linux kernel version 3.14. Unless otherwise stated, we disabled kernel read-
ahead, drive look-ahead, and drive volatile cache using hdparm.
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
16:8 A. Aghayev et al.
Fig. 4. Discovering drive type using latency of ran-
domwrites.Yaxisvariesineachgraph.
Fig. 5. Seagate-SMR head position during ran-
dom writes.
Extensions to fio developed for these tests have been integrated back and are avail-
able in the latest fio release. Slow-motion clips for the head position graphs shown
in the article, as well as the tests themselves, are available at http://sssl.ccs.neu.edu/
skylight.
4.3. Drive Type and Persistent Cache Type
TEST 1 exploits the unusual random write behavior of the SMR drives to differentiate
them from CMR drives. While random writes to a CMR drive incur varying latency
due to random seek time and rotational delay, random writes to an SMR drive are
sequentially logged to the persistent cache with a fixed latency. If random writes are
not local, SMR drives using separate persistent caches by the LBA range [Cassuto et al.
2010] may still incur varying write latency. Therefore, random writes are done within
a small region to ensure that a single persistent cache is used.
TEST 1: Discovering Drive Type
1Write blocks in the first 1GiB in random order to the drive.
2If latency is fixed then the drive is SMR else the drive is CMR.
Figure 4 shows the results for this test. Emulated-SMR-1 sequentially writes in-
coming random writes to the persistent cache. It fills one empty block after another
and due to synchronicity of the writes it misses the next empty block by the time the
next write arrives. Therefore, it waits for a complete rotation resulting in a 10ms write
latency, which is the rotation time of the underlying CMR drive. The submillisecond
latency of Emulated-SMR-2 shows that this drive uses flash for the persistent cache.
The latency of Emulated-SMR-3 is identical to that of Emulated-SMR-1, suggesting a
similar setup. The varying latency of Seagate-CMR identifies it as a conventional drive.
Seagate-SMR shows a fixed 25ms latency with a 325ms bump at the 240th write.
While the fixed latency indicates that it is an SMR drive, we resort to the head position
graph to understand why it takes 25ms to write a single block and what causes the
325ms latency.
Figure 5 shows that the head, initially parked at the ID, seeks to the Outer Diameter
(OD) for the first write. It stays there during the first 239 writes (incidentally, showing
that the persistent cache is at the OD), and on the 240th write it seeks to the center,
staying there for 285ms before seeking back and continuing to write.
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
Skylight—A Window on Shingled Disk Operation 16:9
Fig. 6. Surface of a disk platter in a hypothetical SMR drive divided into two 2.5 track imaginary regions.
The left figure shows the placement of random blocks 3 and 7 when writing synchronously. Each internal
write contains a single block and takes 25ms (50ms in total) to complete. The drive reports 25ms write
latency for each block; reading the blocks in the written order results in a 5ms latency. The right figure
shows the placement of blocks when writing asynchronously with high queue depth. A single internal write
contains both of the blocks, taking 25ms to complete. The drive still reports 25ms write latency for each
block; reading the blocks back in the written order results in a 10ms latency due to missed rotation.
Is all of 25ms latency associated with every block write spent writing or is some of it
spent in rotational delay? When we repeat the test multiple times, the completion time
of the first write ranges between 41 and 52ms, while the remaining writes complete in
25ms. The latency of the first write always consists of a seek from the ID to the OD
(16ms). We hypothesize that the remaining time is spent in rotational delay—likely
waiting for the beginning of a delimited location—and writing (25ms). Depending on
where the head lands after the seek, the latency of the first write changes between
41 and 52ms. The remaining writes are written as they arrive, without seek time and
rotational delay, each taking 25ms. Hence, we hypothesize that a single block host write
results in a 2.5 track internal write. We realize that 25ms latency is artificially high
and expect it to drop in future drives, nevertheless, we base our further explanations
on this assumption. In the following section we explore this phenomenon further.
4.3.1. Journal Entries with Quantized Sizes.
If after TEST 1 we immediately read blocks
in the written order, read latency is fixed at 5ms, indicating 0.5 track distance (cov-
ering a complete track takes a full rotation, which is 10ms for the drive; therefore,
5ms translates to 0.5 track distance) between blocks. On the other hand, if we write
blocks asynchronously at the maximum queue depth of 31 [Libata FAQ 2011] and im-
mediately read them, latency is fixed at 10ms, indicating a missed rotation due to
contiguous placement. Furthermore, although the drive still reports 25ms completion
time for every write, asynchronous writes complete faster—for the 256 write opera-
tions, asynchronous writes complete in 216ms, whereas synchronous writes complete
in 6,539ms, as seen in Figure 5. Gathering these facts, we arrive at Figure 6. Writing
asynchronously with high queue depth allows the drive to pack multiple blocks into a
single internal write, placing them contiguously (shown on the right). The drive reports
the completion of individual host writes packed into the same internal write once the
internal write completes. Thus, although each of the host writes in the same internal
write is reported to take 25ms, it is the same 25ms that went into writing the inter-
nal write. As a result, in the asynchronous case, the drive does fewer internal writes,
which accounts for the fast completion time. The contiguous placement also explains
the 10ms latency when reading blocks in the written order. Writing synchronously,
however, results in doing a separate internal write for every block (shown on the left),
taking longer to complete. Placing blocks starting at the beginning of 2.5 track internal
writes explains the 5ms latency when reading blocks in the written order.
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
16:10 A. Aghayev et al.
Fig. 7. Random write latency of different write sizes on Seagate-SMR, when writing at the queue depth
of 31. Each latency graph corresponds to the latency of a group of writes. For example, the graph at 25ms
corresponds to the latency of writes with sizes in the range of 4–26KiB. Since writes with different sizes in
a range produced similar latency, we plotted a single latency as a representative.
To understand how the internal write size changes with the increasing host write
size, we keep writing at the maximum queue depth, gradually increasing the write
size. Figure 7 shows that the writes in the range of 4–26KiB result in 25ms latency,
suggesting that 31 host writes in this size range fit in a single internal write. As we
jump to the 28KiB writes, the latency increases by 5ms (or 0.5 track) and remains
approximately constant for the writes of sizes up to 54KiB. We observe a similar jump
in latency as we cross from 54 to 56KiB and also from 82 to 84KiB. This shows that
the internal write size increases in 0.5 track increments. Given that the persistent
cache is written using a “log-structured journaling mechanism” [Feldman 2014a], we
infer that the 0.5 track of 2.5 track minimum internal write is the journal entry that
grows in 0.5 track increments, and the remaining two tracks contain out-of-band data,
like parts of the persistent cache map affected by the host writes. The purpose of this
quantization of journal entries is not known, but may be in order to reduce rotational
delay or simplify delimiting and locating them. We further hypothesize that the 325ms
delay in Figure 4, observed every 240th write, is a map merge operation that stores the
updated map at the middle tracks.
As the write size increases to 256KiB we see varying delays, and inspection of com-
pletion times shows less than 31 writes completing in each burst, implying a bound
on the journal entry size. Different completion times for large writes suggest that for
these, the journal entry size is determined dynamically, likely based on the available
drive resources at the time when the journal entry is formed.
4.4. Disk Cache Location and Layout
We next determine the location and layout of the disk cache, exploiting a phenomenon
called fragmented reads [Cassuto et al. 2010]. When sequentially reading a region
in an SMR drive, if the cache contains a newer version of some of the blocks in the
region, the head has to seek to the persistent cache and back, physically fragmenting
a logically sequential read. In TEST 2, we use these variations in seek time to discover
the location and layout of the disk cache.
The test works by choosing a small region and writing every other block in it and
then reading the region sequentially from the beginning, forcing a fragmented read.
LBA numbering conventionally starts at the OD and grows towards the ID. Therefore,
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
Skylight—A Window on Shingled Disk Operation 16:11
Fig. 8. Discovering disk cache structure and loca-
tion using fragmented reads.
Fig. 9. Seagate-SMR head position during frag-
mented reads.
TEST 2: Discovering Disk Cache Location and Layout
1Starting at a given offset, write a block and skip a block, and so on, writing 512 blocks in
total.
2Starting at the same offset, read 1024 blocks; call average latency latoffset.
3Repeat steps 1 and 2 at the offsets high,low,mid.
4if lathig h <latmi d <latlowthen
There is a single disk cache at the ID.
else if lathi g h >latmi d >latlowthen
There is a single disk cache at the OD.
else if lathi g h =latmi d =latlowthen
There are multiple disk caches.
else
assert(lathig h =latlowand lathig h >latmi d );
There is a single disk cache in the middle.
a fragmented read at low LBAs on a drive with the disk cache located at the OD would
incur negligible seek time, whereas a fragmented read at high LBAs on the same drive
would incur high seek time. Conversely, on a drive with the disk cache located at the
ID, a fragmented read would incur high seek time at low LBAs and negligible seek time
at high LBAs. On a drive with the disk cache located at the Middle Diameter (MD),
fragmented reads at low and high LBAs would incur similar high seek times and they
would incur negligible seek times at middle LBAs. Finally, on a drive with multiple
disk caches evenly distributed across the drive, the fragmented read latency would be
mostly due to rotational delay and vary little across the LBA space. Guided by these
assumptions, to identify the location of the disk cache, the test chooses a small region
at low, middle, and high LBAs and forces fragmented reads at these regions.
Figure 8 shows the latency of fragmented reads at three offsets on all SMR drives.
The test correctly identifies the Emulated-SMR-1 as having a single cache at the ID. For
Emulated-SMR-2 with flash cache, latency is seen to be negligible for flash reads, and
a full missed rotation for each disk read. Emulated-SMR-3 is also correctly identified
as having multiple disk caches—the latency graph of all fragmented reads overlap, all
having the same 10ms average latency. For Seagate-SMR1we confirm that it has a
single disk cache at the OD.
1Test performed with volatile cache enabled with hdparm -W1.
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
16:12 A. Aghayev et al.
Fig. 10. Discovering cleaning type. Fig. 11. Seagate-SMR head position during the
3.5-s period starting at the beginning of step 2 of
Test Algorithm 3.
Figure 9 shows the Seagate-SMR head position during fragmented reads at offsets
of 0, 2.5, and 5TB. For the offsets of 2.5 and 5 TB, we see that the head seeks back and
forth between the OD and near-center and between the OD and the ID, respectively,
occasionally missing a rotation. The cache-to-data distance for the LBAs near 0TB was
too small for the resolution of our camera.
4.5. Cleaning Algorithm
The fragmented read effect is also used in TEST 3 to determine whether the drive uses
aggressive or lazy cleaning, by creating a fragmented region and then pausing to allow
an aggressive cleaning to run before reading the region.
TEST 3: Discovering Cleaning Type
1Starting at a given offset, write a block and skip a block and so on, writing 512 blocks in
total.
2Pause for 3–5s.
3Starting at the same offset, read 1,024 blocks.
4if latency is fixed then cleaning is aggressive else cleaning is lazy.
Figure 10 shows the read latency graph of step 3 from TEST 3 at the offset of 2.5TB,
with a 3s pause in step 2. For all drives, offsets were chosen to land within a single band
(Section 4.8). After a pause, the top two emulated drives continue to show fragmented
read behavior, indicating lazy cleaning, while in Emulated-SMR-3 and Seagate-SMR
reads are no longer fragmented, indicating aggressive cleaning.
Figure 11 shows the Seagate-SMR head position during the 3.5second period starting
at the beginning of step 2. Two short seeks from the OD to the ID and back are seen in
the first 200ms; their purpose is not known. The RMW operation for cleaning a band
starts at 1,242ms after the last write, when the head seeks to the band at 2.5TB offset,
reads for 180ms, and seeks back to the cache at the OD where it spends 1,210ms. We
believe this time is spent forming an updated band and persisting it to the disk cache,
to protect against power failure during band overwrite. Next, the head seeks to the
band, taking 227ms to overwrite it and then seeks to the center to update the map.
Hence, cleaning a band in this case took 1.6s. We believe the center to contain the
map because the head always moves to this position after performing a RMW, and
stays there for a short period before eventually parking at the ID. After 3s, reads begin
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
Skylight—A Window on Shingled Disk Operation 16:13
Fig. 12. Latency of reads of random writes im-
mediately after the writes and after 10–20min
pauses.
Fig. 13. Verifying hypothesized cleaning algo-
rithm on Seagate-SMR.
and the head seeks back to the band location, where it stays until reads complete (only
the first 500ms is seen in Figure 11).
We confirmed that the operation starting at 1,242ms is indeed an RMW: when step 3
is begun before the entire cleaning sequence has completed, read behavior is unchanged
from TEST 2. We did not explore the details of the RMW; alternatives like partial read-
modify-write [Poudyal 2013] may also have been used.
4.5.1. Seagate-SMR Cleaning Algorithm.
We next start exploring performance-relevant
details that are specific to the Seagate-SMR cleaning algorithm, by running TEST 4.
In step 1, as the drive receives random writes, it sequentially logs them to the persistent
cache as they arrive. Therefore, immediately reading the blocks back in the written
order should result in a fixed rotational delay with no seek time. During the pause
in step 3, the cleaning process moves the blocks from the persistent cache to their
native locations. As a result, reading after the pause should incur varying seek time
and rotational delay for the blocks moved by the cleaning process, whereas unmoved
blocks should still incur a fixed latency.
TEST 4: Exploring Cleaning Algorithm
1Write 4096 random blocks.
2Read back the blocks in the written order.
3Pause for 10–20min.
4Repeat steps 2 and 3.
In Figure 12, read latency is shown immediately after step 2, and then after 10, 30,
and 50min. We observe that the latency is fixed when we read the blocks immediately
after the writes. If we reread the blocks after a 10min pause, we observe random
latencies for the first 800 blocks, indicating that the cleaning process has moved
these blocks to their native locations. Since every block is expected to be on a different
band, the number of operations with random read latencies after each pause shows
the progress of the cleaning process, that is, the number of bands it has cleaned. Given
that it takes 30min to clean 3,000 bands, it takes 600ms to clean a band whose
single block has been overwritten. We also observe a growing number of cleaned blocks
in the unprocessed region (for example, operations 3,000–4,000 in the 30-min graph);
based on this behavior, we hypothesize that cleaning follows Algorithm 1.
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
16:14 A. Aghayev et al.
Fig. 14. Three different scenarios triggering cleaning on drives using journal entries with quantized sizes
and extent mapping. The text on the left in the figure explains the meaning of the colors.
ALGORITHM 1: Hypothesized Cleaning Algorithm of Seagate-SMR
1Read the next block from the persistent cache, find the block’s band.
2Scan the persistent cache identifying blocks belonging to the band.
3Read-modify-write the band, update the map.
To test this hypothesis we run TEST 5. In Figure 13 we see that after1min, all of
the blocks written in step 1, some of those written in step 2, and all of those written in
step 3 have been cleaned, as indicated by the nonuniform latency, while the remainder
of step 2 blocks remain in the cache, confirming our hypothesis. After 2min all blocks
have been cleaned. (The higher latency for step 2 blocks is due to their higher mean
seek distance.)
TEST 5: Verifying the Hypothesized Cleaning Algorithm
1Write 128 blocks from a 256MiB linear region in random order.
2Write 128 random blocks across the LBA space.
3Repeat step 1, using different blocks.
4Pause for 1min; read all blocks in the written order.
4.6. Persistent Cache Size
We discover the size of the persistent cache by ensuring that the cache is empty and
then measuring how much data may be written before cleaning begins. We use random
writes across the LBA space to fill the cache, because sequential writes may fill the
drive bypassing the cache [Cassuto et al. 2010] and cleaning may never start. Also,
with sequential writes, a drive with multiple caches may fill only one of the caches
and start cleaning before all of the caches are full [Cassuto et al. 2010]. With random
writes, bypassing the cache is not possible; also, they will fill multiple caches at the
same rate and cleaning will start when all of the caches are almost full.
The simple task of filling the cache is complicated in drives using extent mapping:
a cache is considered full when the extent map is full or when the disk cache is full,
whichever happens first. The latter is further complicated by journal entries with
quantized sizes—as seen previously (Section 4.3.1), a single 4KB write may consume
as much cache space as dozens of 8KB writes. Due to this overhead, the actual size of
the disk cache is larger than what is available to host writes—we differentiate the two
by calling them persistent cache raw size and persistent cache size, respectively.
Figure 14 shows three possible scenarios on a hypothetical drive with a persistent
cache raw size of 36 blocks and a 12 entry extent map. The minimum journal entry
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
Skylight—A Window on Shingled Disk Operation 16:15
size is two blocks, and it grows in units of two blocks to the maximum of 16 blocks;
out-of-band data of two blocks is written with every journal entry; the persistent cache
size is 32 blocks.
Part (a) of Figure 14 shows the case of queue depth 1 and one-block writes. After the
host issues nine writes, the drive puts every write to a separate two-block journal entry,
fills the cache with nine journal entries, and starts cleaning. Every write consumes a
slot in the map, shown by the arrows. Due to low queue depth, the drive leaves one
empty block in each journal entry, wasting nine blocks. Exploiting this behavior, TEST 6
discovers the persistent cache raw size. (In this and the following tests, we detect the
start of cleaning when the IOPS drops to near zero.)
TEST 6: Discovering Persistent Cache Raw Size
1Write with a small size and low queue depth until cleaning starts.
2Persistent cache raw size =number of writes ×(minimum journal entry size +out-of-band
data size).
Part (b) of Figure 14 shows the case of queue depth 4 and one-block writes. After
the host issues 12 writes, the drive forms three four-block journal entries. Writing
these journal entries to the cache fills the map and the drive starts cleaning despite a
half-empty cache. We use TEST 7 to discover the persistent cache map size.
TEST 7: Discovering Persistent Cache Map Size
1Write with a small size and high queue depth until cleaning starts.
2Persistent cache map size =number of writes.
Finally, part (c) of Figure 14 shows the case of queue depth 4 and four-block writes.
After the host issues eight writes, the drive forms two 16-block journal entries, filling
the cache. Due to high queue depth and large write size, the drive is able to fill the
cache (without wasting any blocks) before the map fills. We use TEST 8 to discover the
persistent cache size.
TEST 8: Discovering Persistent Cache Size
1Write with a large size and high queue depth until cleaning starts.
2Persistent cache size =total host write size.
Table II shows the result of the tests on Seagate-SMR and Figure 15 shows the
corresponding graph. In the first row, we discover persistent cache raw size using
TEST 6. Writing with 4KiB size and queue depth of 1 produces a fixed 25ms latency
(Section 4.3) that is 2.5 rotations. Hypothesizing that all of the 25ms is spent writing
and a track size is 2MiB at the OD, 22,800 operations correspond to 100GiB.
In rows 2 and 3 we discover the persistent cache map size using TEST 7. For write
sizes of 4 and 64KiB cleaning starts after 182,200 writes, which corresponds to 0.7
and 11.12GiB of host writes, respectively. This confirms that in both cases the drive hits
the map size limit, corresponding to scenario (b) in Figure 14. Assuming that the drive
uses a low watermark to trigger cleaning, we estimate that the map size is 200,000
entries.
In rows 4 and 5 we discover the persistent cache size using TEST 8. With 128KiB
writes we write 17GiB in fewer operations than in row 3, indicating that we are
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
16:16 A. Aghayev et al.
Table II. Discovering Persistent Cache Parameters
Drive Write Size Queue Depth Operation Count Host Writes Internal Writes
Seagate-SMR
4KiB 1 22,800 89MiB 100GiBa
4KiB 31 182,270 0.7GiB N/A
64KiB 31 182,231 11.12GiB N/A
128KiB 31 137,496 16.78GiB N/A
256KiB 31 67,830 16.56GiB N/A
Emulated-SMR-1 4KiB 1 9,175,056 35GiB 35GiB
Emulated-SMR-2 4KiB 1 2,464,153 9.4GiB 9.4GiB
Emulated-SMR-3 4KiB 1 9,175,056 35GiB 35GiB
aThis estimate is based on the hypothesis that all of 25ms during a single block write is spent
writing to disk. While the results of the experiments indicate this to be the case, we think
25ms latency is artificially high and expect it to drop in future drives, which would require
recalculation of this estimate.
Fig. 15. Write latency of asynchronous writes of varying sizes with queue depth of 31 until cleaning starts.
Starting from the top, the graphs correspond to lines 2–5 in Table II. When writing asynchronously, more
writes are packed into the same journal entry. Therefore, although the map merge operations still occur at
every 240th journal write, the interval seems greater than in Figure 16. For 4 and 64KiB write sizes, we
hit the map size limit first, hence cleaning starts after the same number of operations. For 128KiB write
size we hit the space limit before hitting the map size limit; therefore, cleaning starts after fewer number of
operations than in 64KiB writes. Doubling the write size to 256KiB confirms that we are hitting the space
limit, since cleaning starts after half the number of operations of 128KiB writes.
hitting the size limit. To confirm this, we increase write size to 256KiB in row 5; as
expected, the number of operations drops by half while the total write size stays the
same. Again, assuming that the drive has hit the low watermark, we estimate that the
persistent cache size is 20GiB.
Journal entries with quantized sizes and extent mapping are absent topics in aca-
demic literature on SMR, so emulated drives implement neither feature. Running
TEST 6 on emulated drives produces all three answers, since in these drives, the cache
is block-mapped, and the cache size and cache raw size are the same. Furthermore,
set-associative STL divides the persistent cache into cache bands and assigns data
bands to them using modulo arithmetic. Therefore, despite having a single cache, un-
der random writes it behaves similarly to a fully associative cache. The bottom rows
of Table II show that in emulated drives, TEST 8 discovers the cache size (see Table I)
with 95% accuracy.
4.7. Is Persistent Cache Shingled?
We next determine whether the STL manages the persistent cache as a circular log.
While this would not guarantee that the persistent cache is shingled (an STL could
also manage a random-write region as a circular log), it would strongly indicate that
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
Skylight—A Window on Shingled Disk Operation 16:17
Fig. 16. Write latency of 4KiB synchronous random writes, corresponding to the first line in Table II. As was
explained in Section 4.3.1, when writing synchronously the drive writes a journal entry for every write opera-
tion. Every 240th journal entry write results in a 325ms latency, which as was hypothesized in Section 4.3.1
includes a map merge operation. After 23,000 writes, cleaning starts and the IOPS drops precipitously to
0–3. To emphasize the high latency of writes during cleaning we perform 3,000 more operations. As the
graph shows, these writes (23,000–26,000) have 500ms latency.
Fig. 17. Write latency of 30,000 random block writes with a repeating pattern. We choose 10,000 random
blocks across the LBA space and write them in the chosen order. We then write the same 10,000 blocks in
the same order two more times. Unlike Figure 16, the cleaning does not start after 23,000 writes, because
due to the repeating pattern, as the head of the log wraps around, the STL only finds stale blocks that it can
overwrite without cleaning.
the persistent cache is shingled. We start with TEST 9, which chooses a sequence of
10,000 random LBAs across the drive space and writes the sequence straight through,
three times. Given that the persistent cache has space for 23,000 synchronous block
writes (Table II), a dumb STL would fill the cache and start cleaning before the writes
complete.
TEST 9: Discovering if the Persistent Cache is a Circular Log—Part I
1Choose 10,000 random blocks across the LBA space.
2for i0to i<3do
3Write the 10,000 blocks from step 1 in the chosen order.
Figure 17 shows that unlike Figure 16, cleaning does not start after 23,000 writes.
Two simple hypotheses that explain this phenomenon are as follows:
(1) The STL manages the persistent cache as a circular log. When the head of the log
wraps around, STL detects stale blocks and overwrites without cleaning.
(2) The STL overwrites blocks in-place. Since there are 10,000 unique blocks, we never
fill the persistent cache and cleaning never starts.
To find out which one of these is true, we run TEST 10. Since there are still 10,000
unique blocks, if the hypothesis (2) is true, that is, if the STL overwrites the blocks in-
place, we should never consume more than 10,000 writes’ worth of space and cleaning
should not start before the writes complete. Figure 18 shows that cleaning starts after
23,000 writes, invalidating hypothesis (2). Furthermore, if we compare Figure 18 to
Figure 16, we see that the latency of writes after cleaning starts is 100ms and 500ms,
respectively. This corroborates hypothesis (1)—latency is lower in the former, because
after the head of the log wraps around, the STL finds some stale blocks (since these
blocks were chosen from a small pool of 10,000 unique blocks), that it can overwrite
without cleaning. When the blocks are chosen across the LBA space, as in Figure 16,
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
16:18 A. Aghayev et al.
Fig. 18. Write latency of 30,000 random block writes chosen from a pool of 10,000 unique blocks. Unlike
Figure 17, cleaning starts after 23,000, because as the head of the log wraps around, the STL does not
immediately find stale blocks. However, since the blocks are chosen from a small pool, the STL still does find
a large number of stale blocks and can often overwrite without cleaning. Therefore, compared to Figure 16
the write latency during cleaning (operations 23,000–26,000) is not as high, since in Figure 16 the blocks are
chosen across the LBA space and the STL almost never finds a stale block when the head of the log wraps
around.
once the head wraps around, the STL ends up effectively cleaning before every write
since it almost never finds a stale block.
TEST 10: Discovering if the Persistent Cache is a Circular Log—Part II
1Choose 10,000 random blocks across the LBA space.
2for i0to i<30,000 do
3Randomly choose a block from the blocks in step 1 and write.
4.8. Band Size
STLs proposed to date [Amer et al. 2010; Cassuto et al. 2010; Hall 2014] clean a single
band at a time, by reading unmodified data from a band and updates from the cache,
merging them, and writing the merge result back to a band. TEST 11 determines the
band size, by measuring the granularity at which this cleaning process occurs.
TEST 11: Discovering the Band Size
1Select an accuracy granularity a, and a band size estimate b.
2Choose a linear region of size 100 ×band divide it into a-sized blocks.
3Write 4KiB to the beginning of every a-sized block, in random order.
4Force cleaning to run for a few seconds and read 4KiB from the beginning of every a-sized
block in sequential order.
5Consecutive reads with identical high latency identify a cleaned band.
Assuming that the linear region chosen in TEST 11 lies within a region of equal track
length, for data that is not in the persistent cache, 4KB reads at a fixed stride ashould
see identical latencies—that is, a rotational delay equivalent to (amod T) bytes where
Tis the track length. Conversely, reads of data from cache will see varying delays in
the case of a disk cache due to the different (and random) order in which they were
written or submillisecond delays in the case of a flash cache.
With aggressive cleaning, after pausing to allow the disk to clean a few bands, a
linear read of the written blocks will identify the bands that have been cleaned. For
a drive with lazy cleaning the linear region is chosen so that writes fill the persistent
cache and force a few bands to be cleaned, which again may be detected by a linear
read of the written data.
In Figure 19, we see the results of TEST 11 for a=1MiB and b=50MiB, respectively,
with the region located at the 2.5TB offset; for each drive we zoom in to show an
individual band that has been cleaned. We correctly identify the band size for the
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
Skylight—A Window on Shingled Disk Operation 16:19
Fig. 19. Discovering band size. Fig. 20. Head position during the sequential read
for Seagate-SMR, corresponding to the time pe-
riod in Figure 19.
emulated drives (see Table I). The band size of Seagate-SMR at this location is seen to
be 30MiB; running tests at different offsets shows that bands are isocapacity within a
zone (Section 4.12) but vary from 36MiB at the OD to 17MiB at the ID.
Figure 20 shows the head position of Seagate-SMR corresponding to the time period
in Figure 19. It shows that the head remains at the OD during the reads from the
persistent cache up to 454MiB, then seeks to 2.5TB offset and stays there for 30MiB,
and then seeks back to the cache at OD, confirming that the blocks in the band are
read from their native locations.
4.9. Cleaning Time of a Single Band
We observed that cleaning a band whose single block was overwritten can take 600ms,
whereas if we overwrite 2MiB of the band by skipping every other block, cleaning time
increases to 1.6s (Section 4.5). While 600ms cleaning time due to a single block
overwrite gives us a lower bound on the cleaning time, we do not know the upper bound.
Now that we understand the persistent cache structure and band size, in addition to
the cleaning algorithm, we create an adversarial workload that will give us an upper
bound for the cleaning time of a single band.
Table II shows that with a queue depth of 31, we can write 182,270 blocks, that
is, 5,880 journal entries, resulting in 700MiB host writes. Assuming the band size is
35MiB at the OD, 700MiB corresponds to 20 bands. Therefore, if we distribute (through
random writes) the blocks of 20 bands among 5,880 journal entries, the drive will need
to read every packet to clean a single band. Assuming 5–10ms read time for a packet,
reading all of the packets to assemble the band will take 29–60s.
To confirm this hypothesis, we shuffled the first 700MiB worth of blocks and wrote
them with a queue depth of 31. The cleaning took 15min, which is 45s per band.
4.10. Block Mapping
Once we discover the band size (Section 4.8), we can use TEST 12 to determine the
mapping type. This test exploits varying intertrack switching latency between different
track pairs to detect if a band was remapped. After overwriting the first two tracks of
band b, cleaning will move the band to its new location—a different physical location
only if dynamic mapping is used. Plotting latency graphs of step 2 and step 4 will
produce the same pattern for the static mapping and a different pattern for the dynamic
mapping.
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
16:20 A. Aghayev et al.
Fig. 21. Detecting mapping type.
TEST 12: Discovering Mapping Type
1Choose two adjacent isocapacity bands aand b;setnto the number of blocks in a track.
2for i0to i<2do
3for j0to j<ndo
Read block jof track0 of band a
Read block jof trackiof band b
4Overwrite the first two tracks of band b; force cleaning to run.
5Repeat step 2.
Adapting this test to a drive with lazy cleaning involves some extra work. First, we
should start the test on a drive after a secure erase, so that the persistent cache is
empty. Due to lazy cleaning, the graph of step 4 will be the graph of switching between
a track and the persistent cache. Therefore, we will fill the cache until cleaning starts,
and repeat step 2 once in a while, comparing its graph to the previous two: if it is
similar to the last, then data is still in the cache; if it is similar to the first, then the
drive uses static mapping, otherwise, the drive uses dynamic mapping.
We used track and block terms to concisely describe the preceding test, but the size
chosen for these parameters of the test need not match track size and block size of
the underlying drive. Figure 21, for example, shows the plots for the test on all of the
drives using 2MiB for the track size and 16KiB for the block size. The latency pattern
before and after cleaning is different only for Emulated-SMR-3 (seen on the top right),
correctly indicating that it uses dynamic mapping. For all of the remaining drives,
including Seagate-SMR, the latency pattern is the same before and after cleaning,
indicating a static mapping.
4.11. Effect of Mapping Type on Drive Reliability
The type of band mapping used in an SMR drive affects the drive reliability for the
reasons explained next. When we enable the volatile cache on Seagate-SMR, it sustains
full throughput for sequential writes. Seagate-SMR does not contain flash and it uses
static mapping, therefore it can achieve full throughput only if it buffers the data in
the volatile cache and writes directly to the band, bypassing the persistent cache.
This performance improvement, however, comes with a risk of data loss. Since there is
no backup of the overwritten data, if power is lost midway through the band overwrite,
blocks in the following tracks are left in a corrupt state, resulting in data loss. We also
lose the new data since it was buffered in the volatile cache.
A similar error, known as torn write [Bairavasundaram et al. 2008; Krioukov et al.
2008], occurs in CMR drives as well, wherein only a portion of a sector gets written
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
Skylight—A Window on Shingled Disk Operation 16:21
Fig. 22. Torn write scenarios in a hypothetical SMR drive with bands consisting of three tracks and the
write head with of 1.5 tracks. The tracks are shown horizontally instead of circularly to make the illustration
clear. (1) shows the logical view of the band consisting of three tracks, each track having 4,000 sectors.
(2) shows the physical layout of the tracks on a platter that accounts for the track skew. (3a) and (3b) show
the corruption scenario when the power is lost during the track switch. The red region in (3a) shows the
situation after track 1 of the band has been overwritten—track 1 contains new data, whereas track 2 is
corrupted; track 3 contains the old data. (3b) shows the situation after track 2 has been overwritten—track 1
and track 2 contain new data, whereas track 3 is corrupted. If power is lost while the head switches from
track 2 to track 3, block ranges 10,000–11,999 and 8,000–9,999, or the single range of 8,000–11,999 is left in
a corrupt state. (4a) and (4b) show the corruption scenario when the power is lost during the track overwrite.
(4a) is identical to (3a) and shows the situation after track 1 of the band has been overwritten. (4b) shows
the situation where power is lost after blocks 4,000–4,999 have been overwritten. In this case, block ranges
7,000–7,999, 5,000–6,999, and 11,000–11,999, or two ranges of 5,000–7,999 and 11,000–11,999 are left in a
corrupt state.
before power is lost. In CMR drives, the time required to atomically overwrite a sector
is small enough that reports of such errors are rare [Krioukov et al. 2008]. An SMR
drive with static mapping, on the other hand, is similar to a CMR drive with large
(in the case of Seagate-SMR, 17–3 MiB) sectors. Therefore, there is a high probability
that a random power loss during streaming sequential writes will disrupt a band
overwrite.
Figure 22 describes two error scenarios in a hypothetical SMR drive that uses static
mapping. These errors are the consequence of the used mapping scheme, since the
only way of sustaining full throughput in such a scheme is to write to the band directly.
Introducing a small amount of flash to an SMR drive for persistent buffering has its own
challenges—exploiting parallelism for fast flash writes and managing wear leveling is
possible only if large amounts of flash are used, which is not feasible inside an SMR
drive. On the other hand, when using a dynamic band mapping scheme, similar to fully
associative STL, a drive can write the new contents of a band directly to a free band
without jeopardizing the existing band data. This, followed by an atomic switch in the
mapping table would result in full-throughput sequential writes without sacrificing
reliability. The idea is similar to log-block FTLs [Kim et al. 2002; Park et al. 2008]
that have been successful in overcoming slow block overwrites in NAND flash. For the
reasons described, we expect that the next generation of SMR drives will use dynamic
band mapping.
We successfully reproduced torn writes on Seagate-SMR by using an Arduino UNO
R3 board with a two-channel relay shield to control the power to the drive. After
Running TEST 13 at arbitrary offsets, we could reproduce hard read errors as shown
in Figure 23 on all of our sample drives. The offset where errors occurred differed
between drives. These errors disappeared after overwriting the affected regions.
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
16:22 A. Aghayev et al.
Fig. 23. Hard read error under Linux kernel 3.16 when reading a region affected by a torn write.
Fig. 24. Sequential read throughput of Seagate-SMR. Fig. 25. Seagate-SMR head position during se-
quential reads at different offsets.
TEST 13: Reproducing Torn Writes
1Choose an offset.
2for i0to i<50 do
3Power on the drive and start 1MiB sequential writes at offset.
4After 10s power off the drive; wait for 5s and power on the drive.
5Starting at the crash point, go back 5000 blocks and read 6000 blocks.
4.12. Zone Structure
We use sequential reads (TEST 14) to discover the zone structure of Seagate-SMR.
While there are no such drives yet, on drives with dynamic mapping a secure erase
that would restore the mapping to the default state may be necessary for this test to
work. Figure 24 shows the zone profile of Seagate-SMR, with a zoom to the beginning.
TEST 14: Discovering Zone Structure
1Enable kernel read-ahead and drive look-ahead.
2Sequentially read the whole drive in 1MiB blocks.
Similar to CMR drives, the throughput falls as we reach higher LBAs; unlike CMR
drives, there is a pattern that repeats throughout the graph, shown by the zoomed
part. This pattern has an axis of symmetry indicated by the dotted vertical line at
2,264th second. There are eight distinct plateaus to the left and to the right of the
axis with similar throughputs. The fixed throughput in a single plateau and a sharp
change in throughput between plateaus suggest a wide radial stroke and a head switch.
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
Skylight—A Window on Shingled Disk Operation 16:23
Fig. 26. Sequential read latency of Seagate-CMR (top) and Seagate-SMR (bottom) corresponding to a com-
plete cycle of ascent and descent through platter surfaces. While Seagate-CMR completes the cycle in 3.5s,
Seagate-SMR completes it in 1800s, since the latter reads thousands of tracks from a single surface before
switching to the next surface.
Plateaus correspond to large zones of size 18–20GiB, gradually decreasing to 4GiB as
we approach higher LBAs. The slight decrease in throughput in symmetric plateaus on
the right is due to moving from a larger to smaller radii, where sector per track count
decreases; therefore, throughput decreases as well.
We confirmed these hypotheses using the head position graph shown in Figure 25(a),
which corresponds to the time interval of the zoomed graph of Figure 24. Unlike with
CMR drives, where we could not observe head switches due to narrow radial strokes,
with this SMR drive head switches are visible to an unaided eye. Figure 25(a) shows
that the head starts at the OD and slowly moves toward the MD completing this
inwards move at the 1,457th second, indicated by the vertical dotted line. At this
point, the head has just completed a wide radial stroke reading gigabytes from the
top surface of the first platter, and it performs a jump back to the OD and starts a
similar stroke on the bottom surface of the first platter. The direction of the head
movement indicates that the shingling direction is toward the ID at the OD. The head
completes the descent through the platters at the 2,264th second—indicated by the
vertical solid line—and starts its ascent reading surfaces in the reverse order. These
wide radial strokes create “horizontal zones” that consist of thousands of tracks on
the same surface, as opposed to “vertical zones” spanning multiple platters in CMR
drives. We expect these horizontal zones to be the norm in SMR drives, since they
facilitate SMR mechanisms like allocation of isocapacity bands, static mapping, and
dynamic band size adjustment [Feldman 2011]. Figure 25(b) corresponds to the end
of Figure 24, and shows that the direction of the head movement is reversed at the
ID, indicating that both at the OD and at the ID, shingling direction is toward the
middle diameter. To our surprise, Figure 25(c) shows that a conventional serpentine
layout with wide serpents is used at the MD. We speculate that although the whole
surface is managed as if it is shingled, there is a large region in the middle that is not
shingled.
It is hard to confirm the shingling direction without observing the head movement.
The existence of “horizontal zones,” on the other hand, can also be confirmed by con-
trasting the sequential latency graphs of Seagate-SMR and Seagate-CMR. The bottom
of Figure 26 shows the latency graph for the zoomed region in Figure 24. As expected,
the shape of the latency graph matches the shape of the throughput graph mirrored
around the xaxis. The top of Figure 26 shows an excerpt from the latency graph
of Seagate-CMR that is also repeated throughout the latency graph. This graph too
has a pattern that is mirrored at the center, also indicating a completed ascent and
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
16:24 A. Aghayev et al.
descent through the surfaces. However, Seagate-CMR completes the cycle in 3.5s since
it reads only a few tracks from each surface, whereas Seagate-SMR completes the cycle
in 1800s, indicating that it reads thousands of tracks from a single surface.
Smaller spikes at the graph of Seagate-CMR correspond to track switches, and higher
spikes correspond to head switches. While the extra 1ms head switch latency every few
megabytes does not affect the accuracy of emulation, it shows up in some of the tests,
for example, as the bump around 4030thMiB in Figure 19. Figure 26 also shows that
the number of platters can be inferred from the latency graph of sequential reads.
5. RELATED WORK
Little has been published on the subject of system-level behavior of SMR drives. Al-
though several works (for example, Amer et al. [2010] and Le et al. [2011]) have
discussed requirements and possibilities for use of shingled drives in systems, only
three papers to date—Cassuto et al. [2010], Lin et al. [2012], and Hall et al. [2012]—
present example translation layers and simulation results. A range of STL approaches
is found in the patent literature [Coker and Hall 2013; Fallone and Boyle 2013; Feld-
man 2011; Hall 2014], but evaluation and analysis is lacking. Several SMR-specific file
systems have been proposed, such as SMRfs [Gibson and Polte 2009], SFS [Le Moal
et al. 2012], and HiSMRfs [Jin et al. 2014]. He and Du [2014] propose a static mapping
to minimize rewrites for in-place updates, which requires high guard overhead (20%)
and assumes file system free space is contiguous in the upper LBA region. Pitchumani
et al. [2012] present an emulator implemented as a Linux device mapper target that
mimics shingled writing on top of a CMR drive. Tan et al. [2013] describe a simulation
of S-blocks algorithm, with a more accurate simulator calibrated with data from a real
CMR drive. To date, no work (to the authors’ knowledge) has presented measurements
of read and write operations on an SMR drive, or performance-accurate emulation
of STLs.
This work draws heavily on earlier disk characterization studies that have used mi-
crobenchmarks to elicit details of internal performance, such as Schlosser et al. [2005],
Gim and Won [2010], Krevat et al. [2011], Talagala et al. [1999], and Worthington et al.
[1995]. Due to the presence of a translation layer, however, the specific parameters
examined in this work (and the microbenchmarks for measuring them) are different.
6. CONCLUSIONS AND RECOMMENDATIONS
As Table III shows, the Skylight methodology enables us to discover key properties of
two drive-managed SMR disks automatically. With manual intervention, it allows us
to completely reverse engineer a drive. The purpose of doing so is not just to satisfy our
curiosity, however, but to guide both their use and evolution. In particular, we draw
the following conclusions from our measurements of the 5TB Seagate drive:
—Write latency with the volatile cache disabled is high (TEST 1). This appears to be
an artifact of specific design choices rather than fundamental requirements, and we
hope for it to drop in later firmware revisions.
—Sequential throughput (with the volatile cache disabled) is much lower (by 3×or
more, depending on write size) than for conventional drives. (We omitted these test
results, as performance is identical to the random writes in TEST 1.) Due to the use
of static mapping (TEST 12), achieving full sequential throughput requires enabling
volatile cache.
—Random I/O throughput (with the volatile cache enabled or with high queue depth)
is high (TEST 7)—15×that of the equivalent CMR drive. This is a general property
of any SMR drive using a persistent cache.
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
Skylight—A Window on Shingled Disk Operation 16:25
Table III. Properties of the 5 and the 8TB Seagate Drives Discovered
using Skylight Methodology
Drive Model
Property ST5000AS0011 ST8000AS0002a
Drive Type SMR SMR
Persistent Cache Type Disk Disk
Cache Layout and Location Single, at the OD Single, at the OD
Usable Cache Size 20GiB 25GiB
Cache Map Size 200,000 250,000
Band Size 17–36MiB 15–40MiB
Block Mapping Static Static
Cleaning Type Aggressive Aggressive
Cleaning Algorithm FIFO FIFO
Cleaning Time 0.6–45s/band 0.6–45s/band
Zone Structure 4–20GiB 5–40GiB
Shingling Direction Toward MD N/A
aThe benchmarks worked out of the box on the 8TB drive. Since the
8TB drive was on loan, we did not drill a hole on it; therefore, shingling
direction for it is not available.
—Throughput may degrade precipitously when the cache fills after many writes
(Table II). The point at which this occurs depends on write size and queue depth.2
—Background cleaning begins after 1s of idle time, and proceeds in steps requiring
0.6–45s of idle time to clean a single band (Section 4.9).
—Sequential reads of randomly-written data will result in random-like read perfor-
mance until cleaning completes (Section 4.4).
In summary, SMR drives like the ones we studied should offer good performance if
the following conditions are met: (a) the volatile cache is enabled or a high queue depth
is used, (b) writes display strong spatial locality, modifying only a few bands at any
particular time, (c) nonsequential writes (or all writes, if the volatile cache is disabled)
occur in bursts of less than 16GB or 180,000 operations (Table II), and (d) long powered-
on idle periods are available for background cleaning. From the use of aggressive
cleaning that presumes long idle periods, we may conclude that the drive is adapted to
desktop use, but may perform poorly on server workloads. Further work will include
investigation of STL algorithms that may offer a better balance of performance for both.
ACKNOWLEDGMENTS
The authors would like to thank the anonymous reviewers, Kimberly Keeton, Tim Feldman, and Remzi
Arpaci-Dusseau for their feedback on the FAST version of the article.
REFERENCES
Ahmed Amer, Darrell D. E. Long, Ethan L. Miller, Jehan-Francois Paris, and S. J. Thomas Schwarz. 2010.
Design issues for a shingled write disk system. In Proceedings of the 2010 IEEE 26th Symposium on
Mass Storage Systems and Technologies (MSST) (MSST’10). IEEE Computer Society, Washington, DC,
1–12. DOI:http://dx.doi.org/10.1109/MSST.2010.5496991
Jens Axboe. 2015. Flexible I/O Tester. git://git.kernel.dk/fio.git.
Lakshmi N. Bairavasundaram, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Garth R. Goodson,
and Bianca Schroeder. 2008. An analysis of data corruption in the storage stack. Trans. Storage 4, 3,
Article 8 (Nov. 2008), 28 pages. DOI:http://dx.doi.org/10.1145/1416944.1416947
2Although results with the volatile cache enabled are not presented in Section 4.6, they are similar to those
for a queue depth of 31.
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
16:26 A. Aghayev et al.
Luc Bouganim, Bjorn Jnsson, and Philippe Bonnet. 2009. uFLIP: Understanding flash IO patterns. In
Proceedings of the International Conference on Innovative Data Systems Research (CIDR). Asilomar,
California.
Yuval Cassuto, Marco A. A. Sanvido, Cyril Guyot, David R. Hall, and Zvonimir Z. Bandic. 2010. Indirection
systems for shingled-recording disk drives. In Proceedings of the 2010 IEEE 26th Symposium on Mass
Storage Systems and Technologies (MSST) (MSST’10). IEEE Computer Society, Washington, DC, 1–14.
DOI:http://dx.doi.org/10.1109/MSST.2010.5496971
Feng Chen, David A. Koufaty, and Xiaodong Zhang. 2009. Understanding intrinsic characteristics and system
implications of flash memory based solid state drives. In Proceedings of the 11th International Joint
Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’09). ACM, New York,
NY, 181–192. DOI:http://dx.doi.org/10.1145/1555349.1555371
Jonathan Darrel Coker and David Robison Hall. 2013. Indirection memory architecture with reduced memory
requirements for shingled magnetic recording devices. (Nov. 5, 2013). US Patent 8,578,122.
Linux Device-Mapper. 2001. Device-Mapper Resource Page. https://sourceware.org/dm/.
Elizabeth A. Dobisz, Z. Z. Bandic, Tsai-Wei Wu, and T. Albrecht. 2008. Patterned media: Nanofabrication
challenges of future disk drives. Proc. IEEE 96, 11 (Nov. 2008), 1836–1846. DOI:http://dx.doi.org/10.1109/
JPROC.2008.2007600
DRAMeXchange. 2014. NAND Flash Spot Price. (Sept. 2014). http://dramexchange.com.
Robert M. Fallone and William B. Boyle. 2013. Data storage device employing a run-length mapping table
and a single address mapping table. (May 14, 2013). US Patent 8,443,167.
Tim Feldman. 2014a. Host-aware SMR. OpenZFS Developer Summit. Available from https://www.youtube.
com/watch?v=b1yqjV8qemU.
Tim Feldman. 2014b. Personal communication. (Aug. 2014).
Tim Feldman and Garth Gibson. 2013. Shingled magnetic recording: Areal density increase requires new
data management. USENIX 38, 3 (2013).
Timothy Richard Feldman. 2011. Dynamic storage regions. (Feb. 14, 2011). US Patent Appl. 13/026,535.
Garth Gibson and Greg Ganger. 2011. Principles of Operation for Shingled Disk Devices. Technical Report
CMU-PDL-11-107. CMU Parallel Data Laboratory. http://repository.cmu.edu/pdl/7.
Garth Gibson and Milo Polte. 2009. Directions for Shingled-Write and Two-Dimensional Magnetic Record-
ing System Architectures: Synergies with Solid-State Disks. Technical Report CMU-PDL-09-104. CMU
Parallel Data Laboratory. http://repository.cmu.edu/pdl/7.
Jongmin Gim and Youjip Won. 2010. Extract and infer quickly: Obtaining sector geometry of modern
hard disk drives. ACM Trans Storage (TOS) 6, 2, Article 6 (July 2010), 26 pages. DOI:http://dx.doi.
org/10.1145/1807060.1807063
David Hall, John H. Marcos, and Jonathan D. Coker. 2012. Data handling algorithms for autonomous
shingled magnetic recording hdds. IEEE Trans Magn 48, 5, 1777–1781.
David Robison Hall. 2014. Shingle-written magnetic recording (SMR) device with hybrid E-region. (April 1,
2014). US Patent 8,687,303.
Weiping He and David H. C. Du. 2014. Novel address mappings for shingled write disks. In Proceedings
of the 6th USENIX Conference on Hot Topics in Storage and File Systems (HotStorage’14).USENIX
Association, Berkeley, CA, 5–5. http://dl.acm.org/citation.cfm?id=2696578.2696583
HGST. 2014. HGST Unveils Intelligent, Dynamic Storage Solutions to Transform the Data Center. (Sept.
2014). Available from http://www.hgst.com/press-room/.
INCITS T10 Technical Committee. 2014. Information technology—Zoned Block Commands (ZBC).Draft
Standard T10/BSR INCITS 536. American National Standards Institute, Inc. Available from http://
www.t10.org/drafts.htm.
Chao Jin, Wei-Ya Xi, Zhi-Yong Ching, Feng Huo, and Chun-Teck Lim. 2014. HiSMRfs: A high performance
file system for shingled storage array. In Proceedings of the 2014 IEEE 30th Symposium on Mass Storage
Systems and Technologies (MSST). 1–6. DOI:http://dx.doi.org/10.1109/MSST.2014.6855539
Jesung Kim, Jong Min Kim, S. H. Noh, Sang Lyul Min, and Yookun Cho. 2002. A space-efficient flash
translation layer for CompactFlash systems. IEEE Trans. Consumer Electron. 48, 2 (May 2002), 366–
375. DOI:http://dx.doi.org/10.1109/TCE.2002.1010143
Elie Krevat, Joseph Tucek, and Gregory R. Ganger. 2011. Disks are like snowflakes: No two are alike. In
Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systems (HotOS’13).USENIX
Association, Berkeley, CA, 14–14. http://dl.acm.org/citation.cfm?id=1991596.1991615
Andrew Krioukov, Lakshmi N. Bairavasundaram, Garth R. Goodson, Kiran Srinivasan, Randy Thelen,
Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dussea. 2008. Parity lost and parity regained. In
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
Skylight—A Window on Shingled Disk Operation 16:27
Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). USENIX Asso-
ciation, Berkeley, CA, Article 9, 15 pages. http://dl.acm.org/citation.cfm?id=1364813.1364822
Mark H. Kryder, Edward C. Gage, Terry W. McDaniel, William A. Challener, Robert E. Rottmayer, Ganping
Ju, Yiao-Tee Hsia, and M. Fatih Erden. 2008. Heat assisted magnetic recording. Proc. IEEE 96, 11 (Nov.
2008), 1810–1835. DOI:http://dx.doi.org/10.1109/JPROC.2008.2004315
Quoc M. Le, Kumar Sathyanarayana Raju, Ahmed Amer, and JoAnne Holliday. 2011. Workload impact
on shingled write disks: All-writes can be alright. In Proceedings of the 2011 IEEE 19th Annual
International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunica-
tion Systems (MASCOTS’11). IEEE Computer Society, Washington, DC, 444–446. DOI:http://dx.doi.org/
10.1109/MASCOTS.2011.58
Damien Le Moal, Zvonimir Bandic, and Cyril Guyot. 2012. Shingled file system host-side management
of Shingled magnetic recording disks. In Proceedings of the 2012 IEEE International Conference on
Consumer Electronics (ICCE). 425–426. DOI:http://dx.doi.org/10.1109/ICCE.2012.6161799
Libata FAQ. 2011. https://ata.wiki.kernel.org/index.php/Libata_FAQ.
Chung-I Lin, Dongchul Park, Weiping He, and David H. C. Du. 2012. H-SWD: Incorporating hot data
identification into Shingled write disks. In Proceedings of the 2012 IEEE 20th International Symposium
on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS’12).
IEEE Computer Society, Washington, DC, 321–330. DOI:http://dx.doi.org/10.1109/MASCOTS.2012.44
Chanik Park, Wonmoon Cheon, Jeonguk Kang, Kangho Roh, Wonhee Cho, and Jin-Soo Kim. 2008. A reconfig-
urable FTL (Flash Translation Layer) architecture for NAND flash-based applications. ACM Trans. Em-
bed. Comput. Syst. 7, 4, Article 38 (Aug. 2008), 23 pages. DOI:http://dx.doi.org/10.1145/1376804.1376806
S. N. Piramanayagam. 2007. Perpendicular recording media for hard disk drives. J. Appl. Phys. 102, 1 (July
2007), 011301. DOI:http://dx.doi.org/10.1063/1.2750414
Rekha Pitchumani, Andy Hospodor, Ahmed Amer, Yangwook Kang, Ethan L. Miller, and Darrell D. E. Long.
2012. Emulating a Shingled write disk. In Proceedings of the 2012 IEEE 20th International Symposium
on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS’12).
IEEE Computer Society, Washington, DC, 339–346. DOI:http://dx.doi.org/10.1109/MASCOTS.2012.46
Sundar Poudyal. 2013. Partial write system. (March 13, 2013). US Patent Appl. 13/799,827.
Drew Riley. 2013. Samsung’s SSD Global Summit: Samsung: Flexing Its Dominance In The NAND Market.
(Aug. 2013). http://www.tomshardware.com/reviews/samsung-global-ssd-summit-2013,3570.html.
Mendel Rosenblum and John K. Ousterhout. 1991. The design and implementation of a log-structured file
system. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP’91). ACM,
New York, NY, 1–15. DOI:http://dx.doi.org/10.1145/121132.121137
SATA-IO. 2011. Serial ATA Revision 3.1 Specification. Technical Report. SATA-IO.
Steven W. Schlosser, Jiri Schindler, Stratos Papadomanolakis, Minglong Shao, Anastassia Ailamaki, Christos
Faloutsos, and Gregory R. Ganger. 2005. On multidimensional data and modern disks. In Proceedings
of the 4th Conference on USENIX Conference on File and Storage Technologies—Volume 4 (FAST’05).
USENIX Association, Berkeley, CA, 17–17. http://dl.acm.org/citation.cfm?id=1251028.1251045
Seagate 2013a. Seagate Desktop HDD: ST5000DM000, ST4000DM001. Product Manual 100743772. Seagate
Technology LLC.
Seagate 2013b. Seagate Technology PLC Fiscal Fourth Quarter and Year End 2013 Financial Results Sup-
plemental Commentary. (July 2013). Available from http://www.seagate.com/investors.
Seagate 2013c. Terascale HDD. Data sheet DS1793.1-1306US. Seagate Technology PLC.
Seagate 2014. Seagate Ships Worlds First 8TB Hard Drives. (Aug. 2014). Available from http://www.seagate.
com/about/newsroom/.
Nisha Talagala, Remzi H. Arpaci-Dusseau, and David Patterson. 1999. Microbenchmark-based Extraction
of Local and Global Disk Characteristics. Technical Report UCB/CSD-99-1063. EECS Department,
University of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/6275.html.
Sophia Tan, Weiya Xi, Zhi Yong Ching, Chao Jin, and Chun Teck Lim. 2013. Simulation for a Shin-
gled magnetic recording disk. IEEE Trans. Magn. 49, 6 (June 2013), 2677–2681. DOI:http://dx.doi.org/
10.1109/TMAG.2013.2245872
David A. Thompson and John S. Best. 2000. The future of magnetic data storage techology. IBM J. Res. Dev.
44, 3 (May 2000), 311–322. DOI:http://dx.doi.org/10.1147/rd.443.0311
Sumei Wang, Yao Wang, and Randall H. Victora. 2013. Shingled magnetic recording on bit patterned
media at 10 Tb/in2.IEEE Trans. Magn. 49, 7 (July 2013), 3644–3647. DOI:http://dx.doi.org/10.1109/
TMAG.2012.2237545
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
16:28 A. Aghayev et al.
R. Wood, Mason Williams, A Kavcic, and Jim Miles. 2009. The feasibility of magnetic recording at 10
terabits per square inch on conventional media. IEEE Trans. Magn. 45, 2 (Feb. 2009), 917–923.
DOI:http://dx.doi.org/10.1109/TMAG.2008.2010676
Bruce L. Worthington, Gregory R. Ganger, Yale N. Patt, and John Wilkes. 1995. On-line extraction of SCSI
disk drive parameters. In Proceedings of the 1995 ACM SIGMETRICS Joint International Conference on
Measurement and Modeling of Computer Systems (SIGMETRICS’95/PERFORMANCE’95). ACM, New
York, NY, 146–156. DOI:http://dx.doi.org/10.1145/223587.223604
Received June 2015; revised August 2015; accepted September 2015
ACM Transactions on Storage, Vol. 11, No. 4, Article 16, Publication date: October 2015.
... To reduce the frequency of RMW operations, approximately 1% to 10% of the overprovisioned space within the SMR is configured as an internal persistent buffer to absorb non-sequential writes [6,7]. Incoming data is appended in the persistent buffer in a circular log manner, and the original logical block address (LBA) is recorded by a dynamic mapping table. ...
... To ensure that there is enough space in the persistent buffer to accommodate incoming new data, the buffered data is periodically migrated to the native zone. However, cleanup activities inevitably block user requests, especially under write-intensive workloads, and the inefficient persistent buffer management mechanism invokes cleanup activities frequently, thus degrading disk performance [6][7][8]. ...
... Buffer blocks must be periodically written back to accommodate the incoming blocks. In general, the persistent buffer adopts a zone-level write-back mechanism to evict buffer blocks within the same zone that reduces the write amplification generated by cleanup activities [6,7,19,[32][33][34][35]. As shown in Figure 4, the RMW operation of the cleanup process consists of three steps: reading valid blocks from the native zone, collecting victim blocks, and writing back the merged blocks. ...
Article
Full-text available
The explosive growth of massive data makes shingled magnetic recording (SMR) disks a promising candidate for balancing capacity and cost. SMR disks are typically configured with a persistent buffer to reduce the read–modify–write (RMW) overhead introduced by non-sequential writes. Traditional SMR zones-based persistent buffers are subject to sequential-write constraints, and frequent cleanups cause disk performance degradation. Conventional magnetic recording (CMR) zones with in-place update capabilities enable less frequent cleanups and are gradually being used to construct persistent buffers in certain SMR disks. However, existing CMR zones-based persistent buffer designs fail to accurately capture hot blocks with long update periods and are limited by an inflexible data layout, resulting in inefficient cleanups. To address the above issues, we propose a strategy called Amphisbaena. First, a two-phase data block classification method is proposed to capture frequently updated blocks. Then, a locality-aware buffer space management scheme is developed to dynamically manage blocks with different update frequencies. Finally, a latency-sensitive garbage collection policy based on the above is designed to mitigate the impact of cleanup on user requests. Experimental results show that Amphisbaena reduces latency by an average of 29.9% and the number of RMWs by an average of 37% compared to current state-of-the-art strategies.
... When the sectors are updated, the SMR disk must guarantee the integrity of the adjacent data through a time-consuming read-merge-write (RMW) operation (i.e., rewriting the entire band/zone 1 ) [8]. To mitigate the impact of RMW, typically 1-10% of SMR disk space is employed as the persistent buffer to absorb non-sequential writes [9]. The persistent buffer employs an append-write mode to accommodate non-sequential writes, and periodically flushes buffered data to the native zone/band in a first-in-first-out (FIFO) way. ...
... The persistent buffer employs an append-write mode to accommodate non-sequential writes, and periodically flushes buffered data to the native zone/band in a first-in-first-out (FIFO) way. Since the persistent buffer cleaning process activates a series of RMW operations and takes more than 40 s in the worst case [9], frequent cleaning cause SMR disks to suffer significant I/O performance degradation. Under write-intensive workloads, the persistent buffer cleaning overhead of the SMR is a decisive factor affecting its I/O performance [10]. ...
... Currently, SMR disks can be divided into three types, including drive-managed SMR (DM-SMR), host-managed SMR (HM-SMR), and host-aware SMR (HA-SMR). The inside of the DM-SMR disk contains a number of bands (ranging in size from 15MB to 40MB) and a persistent buffer [9]. In addition, a shingle transformation layer (STL) that hides the persistent buffer management details is implemented inside the firmware of DM-SMR. ...
Article
Full-text available
Shingled magnetic recording (SMR) disks satisfy the growing demand for storage capacity for big data applications with their high capacity and low cost. However, the most severe challenge for SMR disks is the precipitous degradation of I/O performance when subjected to non-sequential writes. Using Solid State Drives (SSDs) as external caches for SMR disks is a cost-effective method to improve the I/O performance of SMR disks. Nevertheless, the existing SMR-oriented cache replacement algorithm is ineffective in resolving the conflict between the write amplification and the cache hit rate, resulting in limited performance gains from SSDs to SMR disks. In this paper, we design a multidevice cooperative buffer (MCB) management strategy to boost the write performance of SSD-SMR storage systems. Firstly, MCB selectively directs write traffic into the SSD cache to reduce the frequency of cleaning activity. Then, MCB adaptively evicts victim blocks from the SSD cache based on the locality principle. Finally, MCB utilizes a novelty SMR disk internal persistent buffer management mechanism to further optimize the efficiency of the SSD cache cleaning. The experimental results show that MCB achieves 7.4×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} and 1.5×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} performance improvements for write-intensive traces with spatial locality and temporal locality, respectively, compared to the state-of-the-art strategies.
... In this way, even if a system crash occurs during data shuffling, the correct metadata information can still be found and the system can be rolled back to its previous state. Ensuring data consistency during persistent buffer cleanup is widely discussed [21,22] as a standard technique in SMR disk research. Therefore, the same practice can be used to ensure data consistency during WIF-PB cleanup. ...
... Of all the strategies evaluated, Balloon and MAGIC stand out by introducing a persistent buffer to reduce the frequency of track rewrites. However, frequent cleanups degrade disk performance [21,22]. To further investigate the factors contributing to the performance gap between Balloon and MAGIC, this section focuses on evaluating the cleanup efficiency of their persistent buffers under varying space utilization and workloads. ...
Article
Full-text available
Recently, the emerging technology known as Interlaced Magnetic Recording (IMR) has been receiving widespread attention from both industry and academia. IMR-based disks incorporate interlaced track layouts and energy-assisted techniques to dramatically increase areal densities. The interlaced track layout means that in-place updates to the bottom track require rewriting the adjacent top track to ensure data consistency. However, at high disk utilization, frequent track rewrites degrade disk performance. To address this problem, we propose a solution called Balloon to reduce the frequency of track rewrites. First, an adaptive write interference data placement policy is introduced, which judiciously places data on tracks with low rewrite probability to avoid unnecessary rewrites. Next, an on-demand data shuffling mechanism is designed to reduce user-requests write latency by implicitly migrating data and promptly swapping tracks with high update block coverage to the top track. Finally, a write-interference-free persistent buffer design is proposed. This design dynamically adjusts buffer admission constraints and selectively evicts data blocks to improve the cooperation between data placement and data shuffling. Evaluation results show that Balloon significantly improves the write performance of IMR-based disks at medium and high utilization compared with state-of-the-art studies.
... Furthermore, we implemented a DM-SMR emulator based on a CMR disk by integrating known SMR working mechanisms and their specifications, such as the zone size distribution and persistent buffer size detected in [30] by Aghayev et al. The DM-SMR emulator helps to figure out how a cache algorithm affects the intrinsic behavior of SMR drives. ...
... It is because MOST algorithm used in SMRC cannot utilize the released space from the small and cold zone to reduce the number of RMW operations. The DM-SMR emulator we implemented to analyze the behavior of the SMR drive is based on the highly significant study of Aghayev et al. [30] They used reverse engineering to understand the STL behavior of a Seagate drive-managed shingled disk product. They used well-designed workloads to test the disk and set up a high-speed camera to track the movements of the disk head. ...
Article
Full-text available
To satisfy the enormous storage capacities required for big data, data centers have been adopting high-density shingled magnetic recording (SMR) disks. However, the weak fine-grained random write performance of SMR disks caused by their inherent write amplification and unbalanced read–write performance poses a severe challenge. Many studies have proposed solid-state drive (SSD) cache systems to address this issue. However, existing cache algorithms, such as the least recently used (LRU) algorithm, which is used to optimize cache popularity, and the MOST algorithm, which is used to optimize the write amplification factor, cannot exploit the full performance of the proposed cache systems because of their inappropriate optimization objectives. This paper proposes a new SMR-aware cache framework called SAC+ to improve SMR-based hybrid storage. SAC+ integrates the two dominant types of SMR drives — namely, drive-managed and host-managed SMR drives—and provides a universal framework implementation. In addition, SAC+ integrally combines the drive characteristics to optimize I/O performance. The results of evaluations conducted using real-world traces indicate that SAC+ reduces the I/O time by 36–93% compared with state-of-the-art algorithms.
Article
DM-SMR (device-managed shingled magnetic recording) disks allocate a portion of disk space as the persistent cache (PC) to address the issue of overlapping tracks during data updates. When the PC space becomes insufficient, a space cleaning is triggered to reclaim its invalid space. However, the space cleaning is time-consuming and contributes to the long-tail latency of DM-SMR disks. In the paper, we will propose a space-grained cleaning method that leverages various idle periods to effectively reduce the long-tail latency of DM-SMR disks. The objective is to perform a proper space-grained cleaning for a suitable space region at an appropriate time period, thereby preventing delays in subsequent I/O requests and reducing the long-tail latency associated with DM-SMR disks. The experimental results demonstrate a substantial reduction in the long-tail latency of DM-SMR disks through the proposed method.
Article
With the growing awareness of secure computation, more and more users want to make their digital footprints securely deleted and irrecoverable after updating or removing files on storage devices. To achieve the effect of secure deletion, overwritten-based secure deletion techniques have been proposed to overwrite invalidated storage space with scrambled data content. Nevertheless, overwritten-based secure deletion requires explicit write requests for rewriting invalidated storage space with random data, thus incurring additional internal write traffic on storage devices. Recently, this inefficient aspect of secure deletion is exacerbated by the emergence of high-density interlaced magnetic recording (IMR) technology. IMR technology enhances the storage density of hard disk drives by interlacing narrower top tracks on wider bottom tracks. Owing to the interlaced track layout, securely erasing invalidated data on bottom tracks interferes with adjacent top tracks and track rewrites are required to preserve valid data on top tracks; thus, overwritten-based secure deletion on IMR drives significantly enlarges the internal write traffic and impairs the secure deletion efficiency. In this paper, instead of directly alleviating the constraint induced by the interlaced track layout, we propose leveraging the interlaced track layout of IMR drives in combination with the journaling data stream of journaling file systems to enable prompt secure data deletion while minimizing the write traffic of secure deletion operations. Experimental results suggest that the performance improvement achieve by the proposed prompt secure deletion (P-SD) strategy is 51.84% on average, when compared with the previous TrackLace approach on IMR drives.
Patent
Full-text available
A method or system for determining storage location of an isolation region based on a data region sizing specified by a host device. In one implementation, the isolation region comprises a set of storage locations required for isolation of or more data region of the storage device.
Patent
Full-text available
An indirection system in a shingled storage device is described that uses an algorithm to map LBAs to DBAs based on a predetermined rule or assumption and then handles as exceptions LBAs that are not mapped according to the rule. The assumed rule is that a fixed-length set of sequential host LBAs are located at the start of an I-track. Embodiments of the invention use two tables to provide the mapping of LBAs to DBAs. The mapping assumed by the rule is embodied in the LBA Block Address Table (LBAT) which gives the corresponding I-track address for each LBA Block. The LBA exceptions are recorded using an Exception Pointer Table (EPT), which gives the pointer to the corresponding variable length Exception List for each LBA Block. The indexing into the LBAT and the EPT is derived from the LBA by a simple arithmetic operation.
Patent
Full-text available
A data storage device is disclosed comprising a non-volatile memory comprising a plurality of memory segments. When a write command comprising a logical block address (LBA) is received, a number of consecutive memory segments to access in response to the write command is determined. When the number of consecutive memory segments to access is greater than a threshold, a new run-length mapping entry in a run-length mapping table (RLMT) is created. When the number of memory segments to access is not greater than a threshold, at least one new single address mapping entry in a single address mapping table (SAMT) is created.
Conference Paper
Full-text available
Shingled write disk (SWD) is a magnetic hard disk drive that adopts the shingled magnetic recording (SMR) technology to overcome the areal density limit faced in conventional hard disk drives (HDDs). The SMR design enables SWDs to achieve two to three times higher areal density than the HDDs can reach, but it also makes SWDs unable to support random writes/in-place updates with no performance penalty. In particular, a SWD needs to concern about the random write/update interference, which indicates writing to one track overwrites the data previously stored on the subsequent tracks. Some research has been proposed to serve random write/update out-of-place to alleviate the performance degradation at the cost of bringing in the concept of garbage collection. However, none of these studies investigate SWDs based on the garbage collection performance. In this paper, we propose a SWD design called Hot data identification-based Shingled Write Disk (H-SWD). The H-SWD adopts a window-based hot data identification to effectively manage data in the hot bands and the cold bands such that it can significantly reduce the garbage collection overhead while preventing the random write/update interference. The experimental results with various realistic workloads demonstrates that H-SWD outperforms the Indirection System. Specifically, incorporating a simple hot data identification empowers the H-SWD design to remarkably improve garbage collection performance.
Article
A reader design for reading back at very high density is proposed for bit patterned media recording (BPMR). The idea is a rotated sense head, so that the shields are aligned down-track, combined with oversampled signal processing to regain the lost down-track resolution. Simulation results show that the proposed reader has more than 20 dB gain compared with a normally oriented head array for reading back at 10 Tbit/in(2). The tradeoff between oversampling and increased target length is examined. Island jitters are found to be non-Gaussian distributed. The performance of the new design is investigated for various bit patterns, island jitter and head noise.
Article
In this work, we have combined shingled magnetic recording and bit patterned media (BPM) to achieve an areal density of 10 Tb/in(2). In our design, the gradient of the write field along the cross track direction is as high as 700 Oe/nm. Composite BPM are utilized and the ratio of thermal stability to switching field is optimized accordingly. Assuming 5% switching field distribution, 5% timing error, 5% jitter error and 5% track misregistration, the write field is 7.8 kOe and the BER is 8.0 x 10(-4) for 10 Tb/in(2) with 5.4 nm x 5.4 nm dots; the write field is 7.9 kOe and the BER is 1.0 x 10(-3) for 9.6 Tb/in(2) with 4.4 nm x 4.4 nm dots.
Conference Paper
HiSMRfs a file system with standard POSIX interface suitable for Shingled Magnetic Recording (SMR) drives, has been designed and developed. HiSMRfs can manage raw SMR drives and support random writes without remapping layer implemented inside SMR drives. To achieve high performance, HiSMRfs separates data and metadata storage, and manages them differently. Metadata is managed using in-memory tree structures and stored in a high performance random write area such as in a SSD. Data writing is done through sequential appending style and store in a SMR drive. HiSMRfs includes a file/object-based RAID module for SMR/HDD arrays. The RAID module computes parity for individual files/objects and guarantees that data and parity writing are 100% in sequential and in full stripe. HiSMRfs is also suitable for a hybrid storage system with conventional HDDs and SSDs. Two prototype systems with HiSMRfs have been developed. The performance has been tested and compared with SMRfs and Flashcache. The experimental tests show that HiSMRfs performs 25% better than SMRfs, and 11% better than Flashcache system.
Conference Paper
Gone are the days of homogeneous sets of disks. Even disks of a given batch, of the same make and model, will have significantly different bandwidths. This paper describes the disk technology trends responsible for the now-inherent heterogeneity of multi-disk systems and disk-based clusters, provides measurements quantifying it, and discusses its implications for system designers.
Article
The acceptance of shingled magnetic recording (SMR) drives in the marketplace depends in part upon their system performance with the operational constraints demanded by SMR's wide writing heads. Current thinking suggests that SMR may require significant changes to the host file system because of SMR's natural proclivity for sequential writing, and supposed difficulty with random write operations. In this work, we propose a data handling algorithm which exhibits good short-block sustained random write performance. In addition, we examine optimal algorithms for intermediate transfer lengths on the order of one physical track in length.