ArticlePDF Available

P3Stor: A parallel, durable flash-based SSD for enterprise-scale storage systems

June 2011
Science China Information Sciences 54(6):1129-1141

June 2011
54(6):1129-1141

DOI:10.1007/s11432-011-4266-z

Source
DBLP

Authors:

Nong Xiao

National University of Defense Technology

Zhiguang Chen

Fang Liu

Hunan University

Show all 5 authorsHide

Driven by data-intensive applications, flash-based solid state drives (SSDs) have become increasingly popular in enterprise-scale storage systems. Flash memory exhibits inherent parallelism. However, existing solid state drives have not fully exploited this superiority. We propose P3Stor, a parallel solid state storage architecture that makes full use of flash memory by utilizing module- and bus-level parallelisms to increase average bandwidth and employing chip-level interleaving to hide I/O latency. To improve the bandwidth utilization of traditional interface protocols (e.g., SATA), P3Stor adopts PCI-E interface to support concurrent transactions. Based on the proposed parallel architecture, we design a lazy flash translation layer (LazyFTL) to manage the address space. The proposed LazyFTL adopts flexible super page-level mapping scheme to support multi-level parallelisms. It is able to distinguish hot data from cold data, and hot data identification enables LazyFTL to direct hot and cold data to separate physical blocks, which reduces page migrations when reclaiming blocks. As garbage collector migrates fewer valid pages, write amplification is significantly reduced, which in turn helps to extend the life span. Moreover, LazyFTL rarely triggers wear-leveling process. The lazy wear-leveling mechanism protects users’ requests from being disrupted by background operations. With the guidance of hot data identification, an intelligent write buffer is used to reduce program operations to flash chips. This is meaningful in extending P3Stor’s life span. The performance evaluation using trace-driven simulations and theoretical analysis shows that P3Stor achieves high performance and its life span is more than doubled. Keywordsflash memory–SSD–P3Stor–FTL–storage

Comparison of two types of cache configurations

…

The overall architecture of P3Stor. Write as few as possible. As each block has limited life span, it is important to reduce program operations to flash chips. As presented in the work of Soundararajan et al. [12], write requests from hosts exhibit great temporal and spatial localities. Most of the write requests are overwrites to a few data. A write buffer can significantly reduce the program operations to flash chips.

…

/O overhead of kernel data

…

Address arrangement of modules. Figure 3 Address arrangement of a super chip.

…

Write savings achieved by 256 MB write buffer

…

Figures - uploaded by Nong Xiao

Content may be subject to copyright.

Content uploaded by Nong Xiao

Content may be subject to copyright.

.RESEARCH PAPERS .

Special Focus

SCIENCE CHINA

Information Sciences

June 2011 Vol. 54 No. 6: 1129–1141

doi: 10.1007/s11432-011-4266-z

°Science China Press and Springer-Verlag Berlin Heidelberg 2011 info.scichina.com www.springerlink.com

P3Stor: A parallel, durable ﬂash-based SSD for

enterprise-scale storage systems

XIAO Nong∗, CHEN ZhiGuang, LIU Fang, LAI MingChe & AN LongFei

Department of Computer Science, National University of Defense Technology,

Changsha 410073, China

Received September 11, 2010; accepted April 26, 2011

Abstract Driven by data-intensive applications, ﬂash-based solid state drives (SSDs) have become increasingly

popular in enterprise-scale storage systems. Flash memory exhibits inherent parallelism. However, existing solid

state drives have not fully exploited this superiority. We propose P3Stor, a parallel solid state storage architecture

that makes full use of ﬂash memory by utilizing module- and bus-level parallelisms to increase average bandwidth

and employing chip-level interleaving to hide I/O latency. To improve the bandwidth utilization of traditional

interface protocols (e.g., SATA), P3Stor adopts PCI-E interface to support concurrent transactions. Based on the

proposed parallel architecture, we design a lazy ﬂash translation layer (LazyFTL) to manage the address space.

The proposed LazyFTL adopts ﬂexible super page-level mapping scheme to support multi-level parallelisms. It

is able to distinguish hot data from cold data, and hot data identiﬁcation enables LazyFTL to direct hot and cold

data to separate physical blocks, which reduces page migrations when reclaiming blocks. As garbage collector

migrates fewer valid pages, write ampliﬁcation is signiﬁcantly reduced, which in turn helps to extend the life

span. Moreover, LazyFTL rarely triggers wear-leveling process. The lazy wear-leveling mechanism protects

users’ requests from being disrupted by background operations. With the guidance of hot data identiﬁcation,

an intelligent write buﬀer is used to reduce program operations to ﬂash chips. This is meaningful in extending

P3Stor’s life span. The performance evaluation using trace-driven simulations and theoretical analysis shows

that P3Stor achieves high performance and its life span is more than doubled.

Keywords ﬂash memory, SSD, P3Stor, FTL, storage

Citation Xiao N, Chen Z G, Liu F, et al. P3Stor: A parallel, durable ﬂash-based SSD for enterprise-scale

storage systems. Sci China Inf Sci, 2011, 54: 1129–1141, doi: 10.1007/s11432-011-4266-z

1 Introduction

As the reliance on data-intensive applications continuously increases in various domains, disk-based stor-

age devices are failing to meet the demands owing to high latency and low throughput. In contrast,

ﬂash memory outperforms hard disks in terms of its lower latency, lower power consumption, lighter

weight, and shock resistance. Because of the increase in capacity and decrease in price, ﬂash memory

has attracted increasing attention in recent years. Compared with traditional hard disks, ﬂash memory

has several peculiarities making it unsuitable for direct application to current storage systems. First,

a single ﬂash memory chip is inadequate for large-scale storage systems in terms of both capacity and

∗Corresponding author (email: nongxiao@nudt.edu.cn)

1130 Xiao N, et al. Sci China Inf Sci June 2011 Vol. 54 No. 6

performance. As such, it is better to explore a new architecture to aggregate suﬃcient ﬂash chips. Sec-

ond, ﬂash chips do not support the speciﬁc interface protocols for hard disks. To be compatible with the

software stack accumulated by traditional hard disks, ﬂash chips need to be packaged as a solid state

drive (SSD), which presents a hard disk-like interface. The software layer that hides the peculiarities of

ﬂash chips is called ﬂash translation layer (FTL). SSD’s performance is highly dependent on two aspects

described above, the architecture and FTL design.

In this paper, we design a high performance parallel solid state storage device called P3Stor. P3Stor

aims at high bandwidth, low latency, and long life span. To achieve these goals, we optimize both the

architecture conﬁguration and FTL design. Techniques employed in P3Stor are listed below.

•High-speed interface such as PCI-E is used to support multiple concurrent transactions. This im-

proves the poor bandwidth utilization of traditional interface protocols.

•Module- and bus-level parallelisms are used to increase average bandwidth. In addition, chip-level

interleaving is used to reduce I/O latency.

•P3Stor takes advantage of pipelining to accelerate the command interpretation process.

•To alleviate the negative impact of background operations, we adopt a ﬂexible mapping scheme based

on super pages. The ﬁne-grained mapping table is able to distinguish hot data from cold data, which

enhances the eﬃciency of FTL remarkably.

•The write buﬀer guided by hot data-aware mapping table not only responds to write requests rapidly,

but also remarkably decreases write operations to ﬂash chips thereby lengthening the life span of P3Stor.

•The lazy garbage collection and wear-leveling signiﬁcantly reduce the overhead of FTL. They can

eﬃciently prevent read requests from being congested by background operations.

The rest of this paper is organized as follows. Section 2 presents the background and related work.

Section 3 describes principles of our work. Section 4 explains the architecture of P3Stor, while section 5

discusses the design of LazyFTL. Section 6 explores the design space of FTL and presents our performance

evaluation. The ﬁnal section concludes our work.

2 Background and related work

2.1 NAND ﬂash memory overview

Compared with traditional hard disks, ﬂash memory has several novel characteristics. A NAND ﬂash

chip is organized into several planes. A plane contains thousands of blocks, while a block consists of 64

or 128 pages [1]. A page is appended by a small spare Out-Of-Band area (OOB). Metadata, such as ECC

(error correction code) and page states (free/valid/invalid), which are used to describe the page of data,

can be written to the additional area.

Essentially, pages of ﬂash chips can be accessed in parallel and at random. A typical ﬂash-based device

is made up of a package of ﬂash chips, which are connected by multiple buses. Besides the external-chip

parallelism among the buses, a ﬂash chip with multiple planes also presents internal-chip parallelism

among the planes. The external- and internal-chip parallelisms make it feasible for a ﬂash-based device

to achieve a higher aggregate bandwidth.

Flash memory has three operations: read, program, and erase. Program and erase are destructive

operations. Too many program/erase operations on a block lead to data corruption. The nominal life

span of a block in SLC NAND ﬂash is 105erasure cycles. For MLC ﬂash, it is only 104erasure cycles.

When manufacturers pronounce the nominal life span of a chip, they guarantee that the bit error rate

will be lower than 10−15 to 10−14. However, Desnoyers et al. [2] pointed out that the measured life span

of blocks varies greatly, and is often as much as 100 times higher than the manufacturer speciﬁcations.

2.2 Typical architectures based on ﬂash memory

With the novel characteristics presented in the previous subsection, ﬂash memory has attracted much

attention over the past few years. Two typical architectures based on ﬂash memory are brieﬂy described

below.

Xiao N, et al. Sci China Inf Sci June 2011 Vol. 54 No. 6 1131

Gordon. Gordon [3] is ﬂash-based system for massive parallel, data-centric computing. It leverages

solid state drives, low-power processors, and data-centric programming paradigms to deliver enormous

gains in performance and power eﬃciency. The basis is the adoption of solid state storage. Gordon

exploits parallelism by two techniques. The ﬁrst is aggressively pursuing dynamic parallelism between

requests to ﬂash array. The second is combining physical pages from several planes into super pages.

There are three ways of creating super pages: horizontal striping, vertical striping, and 2-dimensional

striping. Horizontal striping eﬀectively creates a wider bandwidth bus, and increases the raw performance

of the array for large reads. Vertical striping further enhances throughput by increasing bus utilization.

Two-dimensional striping combines both horizontal and vertical striping schemes to generate an even

larger super page. This scheme achieves greater bus bandwidth and lower I/O latency.

Hydra. Hydra [4] also exploits the parallelism among multiple ﬂash chips to enhance storage per-

formance. Techniques adopted to achieve this goal include the following: 1) Hydra adopts bus-level

interleaving to generate a wider bandwidth by aggregating enough ﬂash buses. 2) Chip-level interleaving

is used to hide read latency. These two techniques are similar to the horizontal and vertical striping

schemes adopted by Gordon. 3) Hydra uses a dedicated foreground unit with higher priority to handle

read requests. This scheme prevents read requests from being delayed by background operations. 4) Hy-

dra employs a write buﬀer to respond rapidly to write requests. The write buﬀer also makes it possible for

programs to be performed in parallel. Hydra primarily focuses on parallelism. It emphasizes scheduling

of multiple foreground and background units at system level. However, ﬂash memory operations induced

by interleaving are inﬂexible. Garbage collection and wear-leveling (explained in the next subsection)

have not been optimized. Moreover, Hydra does not address reliability issue.

2.3 Introduction to FTLs

To hide the peculiarities of ﬂash memory and emulate the interface of hard disk drives, ﬂash chips are

packaged into SSDs. The core of SSD is the FTL, which consists of three components: wear-leveling,

garbage collection, and address mapping. Since each block has limited life span, it is important to

distribute erase operations throughout the chip evenly to prevent some blocks from being corrupted

too early. This is exactly what wear-leveling does. Since a page cannot be overwritten, an alternative

method is to direct the updated data to another page and to mark the old page as invalid. FTL needs

a garbage collection mechanism to reclaim these invalid pages. The Out-of-Place Update also leads to

dynamic mapping between logical and physical addresses. A table, therefore, is necessary to maintain

the mapping information. The state-of-art FTLs include BAST [5], FAST [6], SAST [7], KAST [8] and

DFTL [9]. The focus of FTLs mentioned above is generally constrained to mapping schemes and hardly

ever involves parallelism, wear-leveling or reliability. CAFTL [10] focuses on de-duplication at device

level and can be seen as an add-on for the basic FTLs. CA-SSD [11] employs content addressable storage

to exploit value locality, which helps to reduce the average response time.

3 Design principles

We try to build a high bandwidth, low latency, long life span SSD. The following principles guide our

design.

Aggressively exploit parallelism. Typically, a page of a read request is subject to about a 20

µs read delay and a 100 µs bus delay for an SLC NAND ﬂash chip. With a page size of 2 KB, the

bandwidth for read requests is no more than 16.7 MB/s. A read request consisting of 64 consecutive

pages must endure a latency of 7.68 ms. Neither bandwidth nor latency of a single chip with only a

single plane is comparable to that of hard disks. Fortunately, external- and internal-chip parallelisms

can overcome these disadvantages. Chip-level interleaving improves bus utilization signiﬁcantly. A chip

with four planes achieves a bandwidth of about 20 MB/s for both read and write. Bus-level interleaving

produces an aggregate bandwidth, which can exceed the peak of host’s interface protocol. As interleaving

is the preliminary method to increase bandwidth, we adopt this method as well.

1132 Xiao N, et al. Sci China Inf Sci June 2011 Vol. 54 No. 6

Figure 1 The overall architecture of P3Stor.

Write as few as possible. As each block has limited life span, it is important to reduce program

operations to ﬂash chips. As presented in the work of Soundararajan et al. [12], write requests from hosts

exhibit great temporal and spatial localities. Most of the write requests are overwrites to a few data. A

write buﬀer can signiﬁcantly reduce the program operations to ﬂash chips.

Do collect the garbage. When a block is recycled, if it has valid pages, garbage collector must move

these valid pages ﬁrst. Moving pages leads to extra read and write. The extra traﬃc not only impairs

performance, but also wears out ﬂash chips. If there are no valid pages in the block to be reclaimed,

this negative impact disappears. So, we endeavor to reduce the number of valid pages in a block to be

reclaimed.

Belittle wear-leveling. Wear-leveling distributes erasures over all blocks. This usually involves mi-

grating pages in a block with the most erasing cycles to a block with the least erasing cycles. This

process imposes even a greater negative impact on performance than garbage collection does. We ar-

gue that frequent wear-leveling is unnecessary, and that wear-leveling is not the only way to achieve

higher reliability. This research employs a novel mechanism with little overhead, but which, nevertheless,

guarantees reliability.

Endeavor to exhaust the lifecycle. Flash chips in the market have a nominal life span. However,

Desnoyers et al. [2] pointed out that, even if a block has been erased more than the nominal number of

times, it can still be of use. The measured lifetimes of blocks are usually as much as 100 times higher

than the manufacturer speciﬁcations. This motivates us to make use of ﬂash chips for as long as possible,

while guaranteeing the expected reliability.

4 The P3Stor architecture

An overview of P3Stor is shown in Figure 1, with the main features of this architecture summarized

below.

•To improve the poor bandwidth utilization of traditional interface protocols, P3Stor adopts high-

speed interfaces, such as PCI-E, to support multiple concurrent transactions.

•To fully exploit the performance of the ﬂash chip array, P3Stor adopts both module- and bus-level

parallelisms to increase average bandwidth for multiple requests, and chip-level interleaving to reduce

average latency.

Xiao N, et al. Sci China Inf Sci June 2011 Vol. 54 No. 6 1133

•To accelerate the command interpretation process, we design a pipeline for command interpreter.

The pipeline eliminates the bottleneck of centralized interpretations in traditional architectures.

4.1 Front-end and back-end

The overall architecture is divided into Front-end and Back-end as shown in Figure 1. They are described

below.

Front-end. There are three major components in Front-end: command FIFO queue, buﬀer feeder, and

DRAM for write buﬀer. When an I/O request arrives, the command FIFO queue accommodates it and

immediately sends it to command interpreter. The command interpreter determines whether requested

data locates in DRAM. For a read request, if the data hits in DRAM, it is immediately returned to

corresponding buﬀer feeder, otherwise translation to a physical address is invoked. Similarly, when a

write request arrives, data to be written is cached in one of the four buﬀer feeders. P3Stor is deployed

with four buﬀer feeders and thus supports four simultaneous I/O requests from hosts. DRAM devotes to

write requests. For a write request, once the data has been written to DRAM, the host is informed that

the write request has been completed. However, for read requests missing from DRAM, the data fetched

from ﬂash chips is delivered to buﬀer feeder directly.

Back-end. The Back-end includes SRAM, command interpreter, command dispenser, agents, ﬂash

controllers, ECC encoders and decoders. SRAM is the cache for mapping table. P3Stor uses ﬁne-grained

mapping scheme. The mapping table is too large to be loaded into SRAM completely. We cache mapping

information on demand. The command interpreter is responsible for translating logical addresses into

physical addresses and for converting I/O requests to the format understood by ﬂash controllers. The

command dispenser dispatches commands to diﬀerent agents according to physical address supplied by

the command interpreter. The agents distribute diﬀerent commands to ﬂash controllers in out-of-order

mode to exploit chip-level interleaving. Besides, the ECC encoder and decoder are used to detect and

correct bit ﬂips. Each ﬂash memory controller is equipped with multiple encoders and decoders. As

explained in section 5, P3Stor applies diﬀerent ECCs to diﬀerent data.

4.2 Multi-level parallelism

As illustrated in Figure 1, we exploit three diﬀerent levels of parallelism. Details of them are given below.

Module-level parallelism. The entire ﬂash array is divided into 4 modules. Each module is controlled

by an agent. The command dispenser issues multiple commands to invoke diﬀerent agents according to

addresses of I/O requests. The agent is responsible for scheduling I/O commands to buses. In this way,

all agents respond to I/O transactions concurrently. P3Stor dynamically balances the workload between

the four agents.

Bus-level parallelism. In the P3Stor conﬁguration, the data-paths of each module are 32 bits and

can operate at 60 MHz in the FPGA prototype. In contrast, the datasheet of our selected ﬂash memory

indicates an 8-bit data path and 60 MHz operation frequency. To make up the bandwidth gap between the

agent and ﬂash chip, one reasonable conﬁguration is to exploit the 4-way bus-level parallelism. Parallelism

between the four buses achieves an eﬀective bandwidth of 240 MB/s when using four 60 MB/s ﬂash chips.

Chip-level parallelism. P3Stor supports chip-level interleaving of diﬀerent chips sharing a bus,

thereby hides the ﬂash command latency. For example, after issuing an erase command to super-chip 0,

the agent waiting for the erase command completion is not suspended. If other commands are inspected

and found to be directed to diﬀerent super-chips, they will be issued immediately to overlap the execution

of erase. Afterwards, the agent examines the completion of all commands by issuing status check, and

also issues arriving commands directed to free chips. In this way, chip-level parallelism can hide the I/O

latency and increase bus utilization.

To support multi-level parallelisms, the address space is arranged as follows: the entire ﬂash array

is divided into 4 modules. Each module containing four buses is controlled by an agent as shown in

1134 Xiao N, et al. Sci China Inf Sci June 2011 Vol. 54 No. 6

Figure 2 Address arrangement of modules. Figure 3 Address arrangement of a super chip.

Figure 4 Pipeline of the command interpreter.

Figure 2. In each module, ﬂash chips of the four buses are grouped into four super chips. Each super

chip organizes four ﬂash chips with the address arrangement as shown in Figure 3.

4.3 Command interpretation pipeline

To reduce the latency of each transaction, a multifunctional pipeline consisting of four stages is imple-

mented in command interpreter (see Figure 4). The function of analysis stage is parsing I/O requests and

converting them into several I/O commands at the granularity of super page. Besides that, the analysis

stage is responsible for triggering background operations of FTL. For example, if no I/O request arrives

for a long time, the analysis stage invokes garbage collection. The access cache stage checks whether

requested data is cached in write buﬀer. If the data is missed, the address translation stage translates

logical address into physical address with the help of page table cache. The physical address is further

converted to a triple by the throw command stage. The triple designates target super chip, super block,

and super page. The command is ﬁnally forwarded to corresponding agent.

5 The LazyFTL

Following the principles mentioned in section 3, we propose LazyFTL to complement the architecture

described above. On the one hand, LazyFTL is lazy and resists reclaiming blocks that contain valid

pages. It attempts to reduce programs by employing a write buﬀer and avoid wear-leveling. On the other

hand, LazyFTL is smart. It adopts a super page-level mapping scheme for ﬂexibility. It separates hot

and cold data to improve the eﬃciency of garbage collection. It even enables P3Stor to survive beyond

its nominal life span.

Xiao N, et al. Sci China Inf Sci June 2011 Vol. 54 No. 6 1135

Figure 5 2-bit hot data predictor.

5.1 Hot page-aware page table management

For P3Stor, the chip-level interleaving packages four pages from diﬀerent chips into a super page, which

is the atomic access unit. LazyFTL adopts super page-level mapping scheme for ﬂexibility. Management

of the page table includes two aspects, the conﬁguration of page table cache and hot page prediction.

The prototype of P3Stor is 256 GB. A super page contains four 2 KB physical pages. There are 32

M super pages. So, the mapping table contains 32 M entries. If each entry takes up 4 B, a physical

page can accommodate 256 entries (the left 1024 bytes are reserved for future use). The whole page

table consumes 256 MB. It is too costly to hold the entire mapping table in memory of SSD. Fortunately,

localities exhibited by most workloads make it feasible to cache the recently used entries on demand. While

the complete copy of mapping table is maintained on ﬂash chips. Hereafter, the mapping information is

referred as kernel data. The spatial overhead of kernel data is negligible. For a 256 GB SSD, there are

only 256 MB kernel data. The spatial overhead is approximate 0.1% of the SSD capacity. The 256 MB

kernel data holds 128 K physical pages. The data structure that indexes the 128 K pages is called page

table directory. Each entry in the page table directory also takes up 4 B. Thus the page table directory

consumes 0.5 MB. Page table directory is the root to boost the system. It is kept in a battery-backed

RAM.

The cache is managed at the granularity of a page, which contains 256 entries. If some of the entries

in a page are accessed, the whole page is fetched into cache. Entries in a page are also evicted out of the

cache as a whole. To exploit temporal locality, the cache employs the 4-way set-associative structure. To

exploit spatial locality, the cache employs a conservative prefetch policy. When a page of kernel data is

demanded, the next page beyond the demanded one is also fetched into cache.

Flash is write-sensitive. We call the logical pages that are updated frequently, hot pages. The others

are cold pages. It is useful to direct hot and cold pages to separate blocks. We call the process that

distinguishes hot data from cold data, hot page prediction. Garbage collection beneﬁts from hot page

prediction. Since the page table cache is aware of all the access traces, LazyFTL performs hot page

prediction in the page table cache.

LazyFTL adopts 2-bit predictors [13] to predict hot pages. Figure 5 reviews the principle of 2-bit

predictor. All the super pages are divided into four groups, very hot, hot, cold, and very cold, which are

represented by 11, 10, 01, and 00, respectively. If a very cold super page has been over-written many

times, it will ultimately become a very hot super page. On the contrary, if a very hot super page has

not been over-written for a long time, it will become a very cold one. Conversions between the states are

illustrated in Figure 5. When a super page is accessed for the ﬁrst time, LazyFTL fetches the mapping

entry into cache. Exactly, LazyFTL fetches the whole page containing the entry into cache. If the super

page has been overwritten, LazyFTL upgrades its hot degree. On the other hand, if the super page has

not been overwritten until that its mapping entry is evicted out of cache, LazyFTL degrades its hot

degree. As discussed above, a 2 KB physical page contains 256 mapping entries, which consume 1024

bytes. The remaining 1024 bytes are reserved for storing the states of predictors. LazyFTL maintains a

1136 Xiao N, et al. Sci China Inf Sci June 2011 Vol. 54 No. 6

2-bit predictor for each entry. 256 predictors consume 64 bytes. When a page of entries is evicted out of

cache, states of their predictors are also written to ﬂash chips for future use.

5.2 Write buﬀer

As each block has limited life span, LazyFTL employs a DRAM as write buﬀer to reduce programs. The

write buﬀer adopts 4-way set-associative mapping scheme. It supports two triggers to ﬂush pages to chip

arrays.

Capacity-threshold trigger: If the free capacity of write buﬀer is lower than a given threshold, ﬂush

operation is triggered.

Page table cache ﬂush trigger: If the free capacity of page table cache is lower than a given thresh-

old, the cache must ﬂush some mapping entries to ﬂash chips. If the ﬂushed entry is dirty, corresponding

user data in write buﬀer are ﬂushed as well. As the page table cache is aware of hot and cold data,

it always evicts cold entries out of cache. Corresponding cold user data is also ﬂushed by write buﬀer.

Guided by page table cache, the write buﬀer achieves high hit ratio and signiﬁcantly reduces program

operations.

When data is ﬂushed, LazyFTL optimizes program operations in two ways. The ﬁrst is performing

programs in parallel by adopting bus- and chip-level interleaving. The other is directing hot and cold

super pages to separate super blocks. Separating hot data from cold data enhances the eﬃciency of

garbage collection.

5.3 Lazy garbage collection

Although interleaving increases the bandwidth, it imposes a negative impact on garbage collection. Sup-

pose that a write request consists of 16 pages, the 4-way bus-level interleaving dispatches 16 pages to 4

buses, the 4-way chip-level interleaving dispatches 4 pages to 4 diﬀerent chips. As a result, the 16 pages

from a write request scatter across 16 physical blocks. Overwriting the 16 pages leaves each block with

an invalid page. Interleaving causes invalid pages to be dispersed across more blocks.

LazyFTL overcomes this disadvantage in two ways. First, it adopts super page-level mapping scheme.

The ﬁne-grained mapping scheme enables all physical super pages to be able to accept updates. So,

garbage collection will not be triggered until free blocks are used up. Usually, an SSD reserves some

blocks that are prepared to replace worn-out blocks. The large redundant capacity ensures that garbage

collection is rarely triggered. As garbage collection is delayed, more pages have their chances to become

invalid, garbage collector will migrate fewer valid pages in recycling a block. Second, hot and cold pages

are directed to separate super blocks. In addition, garbage collection is delayed for a long time by

adopting super page-level mapping scheme. So, super blocks containing hot pages tend to have no valid

pages. LazyFTL, therefore, almost moves no valid pages during garbage collection.

5.4 Lazy wear-leveling

LazyFLT hardly performs wear-leveling. Only when some super blocks’ erasing cycles approach the

nominal life span, LazyFTL moves some cold pages to them. Meanwhile, more powerful ECC, such

as Reed-Solomon [14] code, is employed to guarantee reliability. Although Reed-Solomon code is more

complex, as cold data is rarely accessed, the complex encoding and decoding process is rarely triggered.

As hot data is stored in reliable super blocks, LazyFTL applies commonplace ECC to hot data for

simplicity.

Desnoyers’ experiments show that the measured life spans of blocks are often as much as 100 times

higher than manufacturer speciﬁcations [2]. Even if the nominal life span has expired, the block can still

be in service. LazyFTL makes full use of the actual life span by adopting a more powerful ECC, and thus

extends the life span of P3Stor. Theoretically, LazyFTL can extend the actual life span as long as the

powerful ECC can guarantee reliability. But, as bit errors happen more frequently, repeated encoding

and decoding will impair performance. So, LazyFTL only stores cold data in nominally worn out blocks.

Xiao N, et al. Sci China Inf Sci June 2011 Vol. 54 No. 6 1137

Hot data is stored in reliable blocks. When all blocks have exceeded their nominal life spans, there are

no reliable blocks to store hot data. The SSD is ultimately worn out.

6 Design space exploration

We evaluate P3Stor by trace driven simulations. Subsection 6.1 introduces experimental setup. Subsec-

tion 6.2 presents design of page table. Subsection 6.3 discusses the write buﬀer. Comparison of garbage

collection with and without hot data prediction is presented in subsection 6.4. The last subsection

analyzes the life span.

6.1 Experimental setup

The experiments are based on trace driven simulations. PC1 and PC2 were collected from desktops

running Windows XP. The applications include ftp, email, VMware, databases, and text editor tools

[15–17]. Server1 and Server2 were collected from servers supplying a platform for online sales. Financial1

and Financial2 are traces from an OLTP application running at a ﬁnancial institution [18] and made

available by Storage Performance Council (SPC). Web1 and Web2 are read-dominant Web Search engine

traces [19].

We used the well-known simulator, FlashSim [20]. FlashSim imitates all components of ﬂash chip,

including pages, blocks, planes, dies, packages, channels and buses. Developers of FlashSim had validated

the performance thereof against a number of commercial SSDs for behavioral similarity. All simulations

are based on the Samsung K9WAG08U1A chip [1]. A chip contains 4 planes, with each plane consisting

of 2048 blocks. A block contains 64 pages. A page is 2 KB with a 64 B OOB area. Generally, SSDs

contain redundant blocks. In our experiments, we assume that the extra capacity is 10% of the nominal

capacity of SSD.

6.2 Page table management

In this subsection, we explore the management of page table cache. We believe that simplicity is as

important as performance in designing a system. The primary cache replacement policy is 4-way set-

associative scheme without prefetch. Within each set, we adopt LRU [21] for simplicity. Table 1 presents

the hit ratios achieved by the 4-way set-associative policy and fully-associative mechanism in a 4 MB

cache, respectively. As the table shows, the 4-way set-associative policy is comparable with the fully-

associative one. In the page table, a page contains 256 entries indexing 256 super pages. Access to any

super pages results in an access to the page of kernel data. Thus, spatial locality of user data requests is

converted into temporal locality for kernel data requests. As temporal locality is consolidated, a simple

replacement policy is suﬃcient for the page table cache. This is why 4-way set-associative policy is

comparable with a fully-associative one. The 4-way set-associative policy is adopted due to its simplicity.

To improve hit-ratio further, we integrate a conservative prefetch mechanism into the cache replacement

policy. Whenever a page is demanded, we prefetch the consecutive page of kernel data. Hereafter, the

4-way set-associative replacement policy with a conservative prefetch mechanism is adopted for the page

table cache.

There are two kinds of overhead for the page table, capacity overhead and I/O overhead. Kernel data

consumes 0.1% of the capacity of P3Stor, which is negligible. I/O overhead is induced by requests for

kernel data. It consists of two types. One involves fetching demanded entries into cache, while the other

involves evicting dirty entries out. Table 2 analyzes the I/O overhead. The column, Request pages,

denotes the total pages requested by hosts. Extra read fetches demanded entries into cache. Extra write

evicts dirty entries out. As the last column shows, the total number of requested pages of kernel data is

less than 5% of the total requested pages of user data. Speciﬁcally, extra writes are no more than 0.5%

of user data writes. Write requests of kernel data hardly reduce the life span of P3Stor.

1138 Xiao N, et al. Sci China Inf Sci June 2011 Vol. 54 No. 6

Table 1 Comparison of two types of cache conﬁgurations

Traces Set-assoc (%) Fully-assoc (%)

PC1 80.55 80.96

PC2 81.82 82.06

Server1 95.41 95.82

Server2 95.00 95.46

Financ1 99.40 99.43

Financ2 97.12 97.68

Web1 48.67 48.68

Web2 48.60 48.59

Table 2 I/O overhead of kernel data

Traces Requested pages Extra read (%) Extra write (%) Total (%)

PC1 140909866 0.804 0.279 1.083

PC2 184729385 0.663 0.202 0.865

Server1 44654475 0.369 0.085 0.454

Server2 42749805 0.420 0.099 0.519

Financ1 36115196 0.222 0.083 0.305

Financ2 17693541 1.640 0.393 2.033

Web1 138040640 3.141 0.000 3.141

Web2 135387780 2.928 0.000 2.928

Table 3 Write savings achieved by 256 MB write buﬀer

Traces Write savings (%)

Arbitrary buﬀer Improved buﬀer

PC1 9.31 17.63

PC2 16.63 30.88

Server1 20.73 41.01

Server2 21.11 41.62

Financ1 29.10 54.67

Financ2 33.44 61.20

Table 4 Write savings achieved by 512MB write buﬀer

Traces Write savings (%)

Arbitrary buﬀer Improved buﬀer

PC1 10.67 19.51

PC2 21.10 35.00

Server1 22.90 42.37

Server2 24.12 43.73

Financ1 33.81 63.90

Financ2 36.86 68.10

6.3 Write buﬀer management

To reduce programs to ﬂash chips, we employ a write buﬀer to ﬁlter some of the write requests. Write

saving [12] is used to measure the performance achieved by write buﬀer. Write saving is the percentage

of total writes prevented from reaching ﬂash chips. If the host writes 100 pages to SSD, but only 60

pages are received by ﬂash chips, the write saving is 40%. We explore two replacement policies for write

buﬀer. The ﬁrst is the arbitrary 4-way set-associative scheme, while the other is the improved 4-way

set-associative scheme with the guidance of hot data-aware page table. Tables 3 and 4 present the write

savings achieved by 256 and 512 MB write buﬀers, respectively. It is obvious that the improved buﬀer

constantly outperforms the arbitrary one. As described in subsection 5.2, evictions in the improved write

Xiao N, et al. Sci China Inf Sci June 2011 Vol. 54 No. 6 1139

Table 5 Comparison of two types of programs

Traces Arbitrary program Hot data-aware program

Erase blocks Avg. valid pages Erase blocks Avg. valid pages

PC1 902959 7.41 851402 4.72

PC2 704565 0.96 691701 0.82

Server1 128175 30.11 93736 19.06

Server2 142574 30.08 106195 19.46

Financ1 1013722 35.11 978375 34.05

Financ1 69204 22.97 68387 22.61

buﬀer are mostly triggered by replacement of page table cache. First, the page table cache tends to

evict cold entries. User data related to these cold entries are also cold. Second, hints supplied by 2-bit

predictors inform write buﬀer which pages are cold. As the improved write buﬀer is aware of cold data,

it obtains a higher write saving.

6.4 Hot data-aware program

LazyFTL distinguishes hot pages from cold pages, and writes them to separate super blocks. We refer

to this intelligent behavior as hot data-aware program. Correspondingly, a method unaware of hot pages

is called arbitrary program. We compare the two methods via simulations. Two metrics are deﬁned to

measure the performance. One is the number of blocks to be erased during garbage collection. This

metric relates to the life span of P3Stor. The other is the average number of valid pages to be migrated

when a block is recycled. This metric relates to the delay of garbage collection. Ideally, we expect these

two quantities to be as small as possible.

The experimental results are given in Table 5. Generally, the hot data-aware program outperforms

the arbitrary method, especially on Server1 and Server2. These two traces exhibit hybrid access pat-

terns mixed with cold and hot pages. The arbitrary method is not competent for them. PC2 contains

large chunks of overwrites only. There are no valid pages to migrate for garbage collection. Thus, the

two methods behave similarly. Financial1 and Financial2 are write-intensive workloads. All pages are

overwritten randomly and frequently. It is thus, not possible to optimize the programs for either of the

methods. The hot data-aware program is good at scheduling write sequences with mixed cold and hot

pages. These types of sequences are the most popular access patterns in enterprise-scale storage sys-

tems. As hot data-aware program migrates fewer valid pages and erases fewer physical blocks for garbage

collection, LazyFTL responds rapidly to user’s I/O requests. Moreover, P3Stor has a longer life span.

6.5 Lifecycle analysis

P3Stor uses three methods to extend its life span. The most important one is the write buﬀer. As

explained in subsection 6.3, with a 512 MB write buﬀer, P3Stor almost reduces write traﬃc by 70%

for some workloads. The prototype of P3Stor is deployed with a 2 GB write buﬀer, which can ﬁlter at

least 50% of programs for most workloads. This means that the life span is almost doubled. The second

method used to extend life span is the eﬃcient garbage collection. As discussed in subsection 6.4, guided

by the hot data-aware page table, the garbage collector erases fewer blocks. For typical enterprise-scale

workloads mixed with hot and cold data requests as in Server1 and Server2, LazyFTL’s garbage collection

policy erases less blocks by 1.3 times compared with the arbitrary method. The last measure used to

extend life span is the lazy wear-leveling. Since the cold data is stored in blocks that are nominally worn

out, theoretically, all physical blocks are worn out by hot data. The nominally worn out blocks occupied

by cold data are extensions of P3Stor’s life span. Generally, the life span of P3Stor can be doubled or

lengthened even further.

1140 Xiao N, et al. Sci China Inf Sci June 2011 Vol. 54 No. 6

7 Conclusions

We have designed a parallel ﬂash-based storage device called P3Stor that makes full use of ﬂash chips. The

architecture of P3Stor employs four levels of parallelisms. Module- and bus-level parallelisms are used to

increase bandwidth, while chip-level parallelism is used to hide I/O latency. Furthermore, interface-level

parallelism is used to support multiple concurrent transactions. Based on the proposed architecture, we

designed LazyFTL to manage the address space. LazyFTL is able to distinguish hot data from cold data,

which produces useful hints for garbage collection. Moreover, guided by hot data identiﬁcation, the write

buﬀer not only reduces the response time of write requests, but also signiﬁcantly reduces programs to ﬂash

chips, which helps in extending P3Stor’s life span. Besides that, LazyFTL reduces background operations

by performing as little wear-leveling as possible. Performance evaluation by trace-driven simulations and

theoretical analysis show that P3Stor achieves high performance and its life span can be more than

doubled.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grants Nos. 60736013, 60903040,

61025009), and the Program for New Century Excellent Talents in University (Grant No. NCET-08-0145). We

are grateful to Professor Weisong Shi for his valuable suggestions to improve this paper.

References

1 Samsung Electronics. K9WAG08U1A/K9K8G08U0A/K9NBG08U5A Flash Memory Datasheet

2 Boboila S, Desnoyers P. Write endurance in ﬂash drives: measurements and analysis. In: Proc of the Eighth USENIX

Symposium on File and Storage Technologies (FAST10). San Jose, CA, USA, 2010. 115–128

3 Caulﬁeld A M, Grupp L M, Swanson S. Gordon: using ﬂash memory to build fast, power-eﬃcient clusters for data-

intensive applications. In: Proc of ASPLOS’09. Washington, DC, USA, 2009

4 Seong Y J, Nam E H, Yoon J H, et al. Hydra: A block-mapped parallel ﬂash memory solid-state disk architecture.

IEEE Trans Comput, 2010, 59: 905–921

5 Chung T, Park D, Park S, et al. System software for ﬂash memory: A survey. In: Proc of the International Conference

on Embedded and Ubiquitous Computing, Seoul, Korea, 2006. 394–404

6 Lee S, Park D, Chung T, et al. A log buﬀer based ﬂash translation layer using fully associative sector translation. IEEE

Trans Embed Comput Syst, 2007, 6: 18

7 Park S Y, Cheon W, Lee Y, et al. A re-conﬁgurable FTL (ﬂash translation layer) architecture for NAND ﬂash based

applications. In: Proc of International Workshop on Rapid System Prototyping, Paris, France, 2007. 202–208

8 Cho H, Shin D, Eom Y. KAST: K-associative sector translation for NAND ﬂash memory in real-time systems. In: Proc

of Design, Automation and Test in Europe (DATE09), Nice, France, 2009. 507–512

9 Gupta A, Kim Y, Urgaonkar B. DFTL: A ﬂash translation layer employing demand-based selective caching of page-level

address mappings. In: Proc of the 14th international conference on Architectural Support for Programming Languages

and Operating Systems (ASPLOS09), Washington, DC, USA, 2009. 229–240

10 Chen F, Luo T, Zhang X D. CAFTL: A content-aware ﬂash translation layer enhancing the lifespan of ﬂash memory

based solid state drives. In: Proc of the Ninth USENIX Symposium on File and Storage Technologies (FAST11). San

Jose, CA, 2011

11 Gupta A, Pisolkar R, Urgaonkar B, et al. Leveraging value locality in optimizing NAND ﬂash-based SSDs. In: Proc of

the Ninth USENIX Symposium on File and Storage Technologies (FAST11). San Jose, CA, 2011

12 Soundarara jan G, Prabhakaran V, Balakrishnan M, et al. Extending SSD lifetimes with disk-based write caches. In:

Proc of the Eighth USENIX Symposium on File and Storage Technologies (FAST10). San Jose, CA, 2010. 101–114

13 Hennessy J, Patterson D. Computer Architecture—Quantitative Approach. 3rd ed. San Fransisco: Morgan Kaufmann,

2003

Xiao N, et al. Sci China Inf Sci June 2011 Vol. 54 No. 6 1141

14 Reed I S, Solomon G. Polynomial codes over certain ﬁnite ﬁelds. SIAM J Appl Math, 1960, 8: 300–304

15 Chang L P, Kuo T W. An adaptive stripping architecture for ﬂash memory storage systems of embedded systems. In:

Proc of IEEE Eighth Real-Time and Embedded Technology and Applications Symposium. San Jose, CA, USA, 2002.

178–196

16 Wu C H, Chang L P, Kuo T W. An eﬃcient R-tree implementation over ﬂash-memory storage systems. In: Proc of the

11th International Symposium on Advances in Geographic Information Systems. New Orleans, Louisiana, USA, 2003.

17–24

17 PC traces for ﬂash memory research. http://newslab.csie.ntu.edu.tw/ﬂash/index.php?Selected Item =Traces

18 OLTP Trace from UMass Trace Repository. http://traces.cs.umass.edu/index.php/Storage/Storage

19 Websearch Trace from UMass Trace Repository. http://traces.cs.umass.edu/index.php/Storage/Storage

20 Kim Y, Tauras B, Gupta A, et al. FlashSim: a simulator for NAND ﬂash-based solid-state drives. Technical Report,

CSE-09-008 of the Pennsylvania State University. 2009

21 Belady L. A study of replacement algorithms for a virtual-storage computer. IBM Syst J, 1966, 5: 78–101

Optimizing the SSD Burst Buffer by Traffic Detection

Preprint

Full-text available

Feb 2019

Currently, Burst buffer has been proposed to manage the SSD buffering of bursty write requests. Although burst buffer can improve I/O performance in many cases, we find that it has some limitations such as requiring large SSD capacity and harmonious overlapping between computation phase and data flushing phase. In this paper, we propose a scheme, called SSDUP+1. SSDUP+ aims to improve the burst buffer by addressing the above limitations. First, in order to reduce the demand for the SSD capacity, we develop a novel method to detect and quantify the data randomness in the write traffic. Further, an adaptive algorithm is proposed to classify the random writes dynamically. By doing so, much less SSD capacity is required to achieve the similar performance as other burst buffer schemes. Next, in order to overcome the difficulty of perfectly overlapping the computation phase and the flushing phase, we propose a pipeline mechanism for the SSD buffer, in which data buffering and flushing are performed in pipeline. In addition, in order to improve the I/O throughput, we adopt a traffic-aware flushing strategy to reduce the I/O interference in HDD. Finally, in order to further improve the performance of buffering random writes in SSD, SSDUP+ transforms the random writes to sequential writes in SSD by storing the data with a log structure. Further, SSDUP+ uses the AVL tree structure to store the sequence information of the data. We have implemented a prototype of SSDUP+ based on OrangeFS and conducted extensive experiments. The experimental results show that our proposed SSDUP+ can save an average of 50% SSD space, while delivering almost the same performance as other common burst buffer schemes. In addition, SSDUP+ can save about 20% SSD space compared with the previous version of this work, SSDUP, while achieving 20%-30% higher I/O throughput than SSDUP.

Gemini: A Novel Hardware and Software Implementation of High-performance PCIe SSD

Article

Full-text available

Aug 2017
INT J PARALLEL PROG

In the era of big data, high-bandwidth and high-concurrency architecture of storage systems is urgently needed. Due to the superiority in power consumption, random access rate and shock resistance, NAND flash memory is popularly adopted in enterprise-class storage systems, and gradually takes the place of traditional hard disk. However, this kind of superiority is not off-the-shelf. Several factors, such as out-of-place update and limited erase/program cycles, have hindered the applicability of flash memory in existing storage systems. Therefore, to fully exploit flash memory’s advantages, this paper proposes a high-performance PCIe SSD, Gemini, and depicts its principles in hardware and software implementation. Our proposed Gemini features several hardware and software optimizations, including PBFTL (the page to block mapping FTL), Dysource (a synchronous-interface flash channel controller with the out-of-order scheduling strategy), a customized I/O stack, the scatter/gather DMA and the multi-queue architecture. What’s more, an FPGA-based prototype of Gemini with 2 TB storage capacity is implemented for verification. In experiment, Gemini achieves a maximum read bandwidth of 3.6 GB/s and a maximum write bandwidth of 1.08 GB/s for 64 KB data access. It can also provide remarkable processing rates over 580,000 IOPS and 270,000 IOPS, with regard to 4 KB random read and write respectively.

Optimizing the SSD Burst Buffer by Traffic Detection

Article

Mar 2020

Currently, HPC storage systems still use hard disk drive (HDD) as their dominant storage device. Solid state drive (SSD) is widely deployed as the buffer to HDDs. Burst buffer has also been proposed to manage the SSD buffering of bursty write requests. Although burst buffer can improve I/O performance in many cases, we find that it has some limitations such as requiring large SSD capacity and harmonious overlapping between computation phase and data flushing phase. In this article, we propose a scheme, called SSDUP+. ¹ SSDUP+ aims to improve the burst buffer by addressing the above limitations. First, to reduce the demand for the SSD capacity, we develop a novel method to detect and quantify the data randomness in the write traffic. Further, an adaptive algorithm is proposed to classify the random writes dynamically. By doing so, much less SSD capacity is required to achieve the similar performance as other burst buffer schemes. Next, to overcome the difficulty of perfectly overlapping the computation phase and the flushing phase, we propose a pipeline mechanism for the SSD buffer, in which data buffering and flushing are performed in pipeline. In addition, to improve the I/O throughput, we adopt a traffic-aware flushing strategy to reduce the I/O interference in HDD. Finally, to further improve the performance of buffering random writes in SSD, SSDUP+ transforms the random writes to sequential writes in SSD by storing the data with a log structure. Further, SSDUP+ uses the AVL tree structure to store the sequence information of the data. We have implemented a prototype of SSDUP+ based on OrangeFS and conducted extensive experiments. The experimental results show that our proposed SSDUP+ can save an average of 50% SSD space while delivering almost the same performance as other common burst buffer schemes. In addition, SSDUP+ can save about 20% SSD space compared with the previous version of this work, SSDUP, while achieving 20–30% higher I/O throughput than SSDUP.

An efficient resource-optimized learning prefetcher for solid state drives

Conference Paper

Mar 2018

HIFFS: A Hybrid Index for Flash File System

Conference Paper

Aug 2015

Cost-Effective Hybrid Replacement Strategy for SSD in Web Cache

Conference Paper

Oct 2015

X-code+: A compromised coding scheme with smaller rebuild window and load-balance

Conference Paper

Dec 2015

Modern storage systems call for higher reliability care beyond RAID-6 against ever-increasing data failures. And no code extension exist based on X-code, which is noted for vertical alignment and optimal update complexity. In order to construct flexible and practically reliable codes, we argue that failures happen successively and higher-level failures can be degraded into separate lower-level failures with shorter rebuilding time. Therefore, we present X-code+, a fast recoverable coding scheme by explicitly modifying and extending X-code. We modify the original X-code by redistributing elements to keep a base reliability for double-failure and further extend it with two-element tuples in one extra parity drive to shorten rebuild windows and degrade complex failure scenarios. Analysis shows that X-code+ has better recovery performance and load-balance and less update penalty than its peer codes.

NIS: A New Index Scheme for Flash File System

Conference Paper

Oct 2015

A 1G-cell floating-gate NOR flash memory in 65 nm technology with 100 ns random access time

Article

Apr 2015

A 1G-cell NOR flash memory chip has been designed and fabricated successfully with 65 nm technology. To compromise the array efficiency and chip speed, the paper establishes an array model including parasitics of the whole array, and optimizes the sector structure as 512 wordlines (WLs) and 4096 bitlines (BLs). Furthermore, by adding other models of long and thin metal lines, we have analyzed the speed of critical circuit nodes. As a result, the agreement of WL delay between simulation and measurement verifies the accuracy of the array model and lines models. The test results indicate that the chip achieves random access time of 100 ns and page read time of 25 ns under 3.3 V voltage supply.

Optimizing random write performance of FAST FTL for NAND flash memory

Article

Mar 2014

The NAND flash memory has gained its popularity as a storage device for consumer electronics due to its higher performance and lower power consumption. In most of these devices, an FTL (Flash Translation Layer) is adopted to emulate a block device interface to support the conventional disk-based file systems that make the flash management much easier. Among various FTLs, the FAST (Fully-Associative Sector Translation) FTL has shown superior performance, becoming one of the state-of-the-art approaches. However, the FAST FTL performs poorly while dealing with a huge number of small-sized random writes brought by upper applications such as database transaction processing workloads. The two important reasons are the absence of efficient selection schemes for the reclaiming of random log blocks that leads to large overhead of full merges, and the sequential log block scheme which no longer applies to random writes due to the large costs of partial merges. To overcome the above two defects in the presence of random writes, two techniques have been proposed. The first technique reduced full merge costs by adopting a novel random log block selection algorithm, based on the block associativity and the relevant-valid-page-amount of random log blocks as the key block selection criterion. The second technique replaced the sequential log block with a random log block to eliminate the overhead of partial merges. Experimental results showed that our optimizations can outperform FAST FTL significantly in three aspects: erase counts, page migration amount, and response time. The maximum improvement level in these cases could reach up to 66.8%, 98.2%, and 51.0%, respectively.

Write Endurance in Flash Drives: Measurements and Analysis.

Conference Paper

Full-text available

Feb 2010

We examine the write endurance of USB flash drives using a range of approaches: chip-level measurements, reverse engineering, timing analysis, whole-device endurance testing, and simulation. The focus of our investigation is not only measured endurance, but underlying factors at the level of chips and algorithms--both typical and ideal--which determine the endurance of a device. Our chip-level measurements show endurance far in excess of nominal values quoted by manufacturers, by a factor of as much as 100. We reverse engineer specifics of the Flash Translation Layers (FTLs) used by several devices, and find a close correlation between measured whole-device endurance and predictions from reverse-engineered FTL parameters and measured chip endurance values. We present methods based on analysis of operation latency which provide a non-intrusive mechanism for determining FTL parameters. Finally we present Monte Carlo simulation results giving numerical bounds on endurance achievable by any on-line algorithm in the face of arbitrary or malicious access patterns.

KAST: K-associative sector translation for NAND flash memory in real-time systems

Conference Paper

Full-text available

May 2009

Flash memory is a good candidate for the storage device in real-time systems due to its non-fluctuating performance, low power consumption and high shock resistance. However, the garbage collection for invalid pages in flash memory can invoke a long blocking time. Moreover, the worst-case blocking time is significantly long compared to the best-case blocking time under the current flash management techniques. In this paper, we propose a novel Flash Translation Layer (FTL), called KAST, where user can configure the maximum log block associativity to control the worst-case blocking time. Performance evaluation using simulations shows that the overall performance of KAST is better than the current FTL schemes as well as KAST guarantees the longest block time is shorter than the specified value.

System Software for Flash Memory: A Survey

Conference Paper

Full-text available

Aug 2006
Lect Notes Comput Sci

Recently, flash memory is widely adopted in embedded ap- plications since it has several strong points: non-volatility, fast access speed, shock resistance, and low power consumption. However, due to its hardware characteristic, namely "erase before write", it requires a software layer called FTL (Flash Translation Layer). This paper surveys the state-of-the-art FTL software for flash memory. This paper also de- scribes problem definitions, several algorithms proposed to solve them, and related research issues. In addition, this paper provides performance results based on our implementation of each of FTL algorithms.

DFTL: a flash translation layer employing demand-based selective caching of page-level address mappings

Conference Paper

Feb 2009

Recent technological advances in the development of flash-memory based devices have consolidated their leadership position as the preferred storage media in the embedded systems market and opened new vistas for deployment in enterprise-scale storage systems. Unlike hard disks, flash devices are free from any mechanical moving parts, have no seek or rotational delays and consume lower power. However, the internal idiosyncrasies of flash technology make its performance highly dependent on workload characteristics. The poor performance of random writes has been a cause of major concern, which needs to be addressed to better utilize the potential of flash in enterprise-scale environments. We examine one of the important causes of this poor performance: the design of the Flash Translation Layer (FTL), which performs the virtual-to-physical address translations and hides the erase-before-write characteristics of flash. We propose a complete paradigm shift in the design of the core FTL engine from the existing techniques with our Demand-based Flash Translation Layer (DFTL), which selectively caches page-level address mappings. We develop a flash simulation framework called FlashSim. Our experimental evaluation with realistic enterprise-scale workloads endorses the utility of DFTL in enterprise-scale storage systems by demonstrating: (i) improved performance, (ii) reduced garbage collection overhead and (iii) better overload behavior compared to state-of-the-art FTL schemes. For example, a predominantly random-write dominant I/O trace from an OLTP application running at a large financial institution shows a 78% improvement in average response time (due to a 3-fold reduction in operations of the garbage collector), compared to a state-of-the-art FTL scheme. Even for the well-known read-dominant TPC-H benchmark, for which DFTL introduces additional overheads, we improve system response time by 56%.

DFTL

Conference Paper

Mar 2009

Polynomial Codes Over Certain Finite Fields

Article

Jun 1960

FlashSim: A Simulator for NAND Flash-based Solid-State Drives

Article

Apr 2009

NAND Flash memory-based Solid-State Disks (SSDs) are becoming popular as the storage media in do-mains ranging from mobile laptops to enterprise-scale storage systems due to a number of benefits (e.g., lighter weights, faster access times, lower power consumption, higher resistance to vibrations) they offer over the conventionally popular Hard Disk Drives (HDDs). While a number of well-regarded simulation environments exist for HDDs, the same is not yet true for SSDs. This is due to SSDs having been in the storage market for relatively less time as well as the lack of information (hardware configuration and software methods) about state-of-the-art SSDs that is publicly available. We describe the design and implementation of FlashSim, a simulator aimed at filling this void in performance evaluation of emerging storage systems that employ SSDs. FlashSim is an event-driven simulator that follows the objected-oriented programming paradigm for modu-larity. We have validated the performance of FlashSim against a number of commercial SSDs for behavioral similarity. We have also used FlashSim to compare the performance of SSD devices employing different Flash Translation Layer (FTL) schemes, and analyzed the energy consumption of different FTL schemes in the SSD. FlashSim has been written to be inter-operable with the well-regarded DiskSim simulator, thus enabling the simulation of a variety of "hybrid" storage systems employing combinations of SSDs and HDDs. Given the current interest in such hybrid systems as opposed to systems with SSDs replacing HDDs (due to higher price), we believe this to be an especially useful feature of FlashSim. We have made FlashSim freely available for download with the hope that it would be of use to researchers exploring the design of SSD-based systems.

A study of replacement algorithms for a virtual-storage computer

Article

Feb 1966
IBM SYST J

L. A. Belady

One of the basic limitations of a digital computer is the size of its available memory.<sup>1</sup> In most cases, it is neither feasible nor economical for a user to insist that every problem program fit into memory. The number of words of information in a program often exceeds the number of cells (i.e., word locations) in memory. The only way to solve this problem is to assign more than one program word to a cell. Since a cell can hold only one word at a time, extra words assigned to the cell must be held in external storage. Conventionally, overlay techniques are employed to exchange memory words and external-storage words whenever needed; this, of course, places an additional planning and coding burden on the programmer. For several reasons, it would be advantageous to rid the programmer of this function by providing him with a “virtual” memory larger than his program. An approach that permits him to use a sufficiently large address range can accomplish this objective, assuming that means are provided for automatic execution of the memory-overlay functions.

ABSTRACT An Efficient R-Tree Implementation over Flash-Memory Storage Systems

Conference Paper

Nov 2003

For many applications with spatial data management such as Geographic Information Systems (GIS), block-oriented access over flash memory could introduce a significant number of node updates. Such node updates could result in a large number of out-place updates and garbage collection over flash memory and damage its reliability. In this paper, we propose a very different approach which could efficiently handle fine-grained updates due to R-tree index access of spatial data over flash memory. The implementation is done directly over the flash translation layer (FTL) without any modifications to existing application systems. The feasibility of the proposed methodology is demonstrated with significant improvement on system performance, overheads on flash-memory management, and energy dissipation.

CAFTL: A Content-Aware Flash Translation Layer Enhancing the Lifespan of Flash Memory based Solid State Drives.

Conference Paper

Jan 2011

Although Flash Memory based Solid State Drive (SSD) exhibits high performance and low power consumption, a critical concern is its limited lifespan along with the associated reliability issues. In this paper, we propose to build a Content-Aware Flash Translation Layer (CAFTL) to enhance the endurance of SSDs at the device level. With no need of any semantic information from the host, CAFTL can effectively reduce write traffic to flash memory by removing unnecessary duplicate writes and can also substantially extend available free flash memory space by coalescing redundant data in SSDs, which further improves the efficiency of garbage collection and wear-leveling. In order to retain high data access performance, we have also designed a set of acceleration techniques to reduce the runtime overhead and minimize the performance impact caused by extra computational cost. Our experimental results show that our solution can effectively identify up to 86.2% of the duplicate writes, which translates to a write traffic reduction of up to 24.2% and extends the flash space by a factor of up to 31.2%. Meanwhile, CAFTL only incurs a minimized performance overhead by a factor of up to 0.5%.

P3Stor: A parallel, durable flash-based SSD for enterprise-scale storage systems

Abstract and Figures

Recommended publications

Building Reliable Massive Capacity SSDs through a Flash Aware RAID-Like Protection

An empirical study of redundant array of independent solid-state drives (RAIS)

DSFTL: An Efficient FTL for Flash Memory Based Storage Systems

Characterization of OLTP I/O Workloads for Dimensioning Embedded Write Cache for Flash Memories: A C...