Content uploaded by Yulai Xie
Author content
All content in this area was uploaded by Yulai Xie on Dec 14, 2018
Content may be subject to copyright.
IEEE TRANSACTIONS ON MAGNETICS, VOL. 49, NO. 6, JUNE 2013 2805
Design and Evaluation of a Provenance-Based Rebuild Framework
Yulai Xie, Dan Feng, Zhipeng Tan, and Junzhe Zhou
Wuhan National Lab for Optoelectronics, School of Computers, Huazhong University of Science and Technology, Wuhan
430074, China
Previous storage reliability solutions mainly include log files, snapshot, backup or ECC-based schemes. These technologies focus more
on the whole disk or system reliability, but not single data object. However, it is not efficient to rebuild a file by recovering a whole system
using the technologies above. In this paper, we propose to use provenance, the origin or history of objects, to rebuild damaged or lost
files. Provenance-based rebuild can exactly reconstruct the right file lost or damaged by backtracking its generation process in the past.
The evaluation results show that provenance-based rebuild performs significantly better than log-based technology with reasonable time
and space overhead.
Index Terms—Data rebuild, provenance, reliability, storage system.
I. INTRODUCTION
STORAGE reliability has long been an important concern
in both desktop and server platforms. There already exists
a series of solutions, such as log-based technology, snapshot,
backup and the ECC-based scheme. These technologies ensure
the whole system security or storage reliability by recording
disk activities, copying original data or computing the integrity
check data in case of its crash.
On the other hand, people are more and more concerned about
rebuilding the single data object (e.g., an important damaged
file). Compared to recovering a whole system, it is also a very
common case that a user wants to rebuild a file or recover the
content of a vital file. Provenance, which exactly describes the
origin or lineage of a data object, shows the details of the cre-
ation of a data object and reveals the dependency between dif-
ferent objects. Provenance can be used to debug the programs
[1], disclose the root cause of a system attack [2] and speed up
the desktop search [3]. Provenance can be used also as an im-
portant clue to regenerate the experimental data [4] and rebuild
the individual file by replaying the generation process.
There are already some studies [5] pointing out the advan-
tage of using provenance to rebuild lost/broken files compared
to the traditional storage reliability scheme, such as per-file re-
build and parallel rebuild. In addition, they proposed to build
a framework that uses cloud as infrastructure that provides rich
and dynamic resources to regenerate data, and they analyzed the
issues needed to consider when building this kind of a frame-
work.
However, we still need to address a series of important issues
before provenance-based rebuild can be really applied in the real
world.
1) How can we find the proper provenance for rebuild?
2) Which kind of factors can affect the provenance-based re-
build performance?
3) Can the performance of provenance-based rebuild defeat
the traditional reliability scheme?
Manuscript received December 11, 2012; accepted February 28, 2013. Date
of current version May 30, 2013. Corresponding author: D. Feng (e-mail:
dfeng@hust.edu.cn).
Digital Object Identifier 10.1109/TMAG.2013.2251460
To address these problems, we present the design, implemen-
tation and evaluation of a provenance-based rebuild framework.
The framework is built with the following two design principles.
Transparency: Provenance-based rebuild is invoked to exe-
cute and regenerate a lost file during read or write.
Modular: Provenance-based rebuild framework can be
easily incorporated into the existing provenance collection
system, such as PASS [1].
The framework can accurately invoke a rebuild process, re-
trieve the provenance of the lost files, analyze the possible cases
that may affect the rebuild, and execute the rebuild process in
a user-space system. The evaluation results on this framework
show that the provenance-based rebuild performance signifi-
cantly outperforms the log-based technology. Though a series
of factors may affect the rebuild performance, we demonstrate
that they do not affect the performance much.
The rest of the paper is organized as follows. We summa-
rize background and related work in Section II and elaborate
the design and implementation of our rebuild framework in
Section III. In Section IV, we evaluate our framework. In
Section V, we discuss some important research challenges.
In Section VI, we list a series of typical application use cases
for provenance-based rebuild. In Section VII, we conclude the
paper.
II. BACKGROUND AND RELATED WORK
In this section, we first give an overview of provenance-aware
storage system. Second, we present the related work on data
reconstruction approach and then motivate our research.
A. Provenance-Aware Storage System
Generally, there exist three kinds of metadata: common meta-
data such asfile ID or name, content-based metadata and prove-
nance. Provenance is commonly in the form of causality-based
graphs that represent the dependency relationship between dif-
ferent data objects. For example, if A depends on B, then B
should appear in the provenance of A.
The Provenance-Aware Storage System (i.e., PASS [1]) ob-
serves traditional READ/WRITE system calls and accordingly
collects provenance records. Any application system call like
READ or WRITE will be recorded in the provenance graphs.
For example, the invocation of a READ system call to a file
0018-9464/$31.00 © 2013 IEEE
2806 IEEE TRANSACTIONS ON MAGNETICS, VOL. 49, NO. 6, JUNE 2013
by a process will construct a provenance edge that indicates the
process depends on the file. There are three kinds of nodes in the
PASS provenance graph: file, process and pipe. Each node has
related attributes that describe its identity information. For file,
PASS records its file ID and name. For process, PASS records its
PID, name, command line arguments, environment variables, a
reference to the file being executed, etc. PASS can use local file
system, network-attached and cloud as its storage backend [6].
B. Existing Reconstruction Approaches
Currently, the prevalent data reconstruction approach is an
ECC-based scheme which has been widely used in large scale
storage system such as RAID5-structured storage system [7].
The advantage of this scheme is its high efficient reconstruc-
tion performance while independent of any software installed on
the disks. Another widely used technology is log-structured file
system [8], which uses log files to record disk activities and can
rebuild system in case of its crash. Similarly, the snapshot tech-
nology ensures the system security by employing technology
such as “copy-on-write”. These approaches focus more on the
whole system security, but neglect to secure a single file with
respect to a more fine-granularity point.
Some of the recent work [9], [10] has proposed to make
maximum use of intermediate data generated in the scientific
workflow to reduce storage overhead and improve recom-
puting times. There are also some works [11] looking for the
tradeoff between storage and computation by using prove-
nance. However, these work only focus on using provenance
and recomputation to improve the whole system efficiency, but
not on the design and implementation of provenance-based
rebuild scheme.
Madden et al. [5] proposed to use provenance generated by
the PASS system to rebuild lost files. They focus on elaborating
the advantage of provenance-based rebuild compared to ECC
scheme and employing cloud as the infrastructure to process re-
build task. For instance, provenance-based rebuild can provide
amorefine-granularity, parallel and priority-based method than
the traditional ECC scheme.
In prior work [12], we outlined the basic rebuild framework
and proposed to utilize active storage technology as an optional
solution to be employed by cloud to improve rebuild perfor-
mance. This work makes a more comprehensive evaluation by
explicitly comparing the performance of provenance-based re-
build with other schemes and analyzing the various factors that
can impact the rebuild performance.
III. DESIGN AND IMPLEMENTATION
In this section, first we will state the basic framework of
provenance-based rebuild. Then we elaborate the design issues
of each component of the framework.
A. Provenance-Based Rebuild Framework
Fig. 1 shows the framework of Provenance-based Rebuild
(PR).AsdepictedinFig.1,PRconsistsoffive modules, namely,
the Provenance Collection,theProvenance Query,theFactor
Analysis,theRebuild Execution and the Recover Process.
Provenance Collection gets and analyzes READ/WRITE com-
mands and translates them to the provenance records, then
Fig. 1. Architecture of PR.
writes both the data and provenance records into the disks. For
READ command, if Provenance Collection cannot get data
from disk (i.e., data is lost), it will invoke the provenance re-
build module for data reconstruction. In this case, Provenance
Collection communicates to the Provenance Query module in
the user space and passes the information such as the name of
the file to be rebuilt to it. Provenance Query is responsible for
querying the provenance of the lost file from the provenance
database in the disk. Usually, the provenance of a file includes
the process that generates it and input parameters needed
during the process execution. Factor Analysis is responsible for
analyzing the possible cases that data reconstruction may affect
the normal file. Rebuild Execution schedules the corresponding
process queried to execute using the input parameters queried
as input to generate the lost file. At last, Recover Process
recovers the affected files to the normal state. In PASS, the
provenance is usually stored in a database such as BerkeleyDB.
The query engine is a series of tools such as PQL [13].
B. Provenance Collection
We have to collect necessary provenance information for
data rebuild. For example, in order to enable a process re-exe-
cutable, we have to collect the complete execution environment
of the process. This typically includes the input parameters
of a process, environment variables, process name/identifiers,
etc. For a file to be re-processed, this usually includes the file
name, inode number, etc. In addition, we have to collect the
relationship between a file and the process that processes it. For
instance, “process P reads file A” will invoke the collection of
a provenance record to indicate that P depends on A.
Furthermore, we have to record the offset and the detailed
bytes written to a file in some cases. For example, if a user in-
puts some characters into a text file, then once this file is lost or
broken, the success of recreation of this text file is only based on
whether we have recorded the contents written to the file. Gen-
erally speaking, if the creation of a file only depends on external
input or messages from the network, then we have to collect the
offset and the bytes written to a file as provenance information.
XIE et al.: PROVENANCE-BASED REBUILD FRAMEWORK 2807
Note that the provenance collected should be written to disk
before the data, otherwise, the data stored on the disks may have
no provenance in case of system crash. But this imposes the time
overhead for the normal read or write which we will evaluate in
Section IV.
C. Provenance Queries
Efficient provenance query is a must for high performance
rebuild. For the current implementation, we store the collected
provenance into a BerkeleyDB database. For the information
to describe the basic identity of a process or a file, such as en-
vironment variables or file name, we store them into a identi-
tyDB database, for the information to describe the relationship
between process and file, we store them into a ancestorDB data-
base. We assign each process or file a unique number (called
pnode number). In order for a more efficient query, we build
some indices, such as NameDB that stores the relationship be-
tween the name of a file or process and its pnode number. If
process P reads file A, then writes the information to file B, typ-
ically the queries for the provenance of file B are as follows.
1) Query NameDB to find the pnode number of file B using
the file name of B as keyword.
2) Query ancestorDB to find the process P that file B depends
on using the pnode number of file B as keyword.
3) Query identityDB to find the identity information of
process P, such as input parameters and environment
variables.
4) Query ancestorDB to find the file A that process P depends
on.
5) Execute the process P to generate file B.
D. Factor Analysis
The rebuild execution, though it can rebuild a file after ac-
quiring the provenance (i.e., the process that generates it and
input parameters), can also affect other normal files. Typically, if
a rebuild process generates multiple files at a time, but only one
file is the lost file needed to be rebuilt, then the other files gen-
erated can be excrescent. The excrescent files generated during
rebuild can be categorized into the following three types.
1) These files exist before rebuild and are completely the
same as the generated files. Thus, the generated files will
automatically overlay the existing files.
2) These files do not exist before rebuild. So, the generated
files should be deleted.
3) These files exist with a higher version before rebuild. So,
the existing files with a higher version should be renamed
before rebuild, and overlay the generated files after rebuild.
For instance, as shown in Fig. 2, process P generates file A,
file B and file C. Then file B and file C are modified and con-
verted to B1 (i.e., file B with version 1) and C1 (i.e., file C with
version 1) using process P1 and P2, respectively. When file A
is lost and needs to be rebuilt using process P, file B and file C
will be also generated. As file B and file C have the same name
as B1 and C1, they will automatically overlay the existing B1
and C1 respectively. However, we record the version numbers
of B1 and C1, and rename B1 and C1 if their versions are not
zero before rebuild. Then when the newly B and C are gener-
ated, we use B1 and C1 to overlay them respectively.
Fig. 2. Example of how file is affected in rebuild.
Note that in this step, we have to first query the provenance
database to judge that whether the generated excrescent files
exist before rebuild. If they do exist, we have to further query if
they have higher versions and rename these files with a higher
version. We will evaluate how they impact the whole rebuild
performance in Section IV.
E. Rebuild Execution
As we have stated above, the basic provenance rebuild
scheme regenerates a file by first searching its ancestors in
the provenance chain. In most of the cases, the ancestors are
already on the disks. However, if any of the ancestors does
not exist (may be also lost), we have to first regenerate the
ancestors. For instance, if process P reads file A and writes file
B(
i.e., ), then when we regenerate B, and find A
does not exist, we have to first regenerate A (e.g., using process
P1), and then use P to regenerate B.
However, the rebuild sequence (first P1, then P) is not al-
ways correct. Sometimes, each process can not execute sepa-
rately but should interact with another process to complete a
rebuild task. For instance, we use gcc to compile a file called
hello.c to hello.o. The provenance graph for this process is as
shown in Fig. 3. In our experiment, we search the provenance
graph to locate the tmp file as an ancestor of P (i.e., gcc). But we
find that tmp file does not exist on the disk. So we run p1 (i.e.,
ccl) to generate it. However, we still cannot find the regenerated
tmp file on the disk because it is a temporary file that does not
exist on the disk in a stable state. Furthermore, we find that P1
depends on P (i.e., gcc) which forks P1. So in this case, we do
not have to separately rebuild the tmp file using P1, but we only
have to run the process P, which will automatically forks P1 that
generates the tmp file, which is used as an input for P before it
disappears from the disk.
F. Recover Process
After rebuild, we have to delete the unnecessary files and re-
cover the files with a high version. For instance, in Fig. 2, we
should delete file B and file C, and rename fileB1andC1with
the same name as file B and file C, respectively.
IV. EVA L U AT I O N
In this section, we evaluate the performance of the prove-
nance-based rebuild scheme. First, we compare provenance-
based rebuild performance with log-based and snapshot tech-
nology. Second, we evaluate how a variety of factors impact on
2808 IEEE TRANSACTIONS ON MAGNETICS, VOL. 49, NO. 6, JUNE 2013
Fig. 3. Example of process interaction in rebuild.
the provenance-based rebuild performance. Then, we analyze
the space and time overhead imposed by provenance.
A. Experimental Setup
We run our experiments on a Fedora Linux 2.6.23.17 oper-
ating system with Intel Pentium 4 2.66 GHz processors, 1 GB
DDR400memoryand80GBharddrive.
We collect provenance using PASS system (PASSv2) and
store provenance in a BerkeleyDB (version 4.6.21) database.
For log-based technology, we use a tool called ext3grep [14] to
analyze the ext3 log file, and then find the logical address of the
lost file and recover it. For snapshot technology, we employ the
LVM (Logical Volume Management) technology, which utilizes
the copy-on-write technology to backup and recover the data.
Since the purpose of our evaluation is to validate whether
it is good enough to rebuild data using provenance, we adopt
two typical data generation methods that are frequently used
in our common life: copy and uncompress. They use /bin/cp
and /bin/tar processes to process and generate files in our ex-
periments respectively. Note that we do not use the benchmark
(e.g., postmark) to measure the rebuild performance. Since most
of these benchmarks are purely used to evaluate the system I/O
performance, but are not used for the purpose of rebuilding the
real data. For instance, we never rebuild a lost/broken file using
the postmark process.
B. Rebuild Performance
Fig. 4 shows the rebuild performance for provenance-based,
log-based and snapshot technologies with files in different size.
For all these technologies, we first copy different size of data to
an empty file in a directory using /bin/cp process, then we delete
the file, and at last we rebuild it using different technologies.
For the case using provenance, the directory is where we mount
aPASSvolume.SothePASSsystem automatically collects
provenance when the copy process happens. For the case using
snapshot, we make a snapshot using LVM before deleting the
file. We can see that, the performance of provenance-based
rebuild has a comparable performance with the snapshot tech-
nology, and significantly outperforms log-based technology.
Fig. 4. Rebuild performance for provenance-based, log-based and snapshot
technologies with files in different size. In provenance-based rebuild, we copy
different size of the file to the PASS volume and collect provenance. Then we
delete the file on the PASS volume, query the provenance of that file and then
regenerate it using provenance.
Fig. 5. Breakdown of rebuild performance with different number of files gen-
erated during rebuild.
This is because that the log-based technology has to scan the log
file for the address of the deleted/lost file, which consumes lots
of time. Comparatively, the provenance information is stored
in database and querying provenance from index-ed database is
more efficient and much faster. The snapshot technology, only
needs to copy back the data from the snapshot volume, so it is
also very efficient. But the drawback of snapshot technology
is that time of making a snapshot depends on the person who
uses it, but not depends on the time when the file was broken or
lost. So it is not easy to recover the lost file to the most recent
version using snapshot technology.
C. Factors That Impact on Rebuild Performance
The rebuild time is composed of the time costs on a series
of steps as shown in Fig. 1. The performance of these steps is
impacted by a variety of factors as follows.
Number of Files Generated in Rebuild: The rebuild process
may generate a series of files even though the number of real
files needed to be rebuilt is only one or two. These excrescent
files generated can raise the rebuild execution time, and also the
Provenance Query and Factor Analysis time which both typ-
ically need more time to query the provenance database. As
shown in Fig. 5, we use a process /bin/tar to uncompress a
package that contains different number of files (10, 100, 1000,
respectively). We can see that the time cost of rebuild execution,
provenance query and factor analysis are all consistent with the
increase in the number of files generated.
Files Affected: As stated in Section III-D, the excrescent files
generated during rebuild can be categorized into three types.
They should be overlayed by the existing same files, directly
XIE et al.: PROVENANCE-BASED REBUILD FRAMEWORK 2809
Fig. 6. Rebuild time breakdown for different applications. For /bin/tar process, we uncompress a.tar.gz file and generate 100 files. For /bin/cp process, we copy
1 KB data to each of the 100 newly created files. (a) /bin/tar. (b) /bin/cp.
Fig. 7. Breakdown of rebuild performance with different size of provenance
database. The number of records in database is from 10 000 to 100 million.
deleted or overlayed by the existing files with higher versions,
which should be renamed before rebuild and renamed back to
the original name after rebuild.
Fig. 6 shows the rebuild time breakdown of the process
/bin/tar and /bin/cp in these three types. In the first type, these
two processes generate 100 files which are completely the same
as the existing 100 files. In the second type (i.e., 50-delete), 50
files of these 100 files do not exist before rebuild and need to
be deleted. In the third type (i.e., 50-version), 50 files of these
100 files are with higher versions before rebuild. As it involves
querying the files with a high version in database and renaming
them, the Factor Analysis time raises a lot in the third type.
However, since the Factor Analysis time only takes up a small
portion of the overall rebuild time, the overall rebuild time
does not raise too much. For the tar process, the whole rebuild
time in the third type outperforms the first type 17.0% and the
second type 11.8%, respectively. For the /bin/cp process, the
whole rebuild time in the third type only outperforms the first
type 10.3% and second type 7.1%.
Size of Provenance Database: If provenance database is very
big, querying a few of the provenance records from a large data-
base can incur big time overhead. This will make the Prove-
nance Query and Factor Analysis time very long. We have mea-
sured the breakdown of rebuild time using /bin/cp process for
different size of databases as shown in Fig. 7. The whole re-
build time scales linearly as the time costs on Provenance Query
and Factor Analysis increases. This indicates that maintaining a
small provenance database can significantly benefittherebuild
performance.
Execution Process: The rebuild time strongly depends on the
execution process. The discussion on this is beyond the scope
of this paper.
D. Overhead Analysis
The provenance used for rebuild has to be first collected
during the read or write process. As provenance has to be
written to the disk before the data, this collection process
can impact on the normal read/write performance. On the
other hand, collecting and storing provenance can incur space
overhead.
We evaluate the overhead of /bin/tar and /bin/cp processes as
shown in Fig. 8. The “Ext3” represents the case without prove-
nance, The “PR” indicates that we collect provenance using
PASS system, and rebuild data using our provenance-based re-
build framework. The elapsed time overhead for tar is 24.5%,
and for cp is 40.9%. The increase lies in the extra write to record
provenance. The space overhead for tar is 10.3%, and for cp is
55.3%.
To give a better understanding of why some of the overheads
are so high, we measured space and elapsed time overheads of
cp process for rebuilding files with different size as shown in
Tables I and II, respectively. Table I shows that the size of prove-
nance is changeless. This is because provenance only records
the relationship between the cp process and the files, but not
records the content of the files. As the size of provenance does
not change, the space overhead decreases as the size of the re-
built file increases. On the other hand, the time to write these
provenance is almost unaltered. But as the cp process costs more
time as the size of the copied data increases, the provenance
time overhead becomes smaller and smaller (see Table II). We
can see that the space overhead even can be neglected and time
overhead can be also very small when the rebuilt file becomes
very big. This indicates that provenance-based rebuild can be
very useful for big data rebuild.
V. C HALLENGES
Though we have presented a basic framework for prove-
nance-based rebuild and evaluated its performance, overhead
and a variety of factors that can impact its performance, there
still exists a number of challenges that we consider can define
whether it is good enough and worth taking for the practical
use.
Indefinite Process: Theoretically, we can rebuild any of the
lost files if the execution environment is completely the same as
when the file was first generated. However, we cannot always
acquire the same result if the execution results of processes are
indefinitely, such as a process with random seed.
2810 IEEE TRANSACTIONS ON MAGNETICS, VOL. 49, NO. 6, JUNE 2013
Fig. 8. Provenance time and space overhead for tar and cp workloads. For tar process, we uncompress a.tar.gz file and generate 100 files. For cp process, we copy
1 KB data to each of the 100 newly created files. (a) Time overhead. (b) Space overhead.
TAB LE I
SPAC E OVERHEADS (IN MB) FOR PR (PROVENANCE-BASED REBUILD)WHEN USING CP PROCESS TO REBUILD FILES WITH DIFFERENT SIZE
TAB LE II
ELAPSED TIME OVERHEADS (IN SECONDS)FOR PR WHEN USING CP PROCESS TO REBUILD FILES WITH DIFFERENT SIZE
Size of Provenance: Though the provenance we have col-
lected so far can enable the rebuild for the most of the time,
we may need to collect more information such as the operating
system information, the version of Linux kernel, etc., in some
more complex cases. On the other hand, unoptimized prove-
nance can take up substantial storage space, how to accurately
collect provenance to enable rebuild while retain a relatively
small amount of provenance is a big challenge. We are exploring
how to reduce the provenance size by exploiting the character-
istics of provenance [15], [16].
Provenance or RAID?: Provenance-based rebuild has some
important advantages compared to the traditional ECC scheme
in RAID-structured system, such as per-file rebuild and parallel
rebuild. Additionally, it can rebuild the data in any number of
disks in RAID system even when all these disks are crashed as
long as the provenance of data in those disks has been saved in
a separated storage pool. However, we still have to use classical
techniques such as ECC or backup technologies to secure these
provenance data.
Rebuild in Network-Attached Environment: There already
exists provenance aware NFS architecture [4] which generates
data and collects provenance at the client and stores both of
them on the server. There are two choices to rebuild data using
provenance in this case. The first is rebuilding data at the client,
this is because the lost/damaged data was previously generated
at the client. However, all the provenance of the file should be
transferred to the client which can incur network overhead. The
second is rebuilding data on the server. However, the execution
environment on the server should be similar or same as the client
where the data was first generated.
VI. APPLICATION USE CASES
To give a better understanding of where provenance-based
rebuild can be used, we list a series of typical application use
cases as follows.
Validate Experimental Data: Scientists usually want to re-
produce the experimental results to see whether the whole ex-
periment process is normal and correct even when they have
already got some experimental results. Scientists can use the
provenance of the experimental data to automatically and con-
veniently replay the experiment.
Enhance Backup: The backup technology has been widely
used in enterprise system and cloud infrastructures. However,
we consider provenance is better to be used in the following
cases: 1) Backup of big data is expensive, since storing a number
of copies of these data consumes more storage space than only
XIE et al.: PROVENANCE-BASED REBUILD FRAMEWORK 2811
storingthecommonworkflows and input data (i.e., provenance)
that generate it. 2) Backup can provide a series of history ver-
sions of a document for a user to choose from for data recovery.
Provenance further tells a user the relationship between these
versions and which one is the exact one for recovery.
Play Video: Many video websites provide different definition
of the same video for users with different requirements. For in-
stance, a user may only want the video is fluently playing even
if the definition is not high. The website usually keeps several
copies for video of each definition in case of crash. A better so-
lution may be keeping only one copy for video of each definition
for online play, and using provenance to record how to convert
original video to different definition of video. This usually can
save at least 50% storage space.
High Performance Computing: It is a common case that sci-
entists use a small size of input data and some transformation
rules to generate enormous data for analyze in HPC areas. It
is not cost-effective to keep all these data. Recording only the
input data and transformation rules (i.e., provenance) can sig-
nificantly reduce storage overhead.
VII. CONCLUSION
In this paper, we have presented our experience in designing
and implementing a provenance-based rebuild framework. We
have discussed a wide variety of issues that we have to solve
when implementing this kind of a framework. The experimental
results show that provenance-based rebuild significantly outper-
forms log-based technology for per-file rebuild with reasonable
space and time overhead.
ACKNOWLEDGMENT
This work was supported in part by the National Basic Re-
search 973 Program of China under Grant 2011CB302301, by
the NSFC under Grants 61025008, 61232004, 61173043, 863
and by Program 2012AA012403. The authors thank the anony-
mous reviewers for their comments on this paper.
REFERENCES
[1] K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. I. Seltzer,
“Provenance-aware storage systems,” in Proc. USENIX’06, 2006.
[2] S. T. King and P. M. Chen, “Backtracking intrusions,” in Proc.
SOSP’03, Bolton Landing, NY, USA, 2003.
[3] S. Shah, C. A. N. Soules, G. R. Ganger, and B. D. Noble, “Using prove-
nance to aid in personal file search,” in Proc. USENIX’07, 2007.
[4] K.-K. Muniswamy-Reddy, “Foundations for provenance-aware sys-
tems,” Ph.D. thesis, Harvard Univ., Cambridge, MA, 2010.
[5] B. A. Madden, I. F. Adams, M. W. Storer, E. L. Miller, D. D. E. Long,
andT.M.Kroeger,“Provenance based rebuild: Using data provenance
to improve reliability,” Tech. Rep. UCSC-SSRC-11-04, 2011.
[6] K.-K. Muniswamy-Reddy, P. Macko, and M. I. Seltzer, “Provenance
for the cloud,” in Proc. FAST’10, 2010.
[7] S. Wu, H. Jiang, D. Feng, L. Tian, and B. Mao, “WorkOut: I/O work-
load outsourcing for boosting RAID reconstruction performance,” in
Proc. 7th USENIX Conf. File and Storage Technol., 2009.
[8] M. Rosenblum and J. K. Ousterhout, “The design and implementation
of a log-structured file system,” ACM Trans. Comput. Syst., vol. 10, pp.
26–52, 1992.
[9] S. Y. Ko, I. Hoque, B. Cho, and Indranil, “On availability of interme-
diate data in cloud computation,” in Proc. HotOS’09, 2009.
[10] D. Yuan, Y. Yang, X. Liu, and J. Chen, “A cost-effective strategy for in-
termediate data storage in scientific cloud workflow systems,” in Proc.
24th IEEE Int. Parallel Distrib. Process. Symp., 2010.
[11] I. Adams, D. D. E. Long, E. L. Miller, S. Pasupathy, and M. W. Storer,
“Maximizing efficiency by trading storage for computation,” in Proc.
Workshop on Hot Topics in Cloud Comput., 2009.
[12] Y. Xie, D. Feng, Z. Tan, L. Chen, and J. Zhou, “Experiences building a
provenance-based reconstruction system,” in Proc. Storage Syst., Hard
Disk and Solid State Technol. Summit in Conjunction With the APMRC
Conf., 2012.
[13] [Online]. Available: http://www.eecs.harvard.edu/syrah/pql
[14] [Online]. Available: http://code.google.com/p/ext3grep/
[15] Y. Xie, K.-K. Muniswamy-Reddy, D. Feng, L. Yan, D. D. E. Long,
Z. Tan, and L. Chen, “A hybrid approach for efficient provenance
storage,” in Proc. 21th ACM Int. Conf. Inf. Knowledge Manage.
(CIKM), 2012.
[16] Y. Xie, K.-K. Muniswamy-Reddy, D. D. E. Long, A. Amer, D. Feng,
and Z. Tan, “Compressing provenance graphs,” in Proc. 3rd USENIX
Workshop on the Theory and Practice of Provenance, 2011.