ArticlePDF Available

Design and Evaluation of a Provenance-Based Rebuild Framework

Authors:

Abstract and Figures

Previous storage reliability solutions mainly include log files, snapshot, backup or ECC-based schemes. These technologies focus more on the whole disk or system reliability, but not single data object. However, it is not efficient to rebuild a file by recovering a whole system using the technologies above. In this paper, we propose to use provenance, the origin or history of objects, to rebuild damaged or lost files. Provenance-based rebuild can exactly reconstruct the right file lost or damaged by backtracking its generation process in the past. The evaluation results show that provenance-based rebuild performs significantly better than log-based technology with reasonable time and space overhead.
Content may be subject to copyright.
IEEE TRANSACTIONS ON MAGNETICS, VOL. 49, NO. 6, JUNE 2013 2805
Design and Evaluation of a Provenance-Based Rebuild Framework
Yulai Xie, Dan Feng, Zhipeng Tan, and Junzhe Zhou
Wuhan National Lab for Optoelectronics, School of Computers, Huazhong University of Science and Technology, Wuhan
430074, China
Previous storage reliability solutions mainly include log les, snapshot, backup or ECC-based schemes. These technologies focus more
on the whole disk or system reliability, but not single data object. However, it is not efcient to rebuild a le by recovering a whole system
using the technologies above. In this paper, we propose to use provenance, the origin or history of objects, to rebuild damaged or lost
les. Provenance-based rebuild can exactly reconstruct the right le lost or damaged by backtracking its generation process in the past.
The evaluation results show that provenance-based rebuild performs signicantly better than log-based technology with reasonable time
and space overhead.
Index Terms—Data rebuild, provenance, reliability, storage system.
I. INTRODUCTION
STORAGE reliability has long been an important concern
in both desktop and server platforms. There already exists
a series of solutions, such as log-based technology, snapshot,
backup and the ECC-based scheme. These technologies ensure
the whole system security or storage reliability by recording
disk activities, copying original data or computing the integrity
check data in case of its crash.
On the other hand, people are more and more concerned about
rebuilding the single data object (e.g., an important damaged
le). Compared to recovering a whole system, it is also a very
common case that a user wants to rebuild a le or recover the
content of a vital le. Provenance, which exactly describes the
origin or lineage of a data object, shows the details of the cre-
ation of a data object and reveals the dependency between dif-
ferent objects. Provenance can be used to debug the programs
[1], disclose the root cause of a system attack [2] and speed up
the desktop search [3]. Provenance can be used also as an im-
portant clue to regenerate the experimental data [4] and rebuild
the individual le by replaying the generation process.
There are already some studies [5] pointing out the advan-
tage of using provenance to rebuild lost/broken les compared
to the traditional storage reliability scheme, such as per-le re-
build and parallel rebuild. In addition, they proposed to build
a framework that uses cloud as infrastructure that provides rich
and dynamic resources to regenerate data, and they analyzed the
issues needed to consider when building this kind of a frame-
work.
However, we still need to address a series of important issues
before provenance-based rebuild can be really applied in the real
world.
1) How can we nd the proper provenance for rebuild?
2) Which kind of factors can affect the provenance-based re-
build performance?
3) Can the performance of provenance-based rebuild defeat
the traditional reliability scheme?
Manuscript received December 11, 2012; accepted February 28, 2013. Date
of current version May 30, 2013. Corresponding author: D. Feng (e-mail:
dfeng@hust.edu.cn).
Digital Object Identier 10.1109/TMAG.2013.2251460
To address these problems, we present the design, implemen-
tation and evaluation of a provenance-based rebuild framework.
The framework is built with the following two design principles.
Transparency: Provenance-based rebuild is invoked to exe-
cute and regenerate a lost le during read or write.
Modular: Provenance-based rebuild framework can be
easily incorporated into the existing provenance collection
system, such as PASS [1].
The framework can accurately invoke a rebuild process, re-
trieve the provenance of the lost les, analyze the possible cases
that may affect the rebuild, and execute the rebuild process in
a user-space system. The evaluation results on this framework
show that the provenance-based rebuild performance signi-
cantly outperforms the log-based technology. Though a series
of factors may affect the rebuild performance, we demonstrate
that they do not affect the performance much.
The rest of the paper is organized as follows. We summa-
rize background and related work in Section II and elaborate
the design and implementation of our rebuild framework in
Section III. In Section IV, we evaluate our framework. In
Section V, we discuss some important research challenges.
In Section VI, we list a series of typical application use cases
for provenance-based rebuild. In Section VII, we conclude the
paper.
II. BACKGROUND AND RELATED WORK
In this section, we rst give an overview of provenance-aware
storage system. Second, we present the related work on data
reconstruction approach and then motivate our research.
A. Provenance-Aware Storage System
Generally, there exist three kinds of metadata: common meta-
data such asle ID or name, content-based metadata and prove-
nance. Provenance is commonly in the form of causality-based
graphs that represent the dependency relationship between dif-
ferent data objects. For example, if A depends on B, then B
should appear in the provenance of A.
The Provenance-Aware Storage System (i.e., PASS [1]) ob-
serves traditional READ/WRITE system calls and accordingly
collects provenance records. Any application system call like
READ or WRITE will be recorded in the provenance graphs.
For example, the invocation of a READ system call to a le
0018-9464/$31.00 © 2013 IEEE
2806 IEEE TRANSACTIONS ON MAGNETICS, VOL. 49, NO. 6, JUNE 2013
by a process will construct a provenance edge that indicates the
process depends on the le. There are three kinds of nodes in the
PASS provenance graph: le, process and pipe. Each node has
related attributes that describe its identity information. For le,
PASS records its le ID and name. For process, PASS records its
PID, name, command line arguments, environment variables, a
reference to the le being executed, etc. PASS can use local le
system, network-attached and cloud as its storage backend [6].
B. Existing Reconstruction Approaches
Currently, the prevalent data reconstruction approach is an
ECC-based scheme which has been widely used in large scale
storage system such as RAID5-structured storage system [7].
The advantage of this scheme is its high efcient reconstruc-
tion performance while independent of any software installed on
the disks. Another widely used technology is log-structured le
system [8], which uses log les to record disk activities and can
rebuild system in case of its crash. Similarly, the snapshot tech-
nology ensures the system security by employing technology
such as “copy-on-write”. These approaches focus more on the
whole system security, but neglect to secure a single le with
respect to a more ne-granularity point.
Some of the recent work [9], [10] has proposed to make
maximum use of intermediate data generated in the scientic
workow to reduce storage overhead and improve recom-
puting times. There are also some works [11] looking for the
tradeoff between storage and computation by using prove-
nance. However, these work only focus on using provenance
and recomputation to improve the whole system efciency, but
not on the design and implementation of provenance-based
rebuild scheme.
Madden et al. [5] proposed to use provenance generated by
the PASS system to rebuild lost les. They focus on elaborating
the advantage of provenance-based rebuild compared to ECC
scheme and employing cloud as the infrastructure to process re-
build task. For instance, provenance-based rebuild can provide
amorene-granularity, parallel and priority-based method than
the traditional ECC scheme.
In prior work [12], we outlined the basic rebuild framework
and proposed to utilize active storage technology as an optional
solution to be employed by cloud to improve rebuild perfor-
mance. This work makes a more comprehensive evaluation by
explicitly comparing the performance of provenance-based re-
build with other schemes and analyzing the various factors that
can impact the rebuild performance.
III. DESIGN AND IMPLEMENTATION
In this section, rst we will state the basic framework of
provenance-based rebuild. Then we elaborate the design issues
of each component of the framework.
A. Provenance-Based Rebuild Framework
Fig. 1 shows the framework of Provenance-based Rebuild
(PR).AsdepictedinFig.1,PRconsistsofve modules, namely,
the Provenance Collection,theProvenance Query,theFactor
Analysis,theRebuild Execution and the Recover Process.
Provenance Collection gets and analyzes READ/WRITE com-
mands and translates them to the provenance records, then
Fig. 1. Architecture of PR.
writes both the data and provenance records into the disks. For
READ command, if Provenance Collection cannot get data
from disk (i.e., data is lost), it will invoke the provenance re-
build module for data reconstruction. In this case, Provenance
Collection communicates to the Provenance Query module in
the user space and passes the information such as the name of
the le to be rebuilt to it. Provenance Query is responsible for
querying the provenance of the lost le from the provenance
database in the disk. Usually, the provenance of a le includes
the process that generates it and input parameters needed
during the process execution. Factor Analysis is responsible for
analyzing the possible cases that data reconstruction may affect
the normal le. Rebuild Execution schedules the corresponding
process queried to execute using the input parameters queried
as input to generate the lost le. At last, Recover Process
recovers the affected les to the normal state. In PASS, the
provenance is usually stored in a database such as BerkeleyDB.
The query engine is a series of tools such as PQL [13].
B. Provenance Collection
We have to collect necessary provenance information for
data rebuild. For example, in order to enable a process re-exe-
cutable, we have to collect the complete execution environment
of the process. This typically includes the input parameters
of a process, environment variables, process name/identiers,
etc. For a le to be re-processed, this usually includes the le
name, inode number, etc. In addition, we have to collect the
relationship between a le and the process that processes it. For
instance, “process P reads le A” will invoke the collection of
a provenance record to indicate that P depends on A.
Furthermore, we have to record the offset and the detailed
bytes written to a le in some cases. For example, if a user in-
puts some characters into a text le, then once this le is lost or
broken, the success of recreation of this text le is only based on
whether we have recorded the contents written to the le. Gen-
erally speaking, if the creation of a le only depends on external
input or messages from the network, then we have to collect the
offset and the bytes written to a le as provenance information.
XIE et al.: PROVENANCE-BASED REBUILD FRAMEWORK 2807
Note that the provenance collected should be written to disk
before the data, otherwise, the data stored on the disks may have
no provenance in case of system crash. But this imposes the time
overhead for the normal read or write which we will evaluate in
Section IV.
C. Provenance Queries
Efcient provenance query is a must for high performance
rebuild. For the current implementation, we store the collected
provenance into a BerkeleyDB database. For the information
to describe the basic identity of a process or a le, such as en-
vironment variables or le name, we store them into a identi-
tyDB database, for the information to describe the relationship
between process and le, we store them into a ancestorDB data-
base. We assign each process or le a unique number (called
pnode number). In order for a more efcient query, we build
some indices, such as NameDB that stores the relationship be-
tween the name of a le or process and its pnode number. If
process P reads le A, then writes the information to le B, typ-
ically the queries for the provenance of le B are as follows.
1) Query NameDB to nd the pnode number of le B using
the le name of B as keyword.
2) Query ancestorDB to nd the process P that le B depends
on using the pnode number of le B as keyword.
3) Query identityDB to nd the identity information of
process P, such as input parameters and environment
variables.
4) Query ancestorDB to nd the le A that process P depends
on.
5) Execute the process P to generate le B.
D. Factor Analysis
The rebuild execution, though it can rebuild a le after ac-
quiring the provenance (i.e., the process that generates it and
input parameters), can also affect other normal les. Typically, if
a rebuild process generates multiple les at a time, but only one
le is the lost le needed to be rebuilt, then the other les gen-
erated can be excrescent. The excrescent les generated during
rebuild can be categorized into the following three types.
1) These les exist before rebuild and are completely the
same as the generated les. Thus, the generated les will
automatically overlay the existing les.
2) These les do not exist before rebuild. So, the generated
les should be deleted.
3) These les exist with a higher version before rebuild. So,
the existing les with a higher version should be renamed
before rebuild, and overlay the generated les after rebuild.
For instance, as shown in Fig. 2, process P generates le A,
le B and le C. Then le B and le C are modied and con-
verted to B1 (i.e., le B with version 1) and C1 (i.e., le C with
version 1) using process P1 and P2, respectively. When le A
is lost and needs to be rebuilt using process P, le B and le C
will be also generated. As le B and le C have the same name
as B1 and C1, they will automatically overlay the existing B1
and C1 respectively. However, we record the version numbers
of B1 and C1, and rename B1 and C1 if their versions are not
zero before rebuild. Then when the newly B and C are gener-
ated, we use B1 and C1 to overlay them respectively.
Fig. 2. Example of how le is affected in rebuild.
Note that in this step, we have to rst query the provenance
database to judge that whether the generated excrescent les
exist before rebuild. If they do exist, we have to further query if
they have higher versions and rename these les with a higher
version. We will evaluate how they impact the whole rebuild
performance in Section IV.
E. Rebuild Execution
As we have stated above, the basic provenance rebuild
scheme regenerates a le by rst searching its ancestors in
the provenance chain. In most of the cases, the ancestors are
already on the disks. However, if any of the ancestors does
not exist (may be also lost), we have to rst regenerate the
ancestors. For instance, if process P reads le A and writes le
B(
i.e., ), then when we regenerate B, and nd A
does not exist, we have to rst regenerate A (e.g., using process
P1), and then use P to regenerate B.
However, the rebuild sequence (rst P1, then P) is not al-
ways correct. Sometimes, each process can not execute sepa-
rately but should interact with another process to complete a
rebuild task. For instance, we use gcc to compile a le called
hello.c to hello.o. The provenance graph for this process is as
shown in Fig. 3. In our experiment, we search the provenance
graph to locate the tmp le as an ancestor of P (i.e., gcc). But we
nd that tmp le does not exist on the disk. So we run p1 (i.e.,
ccl) to generate it. However, we still cannot nd the regenerated
tmp le on the disk because it is a temporary le that does not
exist on the disk in a stable state. Furthermore, we nd that P1
depends on P (i.e., gcc) which forks P1. So in this case, we do
not have to separately rebuild the tmp le using P1, but we only
have to run the process P, which will automatically forks P1 that
generates the tmp le, which is used as an input for P before it
disappears from the disk.
F. Recover Process
After rebuild, we have to delete the unnecessary les and re-
cover the les with a high version. For instance, in Fig. 2, we
should delete le B and le C, and rename leB1andC1with
the same name as le B and le C, respectively.
IV. EVA L U AT I O N
In this section, we evaluate the performance of the prove-
nance-based rebuild scheme. First, we compare provenance-
based rebuild performance with log-based and snapshot tech-
nology. Second, we evaluate how a variety of factors impact on
2808 IEEE TRANSACTIONS ON MAGNETICS, VOL. 49, NO. 6, JUNE 2013
Fig. 3. Example of process interaction in rebuild.
the provenance-based rebuild performance. Then, we analyze
the space and time overhead imposed by provenance.
A. Experimental Setup
We run our experiments on a Fedora Linux 2.6.23.17 oper-
ating system with Intel Pentium 4 2.66 GHz processors, 1 GB
DDR400memoryand80GBharddrive.
We collect provenance using PASS system (PASSv2) and
store provenance in a BerkeleyDB (version 4.6.21) database.
For log-based technology, we use a tool called ext3grep [14] to
analyze the ext3 log le, and then nd the logical address of the
lost le and recover it. For snapshot technology, we employ the
LVM (Logical Volume Management) technology, which utilizes
the copy-on-write technology to backup and recover the data.
Since the purpose of our evaluation is to validate whether
it is good enough to rebuild data using provenance, we adopt
two typical data generation methods that are frequently used
in our common life: copy and uncompress. They use /bin/cp
and /bin/tar processes to process and generate les in our ex-
periments respectively. Note that we do not use the benchmark
(e.g., postmark) to measure the rebuild performance. Since most
of these benchmarks are purely used to evaluate the system I/O
performance, but are not used for the purpose of rebuilding the
real data. For instance, we never rebuild a lost/broken le using
the postmark process.
B. Rebuild Performance
Fig. 4 shows the rebuild performance for provenance-based,
log-based and snapshot technologies with les in different size.
For all these technologies, we rst copy different size of data to
an empty le in a directory using /bin/cp process, then we delete
the le, and at last we rebuild it using different technologies.
For the case using provenance, the directory is where we mount
aPASSvolume.SothePASSsystem automatically collects
provenance when the copy process happens. For the case using
snapshot, we make a snapshot using LVM before deleting the
le. We can see that, the performance of provenance-based
rebuild has a comparable performance with the snapshot tech-
nology, and signicantly outperforms log-based technology.
Fig. 4. Rebuild performance for provenance-based, log-based and snapshot
technologies with les in different size. In provenance-based rebuild, we copy
different size of the le to the PASS volume and collect provenance. Then we
delete the le on the PASS volume, query the provenance of that le and then
regenerate it using provenance.
Fig. 5. Breakdown of rebuild performance with different number of les gen-
erated during rebuild.
This is because that the log-based technology has to scan the log
le for the address of the deleted/lost le, which consumes lots
of time. Comparatively, the provenance information is stored
in database and querying provenance from index-ed database is
more efcient and much faster. The snapshot technology, only
needs to copy back the data from the snapshot volume, so it is
also very efcient. But the drawback of snapshot technology
is that time of making a snapshot depends on the person who
uses it, but not depends on the time when the le was broken or
lost. So it is not easy to recover the lost le to the most recent
version using snapshot technology.
C. Factors That Impact on Rebuild Performance
The rebuild time is composed of the time costs on a series
of steps as shown in Fig. 1. The performance of these steps is
impacted by a variety of factors as follows.
Number of Files Generated in Rebuild: The rebuild process
may generate a series of les even though the number of real
les needed to be rebuilt is only one or two. These excrescent
les generated can raise the rebuild execution time, and also the
Provenance Query and Factor Analysis time which both typ-
ically need more time to query the provenance database. As
shown in Fig. 5, we use a process /bin/tar to uncompress a
package that contains different number of les (10, 100, 1000,
respectively). We can see that the time cost of rebuild execution,
provenance query and factor analysis are all consistent with the
increase in the number of les generated.
Files Affected: As stated in Section III-D, the excrescent les
generated during rebuild can be categorized into three types.
They should be overlayed by the existing same les, directly
XIE et al.: PROVENANCE-BASED REBUILD FRAMEWORK 2809
Fig. 6. Rebuild time breakdown for different applications. For /bin/tar process, we uncompress a.tar.gz le and generate 100 les. For /bin/cp process, we copy
1 KB data to each of the 100 newly created les. (a) /bin/tar. (b) /bin/cp.
Fig. 7. Breakdown of rebuild performance with different size of provenance
database. The number of records in database is from 10 000 to 100 million.
deleted or overlayed by the existing les with higher versions,
which should be renamed before rebuild and renamed back to
the original name after rebuild.
Fig. 6 shows the rebuild time breakdown of the process
/bin/tar and /bin/cp in these three types. In the rst type, these
two processes generate 100 les which are completely the same
as the existing 100 les. In the second type (i.e., 50-delete), 50
les of these 100 les do not exist before rebuild and need to
be deleted. In the third type (i.e., 50-version), 50 les of these
100 les are with higher versions before rebuild. As it involves
querying the les with a high version in database and renaming
them, the Factor Analysis time raises a lot in the third type.
However, since the Factor Analysis time only takes up a small
portion of the overall rebuild time, the overall rebuild time
does not raise too much. For the tar process, the whole rebuild
time in the third type outperforms the rst type 17.0% and the
second type 11.8%, respectively. For the /bin/cp process, the
whole rebuild time in the third type only outperforms the rst
type 10.3% and second type 7.1%.
Size of Provenance Database: If provenance database is very
big, querying a few of the provenance records from a large data-
base can incur big time overhead. This will make the Prove-
nance Query and Factor Analysis time very long. We have mea-
sured the breakdown of rebuild time using /bin/cp process for
different size of databases as shown in Fig. 7. The whole re-
build time scales linearly as the time costs on Provenance Query
and Factor Analysis increases. This indicates that maintaining a
small provenance database can signicantly benettherebuild
performance.
Execution Process: The rebuild time strongly depends on the
execution process. The discussion on this is beyond the scope
of this paper.
D. Overhead Analysis
The provenance used for rebuild has to be rst collected
during the read or write process. As provenance has to be
written to the disk before the data, this collection process
can impact on the normal read/write performance. On the
other hand, collecting and storing provenance can incur space
overhead.
We evaluate the overhead of /bin/tar and /bin/cp processes as
shown in Fig. 8. The “Ext3” represents the case without prove-
nance, The “PR” indicates that we collect provenance using
PASS system, and rebuild data using our provenance-based re-
build framework. The elapsed time overhead for tar is 24.5%,
and for cp is 40.9%. The increase lies in the extra write to record
provenance. The space overhead for tar is 10.3%, and for cp is
55.3%.
To give a better understanding of why some of the overheads
are so high, we measured space and elapsed time overheads of
cp process for rebuilding les with different size as shown in
Tables I and II, respectively. Table I shows that the size of prove-
nance is changeless. This is because provenance only records
the relationship between the cp process and the les, but not
records the content of the les. As the size of provenance does
not change, the space overhead decreases as the size of the re-
built le increases. On the other hand, the time to write these
provenance is almost unaltered. But as the cp process costs more
time as the size of the copied data increases, the provenance
time overhead becomes smaller and smaller (see Table II). We
can see that the space overhead even can be neglected and time
overhead can be also very small when the rebuilt le becomes
very big. This indicates that provenance-based rebuild can be
very useful for big data rebuild.
V. C HALLENGES
Though we have presented a basic framework for prove-
nance-based rebuild and evaluated its performance, overhead
and a variety of factors that can impact its performance, there
still exists a number of challenges that we consider can dene
whether it is good enough and worth taking for the practical
use.
Indenite Process: Theoretically, we can rebuild any of the
lost les if the execution environment is completely the same as
when the le was rst generated. However, we cannot always
acquire the same result if the execution results of processes are
indenitely, such as a process with random seed.
2810 IEEE TRANSACTIONS ON MAGNETICS, VOL. 49, NO. 6, JUNE 2013
Fig. 8. Provenance time and space overhead for tar and cp workloads. For tar process, we uncompress a.tar.gz le and generate 100 les. For cp process, we copy
1 KB data to each of the 100 newly created les. (a) Time overhead. (b) Space overhead.
TAB LE I
SPAC E OVERHEADS (IN MB) FOR PR (PROVENANCE-BASED REBUILD)WHEN USING CP PROCESS TO REBUILD FILES WITH DIFFERENT SIZE
TAB LE II
ELAPSED TIME OVERHEADS (IN SECONDS)FOR PR WHEN USING CP PROCESS TO REBUILD FILES WITH DIFFERENT SIZE
Size of Provenance: Though the provenance we have col-
lected so far can enable the rebuild for the most of the time,
we may need to collect more information such as the operating
system information, the version of Linux kernel, etc., in some
more complex cases. On the other hand, unoptimized prove-
nance can take up substantial storage space, how to accurately
collect provenance to enable rebuild while retain a relatively
small amount of provenance is a big challenge. We are exploring
how to reduce the provenance size by exploiting the character-
istics of provenance [15], [16].
Provenance or RAID?: Provenance-based rebuild has some
important advantages compared to the traditional ECC scheme
in RAID-structured system, such as per-le rebuild and parallel
rebuild. Additionally, it can rebuild the data in any number of
disks in RAID system even when all these disks are crashed as
long as the provenance of data in those disks has been saved in
a separated storage pool. However, we still have to use classical
techniques such as ECC or backup technologies to secure these
provenance data.
Rebuild in Network-Attached Environment: There already
exists provenance aware NFS architecture [4] which generates
data and collects provenance at the client and stores both of
them on the server. There are two choices to rebuild data using
provenance in this case. The rst is rebuilding data at the client,
this is because the lost/damaged data was previously generated
at the client. However, all the provenance of the le should be
transferred to the client which can incur network overhead. The
second is rebuilding data on the server. However, the execution
environment on the server should be similar or same as the client
where the data was rst generated.
VI. APPLICATION USE CASES
To give a better understanding of where provenance-based
rebuild can be used, we list a series of typical application use
cases as follows.
Validate Experimental Data: Scientists usually want to re-
produce the experimental results to see whether the whole ex-
periment process is normal and correct even when they have
already got some experimental results. Scientists can use the
provenance of the experimental data to automatically and con-
veniently replay the experiment.
Enhance Backup: The backup technology has been widely
used in enterprise system and cloud infrastructures. However,
we consider provenance is better to be used in the following
cases: 1) Backup of big data is expensive, since storing a number
of copies of these data consumes more storage space than only
XIE et al.: PROVENANCE-BASED REBUILD FRAMEWORK 2811
storingthecommonworkows and input data (i.e., provenance)
that generate it. 2) Backup can provide a series of history ver-
sions of a document for a user to choose from for data recovery.
Provenance further tells a user the relationship between these
versions and which one is the exact one for recovery.
Play Video: Many video websites provide different denition
of the same video for users with different requirements. For in-
stance, a user may only want the video is uently playing even
if the denition is not high. The website usually keeps several
copies for video of each denition in case of crash. A better so-
lution may be keeping only one copy for video of each denition
for online play, and using provenance to record how to convert
original video to different denition of video. This usually can
save at least 50% storage space.
High Performance Computing: It is a common case that sci-
entists use a small size of input data and some transformation
rules to generate enormous data for analyze in HPC areas. It
is not cost-effective to keep all these data. Recording only the
input data and transformation rules (i.e., provenance) can sig-
nicantly reduce storage overhead.
VII. CONCLUSION
In this paper, we have presented our experience in designing
and implementing a provenance-based rebuild framework. We
have discussed a wide variety of issues that we have to solve
when implementing this kind of a framework. The experimental
results show that provenance-based rebuild signicantly outper-
forms log-based technology for per-le rebuild with reasonable
space and time overhead.
ACKNOWLEDGMENT
This work was supported in part by the National Basic Re-
search 973 Program of China under Grant 2011CB302301, by
the NSFC under Grants 61025008, 61232004, 61173043, 863
and by Program 2012AA012403. The authors thank the anony-
mous reviewers for their comments on this paper.
REFERENCES
[1] K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. I. Seltzer,
“Provenance-aware storage systems,” in Proc. USENIX’06, 2006.
[2] S. T. King and P. M. Chen, “Backtracking intrusions,” in Proc.
SOSP’03, Bolton Landing, NY, USA, 2003.
[3] S. Shah, C. A. N. Soules, G. R. Ganger, and B. D. Noble, “Using prove-
nance to aid in personal le search,” in Proc. USENIX’07, 2007.
[4] K.-K. Muniswamy-Reddy, “Foundations for provenance-aware sys-
tems,” Ph.D. thesis, Harvard Univ., Cambridge, MA, 2010.
[5] B. A. Madden, I. F. Adams, M. W. Storer, E. L. Miller, D. D. E. Long,
andT.M.Kroeger,“Provenance based rebuild: Using data provenance
to improve reliability,” Tech. Rep. UCSC-SSRC-11-04, 2011.
[6] K.-K. Muniswamy-Reddy, P. Macko, and M. I. Seltzer, “Provenance
for the cloud,” in Proc. FAST’10, 2010.
[7] S. Wu, H. Jiang, D. Feng, L. Tian, and B. Mao, “WorkOut: I/O work-
load outsourcing for boosting RAID reconstruction performance,” in
Proc. 7th USENIX Conf. File and Storage Technol., 2009.
[8] M. Rosenblum and J. K. Ousterhout, “The design and implementation
of a log-structured le system,” ACM Trans. Comput. Syst., vol. 10, pp.
26–52, 1992.
[9] S. Y. Ko, I. Hoque, B. Cho, and Indranil, “On availability of interme-
diate data in cloud computation,” in Proc. HotOS’09, 2009.
[10] D. Yuan, Y. Yang, X. Liu, and J. Chen, “A cost-effective strategy for in-
termediate data storage in scientic cloud workow systems,” in Proc.
24th IEEE Int. Parallel Distrib. Process. Symp., 2010.
[11] I. Adams, D. D. E. Long, E. L. Miller, S. Pasupathy, and M. W. Storer,
“Maximizing efciency by trading storage for computation,” in Proc.
Workshop on Hot Topics in Cloud Comput., 2009.
[12] Y. Xie, D. Feng, Z. Tan, L. Chen, and J. Zhou, “Experiences building a
provenance-based reconstruction system,” in Proc. Storage Syst., Hard
Disk and Solid State Technol. Summit in Conjunction With the APMRC
Conf., 2012.
[13] [Online]. Available: http://www.eecs.harvard.edu/syrah/pql
[14] [Online]. Available: http://code.google.com/p/ext3grep/
[15] Y. Xie, K.-K. Muniswamy-Reddy, D. Feng, L. Yan, D. D. E. Long,
Z. Tan, and L. Chen, “A hybrid approach for efcient provenance
storage,” in Proc. 21th ACM Int. Conf. Inf. Knowledge Manage.
(CIKM), 2012.
[16] Y. Xie, K.-K. Muniswamy-Reddy, D. D. E. Long, A. Amer, D. Feng,
and Z. Tan, “Compressing provenance graphs,” in Proc. 3rd USENIX
Workshop on the Theory and Practice of Provenance, 2011.
... Provenance is the metadata that represents the history or lineage of a data object [2]. It is widely used in various areas, such as data rebuilding [3], debugging [4], and cloud security [5]. Provenance, which exactly describes the origin or lineage of a data node, shows the details of the creation of a data node and reveals the upstream and downstream between different data nodes. ...
... Previous researches on provenance have mainly focused on workflow systems, operating systems, file systems and cloud computing. Xie et al. [3] used provenance, the origin or history of objects, to rebuild damaged or lost files. This method (PDRM) can exactly reconstruct the right file lost or damaged by back tracking its generation process in the past. ...
... Compression ratio is a value that represents the ratio of the volume from the size of the compressed file to uncompressed file size, the smaller the better. To verify the construction performance of the proposed algorithm, we compare our advanced algorithm of constructing data supply chain based on layered PROV (ACDSC-LP) with a provenance-based data reconstruction method (PDRM) [3] from query time and construction time. ...
Article
Full-text available
The inability to effectively construct data supply chain in distributed environments is becoming one of the top concerns in big data area. Aiming at this problem, a novel method of constructing data supply chain based on layered PROV is proposed. First, to abstractly describe the data transfer processes from creation to distribution, a data provenance specification presented by W3C is used to standardize the information records of data activities within and across data platforms. Then, a distributed PROV data generation algorithm for multi-platform is designed. Further, we propose a tiered storage management of provenance based on summarization technology, which reduces the provenance records by compressing mid versions so as to realize multi-level management of PROV. In specific, we propose a hierarchical visual technique based on a layered query mechanism, which allows users to visualize data supply chain from general to detail. The experimental results show that the proposed approach can effectively improve the construction performance for data supply chain.
... In terms of traceability data protection and traceability chain establishment, the current traditional methods of data traceability include data citation technology [98], labeling methods [99], and reverse query methods [100]. However, these traditional methods are insufficient for protecting the traceability data, and they must often rely on a centralized third party to store or verify the traceability data. ...
... Different studies for blockchain. [81] Proposed Blockchain 1.0 [84,85] Blockchain 2.0 with support for creating advanced smart contracts [98][99][100] Three different approaches to data traceability are proposed. [102][103][104][105] Proposed blockchain extensions and cross-chain sidechaining techniques ...
Article
Full-text available
The Industrial Internet of Things (IIoT), where numerous smart devices associated with sensors, actuators, computers, and people communicate with shared networks, has gained advantages in many fields, such as smart manufacturing, intelligent transportation, and smart grids. However, security is becoming increasingly challenging due to the vulnerability of the IIoT to various malicious attacks. In this paper, the security issues of the IIoT are reviewed from the following three aspects: (1) security threats and their attack mechanisms are presented to illustrate the vulnerability of the IIoT; (2) the intrusion detection methods are listed from the attack identification perspectives; and (3) some defense strategies are comprehensively summarized. Several concluding remarks and promising future directions are provided at the end of this paper.
... The causality-based context, which we term as provenance, describes how data come to its present status and can be used to enable monitoring and forensic analysis by capturing the data flow and dependency relationship between different data objects. Provenance has been widely used in recording experimental details, debugging, optimizing search (Shah et al., 2007), and data rebuild (Xie et al., 2013a). Provenancebased methods have also been used in both local (Pohly et al., 2012) and distributed environments (Zhou et al., 2011;Tariq et al., 2011;Gehani et al., 2010) to trace back the intrusion source. ...
Article
Provenance, the history or lineage of an object, has been used to enable efficient forensic analysis in intrusion prevention system to detect intrusion, correlate anomaly, and reduce false alert. Especially for the network-attached environment, it is critical and necessary to accurately capture network context to trace back the intrusion source and identify the system vulnerability. However, most of the existing methods fail to collect accurate and complete network-attached provenance. In addition, how to enable efficient forensic analysis with minimal provenance storage overhead remains a big challenge. This paper proposes a provenance-based monitoring and forensic analysis framework called PDMS that builds upon existing provenance tracking framework. On one hand, it monitors and records every network session, and collects the dependency relationships between files, processes and network sockets. By carefully describing and collecting the network socket information, PDMS can accurately track the data flow in and out of the system. On the other hand, this framework unifies both efficient provenance filtering and query-friendly compression. Evaluation results show that this framework can make accurate and highly efficient forensic analysis with minimal provenance storage overhead.
Article
With the wide use of video-editing software, more and more similar videos are uploaded to the video-sharing platforms in cloud storage. The large number of near-duplicate videos would not only make the users not satisfactory, but also consume more resources with service providers. Since the near-duplicate videos have redundancy at the content level, the traditional de-duplication technology is not suitable to address the redundancies. In order to decrease the redundancy, we proposed a novel high performance provenance-based video-sharing system, called Provis. Provis records the provenance of the videos and compresses the near-duplicate videos via storing their provenance which replaces storing their video files to improve the space efficiency. When users attempt to upload the edited videos, Provis can rebuild the videos in the cloud via collecting and uploading their provenance to reduce the network overhead of uploading videos. We implemented Provis and compared its performance with other existing video compression techniques. Our evaluation shows that Provis achieves significant space savings.
Conference Paper
Full-text available
Efficient provenance storage is an essential step towards the adoption of provenance. In this paper, we analyze the provenance collected from multiple workloads with a view towards efficient storage. Based on our analysis, we characterize the properties of provenance with respect to long term storage. We then propose a hybrid scheme that takes advantage of the graph structure of provenance data and the inherent duplication in provenance data. Our evaluation indicates that our hybrid scheme, a combination of web graph compression (adapted for provenance) and dictionary encoding, provides the best tradeoff in terms of compression ratio, compression time and query performance when compared to other compression schemes.
Article
Full-text available
Traditionally, computing has meant calculating results and then storing those results for later use. Unfortu- nately, committing large volumes of rarely used data to storage wastes space and energy, making it a very expen- sive strategy. Cloud computing, with its readily available and flexibly allocatable computing resources, suggests an alternative: storing the provenance data, and means to recomputing results as needed. While computation and storage are equivalent, finding the balance between the two that maximizes efficiency is difficult. One of the fundamental challenges of this is- sue is rooted in the knowledge gap separating the users and the cloud administrators—neither has a completely informed view. Users have a semantic understanding of their data, while administrators have an understanding of the cloud's underlying structure. We detail the user knowledge and system knowledge needed to construct a comprehensive cost model for analyzing the trade-off between storing a result and regenerating a result, allow- ing users and administrators to make an informed cost- benefit analysis.
Conference Paper
Full-text available
The provenance community has built a number of sys-tems to collect provenance, most of which assume that provenance will be retained indefinitely. However, it is not cost-effective to retain provenance information inef-ficiently. Since provenance can be viewed as a graph, we note the similarities to web graphs and draw upon tech-niques from the web compression domain to provide our own novel and improved graph compression solutions for provenance graphs. Our preliminary results show that adapting web compression techniques results in a com-pression ratio of 2.12:1 to 2.71:1, which we can improve upon to reach ratios of up to 3.31:1.
Conference Paper
Full-text available
User I/O intensity can significantly impact the perfor- mance of on-line RAID reconstruction due to contention for the shared disk bandwidth. Based on this observa- tion, this paper proposes a novel scheme, calledWorkOut (I/OWork load Outsourcing), to significantly boost RAID reconstruction performance. WorkOut effectively out- sources all write requests and popular read requests orig- inally targeted at the degraded RAID set to a surrogate RAID set during reconstruction. Our lightweight pro- totype implementation of WorkOut and extensive trace- driven and benchmark-driven experiments demonstrate that, compared with existing reconstruction approaches, WorkOut significantly speeds up both the total recon- struction time and the average user response time. Im- portantly, WorkOut is orthogonal to and can be easily incorporated into any existing reconstruction algorithms. Furthermore, it can be extended to improving the perfor- mance of other background support RAID tasks, such as re-synchronization and disk scrubbing.
Conference Paper
Full-text available
Many scientific workflows are data intensive where a large volume of intermediate data is generated during their execution. Some valuable intermediate data need to be stored for sharing or reuse. Traditionally, they are selectively stored according to the system storage capacity, determined manually. As doing science on cloud has become popular nowadays, more intermediate data can be stored in scientific cloud workflows based on a pay-for-use model. In this paper, we build an Intermediate data Dependency Graph (IDG) from the data provenances in scientific workflows. Based on the IDG, we develop a novel intermediate data storage strategy that can reduce the cost of the scientific cloud workflow system by automatically storing the most appropriate intermediate datasets in the cloud storage. We utilise Amazon's cost model and apply the strategy to an astrophysics pulsar searching scientific workflow for evaluation. The results show that our strategy can reduce the overall cost of scientific cloud workflow execution significantly.
Conference Paper
Full-text available
As the scope of personal data grows, it becomes in- creasingly difficult to find what we need when we need it. Desktop search tools provide a potential answer, but most existing tools are incomplete solutions: they index content, but fail to capture dynamic relationships from the user's context. One emerging solution to this is context- enhanced search, a technique that reorders and extends the results of content-only search using contextual infor- mation. Within this framework, we propose using strict causality, rather than temporal locality, the current state of the art, to direct contextual searches. Causality more accurately identifies data flow between files, reducing the false-positives created by context-switching and back- ground noise. Further, unlike previous work, we con- duct an online user study with a fully-functioning imple- mentation to evaluate user-perceived search quality di- rectly. Search results generated by our causality mech- anism are rated a statistically-significant 17% higher on average over all queries than by using content-only search or context-enhanced search with temporal locality.
Conference Paper
Provenance is a kind of metadata that identifies the origin or history of objects. Provenance enables new functionality in a wide range of areas, including experimental documentation, security, search, debugging, etc. In this paper, we explore a new provenance application: provenance-based rebuild. Compared to the traditional ECC scheme, using provenance to reconstruct lost data has the salient advantage such as a more fine-grained reconstruction granularity and parallel rebuild. We present the experience in detailed designing and implementing this system, including a wide variety of issues that we have to solve. We also propose to utilize some classical techniques such as active storage technology to accelerate the provenance-based rebuild performance.
Conference Paper
This paper takes a renewed look at the problem of managing intermediate data that is generated during dataflow computations (e.g., MapReduce, Pig, Dryad, etc.) within clouds. We discuss salient features of this intermediate data and outline requirements for a solu- tion. Our experiments show that existing local write- remote read solutions, traditional distributed file sys- tems (e.g., HDFS), and support from transport protocols (e.g., TCP-Nice) cannot guarantee both data availabil- ity and minimal interference, which are our key require- ments. We present design ideas for a new intermediate data storage system.
Conference Paper
Analyzing intrusions today is an arduous, largely manual task because system administrators lack the information and tools needed to understand easily the sequence of steps that occurred in an attack. The goal of BackTracker is to identify automatically potential sequences of steps that occurred in an intrusion. Starting with a single detection point (e.g., a suspicious file), BackTracker identifies files and processes that could have affected that detection point and displays chains of events in a dependency graph. We use BackTracker to analyze several real attacks against computers that we set up as honeypots. In each case, BackTracker is able to highlight effectively the entry point used to gain access to the system and the sequence of steps from that entry point to the point at which we noticed the intrusion. The logging required to support BackTracker added 9% overhead in running time and generated 1.2 GB per day of log data for an operating-system intensive workload.
Conference Paper
A Provenance-Aware Storage System (PASS) is a storage system that automatically collects and maintains prove- nance or lineage, the complete history or ancestry of an item. We discuss the advantages of treating provenance as meta-data collected and maintained by the storage sys- tem, rather than as manual annotations stored in a sepa- rately administered database. We describe a PASS imple- mentation, discussing the challenges it presents, perfor- mance cost it incurs, and the new functionality it enables. We show that with reasonable overhead, we can provide useful functionality not available in today's file systems or provenance management systems.