ArticlePDF Available

Design and Evaluation of a Provenance-Based Rebuild Framework

June 2013
IEEE Transactions on Magnetics 49(6):2805-2811

June 2013
49(6):2805-2811

DOI:10.1109/TMAG.2013.2251460

Authors:

Yulai Xie

Huazhong University of Science and Technology

Zhipeng Tan

Huazhong University of Science and Technology

Previous storage reliability solutions mainly include log files, snapshot, backup or ECC-based schemes. These technologies focus more on the whole disk or system reliability, but not single data object. However, it is not efficient to rebuild a file by recovering a whole system using the technologies above. In this paper, we propose to use provenance, the origin or history of objects, to rebuild damaged or lost files. Provenance-based rebuild can exactly reconstruct the right file lost or damaged by backtracking its generation process in the past. The evaluation results show that provenance-based rebuild performs significantly better than log-based technology with reasonable time and space overhead.

Architecture of PR.

…

Example of how file is affected in rebuild.

…

Example of process interaction in rebuild.

…

Breakdown of rebuild performance with different number of files generated during rebuild.

…

Rebuild time breakdown for different applications. For /bin/tar process, we uncompress a.tar.gz file and generate 100 files. For /bin/cp process, we copy 1 KB data to each of the 100 newly created files. (a) /bin/tar. (b) /bin/cp.

…

Figures - uploaded by Yulai Xie

Content may be subject to copyright.

Content uploaded by Yulai Xie

Content may be subject to copyright.

IEEE TRANSACTIONS ON MAGNETICS, VOL. 49, NO. 6, JUNE 2013 2805

Design and Evaluation of a Provenance-Based Rebuild Framework

Yulai Xie, Dan Feng, Zhipeng Tan, and Junzhe Zhou

Wuhan National Lab for Optoelectronics, School of Computers, Huazhong University of Science and Technology, Wuhan

430074, China

Previous storage reliability solutions mainly include log ﬁles, snapshot, backup or ECC-based schemes. These technologies focus more

on the whole disk or system reliability, but not single data object. However, it is not efﬁcient to rebuild a ﬁle by recovering a whole system

using the technologies above. In this paper, we propose to use provenance, the origin or history of objects, to rebuild damaged or lost

ﬁles. Provenance-based rebuild can exactly reconstruct the right ﬁle lost or damaged by backtracking its generation process in the past.

The evaluation results show that provenance-based rebuild performs signiﬁcantly better than log-based technology with reasonable time

and space overhead.

Index Terms—Data rebuild, provenance, reliability, storage system.

I. INTRODUCTION

STORAGE reliability has long been an important concern

in both desktop and server platforms. There already exists

a series of solutions, such as log-based technology, snapshot,

backup and the ECC-based scheme. These technologies ensure

the whole system security or storage reliability by recording

disk activities, copying original data or computing the integrity

check data in case of its crash.

On the other hand, people are more and more concerned about

rebuilding the single data object (e.g., an important damaged

ﬁle). Compared to recovering a whole system, it is also a very

common case that a user wants to rebuild a ﬁle or recover the

content of a vital ﬁle. Provenance, which exactly describes the

origin or lineage of a data object, shows the details of the cre-

ation of a data object and reveals the dependency between dif-

ferent objects. Provenance can be used to debug the programs

[1], disclose the root cause of a system attack [2] and speed up

the desktop search [3]. Provenance can be used also as an im-

portant clue to regenerate the experimental data [4] and rebuild

the individual ﬁle by replaying the generation process.

There are already some studies [5] pointing out the advan-

tage of using provenance to rebuild lost/broken ﬁles compared

to the traditional storage reliability scheme, such as per-ﬁle re-

build and parallel rebuild. In addition, they proposed to build

a framework that uses cloud as infrastructure that provides rich

and dynamic resources to regenerate data, and they analyzed the

issues needed to consider when building this kind of a frame-

work.

However, we still need to address a series of important issues

before provenance-based rebuild can be really applied in the real

world.

1) How can we ﬁnd the proper provenance for rebuild?

2) Which kind of factors can affect the provenance-based re-

build performance?

3) Can the performance of provenance-based rebuild defeat

the traditional reliability scheme?

Manuscript received December 11, 2012; accepted February 28, 2013. Date

of current version May 30, 2013. Corresponding author: D. Feng (e-mail:

dfeng@hust.edu.cn).

Digital Object Identiﬁer 10.1109/TMAG.2013.2251460

To address these problems, we present the design, implemen-

tation and evaluation of a provenance-based rebuild framework.

The framework is built with the following two design principles.

Transparency: Provenance-based rebuild is invoked to exe-

cute and regenerate a lost ﬁle during read or write.

Modular: Provenance-based rebuild framework can be

easily incorporated into the existing provenance collection

system, such as PASS [1].

The framework can accurately invoke a rebuild process, re-

trieve the provenance of the lost ﬁles, analyze the possible cases

that may affect the rebuild, and execute the rebuild process in

a user-space system. The evaluation results on this framework

show that the provenance-based rebuild performance signiﬁ-

cantly outperforms the log-based technology. Though a series

of factors may affect the rebuild performance, we demonstrate

that they do not affect the performance much.

The rest of the paper is organized as follows. We summa-

rize background and related work in Section II and elaborate

the design and implementation of our rebuild framework in

Section III. In Section IV, we evaluate our framework. In

Section V, we discuss some important research challenges.

In Section VI, we list a series of typical application use cases

for provenance-based rebuild. In Section VII, we conclude the

paper.

II. BACKGROUND AND RELATED WORK

In this section, we ﬁrst give an overview of provenance-aware

storage system. Second, we present the related work on data

reconstruction approach and then motivate our research.

A. Provenance-Aware Storage System

Generally, there exist three kinds of metadata: common meta-

data such asﬁle ID or name, content-based metadata and prove-

nance. Provenance is commonly in the form of causality-based

graphs that represent the dependency relationship between dif-

ferent data objects. For example, if A depends on B, then B

should appear in the provenance of A.

The Provenance-Aware Storage System (i.e., PASS [1]) ob-

serves traditional READ/WRITE system calls and accordingly

collects provenance records. Any application system call like

READ or WRITE will be recorded in the provenance graphs.

For example, the invocation of a READ system call to a ﬁle

2806 IEEE TRANSACTIONS ON MAGNETICS, VOL. 49, NO. 6, JUNE 2013

by a process will construct a provenance edge that indicates the

process depends on the ﬁle. There are three kinds of nodes in the

PASS provenance graph: ﬁle, process and pipe. Each node has

related attributes that describe its identity information. For ﬁle,

PASS records its ﬁle ID and name. For process, PASS records its

PID, name, command line arguments, environment variables, a

reference to the ﬁle being executed, etc. PASS can use local ﬁle

system, network-attached and cloud as its storage backend [6].

B. Existing Reconstruction Approaches

Currently, the prevalent data reconstruction approach is an

ECC-based scheme which has been widely used in large scale

storage system such as RAID5-structured storage system [7].

The advantage of this scheme is its high efﬁcient reconstruc-

tion performance while independent of any software installed on

the disks. Another widely used technology is log-structured ﬁle

system [8], which uses log ﬁles to record disk activities and can

rebuild system in case of its crash. Similarly, the snapshot tech-

nology ensures the system security by employing technology

such as “copy-on-write”. These approaches focus more on the

whole system security, but neglect to secure a single ﬁle with

respect to a more ﬁne-granularity point.

Some of the recent work [9], [10] has proposed to make

maximum use of intermediate data generated in the scientiﬁc

workﬂow to reduce storage overhead and improve recom-

puting times. There are also some works [11] looking for the

tradeoff between storage and computation by using prove-

nance. However, these work only focus on using provenance

and recomputation to improve the whole system efﬁciency, but

not on the design and implementation of provenance-based

rebuild scheme.

Madden et al. [5] proposed to use provenance generated by

the PASS system to rebuild lost ﬁles. They focus on elaborating

the advantage of provenance-based rebuild compared to ECC

scheme and employing cloud as the infrastructure to process re-

build task. For instance, provenance-based rebuild can provide

amoreﬁne-granularity, parallel and priority-based method than

the traditional ECC scheme.

In prior work [12], we outlined the basic rebuild framework

and proposed to utilize active storage technology as an optional

solution to be employed by cloud to improve rebuild perfor-

mance. This work makes a more comprehensive evaluation by

explicitly comparing the performance of provenance-based re-

build with other schemes and analyzing the various factors that

can impact the rebuild performance.

III. DESIGN AND IMPLEMENTATION

In this section, ﬁrst we will state the basic framework of

provenance-based rebuild. Then we elaborate the design issues

of each component of the framework.

A. Provenance-Based Rebuild Framework

Fig. 1 shows the framework of Provenance-based Rebuild

(PR).AsdepictedinFig.1,PRconsistsofﬁve modules, namely,

the Provenance Collection,theProvenance Query,theFactor

Analysis,theRebuild Execution and the Recover Process.

Provenance Collection gets and analyzes READ/WRITE com-

mands and translates them to the provenance records, then

Fig. 1. Architecture of PR.

writes both the data and provenance records into the disks. For

READ command, if Provenance Collection cannot get data

from disk (i.e., data is lost), it will invoke the provenance re-

build module for data reconstruction. In this case, Provenance

Collection communicates to the Provenance Query module in

the user space and passes the information such as the name of

the ﬁle to be rebuilt to it. Provenance Query is responsible for

querying the provenance of the lost ﬁle from the provenance

database in the disk. Usually, the provenance of a ﬁle includes

the process that generates it and input parameters needed

during the process execution. Factor Analysis is responsible for

analyzing the possible cases that data reconstruction may affect

the normal ﬁle. Rebuild Execution schedules the corresponding

process queried to execute using the input parameters queried

as input to generate the lost ﬁle. At last, Recover Process

recovers the affected ﬁles to the normal state. In PASS, the

provenance is usually stored in a database such as BerkeleyDB.

The query engine is a series of tools such as PQL [13].

B. Provenance Collection

We have to collect necessary provenance information for

data rebuild. For example, in order to enable a process re-exe-

cutable, we have to collect the complete execution environment

of the process. This typically includes the input parameters

of a process, environment variables, process name/identiﬁers,

etc. For a ﬁle to be re-processed, this usually includes the ﬁle

name, inode number, etc. In addition, we have to collect the

relationship between a ﬁle and the process that processes it. For

instance, “process P reads ﬁle A” will invoke the collection of

a provenance record to indicate that P depends on A.

Furthermore, we have to record the offset and the detailed

bytes written to a ﬁle in some cases. For example, if a user in-

puts some characters into a text ﬁle, then once this ﬁle is lost or

broken, the success of recreation of this text ﬁle is only based on

whether we have recorded the contents written to the ﬁle. Gen-

erally speaking, if the creation of a ﬁle only depends on external

input or messages from the network, then we have to collect the

offset and the bytes written to a ﬁle as provenance information.

XIE et al.: PROVENANCE-BASED REBUILD FRAMEWORK 2807

Note that the provenance collected should be written to disk

before the data, otherwise, the data stored on the disks may have

no provenance in case of system crash. But this imposes the time

overhead for the normal read or write which we will evaluate in

Section IV.

C. Provenance Queries

Efﬁcient provenance query is a must for high performance

rebuild. For the current implementation, we store the collected

provenance into a BerkeleyDB database. For the information

to describe the basic identity of a process or a ﬁle, such as en-

vironment variables or ﬁle name, we store them into a identi-

tyDB database, for the information to describe the relationship

between process and ﬁle, we store them into a ancestorDB data-

base. We assign each process or ﬁle a unique number (called

pnode number). In order for a more efﬁcient query, we build

some indices, such as NameDB that stores the relationship be-

tween the name of a ﬁle or process and its pnode number. If

process P reads ﬁle A, then writes the information to ﬁle B, typ-

ically the queries for the provenance of ﬁle B are as follows.

1) Query NameDB to ﬁnd the pnode number of ﬁle B using

the ﬁle name of B as keyword.

2) Query ancestorDB to ﬁnd the process P that ﬁle B depends

on using the pnode number of ﬁle B as keyword.

3) Query identityDB to ﬁnd the identity information of

process P, such as input parameters and environment

variables.

4) Query ancestorDB to ﬁnd the ﬁle A that process P depends

on.

5) Execute the process P to generate ﬁle B.

D. Factor Analysis

The rebuild execution, though it can rebuild a ﬁle after ac-

quiring the provenance (i.e., the process that generates it and

input parameters), can also affect other normal ﬁles. Typically, if

a rebuild process generates multiple ﬁles at a time, but only one

ﬁle is the lost ﬁle needed to be rebuilt, then the other ﬁles gen-

erated can be excrescent. The excrescent ﬁles generated during

rebuild can be categorized into the following three types.

1) These ﬁles exist before rebuild and are completely the

same as the generated ﬁles. Thus, the generated ﬁles will

automatically overlay the existing ﬁles.

2) These ﬁles do not exist before rebuild. So, the generated

ﬁles should be deleted.

3) These ﬁles exist with a higher version before rebuild. So,

the existing ﬁles with a higher version should be renamed

before rebuild, and overlay the generated ﬁles after rebuild.

For instance, as shown in Fig. 2, process P generates ﬁle A,

ﬁle B and ﬁle C. Then ﬁle B and ﬁle C are modiﬁed and con-

verted to B1 (i.e., ﬁle B with version 1) and C1 (i.e., ﬁle C with

version 1) using process P1 and P2, respectively. When ﬁle A

is lost and needs to be rebuilt using process P, ﬁle B and ﬁle C

will be also generated. As ﬁle B and ﬁle C have the same name

as B1 and C1, they will automatically overlay the existing B1

and C1 respectively. However, we record the version numbers

of B1 and C1, and rename B1 and C1 if their versions are not

zero before rebuild. Then when the newly B and C are gener-

ated, we use B1 and C1 to overlay them respectively.

Fig. 2. Example of how ﬁle is affected in rebuild.

Note that in this step, we have to ﬁrst query the provenance

database to judge that whether the generated excrescent ﬁles

exist before rebuild. If they do exist, we have to further query if

they have higher versions and rename these ﬁles with a higher

version. We will evaluate how they impact the whole rebuild

performance in Section IV.

E. Rebuild Execution

As we have stated above, the basic provenance rebuild

scheme regenerates a ﬁle by ﬁrst searching its ancestors in

the provenance chain. In most of the cases, the ancestors are

already on the disks. However, if any of the ancestors does

not exist (may be also lost), we have to ﬁrst regenerate the

ancestors. For instance, if process P reads ﬁle A and writes ﬁle

i.e., ), then when we regenerate B, and ﬁnd A

does not exist, we have to ﬁrst regenerate A (e.g., using process

P1), and then use P to regenerate B.

However, the rebuild sequence (ﬁrst P1, then P) is not al-

ways correct. Sometimes, each process can not execute sepa-

rately but should interact with another process to complete a

rebuild task. For instance, we use gcc to compile a ﬁle called

hello.c to hello.o. The provenance graph for this process is as

shown in Fig. 3. In our experiment, we search the provenance

graph to locate the tmp ﬁle as an ancestor of P (i.e., gcc). But we

ﬁnd that tmp ﬁle does not exist on the disk. So we run p1 (i.e.,

ccl) to generate it. However, we still cannot ﬁnd the regenerated

tmp ﬁle on the disk because it is a temporary ﬁle that does not

exist on the disk in a stable state. Furthermore, we ﬁnd that P1

depends on P (i.e., gcc) which forks P1. So in this case, we do

not have to separately rebuild the tmp ﬁle using P1, but we only

have to run the process P, which will automatically forks P1 that

generates the tmp ﬁle, which is used as an input for P before it

disappears from the disk.

F. Recover Process

After rebuild, we have to delete the unnecessary ﬁles and re-

cover the ﬁles with a high version. For instance, in Fig. 2, we

should delete ﬁle B and ﬁle C, and rename ﬁleB1andC1with

the same name as ﬁle B and ﬁle C, respectively.

IV. EVA L U AT I O N

In this section, we evaluate the performance of the prove-

nance-based rebuild scheme. First, we compare provenance-

based rebuild performance with log-based and snapshot tech-

nology. Second, we evaluate how a variety of factors impact on

2808 IEEE TRANSACTIONS ON MAGNETICS, VOL. 49, NO. 6, JUNE 2013

Fig. 3. Example of process interaction in rebuild.

the provenance-based rebuild performance. Then, we analyze

the space and time overhead imposed by provenance.

A. Experimental Setup

We run our experiments on a Fedora Linux 2.6.23.17 oper-

ating system with Intel Pentium 4 2.66 GHz processors, 1 GB

DDR400memoryand80GBharddrive.

We collect provenance using PASS system (PASSv2) and

store provenance in a BerkeleyDB (version 4.6.21) database.

For log-based technology, we use a tool called ext3grep [14] to

analyze the ext3 log ﬁle, and then ﬁnd the logical address of the

lost ﬁle and recover it. For snapshot technology, we employ the

LVM (Logical Volume Management) technology, which utilizes

the copy-on-write technology to backup and recover the data.

Since the purpose of our evaluation is to validate whether

it is good enough to rebuild data using provenance, we adopt

two typical data generation methods that are frequently used

in our common life: copy and uncompress. They use /bin/cp

and /bin/tar processes to process and generate ﬁles in our ex-

periments respectively. Note that we do not use the benchmark

(e.g., postmark) to measure the rebuild performance. Since most

of these benchmarks are purely used to evaluate the system I/O

performance, but are not used for the purpose of rebuilding the

real data. For instance, we never rebuild a lost/broken ﬁle using

the postmark process.

B. Rebuild Performance

Fig. 4 shows the rebuild performance for provenance-based,

log-based and snapshot technologies with ﬁles in different size.

For all these technologies, we ﬁrst copy different size of data to

an empty ﬁle in a directory using /bin/cp process, then we delete

the ﬁle, and at last we rebuild it using different technologies.

For the case using provenance, the directory is where we mount

aPASSvolume.SothePASSsystem automatically collects

provenance when the copy process happens. For the case using

snapshot, we make a snapshot using LVM before deleting the

ﬁle. We can see that, the performance of provenance-based

rebuild has a comparable performance with the snapshot tech-

nology, and signiﬁcantly outperforms log-based technology.

Fig. 4. Rebuild performance for provenance-based, log-based and snapshot

technologies with ﬁles in different size. In provenance-based rebuild, we copy

different size of the ﬁle to the PASS volume and collect provenance. Then we

delete the ﬁle on the PASS volume, query the provenance of that ﬁle and then

regenerate it using provenance.

Fig. 5. Breakdown of rebuild performance with different number of ﬁles gen-

erated during rebuild.

This is because that the log-based technology has to scan the log

ﬁle for the address of the deleted/lost ﬁle, which consumes lots

of time. Comparatively, the provenance information is stored

in database and querying provenance from index-ed database is

more efﬁcient and much faster. The snapshot technology, only

needs to copy back the data from the snapshot volume, so it is

also very efﬁcient. But the drawback of snapshot technology

is that time of making a snapshot depends on the person who

uses it, but not depends on the time when the ﬁle was broken or

lost. So it is not easy to recover the lost ﬁle to the most recent

version using snapshot technology.

C. Factors That Impact on Rebuild Performance

The rebuild time is composed of the time costs on a series

of steps as shown in Fig. 1. The performance of these steps is

impacted by a variety of factors as follows.

Number of Files Generated in Rebuild: The rebuild process

may generate a series of ﬁles even though the number of real

ﬁles needed to be rebuilt is only one or two. These excrescent

ﬁles generated can raise the rebuild execution time, and also the

Provenance Query and Factor Analysis time which both typ-

ically need more time to query the provenance database. As

shown in Fig. 5, we use a process /bin/tar to uncompress a

package that contains different number of ﬁles (10, 100, 1000,

respectively). We can see that the time cost of rebuild execution,

provenance query and factor analysis are all consistent with the

increase in the number of ﬁles generated.

Files Affected: As stated in Section III-D, the excrescent ﬁles

generated during rebuild can be categorized into three types.

They should be overlayed by the existing same ﬁles, directly

XIE et al.: PROVENANCE-BASED REBUILD FRAMEWORK 2809

Fig. 6. Rebuild time breakdown for different applications. For /bin/tar process, we uncompress a.tar.gz ﬁle and generate 100 ﬁles. For /bin/cp process, we copy

1 KB data to each of the 100 newly created ﬁles. (a) /bin/tar. (b) /bin/cp.

Fig. 7. Breakdown of rebuild performance with different size of provenance

database. The number of records in database is from 10 000 to 100 million.

deleted or overlayed by the existing ﬁles with higher versions,

which should be renamed before rebuild and renamed back to

the original name after rebuild.

Fig. 6 shows the rebuild time breakdown of the process

/bin/tar and /bin/cp in these three types. In the ﬁrst type, these

two processes generate 100 ﬁles which are completely the same

as the existing 100 ﬁles. In the second type (i.e., 50-delete), 50

ﬁles of these 100 ﬁles do not exist before rebuild and need to

be deleted. In the third type (i.e., 50-version), 50 ﬁles of these

100 ﬁles are with higher versions before rebuild. As it involves

querying the ﬁles with a high version in database and renaming

them, the Factor Analysis time raises a lot in the third type.

However, since the Factor Analysis time only takes up a small

portion of the overall rebuild time, the overall rebuild time

does not raise too much. For the tar process, the whole rebuild

time in the third type outperforms the ﬁrst type 17.0% and the

second type 11.8%, respectively. For the /bin/cp process, the

whole rebuild time in the third type only outperforms the ﬁrst

type 10.3% and second type 7.1%.

Size of Provenance Database: If provenance database is very

big, querying a few of the provenance records from a large data-

base can incur big time overhead. This will make the Prove-

nance Query and Factor Analysis time very long. We have mea-

sured the breakdown of rebuild time using /bin/cp process for

different size of databases as shown in Fig. 7. The whole re-

build time scales linearly as the time costs on Provenance Query

and Factor Analysis increases. This indicates that maintaining a

small provenance database can signiﬁcantly beneﬁttherebuild

performance.

Execution Process: The rebuild time strongly depends on the

execution process. The discussion on this is beyond the scope

of this paper.

D. Overhead Analysis

The provenance used for rebuild has to be ﬁrst collected

during the read or write process. As provenance has to be

written to the disk before the data, this collection process

can impact on the normal read/write performance. On the

other hand, collecting and storing provenance can incur space

overhead.

We evaluate the overhead of /bin/tar and /bin/cp processes as

shown in Fig. 8. The “Ext3” represents the case without prove-

nance, The “PR” indicates that we collect provenance using

PASS system, and rebuild data using our provenance-based re-

build framework. The elapsed time overhead for tar is 24.5%,

and for cp is 40.9%. The increase lies in the extra write to record

provenance. The space overhead for tar is 10.3%, and for cp is

55.3%.

To give a better understanding of why some of the overheads

are so high, we measured space and elapsed time overheads of

cp process for rebuilding ﬁles with different size as shown in

Tables I and II, respectively. Table I shows that the size of prove-

nance is changeless. This is because provenance only records

the relationship between the cp process and the ﬁles, but not

records the content of the ﬁles. As the size of provenance does

not change, the space overhead decreases as the size of the re-

built ﬁle increases. On the other hand, the time to write these

provenance is almost unaltered. But as the cp process costs more

time as the size of the copied data increases, the provenance

time overhead becomes smaller and smaller (see Table II). We

can see that the space overhead even can be neglected and time

overhead can be also very small when the rebuilt ﬁle becomes

very big. This indicates that provenance-based rebuild can be

very useful for big data rebuild.

V. C HALLENGES

Though we have presented a basic framework for prove-

nance-based rebuild and evaluated its performance, overhead

and a variety of factors that can impact its performance, there

still exists a number of challenges that we consider can deﬁne

whether it is good enough and worth taking for the practical

use.

Indeﬁnite Process: Theoretically, we can rebuild any of the

lost ﬁles if the execution environment is completely the same as

when the ﬁle was ﬁrst generated. However, we cannot always

acquire the same result if the execution results of processes are

indeﬁnitely, such as a process with random seed.

2810 IEEE TRANSACTIONS ON MAGNETICS, VOL. 49, NO. 6, JUNE 2013

Fig. 8. Provenance time and space overhead for tar and cp workloads. For tar process, we uncompress a.tar.gz ﬁle and generate 100 ﬁles. For cp process, we copy

1 KB data to each of the 100 newly created ﬁles. (a) Time overhead. (b) Space overhead.

TAB LE I

SPAC E OVERHEADS (IN MB) FOR PR (PROVENANCE-BASED REBUILD)WHEN USING CP PROCESS TO REBUILD FILES WITH DIFFERENT SIZE

TAB LE II

ELAPSED TIME OVERHEADS (IN SECONDS)FOR PR WHEN USING CP PROCESS TO REBUILD FILES WITH DIFFERENT SIZE

Size of Provenance: Though the provenance we have col-

lected so far can enable the rebuild for the most of the time,

we may need to collect more information such as the operating

system information, the version of Linux kernel, etc., in some

more complex cases. On the other hand, unoptimized prove-

nance can take up substantial storage space, how to accurately

collect provenance to enable rebuild while retain a relatively

small amount of provenance is a big challenge. We are exploring

how to reduce the provenance size by exploiting the character-

istics of provenance [15], [16].

Provenance or RAID?: Provenance-based rebuild has some

important advantages compared to the traditional ECC scheme

in RAID-structured system, such as per-ﬁle rebuild and parallel

rebuild. Additionally, it can rebuild the data in any number of

disks in RAID system even when all these disks are crashed as

long as the provenance of data in those disks has been saved in

a separated storage pool. However, we still have to use classical

techniques such as ECC or backup technologies to secure these

provenance data.

Rebuild in Network-Attached Environment: There already

exists provenance aware NFS architecture [4] which generates

data and collects provenance at the client and stores both of

them on the server. There are two choices to rebuild data using

provenance in this case. The ﬁrst is rebuilding data at the client,

this is because the lost/damaged data was previously generated

at the client. However, all the provenance of the ﬁle should be

transferred to the client which can incur network overhead. The

second is rebuilding data on the server. However, the execution

environment on the server should be similar or same as the client

where the data was ﬁrst generated.

VI. APPLICATION USE CASES

To give a better understanding of where provenance-based

rebuild can be used, we list a series of typical application use

cases as follows.

Validate Experimental Data: Scientists usually want to re-

produce the experimental results to see whether the whole ex-

periment process is normal and correct even when they have

already got some experimental results. Scientists can use the

provenance of the experimental data to automatically and con-

veniently replay the experiment.

Enhance Backup: The backup technology has been widely

used in enterprise system and cloud infrastructures. However,

we consider provenance is better to be used in the following

cases: 1) Backup of big data is expensive, since storing a number

of copies of these data consumes more storage space than only

XIE et al.: PROVENANCE-BASED REBUILD FRAMEWORK 2811

storingthecommonworkﬂows and input data (i.e., provenance)

that generate it. 2) Backup can provide a series of history ver-

sions of a document for a user to choose from for data recovery.

Provenance further tells a user the relationship between these

versions and which one is the exact one for recovery.

Play Video: Many video websites provide different deﬁnition

of the same video for users with different requirements. For in-

stance, a user may only want the video is ﬂuently playing even

if the deﬁnition is not high. The website usually keeps several

copies for video of each deﬁnition in case of crash. A better so-

lution may be keeping only one copy for video of each deﬁnition

for online play, and using provenance to record how to convert

original video to different deﬁnition of video. This usually can

save at least 50% storage space.

High Performance Computing: It is a common case that sci-

entists use a small size of input data and some transformation

rules to generate enormous data for analyze in HPC areas. It

is not cost-effective to keep all these data. Recording only the

input data and transformation rules (i.e., provenance) can sig-

niﬁcantly reduce storage overhead.

VII. CONCLUSION

In this paper, we have presented our experience in designing

and implementing a provenance-based rebuild framework. We

have discussed a wide variety of issues that we have to solve

when implementing this kind of a framework. The experimental

results show that provenance-based rebuild signiﬁcantly outper-

forms log-based technology for per-ﬁle rebuild with reasonable

space and time overhead.

ACKNOWLEDGMENT

This work was supported in part by the National Basic Re-

search 973 Program of China under Grant 2011CB302301, by

the NSFC under Grants 61025008, 61232004, 61173043, 863

and by Program 2012AA012403. The authors thank the anony-

mous reviewers for their comments on this paper.

REFERENCES

[1] K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. I. Seltzer,

“Provenance-aware storage systems,” in Proc. USENIX’06, 2006.

[2] S. T. King and P. M. Chen, “Backtracking intrusions,” in Proc.

SOSP’03, Bolton Landing, NY, USA, 2003.

[3] S. Shah, C. A. N. Soules, G. R. Ganger, and B. D. Noble, “Using prove-

nance to aid in personal ﬁle search,” in Proc. USENIX’07, 2007.

[4] K.-K. Muniswamy-Reddy, “Foundations for provenance-aware sys-

tems,” Ph.D. thesis, Harvard Univ., Cambridge, MA, 2010.

[5] B. A. Madden, I. F. Adams, M. W. Storer, E. L. Miller, D. D. E. Long,

andT.M.Kroeger,“Provenance based rebuild: Using data provenance

to improve reliability,” Tech. Rep. UCSC-SSRC-11-04, 2011.

[6] K.-K. Muniswamy-Reddy, P. Macko, and M. I. Seltzer, “Provenance

for the cloud,” in Proc. FAST’10, 2010.

[7] S. Wu, H. Jiang, D. Feng, L. Tian, and B. Mao, “WorkOut: I/O work-

load outsourcing for boosting RAID reconstruction performance,” in

Proc. 7th USENIX Conf. File and Storage Technol., 2009.

[8] M. Rosenblum and J. K. Ousterhout, “The design and implementation

of a log-structured ﬁle system,” ACM Trans. Comput. Syst., vol. 10, pp.

26–52, 1992.

[9] S. Y. Ko, I. Hoque, B. Cho, and Indranil, “On availability of interme-

diate data in cloud computation,” in Proc. HotOS’09, 2009.

[10] D. Yuan, Y. Yang, X. Liu, and J. Chen, “A cost-effective strategy for in-

termediate data storage in scientiﬁc cloud workﬂow systems,” in Proc.

24th IEEE Int. Parallel Distrib. Process. Symp., 2010.

[11] I. Adams, D. D. E. Long, E. L. Miller, S. Pasupathy, and M. W. Storer,

“Maximizing efﬁciency by trading storage for computation,” in Proc.

Workshop on Hot Topics in Cloud Comput., 2009.

[12] Y. Xie, D. Feng, Z. Tan, L. Chen, and J. Zhou, “Experiences building a

provenance-based reconstruction system,” in Proc. Storage Syst., Hard

Disk and Solid State Technol. Summit in Conjunction With the APMRC

Conf., 2012.

[13] [Online]. Available: http://www.eecs.harvard.edu/syrah/pql

[14] [Online]. Available: http://code.google.com/p/ext3grep/

[15] Y. Xie, K.-K. Muniswamy-Reddy, D. Feng, L. Yan, D. D. E. Long,

Z. Tan, and L. Chen, “A hybrid approach for efﬁcient provenance

storage,” in Proc. 21th ACM Int. Conf. Inf. Knowledge Manage.

(CIKM), 2012.

[16] Y. Xie, K.-K. Muniswamy-Reddy, D. D. E. Long, A. Amer, D. Feng,

and Z. Tan, “Compressing provenance graphs,” in Proc. 3rd USENIX

Workshop on the Theory and Practice of Provenance, 2011.

Constructing data supply chain based on layered PROV

Article

Full-text available

Apr 2017
J SUPERCOMPUT

The inability to effectively construct data supply chain in distributed environments is becoming one of the top concerns in big data area. Aiming at this problem, a novel method of constructing data supply chain based on layered PROV is proposed. First, to abstractly describe the data transfer processes from creation to distribution, a data provenance specification presented by W3C is used to standardize the information records of data activities within and across data platforms. Then, a distributed PROV data generation algorithm for multi-platform is designed. Further, we propose a tiered storage management of provenance based on summarization technology, which reduces the provenance records by compressing mid versions so as to realize multi-level management of PROV. In specific, we propose a hierarchical visual technique based on a layered query mechanism, which allows users to visualize data supply chain from general to detail. The experimental results show that the proposed approach can effectively improve the construction performance for data supply chain.

Security Issues on Industrial Internet of Things: Overview and Challenges

Article

Full-text available

Dec 2023

The Industrial Internet of Things (IIoT), where numerous smart devices associated with sensors, actuators, computers, and people communicate with shared networks, has gained advantages in many fields, such as smart manufacturing, intelligent transportation, and smart grids. However, security is becoming increasingly challenging due to the vulnerability of the IIoT to various malicious attacks. In this paper, the security issues of the IIoT are reviewed from the following three aspects: (1) security threats and their attack mechanisms are presented to illustrate the vulnerability of the IIoT; (2) the intrusion detection methods are listed from the attack identification perspectives; and (3) some defense strategies are comprehensively summarized. Several concluding remarks and promising future directions are provided at the end of this paper.

Efficient monitoring and forensic analysis via accurate network-attached provenance collection with minimal storage overhead

Article

May 2018
DIGIT INVEST

Provenance, the history or lineage of an object, has been used to enable efficient forensic analysis in intrusion prevention system to detect intrusion, correlate anomaly, and reduce false alert. Especially for the network-attached environment, it is critical and necessary to accurately capture network context to trace back the intrusion source and identify the system vulnerability. However, most of the existing methods fail to collect accurate and complete network-attached provenance. In addition, how to enable efficient forensic analysis with minimal provenance storage overhead remains a big challenge. This paper proposes a provenance-based monitoring and forensic analysis framework called PDMS that builds upon existing provenance tracking framework. On one hand, it monitors and records every network session, and collects the dependency relationships between files, processes and network sockets. By carefully describing and collecting the network socket information, PDMS can accurately track the data flow in and out of the system. On the other hand, this framework unifies both efficient provenance filtering and query-friendly compression. Evaluation results show that this framework can make accurate and highly efficient forensic analysis with minimal provenance storage overhead.

Application-aware video-sharing services via provenance in cloud storage

Article

Jan 2015

With the wide use of video-editing software, more and more similar videos are uploaded to the video-sharing platforms in cloud storage. The large number of near-duplicate videos would not only make the users not satisfactory, but also consume more resources with service providers. Since the near-duplicate videos have redundancy at the content level, the traditional de-duplication technology is not suitable to address the redundancies. In order to decrease the redundancy, we proposed a novel high performance provenance-based video-sharing system, called Provis. Provis records the provenance of the videos and compresses the near-duplicate videos via storing their provenance which replaces storing their video files to improve the space efficiency. When users attempt to upload the edited videos, Provis can rebuild the videos in the cloud via collecting and uploading their provenance to reduce the network overhead of uploading videos. We implemented Provis and compared its performance with other existing video compression techniques. Our evaluation shows that Provis achieves significant space savings.

A hybrid approach for efficient provenance storage

Conference Paper

Full-text available

Oct 2012

Efficient provenance storage is an essential step towards the adoption of provenance. In this paper, we analyze the provenance collected from multiple workloads with a view towards efficient storage. Based on our analysis, we characterize the properties of provenance with respect to long term storage. We then propose a hybrid scheme that takes advantage of the graph structure of provenance data and the inherent duplication in provenance data. Our evaluation indicates that our hybrid scheme, a combination of web graph compression (adapted for provenance) and dictionary encoding, provides the best tradeoff in terms of compression ratio, compression time and query performance when compared to other compression schemes.

Maximizing Efficiency By Trading Storage for Computation

Article

Full-text available

Jan 2009

Traditionally, computing has meant calculating results and then storing those results for later use. Unfortu- nately, committing large volumes of rarely used data to storage wastes space and energy, making it a very expen- sive strategy. Cloud computing, with its readily available and flexibly allocatable computing resources, suggests an alternative: storing the provenance data, and means to recomputing results as needed. While computation and storage are equivalent, finding the balance between the two that maximizes efficiency is difficult. One of the fundamental challenges of this is- sue is rooted in the knowledge gap separating the users and the cloud administrators—neither has a completely informed view. Users have a semantic understanding of their data, while administrators have an understanding of the cloud's underlying structure. We detail the user knowledge and system knowledge needed to construct a comprehensive cost model for analyzing the trade-off between storing a result and regenerating a result, allow- ing users and administrators to make an informed cost- benefit analysis.

Compressing Provenance Graphs

Conference Paper

Full-text available

Jun 2011

The provenance community has built a number of sys-tems to collect provenance, most of which assume that provenance will be retained indefinitely. However, it is not cost-effective to retain provenance information inef-ficiently. Since provenance can be viewed as a graph, we note the similarities to web graphs and draw upon tech-niques from the web compression domain to provide our own novel and improved graph compression solutions for provenance graphs. Our preliminary results show that adapting web compression techniques results in a com-pression ratio of 2.12:1 to 2.71:1, which we can improve upon to reach ratios of up to 3.31:1.

WorkOut: I/O Workload Outsourcing for Boosting RAID Reconstruction Performance

Conference Paper

Full-text available

Oct 2009

User I/O intensity can significantly impact the perfor- mance of on-line RAID reconstruction due to contention for the shared disk bandwidth. Based on this observa- tion, this paper proposes a novel scheme, calledWorkOut (I/OWork load Outsourcing), to significantly boost RAID reconstruction performance. WorkOut effectively out- sources all write requests and popular read requests orig- inally targeted at the degraded RAID set to a surrogate RAID set during reconstruction. Our lightweight pro- totype implementation of WorkOut and extensive trace- driven and benchmark-driven experiments demonstrate that, compared with existing reconstruction approaches, WorkOut significantly speeds up both the total recon- struction time and the average user response time. Im- portantly, WorkOut is orthogonal to and can be easily incorporated into any existing reconstruction algorithms. Furthermore, it can be extended to improving the perfor- mance of other background support RAID tasks, such as re-synchronization and disk scrubbing.

A cost-effective strategy for intermediate data storage in scientific cloud workflows

Conference Paper

Full-text available

Jan 2010

Many scientific workflows are data intensive where a large volume of intermediate data is generated during their execution. Some valuable intermediate data need to be stored for sharing or reuse. Traditionally, they are selectively stored according to the system storage capacity, determined manually. As doing science on cloud has become popular nowadays, more intermediate data can be stored in scientific cloud workflows based on a pay-for-use model. In this paper, we build an Intermediate data Dependency Graph (IDG) from the data provenances in scientific workflows. Based on the IDG, we develop a novel intermediate data storage strategy that can reduce the cost of the scientific cloud workflow system by automatically storing the most appropriate intermediate datasets in the cloud storage. We utilise Amazon's cost model and apply the strategy to an astrophysics pulsar searching scientific workflow for evaluation. The results show that our strategy can reduce the overall cost of scientific cloud workflow execution significantly.

Using Provenance to Aid in Personal File Search.

Conference Paper

Full-text available

Jan 2007

As the scope of personal data grows, it becomes in- creasingly difficult to find what we need when we need it. Desktop search tools provide a potential answer, but most existing tools are incomplete solutions: they index content, but fail to capture dynamic relationships from the user's context. One emerging solution to this is context- enhanced search, a technique that reorders and extends the results of content-only search using contextual infor- mation. Within this framework, we propose using strict causality, rather than temporal locality, the current state of the art, to direct contextual searches. Causality more accurately identifies data flow between files, reducing the false-positives created by context-switching and back- ground noise. Further, unlike previous work, we con- duct an online user study with a fully-functioning imple- mentation to evaluate user-perceived search quality di- rectly. Search results generated by our causality mech- anism are rated a statistically-significant 17% higher on average over all queries than by using content-only search or context-enhanced search with temporal locality.

Experiences building a provenance-based reconstruction system

Conference Paper

Jan 2012

Provenance is a kind of metadata that identifies the origin or history of objects. Provenance enables new functionality in a wide range of areas, including experimental documentation, security, search, debugging, etc. In this paper, we explore a new provenance application: provenance-based rebuild. Compared to the traditional ECC scheme, using provenance to reconstruct lost data has the salient advantage such as a more fine-grained reconstruction granularity and parallel rebuild. We present the experience in detailed designing and implementing this system, including a wide variety of issues that we have to solve. We also propose to utilize some classical techniques such as active storage technology to accelerate the provenance-based rebuild performance.

On Availability of Intermediate Data in Cloud Computations.

Conference Paper

Jan 2009

This paper takes a renewed look at the problem of managing intermediate data that is generated during dataflow computations (e.g., MapReduce, Pig, Dryad, etc.) within clouds. We discuss salient features of this intermediate data and outline requirements for a solu- tion. Our experiments show that existing local write- remote read solutions, traditional distributed file sys- tems (e.g., HDFS), and support from transport protocols (e.g., TCP-Nice) cannot guarantee both data availabil- ity and minimal interference, which are our key require- ments. We present design ideas for a new intermediate data storage system.

Backtracking intrusions

Conference Paper

Dec 2003

Analyzing intrusions today is an arduous, largely manual task because system administrators lack the information and tools needed to understand easily the sequence of steps that occurred in an attack. The goal of BackTracker is to identify automatically potential sequences of steps that occurred in an intrusion. Starting with a single detection point (e.g., a suspicious file), BackTracker identifies files and processes that could have affected that detection point and displays chains of events in a dependency graph. We use BackTracker to analyze several real attacks against computers that we set up as honeypots. In each case, BackTracker is able to highlight effectively the entry point used to gain access to the system and the sequence of steps from that entry point to the point at which we noticed the intrusion. The logging required to support BackTracker added 9% overhead in running time and generated 1.2 GB per day of log data for an operating-system intensive workload.

Provenance-Aware Storage Systems.

Conference Paper

Jan 2006

A Provenance-Aware Storage System (PASS) is a storage system that automatically collects and maintains prove- nance or lineage, the complete history or ancestry of an item. We discuss the advantages of treating provenance as meta-data collected and maintained by the storage sys- tem, rather than as manual annotations stored in a sepa- rately administered database. We describe a PASS imple- mentation, discussing the challenges it presents, perfor- mance cost it incurs, and the new functionality it enables. We show that with reasonable overhead, we can provide useful functionality not available in today's file systems or provenance management systems.

Design and Evaluation of a Provenance-Based Rebuild Framework

Abstract and Figures

Recommended publications

Performance of Piles as Evaluated by Three-Layer Model

Damage situation and recovery of farmland and farm facilities: Support for the creative reconstructi...

A serial hybrid strategy for files in distributed storage systems

Reconstructing Vulnerability after Floods in Germany: Oil Damage and Recovery