Conference PaperPDF Available

Contextualised Workflow Execution in MyGrid

July 2005
Lecture Notes in Computer Science 3470:412-413

July 2005
3470:412-413

DOI:10.1007/11508380_46

Source
dx.doi.org

Conference: Proceedings of the 2005 European conference on Advances in Grid Computing

Authors:

Mahmut Nedim Alpdemir

Ankara Science University

Arijit Mukherjee

TCS Research

Show all 9 authorsHide

e-Scientists stand to benefit from tools and environments that either hide, or help to manage, the inherent complexity involved in accessing and making concerted use of the diverse resources that might be used as part of an in silico experiment. This paper illustrates the benefits that derive from the provision of integrated access to contextual information that links the phases of a problem-solving activity, so that the steps of a solution do not happen in isolation, but rather as the components of a coherent whole. Experiences with myGrid workflow execution environment (Taverna) are presented, where an information model provides the conceptual basis for contextualisation. This information model describes key characteristics that are shared by many e-Science activities, and is used both to organise the scientist’s personal data resources, and to support data sharing and capture within the myGrid environment.

A UML class diagram providing an overview of the information model

…

Workflow execution results in the MIR Browser

…

A simplified architectural view

…

Figures - uploaded by Arijit Mukherjee

Content may be subject to copyright.

Content uploaded by Arijit Mukherjee

Content may be subject to copyright.

Content uploaded by Arijit Mukherjee

Content may be subject to copyright.

Contextualised Workﬂow Execution in MyGrid

M. Nedim Alpdemir1, Arijit Mukherjee2, Norman W. Paton1,

Alvaro A.A. Fernandes1, Paul Watson2, Kevin Glover3,

Chris Greenhalgh3, Tom Oinn4, and Hannah Tipney1

1Department of Computer Science, University of Manchester, Oxford Road,

Manchester M13 9PL, United Kingdom

2School of Computing Science, University of Newcastle upon Tyne,

Newcastle upon Tyne NE1 7RU, United Kingdom

3School of Comp. Sci. and Inf. Tech., University of Nottingham,

Jubilee Campus, Wollaton Road, Nottingham NG8 1BB, UK

4European Bioinformatics Institute, Wellcome Trust Genome Campus,

Hinxton, Cambridge CB10 1SD, United Kingdom

Abstract. e-Scientists stand to beneﬁt from tools and environments

that either hide, or help to manage, the inherent complexity involved in

accessing and making concerted use of the diverse resources that might be

used as part of an in silico experiment. This paper illustrates the beneﬁts

that derive from the provision of integrated access to contextual informa-

tion that links the phases of a problem-solving activity, so that the steps

of a solution do not happen in isolation, but rather as the components

of a coherent whole. Experiences with myGrid workﬂow execution envi-

ronment (Taverna) are presented, where an information model provides

the conceptual basis for contextualisation. This information model de-

scribes key characteristics that are shared by many e-Science activities,

and is used both to organise the scientist’s personal data resources, and

to support data sharing and capture within the myGrid environment.

1 Introduction and Related Work

Grid-based solutions to typical e-Science problems require the integration of

many distributed resources, and the orchestration of diverse analysis services in a

semantically rich, collaborative environment [5]. In such a context, it is important

that e-Scientists are supported in their day-to-day experiments with tools and

environments that allow the principal focus to be on scientiﬁc challenges, rather

than on the management and organisation of computational activities.

Research into Problem Solving Environments (PSEs) has long targeted this

particular challenge. Although the term Problem Solving Environment means

diﬀerent things to diﬀerent people [4], and its meaning seems to have been evolv-

ing over time, a number of common concepts can be identiﬁed from the relevant

research (e.g. [4, 6, 12, 8]). For example, the following features are commonly sup-

ported: problem deﬁnition; solution formulation; execution of the problem solu-

tion; provenance recording while applying the solution; result visualisation and

P.M.A. Sloot et al. (Eds.): EGC 2005, LNCS 3470, pp. 444–453, 2005.

°Springer-Verlag Berlin Heidelberg 2005

Contextualised Workﬂow Execution in MyGrid 445

analysis; and support for communicating results to others (i.e. collaboration).

Although these are among the most common features, diﬀerent PSEs add var-

ious other capabilities, such as intelligent support for problem formulation and

solution selection, or highlight a particular feature, such as the use of workﬂow

(e.g. [3, 1, 11]).

This paper emphasizes a speciﬁc aspect that has been largely overlooked,

namely the provision of integrated access to contextual information that links

the phases of a problem-solving exercise in a meaningful way. In myGrid, the

following are principles underpin support for contextualisation:

Consistent Representation: The information model ensures that informa-

tion required to establish the execution context conforms to a well-deﬁned data

model, and therefore is understood by the myGrid components that take part

in an experiment as well as external parties.

Automatic Capture: When a workﬂow is executed, contextual information is

preserved by the workﬂow enactment engine, and used to annotate both inter-

mediate and ﬁnal results.

Long-term preservation: The contextual information used to organise the

persistent storage of provenance information on workﬂows and their results,

easing interpretation and sharing.

Uniform Identiﬁcation: Both contextual and experimental data are identiﬁed

and linked using a standard data and metadata identiﬁcation scheme, namely

LSIDs [2].

The rest of the paper describes contextualisation in myGrid, indicating both

how contextualisation supports users and how the myGrid architecture captures

and conveys the relevant information. As such, Section 2 summarises the in-

formation model, which is at the heart of contextualisation. Section 3 provides

some concrete examples of contextualisation in practice, in a bioinformatics ap-

plication. Section 4 provides an architectural view of the execution environment,

and ﬁnally Section 5 presents some conclusions.

2 The Information Model

The myGrid project (http://www.mygrid.org.uk/) is developing high-level

middleware to support the e-Scientist in conducting in silico experiments in

biology. An important part of this has been the design of an Information Model

(IM) [9], which deﬁnes the basic concepts through which diﬀerent aspects of

an e-Science process can be represented and linked. By providing shared data

abstractions that underpin important service interactions, the IM promotes syn-

ergy between myGrid components. The IM is deﬁned in UML, and equivalent

XML Schema deﬁnitions have been derived from the UML to facilitate the design

of the myGrid service interfaces.

446 M.N. Alpdemir et al.

Fig. 1. A UML class diagram providing an overview of the information model

Figure 1 illustrates several of the principal classes and associations in the IM.

In summary; a Programme is a structuring device for grouping other Studies and

can be used to represent e.g. a project or sub-project. An Experiment Design

represents the method to be used (typically as a workﬂow script) to solve a

scientiﬁc problem. An Experiment Instance is an application of an Experiment

Design and represents some executing or completed task. The relationship of a

Person with an Organizational Structure is captured by an Aﬃliation. A Study

Participation qualiﬁes a person’s relationship to the study by a set of study roles.

An Operation Trace represents inputs, outputs and the intermediate results of an

experiment (i.e. the experiment provenance), as opposed to the Data Provenance

which primarily indicates a data item’s creation method and time.

An important feature of the IM is that it does not model application-speciﬁc

data, but rather treats such data as opaque, and delegates responsibility for

its interpretation to users and to application-speciﬁc services. As such, concepts

such as sequence or gene, although they are relevant to the Williams-Beuren Syn-

drome (WBS) case study described in Section 3.1, are not explicitly described

in the IM. Rather, the IM captures information that is common to, and may

even be shared by, many e-Science applications, such as scientists, studies and

Study

name : String

description : String

startTime : Date

endTime : Date

status : String

DataProvenance

title : String

description : String

ExperimentInstance

name : String

ExperimentDesign

name : String

OperationTrace

Operation

name : String

operationScriptXML : String

StudyParticipation

experiment : ExperimentInstance

labbook : LabBookView

Person

email : String

firstName : String

lastName : String Affiliation

description : String

OrganisationalStructure

organisationName : String

unitName : String

Programme

name : String

StudyParticipationEpisode

startDate : Date

endDate : Date

roleName : String

roleDescription : String

Address

street : String

city : String

locality : String

city : String

pcode : String

country : String

addressType : String

LabBookView

name : String

rule : String

ActualInputParameter

name : String

ActualOutputParameter

name : String

AffiliationEpisode

startDate : Date

endDate : Date

status : String

0..1

hasInstances 0

0..*

-instance

initiates

0..*

hasParticipants

contains

0..*

experimentMethod

0..*

isAffiliatedTo

0..*

hasAffiliates

0..*

createdBy

0..*

participatesIn

0..*

0..1

0..*1..*

orgAddres

hasLabBooks

selectedStudies

0..1

inputs 0..*

outputs 0..*

0..*

Contextualised Workﬂow Execution in MyGrid 447

experiments. A consequence of this design decision is that the myGrid compo-

nents are less coupled to each other and to a particular domain, and so are more

easily deployable in diﬀerent contexts. However, interpretation and processing

(e.g. content aware storage and visualisation) of the results for the end user be-

comes more diﬃcult, and falls largely on the application developer’s shoulders.

3 Contextualized Workﬂows: A User’s Perspective

3.1 Case Study

Informatic studies in Williams-Beuren Syndrome (WBS) are used to illustrate

the added value obtained from contextualisation. WBS is a rare disorder charac-

terized by physical and developmental problems. The search for improved under-

standing of the genetic basis for WBS requires repeated application of a range

of standard bioinformatics techniques. Due to the highly repetitive nature of

the genome sequence ﬂanking the Williams-Beuren syndrome critical region, se-

quencing of the region is incomplete, leaving documented gaps in the released

genomic sequence. In order to produce a complete and accurate map of the re-

gion, researchers must constantly search for newly sequenced human DNA clones

that extended into these gap regions [10].

Several requirements of the WBS application stand to beneﬁt from integrated

support for contextualisation:

–The experimenter needs to conduct several tasks repeatedly (e.g. execution

of follow-on workﬂows), which requires the inputs, outputs and intermediate

results of one step to be kept in the same experimental context, to ease

comparisons of multiple runs and of alternative approaches.

–Results or experimental procedures need to be shared among researchers in

the group. A contextualized environment helps scientists to migrate from ad-

hoc practices for capturing the processes followed and the results obtained,

to computer-supported information-rich collaboration schemes.

3.2 Contextual Data n Use

This section illustrates how integrated support for contextualisation surfaces

to the user. In myGrid, workﬂows are developed and executed using the Tav-

erna workbench [7], which is essentially a workﬂow editor and a front-end to

a workﬂow execution environment with an extensible architecture, into which

additional components can be plugged. The following myGrid plug-ins provide

users with access to external information resources when designing and execut-

ing workﬂows:

MIR Browser: The myGrid Information Repository (MIR) is a web service

that provides long-term storage of information model and associated application-

speciﬁc data. The plug-in supports access to and modiﬁcation of data in the MIR.

448 M.N. Alpdemir et al.

Fig. 2. MIR Browser displaying context information

Metadata Browser: The myGrid Metadata Store is a web service that sup-

ports application-speciﬁc annotations of data, including data in the MIR, using

semantic-web technologies. The plug-in supports searching and browsing of in-

formation in the metadata store, as well of the addition of new annotations.

Feta Search Engine: The Feta Search Engine provides access to registry data

on available services and resources.

Users are supported in providing, managing or accessing contextual data using

one or more of the above plug-ins, and beneﬁt from the automatic maintenance

of contextual data in the MIR.

When developing or using workﬂows, the e-Scientist ﬁrst launches the Tav-

erna workbench, which provides workﬂow-speciﬁc user interface elements, as well

as the plug-ins described above. When the MIR browser is launched, the user

provides login details, and is then provided with access to their personal instance

of the information model, as illustrated in Figure 2. This ﬁgure shows how ac-

cess has been provided, among other things, to: (i) data from the studies in

which the e-Scientist is participating – in this case a Williams-Beuren Syndrome

Study; (ii) the experiment designs that are being used in the study – in this case

a single WBS-Scenario Experiment; and (iii) the deﬁnitions of the workﬂows

Contextualised Workﬂow Execution in MyGrid 449

Fig. 3. Workﬂow execution results in the MIR Browser

that are used in the in silico experiments – in this case a single workﬂow named

WBS part-A workﬂow. In this particular case, the absence of a + beside the

ExperimentInstance indicates that the WBS-Scenario Experiment has not yet

been executed. At this point, the e-Scientist is free either to select the existing

workﬂow, or to search for a new workﬂow using Feta search engine. Either way,

it is possible for the workﬂow to be edited, for example by adding new services

discovered using Feta to try variations to an existing solution.

When a workﬂow is selected for execution, the e-Scientist can obtain data

values for use as inputs to the workﬂow from previous experiment results stored

in the MIR. The execution of follow-on analyses is common practice in the WBS

case study.

The user’s view in Figure 2 illustrates that resources, such as individual

workﬂows, do not exist in isolation, but rather are part of a context – in this

case a study into WBS. There may be many experiments and workﬂows that

are part of that study. In addition, when a workﬂow from a study is executed,

it is executed in the context of that study, and the results of the workﬂow are

450 M.N. Alpdemir et al.

Fig. 4. A simpliﬁed architectural view

automatically considered to be among the results of the study. For example,

Figure 3 illustrates the results obtained by executing WBS part-A workﬂow

from Figure 2. The results of the execution are automatically recorded under

the Experiment Instance entity, which is associated with the Experiment Design

from Figure 2. The values obtained for each of the Formaloutputparameter values

from Figure 2 are made available as Actualoutputparameter values in Figure 3.

In addition, provenance information about the workﬂow execution has also been

captured automatically. For example, the LSID [2] of each operation invoked

from the workﬂow is made available as an Operationtrace value. Further browsing

of the individual operation invocations indicates exactly what values were used

as input and output parameters. Such provenance information can be useful in

helping to explain and interpret the results of in silico experiments.

4 Contextualised Workﬂows: An Architectural

Perspective

The core components that participate in the process of formulating and exe-

cuting a workﬂow were introduced in Section 3.2. Figure 4 illustrates principal

relationships between the components, where the Taverna workbench constitutes

the presentation layer, and includes a number of GUI plug-ins to facilitate user

interaction. The workﬂow enactor (Freeﬂuo) is the central component of the

execution environment, and communicates with other myGrid components via

plug-ins that observe the events generated as the enactor’s internal state changes.

For example, when an intermediate step in the workﬂow completes its execution,

the enactor generates an event and makes the intermediate results available to

the event listeners. The MIR plug-in responds to this event by obtaining the

Freefluo

Metadata

Store

MIR

Browser

WF Model

Editor /

Explorer

Feta

Se arch

GUI

Metadata

Browser

MIR

Plug-in

Registry

MD Store

Plug-in

Feta

Plug-in

Taverna Workbench

Taverna Workflow Execution Environment

Grid Information Model (Context)

Contextualised Workﬂow Execution in MyGrid 451

Fig. 5. Interactions between core myGrid components

intermediate results and storing them in the MIR in their appropriate context.

As such, the plug-in architecture is instrumental in facilitating the automatic

propagation of the experimental context across the participating components.

Figure 5 is a UML sequence diagram that illustrates a basic set of interac-

tions between the myGrid components from in Figure 4, and provides a simpliﬁed

view of the steps involved in contextualised execution of a workﬂow. A typical

interaction for a contextualized workﬂow execution starts with user’s login via

the MIRBrowser. The next step is normally ﬁnding a workﬂow to execute. This

could either be done by a simple load from the local ﬁle system, by a get op-

eration from the MIR, or by a search query via the Feta GUI panel. Next, the

experimenter executes the workﬂow. Although in the diagram this is shown as

a direct interaction with the enactor for simplicity, in reality this is done via a

separate GUI panel and the context is passed to the enactor implicitly. As the

enactor executes the workﬂow, it informs event listeners (i.e. the MIR plug-in

and the Metadata Store plug-in) that act as proxies on behalf of other myGrid

components, at each critical event. Only two important events, namely Work-

ﬂowCreated and WorkﬂowCompleted, are shown in the diagram, although there

452 M.N. Alpdemir et al.

are several other intermediate events emitted by the enactor, for example for

capturing provenance data on operation calls. The listeners respond to those

events by extracting the context information and any other information they

need from the enactor’s state, thereby ensuring that the MIR and the Metadata

Store receive the relevant provenance information.

An additional beneﬁt of the automatic capturing of workﬂow results and

provenance information is that the e-Scientist can pose complex queries against

historical records. For example, a query could be expressed to select all the

workﬂows that were executed after date 30th March 2004, by the person ’Hannah

Tipney’, that had an output of type ’BLAST output’.

5 Conclusions

This paper has described how a workﬂow deﬁnition and enactment environment

can provide enhanced support for e-Science activities by closely associating work-

ﬂows with their broader context. The particular beneﬁts that have been obtained

in myGrid derive from the principles introduced in Section 1, namely:

Consistent Representation: the paper has described how an e-Science spe-

ciﬁc, but application-independent, information model can be used not only to

manage the data resources associated with a study, but also to drive interface

components. In addition many myGrid components take and return values that

conform to the information model, leading to more consistent interfaces and

more eﬃcient development.

Automatic Capture: the paper has described how the results of a workﬂow ex-

ecution, plus associated provenance information, can be captured automatically,

and made available throughout an e-Science infrastructure as events to which

diﬀerent components may subscribe. The properties of these events are generally

modelled using the information model, and are used in the core myGrid services

to support the updating of displays and the automatic storage of contextualised

information.

Long-term Preservation: the paper has described how an information repos-

itory can be used to store the data artifacts of individual scientists in a consis-

tent fashion, thereby supporting future interpretation, sharing and analysis of

the data. Most current bioinformatics analyses are conducted in environments in

which the user rather than the system has responsibility for recording precisely

what tasks have taken place, and how speciﬁc derived values have been obtained.

Uniform Identiﬁcation: the paper has described how a wide range of diﬀerent

kinds of data can provide useful context for the conducting of in silico experi-

ments. Such information often has to be shared, or cross-referenced. In myGrid,

LSIDs are used to identify the diﬀerent kinds of data stored in the MIR, and

LSIDs also enable cross-referencing between stores. For example, if an assertion

is made about a workﬂow from an MIR in a Metadata Store, the Metadata Store

will refer to the workﬂow by way of its LSID.

Contextualised Workﬂow Execution in MyGrid 453

As such, the contribution of this paper has been both to demonstrate the beneﬁts

of contextualisation for workﬂow enactment, and also to describe the myGrid ap-

proach, both from a user and architectural perspective. The software described

in this paper is available from http://www.mygrid.org.uk.

Acknowledgements. The work reported in this paper has been supported by

the UK e-Science Programme.

References

1. S. AlSairaﬁ et al. The design of Discovery Net: Towards Open Grid Services for

Knowledge Discovery. The International Journal of High Performance Computing

Applications, 17(3):297 – 315, Fall 2003.

2. T. Clark, S. Martin, and T. Liefeld. Globally distributed object identiﬁcation for

biological knowledgebases. Brieﬁngs in Bioinformatics, 5(1):59 – 70, 2004.

3. E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, S. Patil, M.-H. Su, K. Vahi,

and M. Livny. Pegasus : Mapping scientiﬁc workﬂows onto the grid. In I. Foster

and C. Kesselman, editors, 2nd European Across Grids Conference, 2004.

4. E. Gallopoulos, E. Houstis, and J. R. Rice. Computer as thinker/doer: Problem-

solving environments for computational science. IEEE Comput. Sci. Eng., 1(2):11–

23, 1994.

5. C. Goble, C. Greenhalgh, S. Pettifer, and R. Stevens. Knowledge integration:

In silico experiments in bioinformatics. In I. Foster and C. Kesselman, editors,

The Grid: Blueprint for a New Computing Infrastructure, pages 121–134. Morgan

Kaufmann, 2004.

6. E. N. Houstis and J. R. Rice. Future problem solving environments for computa-

tional science. Math. Comput. Simul., 54(4-5):243–257, 2000.

7. T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Greenwood, T. Carver, M. R. Pocock,

A. Wipat, and P. Li. Taverna: a tool for the composition and enactment of bioin-

formatics workﬂows. Bioinformatics, page bth361, 2004.

8. K. Schuchardt, B. Didier, and G. Black. Ecce – a problem-solving environment’s

evolution toward grid services and a web architecture. Concurrency and Compu-

tation: Practice and Experience, 14(13 – 15):1221 – 1239, 2002.

9. N. Sharman, N. Alpdemir, J. Ferris, M. Greenwood, P. Li, and C. Wroe. The

myGrid Information Model. In S. J. Cox, editor, Proceedings of UK e-Science All

Hands Meeting 2004. EPSRC, September 2004.

10. R. D. Stevens, H. J. Tipney, C. J. Wroe, T. M. Oinn, M. Senger, P. W. Lord, C. A.

Goble, A. Brass, and M. Tassabehji. Exploring Williams-Beuren syndrome using

myGrid. Bioinformatics, 20(suppl 1):i303–310, 2004.

11. I. Taylor, M. Shields, and I. Wang. Grid Resource Management, chapter Resource

Management of Triana P2P Services. Kluwer, June 2003.

12. D. W. Walker, M. Li, O. F. Rana, M. S. Shields, and Y. Huang. The software

architecture of a distributed problem-solving environment. Concurrency: Practice

and Experience, 12(15):1455–1480, 2000.

Bio-UnaGrid: Easing Bioinformatics Workflow Execution Using LONI Pipeline and a Virtual Desktop Grid

Article

Full-text available

Jan 2011

Bioinformatics researches use applications that require large computational capabilities regularly provided by cluster and grid computing infrastructures. Researchers must learn tens of commands to execute bioinformatics applications, to coordinate the manual workflow execution and to use complex distributed computing infrastructures, spending much of their time in technical issues of applications and distributed computing infrastructures. We propose the Bio-UnaGrid infrastructure to facilitate the automatic execution of intensive-computing workflows that require the use of existing application suites and distributed computing infrastructures. With Bio-UnaGrid, bioinformatics workflows are easily created and executed, with a simple click and in a transparent manner, on different cluster and grid computing infrastructures (line command is not used). To provide more processing capabilities, at low cost, Bio-UnaGrid use the idle processing capabilities of computer labs with Windows, Linux and Mac desktop computers, using a key virtualization strategy. We implement Bio-UnaGrid in a dedicated cluster and a computer lab. Results of performance tests evidence the gain obtained by our researchers.

Steps Toward Managing Lineage Metadata in Grid Clusters.

Conference Paper

Jan 2009

The lineage of a piece of data is of utility to a wide range of domains. Several application-specific extensions have been built to facilitate tracking the origin of the output that the software produces. In the quest to provide such support to extant programs, efforts have been recently made to develop operating system functionality for au- diting filesystem activity to infer lineage relationships. We report on our exploration of mechanisms to manage the lineage metadata in Grid clusters.

Policy-Based Integration of Provenance Metadata

Conference Paper

Full-text available

Jun 2011

Reproducibility has been a cornerstone of the scientific method for hundreds of years. The range of sources from which data now originates, the diversity of the individual manipulations performed, and the complexity of the orchestrations of these operations all limit the reproducibility that a scientist can ensure solely by manually recording their actions. We use an architecture where aggregation, fusion, and composition policies define how provenance records can be automatically merged to facilitate the analysis and reproducibility of experiments. We show that the overhead of collecting and storing provenance metadata can vary dramatically depending on the policy used to integrate it.

Bonsai: Balanced Lineage Authentication

Conference Paper

Dec 2007

Security and Data Accountability in Distributed Systems: A Provenance Survey

Conference Paper

Nov 2013

While provenance research is common in distributed systems, many proposed solutions do not address the security of systems and accountability of data stored in those systems. In this paper, we survey provenance solutions which were proposed to address the problems of system security and data accountability in distributed systems. From our survey, we derive a set of minimum requirements that are necessary for a provenance system to be effective in addressing the two problems. Finally, we identify several gaps in the surveyed solutions and present them as challenges that future provenance researchers should tackle. We argue that these gaps have to be addressed before a complete and fool-proof provenance solution can be arrived at in the future.

SPADE: Support for Provenance Auditing in Distributed Environments

Conference Paper

Dec 2012

SPADE is an open source software infrastructure for data provenance collection and management. The underlying data model used throughout the system is graph-based, consisting of vertices and directed edges that are modeled after the node and relationship types described in the Open Provenance Model. The system has been designed to decouple the collection, storage, and querying of provenance metadata. At its core is a novel provenance kernel that mediates between the producers and consumers of provenance information, and handles the persistent storage of records. It operates as a service, peering with remote instances to enable distributed provenance queries. The provenance kernel on each host handles the buffering, filtering, and multiplexing of incoming metadata from multiple sources, including the operating system, applications, and manual curation. Provenance elements can be located locally with queries that use wildcard, fuzzy, proximity, range, and Boolean operators. Ancestor and descendant queries are transparently propagated across hosts until a terminating expression is satisfied, while distributed path queries are accelerated with provenance sketches.

Cross-platform provenance

Conference Paper

Mar 2013

A number of systems have been developed to track workflows -- for example, CMCS helps chemists document combustion research [10], myGrid [14] with Taverna [1] aids biologists, and ESSW is used by earth scientists [5]. Since most infrastructure developed to record the provenance of data has targeted specific fields, the projects were not easily be re-purposed for different domains. The systems differed with respect to what data was captured, the types of operations performed, how the data was stored, and the kinds of queries supported. Since 2006, a community of two dozen research groups interested in data annotation, derivation, and provenance have met regularly "to understand the capabilities of different provenance systems and the expressiveness of their provenance representations," and then iteratively created an Open Provenance Model (OPM) aimed at increasing the interoperability of systems [9].

Cyberenvironments: Adaptive middleware for scientific cyberinfrastructure

Conference Paper

Full-text available

Nov 2007

The principles of adaptive and reflective software (abstract interfaces, exposed metadata, instrumentation) can be applied to create flexible, scalable scientific CyberInfrastructure and to develop Cyberenvironments to support scientific research. Informed by an understanding of scientific processes as a discourse; we argue that a confluence of ideas from adaptive and reflective software and from traditional scientific information management, and grid/web scalable architectures provide a robust foundation for explicitly and coherently manage data, processes, and models using standardized technologies within Cyberenvironments that present an evolvable, domain-oriented view to the user.

Workflow-Driven Ontologies: An Earth Sciences Case Study

Conference Paper

Full-text available

Dec 2006

A goal of the Geosciences Network (GEON) is to develop cyber-infrastructure that will allow earth scientists to discover access, integrate and disseminate knowledge in distributed environments such as the Web, changing the way in which research is conducted. The earth sciences community has begun the complex task of creating ontologies to support this effort. A challenge is to coalesce the needs of the earth scientists, who wish to capture knowledge in a particular discipline through the ontology, with the need to leverage the knowledge to support technology that will facilitate computation, for example, by helping the composition of services. This paper describes an approach for defining workflow-driven ontologies that capture classes and relationships from domain experts and use that knowledge to support composition of services. To demonstrate the capability afforded by this type of ontology, the paper presents examples of workflow specifications generated from a workflow-driven ontology that has been defined for representing knowledge about gravity data.

Mendel: Efficiently Verifying the Lineage of Data Modified in Multiple Trust Domains

Conference Paper

Jun 2010

Data is routinely created, disseminated, and processed in distributed systems that span multiple administrative domains. To maintain accountability while the data is transformed by multiple parties, a consumer must be able to check the lineage of the data and deem it trustworthy. If integrity is not ensured, the consequences can be significant, particularly when the data cannot easily be reproduced. Verifying the provenance of a piece of data generated using inputs from multiple administrative domains is likely to require the use of numerous public keys that originate at external institutions. Current methods for verifying the integrity of such data from other users will not scale for provenance metadata since scores of verifications may be needed to validate a single file's lineage graph. We describe Mendel, a protocol with a three-pronged strategy that combines eager signature verification, lazy trust establishment, and cryptographic ordering witnesses to yield fast lineage verification in distributed multi-domain environments. Further, we show how decisional lineage queries, that is whether one file is the ancestor of the other, can be answered with high probability in constant time.

The myGrid Information Model

Article

Full-text available

Jan 2004

my Grid aims to develop high-level middleware to support an e-scientist in conducting in silico ex- periments in biology. An important part of this is to define a conceptual space - the myGrid informa- tion model - where both biological data and experimental processes are modelled, with emphasis on the latter. As such, it defines the basic concepts through which different aspects of the e-science process are represented and linked. The information model provides shared data abstractions that underpin important service interactions and so promote synergy between myGrid components. This paper describes how the information model captures the e-science process, including the collection of data, experiments, provenance and annotation.

Resource Management of Triana P2P Services

Article

Full-text available

Jan 2003

In this paper we discuss the Triana problem solving environment and its dis-tributed implementation. Triana-specific distribution mechanisms are described along with the corresponding mappings. We outline the middleware indepen-dent nature of this implementation through the use of an application-driven API, called the GAT. The GAT supports many modes of operation including, but not limited to, Web Services and JXTA. We describe how the resources are managed within this context as Triana services and give an overview of one specific GAT binding using JXTA, used to prototype the distributed implementation of Triana services. A discussion of Triana resource management is given with emphasis on Triana-service organization within both P2P and Grid computing environments.

Pegasus: Mapping Scientific Workflows onto the Grid

Conference Paper

Full-text available

Jan 2004
Lect Notes Comput Sci

In this paper we describe the Pegasus system that can map complex workflows onto the Grid. Pegasus takes an abstract description of a workflow and finds the appropriate data and Grid resources to execute the workflow. Pegasus is being released as part of the GriPhyN Virtual Data Toolkit and has been used in a variety of applications ranging from astronomy, biology, gravitational-wave science, and high-energy physics.

The Design of Discovery Net: Towards Open Grid Services for Knowledge Discovery

Article

Full-text available

Aug 2003
INT J HIGH PERFORM C

With the emergence of distributed resources and grid technologies there is a need to provide higher level informatics infrastructures allowing scientists to easily create and execute meaningful data integration and analysis processes that take advantage of the distributed nature of the available resources. These resources typically include heterogeneous data sources, computational resources for task execution and various application-specific services. The effort of the high performance community has so far mainly focused on the delivery of low-level informatics infrastructures enabling the basic needs of grid applications. Such infrastructures are essential but do not directly help end-users in creating generic and re-usable applications. In this paper, we present the Discovery Net architecture for building grid-based knowledge discovery applications. Our architecture enables the creation of high-level, re-usable and distributed application workflows that use a variety of common types of distributed resources. It is built on top of standard protocols and standard infrastructures such as Globus but also defines its own protocols such as the Discovery Process Mark-up Language for data flow management. We discuss an implementation of our architecture and evaluate it by building a real-time genome annotation environment on top.

Knowledge Integration

Chapter

Dec 2004

Biologists, aided by bioinformaticians, have become knowledge workers, intelligently weaving together the information available to the community, linking and correlating it meaningfully, and generating even more information. Many bio-Grid projects focus on the sharing of computational resources, large scale data movement and replication for simulations, remote instrumentation steerage, high-throughput sequence analysis, or image processing, as in the Biomedical Informatics Research Network (BIRN) project. However, much of bioinformatics involves a scientific process with relatively modest computational needs but significant semantic and data complexity. The myGrid project is building high-level services for integrating applications and data resources, concentrating on dynamic resource discovery, workflow specification and dynamic enactment, and distributed query processing. These services merely enable experiments to be formed and executed. Thus, myGrid's second category of services supports the scientific method and best practice found at the bench but often neglected at the workstation, specifically provenance management, change notification, and personalization.

Knowledge Integration: In Silico Experiments in Bioinformatics

Article

Jan 2003

Grid Resource Management

Chapter

Jan 2003

Future problem solving environments for computational science

Article

Dec 2000
MATH COMPUT SIMULAT

In this paper we review the current state of the problem solving environment (PSE) field and make projections for the future. First we describe the computing context, the definition of a PSE and the goals of a PSE. The state-of-the-art is summarized along with the principal components and paradigms for building PSEs. The discussion of the future is given in three parts: future trends, scenarios for 2010/2025, and research issues to be addressed. (C) 2000 Published by Elsevier Science B.V. on behalf of IMACS.

The Software Architecture of a Distributed Problem-Solving Environment

Article

Dec 2000
Concurrency Pract Ex

This paper describes the functionality and software architecture of a generic problem-solving environment (PSE) for collaborative computational science and engineering. A PSE is designed to provide transparent access to heterogeneous distributed computing resources, and is intended to enhance research productivity by making it easier to construct, run, and analyze the results of computer simulations. Although implementation details are not discussed in depth, the rôle of software technologies such as CORBA, Java, and XML is outlined. An XML-based component model is presented. The main features of a Visual Component Composition Environment for software development, and an Intelligent Resource Management System for scheduling components, are described. Some prototype implementations of PSE applications are also presented.

Ecce--a problem-solving environment’s evolution toward Grid services and a Web architecture

Article

Nov 2002
CONCURR COMP-PRACT E

The Extensible Computational Chemistry Environment (Ecce), an innovative problem solving environment (PSE), was designed a decade ago, before the emergence of the Web and Grid computing services. In this paper, we briefly examine the original Ecce architecture and discuss how it is evolving to incorporate both Grid services and components of the Web to increase its range of services, reduce deployment and maintenance costs, and reach a wider audience. We show that Ecce operates in both Grid and non-Grid environments, an important consideration given Ecce's broad range of uses and user community, and discuss the strategies for loosely coupled components that make this possible. Both in- progress work and conceptual plans for how Ecce will evolve are presented.

Contextualised Workflow Execution in MyGrid

Abstract and Figures

Recommended publications

Integrated in silico strategy for PBT assessment and prioritization under REACH

Beyond the genome--SMi Conference 27-28 January 2003, London, UK

Per-Residue Energy Footprints-Based Pharmacophore Modeling as an Enhanced In Silico Approach in Drug...

Regulation of the pentose phosphate pathway in Escherichia coli: gene network reconstruction and mat...