Conference PaperPDF Available

Contextualised Workflow Execution in MyGrid

Authors:
  • TCS Research

Abstract and Figures

e-Scientists stand to benefit from tools and environments that either hide, or help to manage, the inherent complexity involved in accessing and making concerted use of the diverse resources that might be used as part of an in silico experiment. This paper illustrates the benefits that derive from the provision of integrated access to contextual information that links the phases of a problem-solving activity, so that the steps of a solution do not happen in isolation, but rather as the components of a coherent whole. Experiences with myGrid workflow execution environment (Taverna) are presented, where an information model provides the conceptual basis for contextualisation. This information model describes key characteristics that are shared by many e-Science activities, and is used both to organise the scientist’s personal data resources, and to support data sharing and capture within the myGrid environment.
Content may be subject to copyright.
Contextualised Workflow Execution in MyGrid
M. Nedim Alpdemir1, Arijit Mukherjee2, Norman W. Paton1,
Alvaro A.A. Fernandes1, Paul Watson2, Kevin Glover3,
Chris Greenhalgh3, Tom Oinn4, and Hannah Tipney1
1Department of Computer Science, University of Manchester, Oxford Road,
Manchester M13 9PL, United Kingdom
2School of Computing Science, University of Newcastle upon Tyne,
Newcastle upon Tyne NE1 7RU, United Kingdom
3School of Comp. Sci. and Inf. Tech., University of Nottingham,
Jubilee Campus, Wollaton Road, Nottingham NG8 1BB, UK
4European Bioinformatics Institute, Wellcome Trust Genome Campus,
Hinxton, Cambridge CB10 1SD, United Kingdom
Abstract. e-Scientists stand to benefit from tools and environments
that either hide, or help to manage, the inherent complexity involved in
accessing and making concerted use of the diverse resources that might be
used as part of an in silico experiment. This paper illustrates the benefits
that derive from the provision of integrated access to contextual informa-
tion that links the phases of a problem-solving activity, so that the steps
of a solution do not happen in isolation, but rather as the components
of a coherent whole. Experiences with myGrid workflow execution envi-
ronment (Taverna) are presented, where an information model provides
the conceptual basis for contextualisation. This information model de-
scribes key characteristics that are shared by many e-Science activities,
and is used both to organise the scientist’s personal data resources, and
to support data sharing and capture within the myGrid environment.
1 Introduction and Related Work
Grid-based solutions to typical e-Science problems require the integration of
many distributed resources, and the orchestration of diverse analysis services in a
semantically rich, collaborative environment [5]. In such a context, it is important
that e-Scientists are supported in their day-to-day experiments with tools and
environments that allow the principal focus to be on scientific challenges, rather
than on the management and organisation of computational activities.
Research into Problem Solving Environments (PSEs) has long targeted this
particular challenge. Although the term Problem Solving Environment means
different things to different people [4], and its meaning seems to have been evolv-
ing over time, a number of common concepts can be identified from the relevant
research (e.g. [4, 6, 12, 8]). For example, the following features are commonly sup-
ported: problem definition; solution formulation; execution of the problem solu-
tion; provenance recording while applying the solution; result visualisation and
P.M.A. Sloot et al. (Eds.): EGC 2005, LNCS 3470, pp. 444453, 2005.
c
°Springer-Verlag Berlin Heidelberg 2005
Contextualised Workflow Execution in MyGrid 445
analysis; and support for communicating results to others (i.e. collaboration).
Although these are among the most common features, different PSEs add var-
ious other capabilities, such as intelligent support for problem formulation and
solution selection, or highlight a particular feature, such as the use of workflow
(e.g. [3, 1, 11]).
This paper emphasizes a specific aspect that has been largely overlooked,
namely the provision of integrated access to contextual information that links
the phases of a problem-solving exercise in a meaningful way. In myGrid, the
following are principles underpin support for contextualisation:
Consistent Representation: The information model ensures that informa-
tion required to establish the execution context conforms to a well-defined data
model, and therefore is understood by the myGrid components that take part
in an experiment as well as external parties.
Automatic Capture: When a workflow is executed, contextual information is
preserved by the workflow enactment engine, and used to annotate both inter-
mediate and final results.
Long-term preservation: The contextual information used to organise the
persistent storage of provenance information on workflows and their results,
easing interpretation and sharing.
Uniform Identification: Both contextual and experimental data are identified
and linked using a standard data and metadata identification scheme, namely
LSIDs [2].
The rest of the paper describes contextualisation in myGrid, indicating both
how contextualisation supports users and how the myGrid architecture captures
and conveys the relevant information. As such, Section 2 summarises the in-
formation model, which is at the heart of contextualisation. Section 3 provides
some concrete examples of contextualisation in practice, in a bioinformatics ap-
plication. Section 4 provides an architectural view of the execution environment,
and finally Section 5 presents some conclusions.
2 The Information Model
The myGrid project (http://www.mygrid.org.uk/) is developing high-level
middleware to support the e-Scientist in conducting in silico experiments in
biology. An important part of this has been the design of an Information Model
(IM) [9], which defines the basic concepts through which different aspects of
an e-Science process can be represented and linked. By providing shared data
abstractions that underpin important service interactions, the IM promotes syn-
ergy between myGrid components. The IM is defined in UML, and equivalent
XML Schema definitions have been derived from the UML to facilitate the design
of the myGrid service interfaces.
446 M.N. Alpdemir et al.
Fig. 1. A UML class diagram providing an overview of the information model
Figure 1 illustrates several of the principal classes and associations in the IM.
In summary; a Programme is a structuring device for grouping other Studies and
can be used to represent e.g. a project or sub-project. An Experiment Design
represents the method to be used (typically as a workflow script) to solve a
scientific problem. An Experiment Instance is an application of an Experiment
Design and represents some executing or completed task. The relationship of a
Person with an Organizational Structure is captured by an Affiliation. A Study
Participation qualifies a person’s relationship to the study by a set of study roles.
An Operation Trace represents inputs, outputs and the intermediate results of an
experiment (i.e. the experiment provenance), as opposed to the Data Provenance
which primarily indicates a data item’s creation method and time.
An important feature of the IM is that it does not model application-specific
data, but rather treats such data as opaque, and delegates responsibility for
its interpretation to users and to application-specific services. As such, concepts
such as sequence or gene, although they are relevant to the Williams-Beuren Syn-
drome (WBS) case study described in Section 3.1, are not explicitly described
in the IM. Rather, the IM captures information that is common to, and may
even be shared by, many e-Science applications, such as scientists, studies and
Study
name : String
description : String
startTime : Date
endTime : Date
status : String
DataProvenance
title : String
description : String
ExperimentInstance
name : String
ExperimentDesign
name : String
OperationTrace
Operation
name : String
operationScriptXML : String
StudyParticipation
experiment : ExperimentInstance
labbook : LabBookView
Person
email : String
firstName : String
lastName : String Affiliation
description : String
OrganisationalStructure
organisationName : String
unitName : String
Programme
name : String
StudyParticipationEpisode
startDate : Date
endDate : Date
roleName : String
roleDescription : String
Address
street : String
city : String
locality : String
city : String
pcode : String
country : String
addressType : String
LabBookView
name : String
rule : String
ActualInputParameter
name : String
ActualOutputParameter
name : String
AffiliationEpisode
startDate : Date
endDate : Date
status : String
0..1
-
hasInstances 0
0..*
-instance
initiates
0..*
hasParticipants
-
contains
0..*
-
experimentMethod
0..*
-
isAffiliatedTo
0..*
hasAffiliates
0..*
createdBy
0..*
participatesIn
0..*
0..1
0..*1..*
orgAddres
hasLabBooks
-
selectedStudies
0..1
-
inputs 0..*
-
outputs 0..*
-
0..*
Contextualised Workflow Execution in MyGrid 447
experiments. A consequence of this design decision is that the myGrid compo-
nents are less coupled to each other and to a particular domain, and so are more
easily deployable in different contexts. However, interpretation and processing
(e.g. content aware storage and visualisation) of the results for the end user be-
comes more difficult, and falls largely on the application developer’s shoulders.
3 Contextualized Workflows: A User’s Perspective
3.1 Case Study
Informatic studies in Williams-Beuren Syndrome (WBS) are used to illustrate
the added value obtained from contextualisation. WBS is a rare disorder charac-
terized by physical and developmental problems. The search for improved under-
standing of the genetic basis for WBS requires repeated application of a range
of standard bioinformatics techniques. Due to the highly repetitive nature of
the genome sequence flanking the Williams-Beuren syndrome critical region, se-
quencing of the region is incomplete, leaving documented gaps in the released
genomic sequence. In order to produce a complete and accurate map of the re-
gion, researchers must constantly search for newly sequenced human DNA clones
that extended into these gap regions [10].
Several requirements of the WBS application stand to benefit from integrated
support for contextualisation:
The experimenter needs to conduct several tasks repeatedly (e.g. execution
of follow-on workflows), which requires the inputs, outputs and intermediate
results of one step to be kept in the same experimental context, to ease
comparisons of multiple runs and of alternative approaches.
Results or experimental procedures need to be shared among researchers in
the group. A contextualized environment helps scientists to migrate from ad-
hoc practices for capturing the processes followed and the results obtained,
to computer-supported information-rich collaboration schemes.
3.2 Contextual Data n Use
This section illustrates how integrated support for contextualisation surfaces
to the user. In myGrid, workflows are developed and executed using the Tav-
erna workbench [7], which is essentially a workflow editor and a front-end to
a workflow execution environment with an extensible architecture, into which
additional components can be plugged. The following myGrid plug-ins provide
users with access to external information resources when designing and execut-
ing workflows:
MIR Browser: The myGrid Information Repository (MIR) is a web service
that provides long-term storage of information model and associated application-
specific data. The plug-in supports access to and modification of data in the MIR.
i
448 M.N. Alpdemir et al.
Fig. 2. MIR Browser displaying context information
Metadata Browser: The myGrid Metadata Store is a web service that sup-
ports application-specific annotations of data, including data in the MIR, using
semantic-web technologies. The plug-in supports searching and browsing of in-
formation in the metadata store, as well of the addition of new annotations.
Feta Search Engine: The Feta Search Engine provides access to registry data
on available services and resources.
Users are supported in providing, managing or accessing contextual data using
one or more of the above plug-ins, and benefit from the automatic maintenance
of contextual data in the MIR.
When developing or using workflows, the e-Scientist first launches the Tav-
erna workbench, which provides workflow-specific user interface elements, as well
as the plug-ins described above. When the MIR browser is launched, the user
provides login details, and is then provided with access to their personal instance
of the information model, as illustrated in Figure 2. This figure shows how ac-
cess has been provided, among other things, to: (i) data from the studies in
which the e-Scientist is participating in this case a Williams-Beuren Syndrome
Study; (ii) the experiment designs that are being used in the study in this case
a single WBS-Scenario Experiment; and (iii) the definitions of the workflows
Contextualised Workflow Execution in MyGrid 449
Fig. 3. Workflow execution results in the MIR Browser
that are used in the in silico experiments in this case a single workflow named
WBS part-A workflow. In this particular case, the absence of a + beside the
ExperimentInstance indicates that the WBS-Scenario Experiment has not yet
been executed. At this point, the e-Scientist is free either to select the existing
workflow, or to search for a new workflow using Feta search engine. Either way,
it is possible for the workflow to be edited, for example by adding new services
discovered using Feta to try variations to an existing solution.
When a workflow is selected for execution, the e-Scientist can obtain data
values for use as inputs to the workflow from previous experiment results stored
in the MIR. The execution of follow-on analyses is common practice in the WBS
case study.
The user’s view in Figure 2 illustrates that resources, such as individual
workflows, do not exist in isolation, but rather are part of a context in this
case a study into WBS. There may be many experiments and workflows that
are part of that study. In addition, when a workflow from a study is executed,
it is executed in the context of that study, and the results of the workflow are
450 M.N. Alpdemir et al.
Fig. 4. A simplified architectural view
automatically considered to be among the results of the study. For example,
Figure 3 illustrates the results obtained by executing WBS part-A workflow
from Figure 2. The results of the execution are automatically recorded under
the Experiment Instance entity, which is associated with the Experiment Design
from Figure 2. The values obtained for each of the Formaloutputparameter values
from Figure 2 are made available as Actualoutputparameter values in Figure 3.
In addition, provenance information about the workflow execution has also been
captured automatically. For example, the LSID [2] of each operation invoked
from the workflow is made available as an Operationtrace value. Further browsing
of the individual operation invocations indicates exactly what values were used
as input and output parameters. Such provenance information can be useful in
helping to explain and interpret the results of in silico experiments.
4 Contextualised Workflows: An Architectural
Perspective
The core components that participate in the process of formulating and exe-
cuting a workflow were introduced in Section 3.2. Figure 4 illustrates principal
relationships between the components, where the Taverna workbench constitutes
the presentation layer, and includes a number of GUI plug-ins to facilitate user
interaction. The workflow enactor (Freefluo) is the central component of the
execution environment, and communicates with other myGrid components via
plug-ins that observe the events generated as the enactor’s internal state changes.
For example, when an intermediate step in the workflow completes its execution,
the enactor generates an event and makes the intermediate results available to
the event listeners. The MIR plug-in responds to this event by obtaining the
Freefluo
Metadata
Store
MIR
MIR
Browser
WF Model
Editor /
Explorer
Feta
Se arch
GUI
Metadata
Browser
MIR
Plug-in
Registry
MD Store
Plug-in
Feta
Plug-in
Taverna Workbench
Taverna Workflow Execution Environment
my
Grid Information Model (Context)
Contextualised Workflow Execution in MyGrid 451
Fig. 5. Interactions between core myGrid components
intermediate results and storing them in the MIR in their appropriate context.
As such, the plug-in architecture is instrumental in facilitating the automatic
propagation of the experimental context across the participating components.
Figure 5 is a UML sequence diagram that illustrates a basic set of interac-
tions between the myGrid components from in Figure 4, and provides a simplified
view of the steps involved in contextualised execution of a workflow. A typical
interaction for a contextualized workflow execution starts with user’s login via
the MIRBrowser. The next step is normally finding a workflow to execute. This
could either be done by a simple load from the local file system, by a get op-
eration from the MIR, or by a search query via the Feta GUI panel. Next, the
experimenter executes the workflow. Although in the diagram this is shown as
a direct interaction with the enactor for simplicity, in reality this is done via a
separate GUI panel and the context is passed to the enactor implicitly. As the
enactor executes the workflow, it informs event listeners (i.e. the MIR plug-in
and the Metadata Store plug-in) that act as proxies on behalf of other myGrid
components, at each critical event. Only two important events, namely Work-
flowCreated and WorkflowCompleted, are shown in the diagram, although there
452 M.N. Alpdemir et al.
are several other intermediate events emitted by the enactor, for example for
capturing provenance data on operation calls. The listeners respond to those
events by extracting the context information and any other information they
need from the enactor’s state, thereby ensuring that the MIR and the Metadata
Store receive the relevant provenance information.
An additional benefit of the automatic capturing of workflow results and
provenance information is that the e-Scientist can pose complex queries against
historical records. For example, a query could be expressed to select all the
workflows that were executed after date 30th March 2004, by the person ’Hannah
Tipney’, that had an output of type ’BLAST output’.
5 Conclusions
This paper has described how a workflow definition and enactment environment
can provide enhanced support for e-Science activities by closely associating work-
flows with their broader context. The particular benefits that have been obtained
in myGrid derive from the principles introduced in Section 1, namely:
Consistent Representation: the paper has described how an e-Science spe-
cific, but application-independent, information model can be used not only to
manage the data resources associated with a study, but also to drive interface
components. In addition many myGrid components take and return values that
conform to the information model, leading to more consistent interfaces and
more efficient development.
Automatic Capture: the paper has described how the results of a workflow ex-
ecution, plus associated provenance information, can be captured automatically,
and made available throughout an e-Science infrastructure as events to which
different components may subscribe. The properties of these events are generally
modelled using the information model, and are used in the core myGrid services
to support the updating of displays and the automatic storage of contextualised
information.
Long-term Preservation: the paper has described how an information repos-
itory can be used to store the data artifacts of individual scientists in a consis-
tent fashion, thereby supporting future interpretation, sharing and analysis of
the data. Most current bioinformatics analyses are conducted in environments in
which the user rather than the system has responsibility for recording precisely
what tasks have taken place, and how specific derived values have been obtained.
Uniform Identification: the paper has described how a wide range of different
kinds of data can provide useful context for the conducting of in silico experi-
ments. Such information often has to be shared, or cross-referenced. In myGrid,
LSIDs are used to identify the different kinds of data stored in the MIR, and
LSIDs also enable cross-referencing between stores. For example, if an assertion
is made about a workflow from an MIR in a Metadata Store, the Metadata Store
will refer to the workflow by way of its LSID.
Contextualised Workflow Execution in MyGrid 453
As such, the contribution of this paper has been both to demonstrate the benefits
of contextualisation for workflow enactment, and also to describe the myGrid ap-
proach, both from a user and architectural perspective. The software described
in this paper is available from http://www.mygrid.org.uk.
Acknowledgements. The work reported in this paper has been supported by
the UK e-Science Programme.
References
1. S. AlSairafi et al. The design of Discovery Net: Towards Open Grid Services for
Knowledge Discovery. The International Journal of High Performance Computing
Applications, 17(3):297 315, Fall 2003.
2. T. Clark, S. Martin, and T. Liefeld. Globally distributed object identification for
biological knowledgebases. Briefings in Bioinformatics, 5(1):59 70, 2004.
3. E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, S. Patil, M.-H. Su, K. Vahi,
and M. Livny. Pegasus : Mapping scientific workflows onto the grid. In I. Foster
and C. Kesselman, editors, 2nd European Across Grids Conference, 2004.
4. E. Gallopoulos, E. Houstis, and J. R. Rice. Computer as thinker/doer: Problem-
solving environments for computational science. IEEE Comput. Sci. Eng., 1(2):11–
23, 1994.
5. C. Goble, C. Greenhalgh, S. Pettifer, and R. Stevens. Knowledge integration:
In silico experiments in bioinformatics. In I. Foster and C. Kesselman, editors,
The Grid: Blueprint for a New Computing Infrastructure, pages 121–134. Morgan
Kaufmann, 2004.
6. E. N. Houstis and J. R. Rice. Future problem solving environments for computa-
tional science. Math. Comput. Simul., 54(4-5):243–257, 2000.
7. T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Greenwood, T. Carver, M. R. Pocock,
A. Wipat, and P. Li. Taverna: a tool for the composition and enactment of bioin-
formatics workflows. Bioinformatics, page bth361, 2004.
8. K. Schuchardt, B. Didier, and G. Black. Ecce a problem-solving environment’s
evolution toward grid services and a web architecture. Concurrency and Compu-
tation: Practice and Experience, 14(13 15):1221 1239, 2002.
9. N. Sharman, N. Alpdemir, J. Ferris, M. Greenwood, P. Li, and C. Wroe. The
myGrid Information Model. In S. J. Cox, editor, Proceedings of UK e-Science All
Hands Meeting 2004. EPSRC, September 2004.
10. R. D. Stevens, H. J. Tipney, C. J. Wroe, T. M. Oinn, M. Senger, P. W. Lord, C. A.
Goble, A. Brass, and M. Tassabehji. Exploring Williams-Beuren syndrome using
myGrid. Bioinformatics, 20(suppl 1):i303–310, 2004.
11. I. Taylor, M. Shields, and I. Wang. Grid Resource Management, chapter Resource
Management of Triana P2P Services. Kluwer, June 2003.
12. D. W. Walker, M. Li, O. F. Rana, M. S. Shields, and Y. Huang. The software
architecture of a distributed problem-solving environment. Concurrency: Practice
and Experience, 12(15):1455–1480, 2000.
... Several projects have been developed to facilitate the automatic workflow execution in other scientific fields. Projects like Khoros [5], 3D Slicer [6], SCIRun/BioPSE [7] and Karma2 [8] for image processing; MAPS [9] for brain images; Trident [10] for oceanography; Kepler [11] and Swift [12] for agnostic area; MediGRID [13] for biomedical; Pegasus [14], OpenDX [15], and Triana [16] for heterogeneous applications; Taverna [17] is a framework to executed bioinformatics applications in distributed environments using MyGrid [18] middleware. ...
Article
Full-text available
Bioinformatics researches use applications that require large computational capabilities regularly provided by cluster and grid computing infrastructures. Researchers must learn tens of commands to execute bioinformatics applications, to coordinate the manual workflow execution and to use complex distributed computing infrastructures, spending much of their time in technical issues of applications and distributed computing infrastructures. We propose the Bio-UnaGrid infrastructure to facilitate the automatic execution of intensive-computing workflows that require the use of existing application suites and distributed computing infrastructures. With Bio-UnaGrid, bioinformatics workflows are easily created and executed, with a simple click and in a transparent manner, on different cluster and grid computing infrastructures (line command is not used). To provide more processing capabilities, at low cost, Bio-UnaGrid use the idle processing capabilities of computer labs with Windows, Linux and Mac desktop computers, using a key virtualization strategy. We implement Bio-UnaGrid in a dedicated cluster and a computer lab. Results of performance tests evidence the gain obtained by our researchers.
... Numerous domain-specific projects have been developed to record the provenance of data [3, 18, 12, 1]. However, these systems require applications to be customized to utilize the functionality provided for tracking lineage. ...
Conference Paper
The lineage of a piece of data is of utility to a wide range of domains. Several application-specific extensions have been built to facilitate tracking the origin of the output that the software produces. In the quest to provide such support to extant programs, efforts have been recently made to develop operating system functionality for au- diting filesystem activity to infer lineage relationships. We report on our exploration of mechanisms to manage the lineage metadata in Grid clusters.
... In this paradigm the range of sources from which the data originates, the diversity of the individual manipulations performed, and the complexity of the orchestrations that compose these operations all limit the reproducibility that a scientist can ensure solely by manually recording their actions [9] . Systems to track scientific workflows began to develop in response – for example, CMCS helps chemists document combustion research [19], my Grid [31] with Taverna [1] aids biologists, and ESSW is used by earth scientists [10]. Since most infrastructure being developed to record the provenance of scientific data targeted specific fields, the projects could not easily be repurposed for different domains . ...
Conference Paper
Full-text available
Reproducibility has been a cornerstone of the scientific method for hundreds of years. The range of sources from which data now originates, the diversity of the individual manipulations performed, and the complexity of the orchestrations of these operations all limit the reproducibility that a scientist can ensure solely by manually recording their actions. We use an architecture where aggregation, fusion, and composition policies define how provenance records can be automatically merged to facilitate the analysis and reproducibility of experiments. We show that the overhead of collecting and storing provenance metadata can vary dramatically depending on the policy used to integrate it.
Conference Paper
While provenance research is common in distributed systems, many proposed solutions do not address the security of systems and accountability of data stored in those systems. In this paper, we survey provenance solutions which were proposed to address the problems of system security and data accountability in distributed systems. From our survey, we derive a set of minimum requirements that are necessary for a provenance system to be effective in addressing the two problems. Finally, we identify several gaps in the surveyed solutions and present them as challenges that future provenance researchers should tackle. We argue that these gaps have to be addressed before a complete and fool-proof provenance solution can be arrived at in the future.
Conference Paper
SPADE is an open source software infrastructure for data provenance collection and management. The underlying data model used throughout the system is graph-based, consisting of vertices and directed edges that are modeled after the node and relationship types described in the Open Provenance Model. The system has been designed to decouple the collection, storage, and querying of provenance metadata. At its core is a novel provenance kernel that mediates between the producers and consumers of provenance information, and handles the persistent storage of records. It operates as a service, peering with remote instances to enable distributed provenance queries. The provenance kernel on each host handles the buffering, filtering, and multiplexing of incoming metadata from multiple sources, including the operating system, applications, and manual curation. Provenance elements can be located locally with queries that use wildcard, fuzzy, proximity, range, and Boolean operators. Ancestor and descendant queries are transparently propagated across hosts until a terminating expression is satisfied, while distributed path queries are accelerated with provenance sketches.
Conference Paper
A number of systems have been developed to track workflows -- for example, CMCS helps chemists document combustion research [10], myGrid [14] with Taverna [1] aids biologists, and ESSW is used by earth scientists [5]. Since most infrastructure developed to record the provenance of data has targeted specific fields, the projects were not easily be re-purposed for different domains. The systems differed with respect to what data was captured, the types of operations performed, how the data was stored, and the kinds of queries supported. Since 2006, a community of two dozen research groups interested in data annotation, derivation, and provenance have met regularly "to understand the capabilities of different provenance systems and the expressiveness of their provenance representations," and then iteratively created an Open Provenance Model (OPM) aimed at increasing the interoperability of systems [9].
Conference Paper
Full-text available
The principles of adaptive and reflective software (abstract interfaces, exposed metadata, instrumentation) can be applied to create flexible, scalable scientific CyberInfrastructure and to develop Cyberenvironments to support scientific research. Informed by an understanding of scientific processes as a discourse; we argue that a confluence of ideas from adaptive and reflective software and from traditional scientific information management, and grid/web scalable architectures provide a robust foundation for explicitly and coherently manage data, processes, and models using standardized technologies within Cyberenvironments that present an evolvable, domain-oriented view to the user.
Conference Paper
Full-text available
A goal of the Geosciences Network (GEON) is to develop cyber-infrastructure that will allow earth scientists to discover access, integrate and disseminate knowledge in distributed environments such as the Web, changing the way in which research is conducted. The earth sciences community has begun the complex task of creating ontologies to support this effort. A challenge is to coalesce the needs of the earth scientists, who wish to capture knowledge in a particular discipline through the ontology, with the need to leverage the knowledge to support technology that will facilitate computation, for example, by helping the composition of services. This paper describes an approach for defining workflow-driven ontologies that capture classes and relationships from domain experts and use that knowledge to support composition of services. To demonstrate the capability afforded by this type of ontology, the paper presents examples of workflow specifications generated from a workflow-driven ontology that has been defined for representing knowledge about gravity data.
Conference Paper
Data is routinely created, disseminated, and processed in distributed systems that span multiple administrative domains. To maintain accountability while the data is transformed by multiple parties, a consumer must be able to check the lineage of the data and deem it trustworthy. If integrity is not ensured, the consequences can be significant, particularly when the data cannot easily be reproduced. Verifying the provenance of a piece of data generated using inputs from multiple administrative domains is likely to require the use of numerous public keys that originate at external institutions. Current methods for verifying the integrity of such data from other users will not scale for provenance metadata since scores of verifications may be needed to validate a single file's lineage graph. We describe Mendel, a protocol with a three-pronged strategy that combines eager signature verification, lazy trust establishment, and cryptographic ordering witnesses to yield fast lineage verification in distributed multi-domain environments. Further, we show how decisional lineage queries, that is whether one file is the ancestor of the other, can be answered with high probability in constant time.
Article
Full-text available
my Grid aims to develop high-level middleware to support an e-scientist in conducting in silico ex- periments in biology. An important part of this is to define a conceptual space - the myGrid informa- tion model - where both biological data and experimental processes are modelled, with emphasis on the latter. As such, it defines the basic concepts through which different aspects of the e-science process are represented and linked. The information model provides shared data abstractions that underpin important service interactions and so promote synergy between myGrid components. This paper describes how the information model captures the e-science process, including the collection of data, experiments, provenance and annotation.
Article
Full-text available
In this paper we discuss the Triana problem solving environment and its dis-tributed implementation. Triana-specific distribution mechanisms are described along with the corresponding mappings. We outline the middleware indepen-dent nature of this implementation through the use of an application-driven API, called the GAT. The GAT supports many modes of operation including, but not limited to, Web Services and JXTA. We describe how the resources are managed within this context as Triana services and give an overview of one specific GAT binding using JXTA, used to prototype the distributed implementation of Triana services. A discussion of Triana resource management is given with emphasis on Triana-service organization within both P2P and Grid computing environments.
Conference Paper
Full-text available
In this paper we describe the Pegasus system that can map complex workflows onto the Grid. Pegasus takes an abstract description of a workflow and finds the appropriate data and Grid resources to execute the workflow. Pegasus is being released as part of the GriPhyN Virtual Data Toolkit and has been used in a variety of applications ranging from astronomy, biology, gravitational-wave science, and high-energy physics.
Article
Full-text available
With the emergence of distributed resources and grid technologies there is a need to provide higher level informatics infrastructures allowing scientists to easily create and execute meaningful data integration and analysis processes that take advantage of the distributed nature of the available resources. These resources typically include heterogeneous data sources, computational resources for task execution and various application-specific services. The effort of the high performance community has so far mainly focused on the delivery of low-level informatics infrastructures enabling the basic needs of grid applications. Such infrastructures are essential but do not directly help end-users in creating generic and re-usable applications. In this paper, we present the Discovery Net architecture for building grid-based knowledge discovery applications. Our architecture enables the creation of high-level, re-usable and distributed application workflows that use a variety of common types of distributed resources. It is built on top of standard protocols and standard infrastructures such as Globus but also defines its own protocols such as the Discovery Process Mark-up Language for data flow management. We discuss an implementation of our architecture and evaluate it by building a real-time genome annotation environment on top.
Chapter
Biologists, aided by bioinformaticians, have become knowledge workers, intelligently weaving together the information available to the community, linking and correlating it meaningfully, and generating even more information. Many bio-Grid projects focus on the sharing of computational resources, large scale data movement and replication for simulations, remote instrumentation steerage, high-throughput sequence analysis, or image processing, as in the Biomedical Informatics Research Network (BIRN) project. However, much of bioinformatics involves a scientific process with relatively modest computational needs but significant semantic and data complexity. The myGrid project is building high-level services for integrating applications and data resources, concentrating on dynamic resource discovery, workflow specification and dynamic enactment, and distributed query processing. These services merely enable experiments to be formed and executed. Thus, myGrid's second category of services supports the scientific method and best practice found at the bench but often neglected at the workstation, specifically provenance management, change notification, and personalization.
Article
In this paper we review the current state of the problem solving environment (PSE) field and make projections for the future. First we describe the computing context, the definition of a PSE and the goals of a PSE. The state-of-the-art is summarized along with the principal components and paradigms for building PSEs. The discussion of the future is given in three parts: future trends, scenarios for 2010/2025, and research issues to be addressed. (C) 2000 Published by Elsevier Science B.V. on behalf of IMACS.
Article
This paper describes the functionality and software architecture of a generic problem-solving environment (PSE) for collaborative computational science and engineering. A PSE is designed to provide transparent access to heterogeneous distributed computing resources, and is intended to enhance research productivity by making it easier to construct, run, and analyze the results of computer simulations. Although implementation details are not discussed in depth, the rôle of software technologies such as CORBA, Java, and XML is outlined. An XML-based component model is presented. The main features of a Visual Component Composition Environment for software development, and an Intelligent Resource Management System for scheduling components, are described. Some prototype implementations of PSE applications are also presented.
Article
The Extensible Computational Chemistry Environment (Ecce), an innovative problem solving environment (PSE), was designed a decade ago, before the emergence of the Web and Grid computing services. In this paper, we briefly examine the original Ecce architecture and discuss how it is evolving to incorporate both Grid services and components of the Web to increase its range of services, reduce deployment and maintenance costs, and reach a wider audience. We show that Ecce operates in both Grid and non-Grid environments, an important consideration given Ecce's broad range of uses and user community, and discuss the strategies for loosely coupled components that make this possible. Both in- progress work and conceptual plans for how Ecce will evolve are presented.