ArticlePDF Available

Effective and Scalable Data Access Control in Onedata Large Scale Distributed Virtual File System

Authors:

Abstract and Figures

Nowadays, as large amounts of data are generated, either from experiments, satellite imagery or via simulations, access to this data becomes challenging for users who need to further process them, since existing data management makes it difficult to effectively access and share large data sets. In this paper we present an approach to enabling easy and secure collaborations based on the state of the art authentication and authorization mechanisms, advanced group/role mechanism for flexible authorization management and support for identity mapping between local systems, as applied in an eventually consistent distributed file system called Onedata.
Content may be subject to copyright.
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 108C (2017) 445–454
1877-0509 © 2017 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientific committee of the International Conference on Computational Science
10.1016/j.procs.2017.05.054
International Conference on Computational Science, ICCS 2017, 12-14 June 2017,
Zurich, Switzerland
10.1016/j.procs.2017.05.054 1877-0509
© 2017 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the scientic committee of the International Conference on Computational Science
This space is reserved for the Procedia header, do not use it
Effective and Scalable Data Access Control in Onedata
Large Scale Distributed Virtual File System
Michal Wrzeszcz1,2, Lukasz Opiola1,2, Konrad Zemek2, Bartosz Kryza1, Lukasz
Dutka1, Renata Slota2, and Jacek Kitowski1,2
1Academic Computer Centre CYFRONET-AGH, University of Science and Technology, Krakow,
Poland
2AGH University of Science and Technology, Faculty of Computer Science, Electronics and
Telecommunications, Department of Computer Science, Krakow, Poland
kito@agh.edu.pl, rena@agh.edu.pl
Abstract
Nowadays, as large amounts of data are generated, either from experiments, satellite imagery
or via simulations, access to this data becomes challenging for users who need to further process
them, since existing data management makes it difficult to effectively access and share large data
sets. In this paper we present an approach to enabling easy and secure collaborations based
on the state of the art authentication and authorization mechanisms, advanced group/role
mechanism for flexible authorization management and support for identity mapping between
local systems, as applied in an eventually consistent distributed file system called Onedata.
Keywords: big data, open data, data management, authorization, security
1 Introduction
Today, more and more research and commercial applications rely heavily on distributed access
to large data sets, including data collected from physical experiments as well as data obtained
through pure simulations or statistical data collected from web applications. Such data sets
are created in distributed infrastructures, by various organizations using heterogeneous storage
systems and are often too large to be completely transferred between data centers for process-
ing. These issues lead to several requirements that are necessary for a modern distributed large
scale data management system, i.e.: transparent data access from any machine, access to large
data sets without completely transferring them to the computational nodes, flexible metadata
support enabling data discovery, support for single- and multi-tenant deployment, secure and
easy data sharing, advanced group and role mechanisms for large groups of collaborators, sup-
port for open data publishing and data access using standard interfaces and protocols including
POSIX and CDMI (Cloud Data Management Interface) [21].
1
This space is reserved for the Procedia header, do not use it
Effective and Scalable Data Access Control in Onedata
Large Scale Distributed Virtual File System
Michal Wrzeszcz1,2, Lukasz Opiola1,2, Konrad Zemek2, Bartosz Kryza1, Lukasz
Dutka1, Renata Slota2, and Jacek Kitowski1,2
1Academic Computer Centre CYFRONET-AGH, University of Science and Technology, Krakow,
Poland
2AGH University of Science and Technology, Faculty of Computer Science, Electronics and
Telecommunications, Department of Computer Science, Krakow, Poland
kito@agh.edu.pl, rena@agh.edu.pl
Abstract
Nowadays, as large amounts of data are generated, either from experiments, satellite imagery
or via simulations, access to this data becomes challenging for users who need to further process
them, since existing data management makes it difficult to effectively access and share large data
sets. In this paper we present an approach to enabling easy and secure collaborations based
on the state of the art authentication and authorization mechanisms, advanced group/role
mechanism for flexible authorization management and support for identity mapping between
local systems, as applied in an eventually consistent distributed file system called Onedata.
Keywords: big data, open data, data management, authorization, security
1 Introduction
Today, more and more research and commercial applications rely heavily on distributed access
to large data sets, including data collected from physical experiments as well as data obtained
through pure simulations or statistical data collected from web applications. Such data sets
are created in distributed infrastructures, by various organizations using heterogeneous storage
systems and are often too large to be completely transferred between data centers for process-
ing. These issues lead to several requirements that are necessary for a modern distributed large
scale data management system, i.e.: transparent data access from any machine, access to large
data sets without completely transferring them to the computational nodes, flexible metadata
support enabling data discovery, support for single- and multi-tenant deployment, secure and
easy data sharing, advanced group and role mechanisms for large groups of collaborators, sup-
port for open data publishing and data access using standard interfaces and protocols including
POSIX and CDMI (Cloud Data Management Interface) [21].
1
This space is reserved for the Procedia header, do not use it
Effective and Scalable Data Access Control in Onedata
Large Scale Distributed Virtual File System
Michal Wrzeszcz1,2, Lukasz Opiola1,2, Konrad Zemek2, Bartosz Kryza1, Lukasz
Dutka1, Renata Slota2, and Jacek Kitowski1,2
1Academic Computer Centre CYFRONET-AGH, University of Science and Technology, Krakow,
Poland
2AGH University of Science and Technology, Faculty of Computer Science, Electronics and
Telecommunications, Department of Computer Science, Krakow, Poland
kito@agh.edu.pl, rena@agh.edu.pl
Abstract
Nowadays, as large amounts of data are generated, either from experiments, satellite imagery
or via simulations, access to this data becomes challenging for users who need to further process
them, since existing data management makes it difficult to effectively access and share large data
sets. In this paper we present an approach to enabling easy and secure collaborations based
on the state of the art authentication and authorization mechanisms, advanced group/role
mechanism for flexible authorization management and support for identity mapping between
local systems, as applied in an eventually consistent distributed file system called Onedata.
Keywords: big data, open data, data management, authorization, security
1 Introduction
Today, more and more research and commercial applications rely heavily on distributed access
to large data sets, including data collected from physical experiments as well as data obtained
through pure simulations or statistical data collected from web applications. Such data sets
are created in distributed infrastructures, by various organizations using heterogeneous storage
systems and are often too large to be completely transferred between data centers for process-
ing. These issues lead to several requirements that are necessary for a modern distributed large
scale data management system, i.e.: transparent data access from any machine, access to large
data sets without completely transferring them to the computational nodes, flexible metadata
support enabling data discovery, support for single- and multi-tenant deployment, secure and
easy data sharing, advanced group and role mechanisms for large groups of collaborators, sup-
port for open data publishing and data access using standard interfaces and protocols including
POSIX and CDMI (Cloud Data Management Interface) [21].
1
This space is reserved for the Procedia header, do not use it
Effective and Scalable Data Access Control in Onedata
Large Scale Distributed Virtual File System
Michal Wrzeszcz1,2, Lukasz Opiola1,2, Konrad Zemek2, Bartosz Kryza1, Lukasz
Dutka1, Renata Slota2, and Jacek Kitowski1,2
1Academic Computer Centre CYFRONET-AGH, University of Science and Technology, Krakow,
Poland
2AGH University of Science and Technology, Faculty of Computer Science, Electronics and
Telecommunications, Department of Computer Science, Krakow, Poland
kito@agh.edu.pl, rena@agh.edu.pl
Abstract
Nowadays, as large amounts of data are generated, either from experiments, satellite imagery
or via simulations, access to this data becomes challenging for users who need to further process
them, since existing data management makes it difficult to effectively access and share large data
sets. In this paper we present an approach to enabling easy and secure collaborations based
on the state of the art authentication and authorization mechanisms, advanced group/role
mechanism for flexible authorization management and support for identity mapping between
local systems, as applied in an eventually consistent distributed file system called Onedata.
Keywords: big data, open data, data management, authorization, security
1 Introduction
Today, more and more research and commercial applications rely heavily on distributed access
to large data sets, including data collected from physical experiments as well as data obtained
through pure simulations or statistical data collected from web applications. Such data sets
are created in distributed infrastructures, by various organizations using heterogeneous storage
systems and are often too large to be completely transferred between data centers for process-
ing. These issues lead to several requirements that are necessary for a modern distributed large
scale data management system, i.e.: transparent data access from any machine, access to large
data sets without completely transferring them to the computational nodes, flexible metadata
support enabling data discovery, support for single- and multi-tenant deployment, secure and
easy data sharing, advanced group and role mechanisms for large groups of collaborators, sup-
port for open data publishing and data access using standard interfaces and protocols including
POSIX and CDMI (Cloud Data Management Interface) [21].
1
446 Michaƚ Wrzeszcz et al. / Procedia Computer Science 108C (2017) 445–454
Effective and Scalable Data Access Control . . . Wrzeszcz, Opiola, Zemek, . . .
However, existing data management platforms, which are either focused on high perfor-
mance data access on a local network or Dropbox-like solutions for desktop users, often have
complex authentication and authorization mechanisms (for instance based on X.509 certificates
requiring users to manage them manually) and are difficult to deploy by smaller user communi-
ties. Furthermore, users are accustomed to accessing and managing their personal data through
Cloud based services such as Dropbox or Google Drive, while in order to access and process
these data on Virtual Machines or containers in the Cloud, they still have to use some legacy
protocols such FTP and share the data by exchanging URLs or attachments to emails.
In order to address these challenges, we have proposed a novel solution for global data access,
giving the users similar experience and ease of use as with commercial data management and
file synchronization solutions, while providing means for high performance transparent data
access, ensuring security at every step of access including storing data at systems, which are
under control of separate organizations. We have provided corresponding architectural design
and finally - practical implementation of software for global data access without barriers - a
distributed, eventually-consistent virtual filesystem Onedata [6, 17, 25].
In the next sections we briefly describe the Onedata data management platform, and discuss
in detail our approach to data access control in a globally distributed file system and review re-
lated work on the subjects related to large scale data management in distributed infrastructures,
including aspects related to data access control.
2 Data management in Onedata
One of the main issues in modern large scale data management is to manage and efficiently
share data between large and distributed user communities. Onedata addresses this challenge
by implementing a globally distributed storage system divided into zones (or federations), which
are created by deploying a dedicated service called Onezone. Zones enable creation of Onedata
deployments with no relation to other federations. Any organization, community or user group
can deploy their own Onezone service (single-tenant mode) with customized login page or use
some public Onedata deployment (e.g. onedata.org or datahub.egi.eu) and rely on the
Onedata group mechanism for users authorization and isolation (multi-tenant mode). Storage
providers can connect to selected zones to form storage federations, based on heterogeneous
storage backends, while still providing to users a unified, transparent data access functionality.
A typical distributed Onedata deployment is depicted in Fig. 1. Onezone service is the main
point of access for users, allowing Single Sign-On login mechanism (1) for all providers who
granted the user access to their resources. Based on the Onezone authentication and autho-
rization decisions, Oneprovider instances running in storage providers data centers control data
access operations on users spaces which they support (2). Onezone enables also easy and secure
sharing of data between users within a single zone, by means of a simple token exchange (3).
Our system design envisions also support for data sharing based on trust established between
different zones for use cases requiring integration between storage federations (4). Furthermore,
it is not assumed that storage providers within a single zone must trust each other, as only data
and metadata about users who are supported by specific providers within a zone are exchanged
between selected providers, and the data center administrators have full control over which
users will be supported (5). Once the user has authenticated in the Onezone service, he can
directly access the data by connecting to a selected Oneprovider service (e.g. the one closest
to the users computing node), however thanks to the interconnection between Oneprovider ser-
vices within a single federation, users have transparent access to all files which are available
from all storage providers who are supporting their spaces (6). This allows users to access their
2
Effective and Scalable Data Access Control . . . Wrzeszcz, Opiola, Zemek, . . .
Onezone 1
Onezone 2
6. Single access point
to mulprovider
environment
PC with oneclient
or web browser
5. Lack of trust
between providers
1. Different
login methods
Storage system
Provider 2
Process
of user
Oneclient
10. Different
authencaon/
permissions systems
9. Delegaon of permissions
for direct access
Storage system
Web GUI
REST
CDMI
POSIX CLIENT
4. Different providers
cooperaon agreements
3. Advanced cooperaon
of different providers users
Provider 3
Provider 4
2. Limited permissions to
data containers (spaces)
for providers
8. Access remote data on
behalf of user
7. Small delays in
permissions
checking needed
Figure 1: Overview of a typical Onedata deployment.
data from any location without pre-staging via efficient POSIX protocol, by simply mounting
them on their local machines or attaching to Cloud virtual machines or containers. On the
lowest level, a special transfer protocol has been developed, called RTransfer [6], which enables
efficient replication and real-time access to remote data between data centers as well as POSIX
access for end users (8). The transparency of data access is in particular evident in case of run-
ning data processing jobs (including legacy applications) on remote computing nodes, which
can use native POSIX API to access and write files while all access permissions are delegated
using bearer tokens generated during the first authentication (9). Finally, an important issue
in every federated data management systems is to allow local site administrators to be able to
enforce full control over which users have access to which storage resources. In our solution
this is achieved via a special mechanism called LUMA (Local User MApping) [16], which is a
extensible mechanism allowing storage administrators to provide mapping between global user
identities and local, storage specific user credentials (10).
3 Data access authentication and authorization
In order to enable effective data access and sharing in large distributed user communities, several
issues such as unified identity management across different storage sites, distributed authoriza-
tion and flexible group and role management have to be provided by the data management
system. This section details data access control mechanisms that address these issues.
3.1 Identity management
One of the main issues HPC users face when accessing data is the complexity of authentication
and authorization systems based on certificates, their management and renewing procedures.
3
Michaƚ Wrzeszcz et al. / Procedia Computer Science 108C (2017) 445–454 447
Effective and Scalable Data Access Control . . . Wrzeszcz, Opiola, Zemek, . . .
However, existing data management platforms, which are either focused on high perfor-
mance data access on a local network or Dropbox-like solutions for desktop users, often have
complex authentication and authorization mechanisms (for instance based on X.509 certificates
requiring users to manage them manually) and are difficult to deploy by smaller user communi-
ties. Furthermore, users are accustomed to accessing and managing their personal data through
Cloud based services such as Dropbox or Google Drive, while in order to access and process
these data on Virtual Machines or containers in the Cloud, they still have to use some legacy
protocols such FTP and share the data by exchanging URLs or attachments to emails.
In order to address these challenges, we have proposed a novel solution for global data access,
giving the users similar experience and ease of use as with commercial data management and
file synchronization solutions, while providing means for high performance transparent data
access, ensuring security at every step of access including storing data at systems, which are
under control of separate organizations. We have provided corresponding architectural design
and finally - practical implementation of software for global data access without barriers - a
distributed, eventually-consistent virtual filesystem Onedata [6, 17, 25].
In the next sections we briefly describe the Onedata data management platform, and discuss
in detail our approach to data access control in a globally distributed file system and review re-
lated work on the subjects related to large scale data management in distributed infrastructures,
including aspects related to data access control.
2 Data management in Onedata
One of the main issues in modern large scale data management is to manage and efficiently
share data between large and distributed user communities. Onedata addresses this challenge
by implementing a globally distributed storage system divided into zones (or federations), which
are created by deploying a dedicated service called Onezone. Zones enable creation of Onedata
deployments with no relation to other federations. Any organization, community or user group
can deploy their own Onezone service (single-tenant mode) with customized login page or use
some public Onedata deployment (e.g. onedata.org or datahub.egi.eu) and rely on the
Onedata group mechanism for users authorization and isolation (multi-tenant mode). Storage
providers can connect to selected zones to form storage federations, based on heterogeneous
storage backends, while still providing to users a unified, transparent data access functionality.
A typical distributed Onedata deployment is depicted in Fig. 1. Onezone service is the main
point of access for users, allowing Single Sign-On login mechanism (1) for all providers who
granted the user access to their resources. Based on the Onezone authentication and autho-
rization decisions, Oneprovider instances running in storage providers data centers control data
access operations on users spaces which they support (2). Onezone enables also easy and secure
sharing of data between users within a single zone, by means of a simple token exchange (3).
Our system design envisions also support for data sharing based on trust established between
different zones for use cases requiring integration between storage federations (4). Furthermore,
it is not assumed that storage providers within a single zone must trust each other, as only data
and metadata about users who are supported by specific providers within a zone are exchanged
between selected providers, and the data center administrators have full control over which
users will be supported (5). Once the user has authenticated in the Onezone service, he can
directly access the data by connecting to a selected Oneprovider service (e.g. the one closest
to the users computing node), however thanks to the interconnection between Oneprovider ser-
vices within a single federation, users have transparent access to all files which are available
from all storage providers who are supporting their spaces (6). This allows users to access their
2
Effective and Scalable Data Access Control . . . Wrzeszcz, Opiola, Zemek, . . .
Onezone 1
Onezone 2
6. Single access point
to mulprovider
environment
PC with oneclient
or web browser
5. Lack of trust
between providers
1. Different
login methods
Storage system
Provider 1
Provider 2
Compung
Element
Process
of user
Oneclient
10. Different
authencaon/
permissions systems
9. Delegaon of permissions
for direct access
Storage system
Web GUI
REST
CDMI
POSIX CLIENT
4. Different providers
cooperaon agreements
3. Advanced cooperaon
of different providers users
Provider 3
Provider 4
2. Limited permissions to
data containers (spaces)
for providers
8. Access remote data on
behalf of user
7. Small delays in
permissions
checking needed
Figure 1: Overview of a typical Onedata deployment.
data from any location without pre-staging via efficient POSIX protocol, by simply mounting
them on their local machines or attaching to Cloud virtual machines or containers. On the
lowest level, a special transfer protocol has been developed, called RTransfer [6], which enables
efficient replication and real-time access to remote data between data centers as well as POSIX
access for end users (8). The transparency of data access is in particular evident in case of run-
ning data processing jobs (including legacy applications) on remote computing nodes, which
can use native POSIX API to access and write files while all access permissions are delegated
using bearer tokens generated during the first authentication (9). Finally, an important issue
in every federated data management systems is to allow local site administrators to be able to
enforce full control over which users have access to which storage resources. In our solution
this is achieved via a special mechanism called LUMA (Local User MApping) [16], which is a
extensible mechanism allowing storage administrators to provide mapping between global user
identities and local, storage specific user credentials (10).
3 Data access authentication and authorization
In order to enable effective data access and sharing in large distributed user communities, several
issues such as unified identity management across different storage sites, distributed authoriza-
tion and flexible group and role management have to be provided by the data management
system. This section details data access control mechanisms that address these issues.
3.1 Identity management
One of the main issues HPC users face when accessing data is the complexity of authentication
and authorization systems based on certificates, their management and renewing procedures.
3
448 Michaƚ Wrzeszcz et al. / Procedia Computer Science 108C (2017) 445–454
Effective and Scalable Data Access Control . . . Wrzeszcz, Opiola, Zemek, . . .
Onedata utilizes OpenID and OpenID Connect (based on OAuth 2.0) standards to provide easy
and unified identity management. From the users point of view, it simplifies the registration
and login process as they can use one of their existing institutional or social accounts. The
minimum required information is the email address, served by virtually any OpenID provider.
Users can connect multiple OpenID accounts to an already existing account in Onedata, which
gives them more login methods. Onezone serves as the account management center for users,
where they can personalize their settings and authentication methods, or obtain client tokens
(see 3.2) to authorize operations on their behalf across the whole system.
Internally, identity management is the responsibility of Onezone, which is the authentication
and authorization center for all storage providers and users in a federation. Support for con-
crete OpenID providers is extendable via plugins and configurable, which makes it easy to widen
the range of supported providers, or customize the available authentication methods for each
instance of Onezone independently. Onedata also supports basic (login/password) authentica-
tion, which is mostly targeted at system administrators or small isolated deployments. Upon
registration, the new user is given a unique ID, which will be used universally in the system
from now on. By storing user identifiers obtained from OpenID providers (subject id), Onezone
can easily map OpenID accounts onto the unique user ID. Later, when access to resources or
files is negotiated, this ID is used for privileges verification (see section 3.3).
3.2 Macaroon-based bearer tokens
Internally, Onedata system delegates authority through the use of Macaroons [2]. Macaroons
are a type of bearer credentials that leverage chained MACs (message authentication codes)
to allow holder to add new caveats - contextual confinements that limit the scope or degree of
the authorization. In particular, Macaroons allow to add third party caveats that can be only
satisfied by presenting a macaroon-bound proof from the specified third party. All conditions
imposed on the credentials - including those added later by subsequent credentials holders - are
verified by the authorizing party on authorization request.
The basic use of Macaroons in Onedata resembles OAuth model. Upon authentication in
Onezone and redirection to a specific storage provider, the provider receives an authorization
token in the form of a serialized Macaroon. The Macaroon is time-restricted but intended to
be long-lived, and has an additional third-party caveat that requires bearer to present proof
that the user is authenticated. Before using the credentials, storage provider first obtains the
proof from Onezone. The proof is valid for a short period of time and has to be reacquired
when expired. The storage provider can interact with Onezone in user’s name only with the
Macaroon and valid proof of authentication. Another use-case of Macaroons in Onedata is
authorizing native clients. In this case, the client is given only the long-lived token without
the authentication caveat. Note that the authentication caveat serves to ensure that client’s
actions are authorized not only by the authorization server (Onezone) but also by the user.
In the command line client’s case, the client is under full and constant control of the user
and thus does not require reauthentication. However, the native client connects to a storage
provider, which also requires authorization with Onezone to properly function and unlike the
native client is outside of user’s control. To mitigate risk to the user, the native client delegates
its authorization to the storage provider via a Macaroon with a short expiration time, and
refreshes the authorization periodically. The flow of authorization for the native client is shown
in Fig. 2.
Both the third-party authentication caveat for the web-based interface and delegating the
short-lived authorization Macaroon by the native client, enable the whole system to work on
4
Effective and Scalable Data Access Control . . . Wrzeszcz, Opiola, Zemek, . . .
Figure 2: An overview of authorization flow for a native client of Onedata.
users behalf while the user is actively using the system, and revoke the authorization when the
user stops using the system, leaving the user’s identity under their control. Macaroons-based
authorization allows for refinement of granted access and thus tightening of security in next
versions of the subsystem. Macaroons allow to leverage asymmetric encryption to enable third-
parties to determine whether the credentials are valid. This mechanism could be used by the
storage provider to independently verify given credential before or even after using it for autho-
rization with Onezone. For example, the Macaroon might contain a access-type=read-only
caveat that would be checked by the storage provider before a write operation. Examples of
other possible refinements include restricting Macaroons to a specific space, a specific time
period or a given pool of storage providers.
3.3 Groups and privileges mechanism
Existing data management systems tend to be either very complex and targeted for large user
communities with steep learning curve, or very basic solutions, targeted typically for long-tail
of science. In order to provide a unified data management solution which can scale from small
user groups to large user communities, we have implemented a flexible nested group based
mechanism. Their usability is best justified by the privileges system in Onedata. Privileges
are fine-grained and concern members of specific resource, constraining the rights of members
towards the resource. For example, each member of space can be individually granted (or re-
voked) the privileges to modify the space, invite new members, delete the space, write data
within the space (among others). Memberships and privileges of every user are crucial infor-
mation, which influences low level decisions, e.g. if a given user can write or read a certain file
in certain space.
Groups allow collaborating among users and other groups that wish to have shared access
to some data sets, and should share the memberships and privileges towards other resources.
For instance, a group can itself become a member of space. In this case, all members of the
group inherit the privileges of the group for the space. This way, by adding a single group
to the space and setting proper privileges, the administrator can effectively set privileges of a
large pool of users. To achieve diversification of privileges among members, more groups can
5
Michaƚ Wrzeszcz et al. / Procedia Computer Science 108C (2017) 445–454 449
Effective and Scalable Data Access Control . . . Wrzeszcz, Opiola, Zemek, . . .
Onedata utilizes OpenID and OpenID Connect (based on OAuth 2.0) standards to provide easy
and unified identity management. From the users point of view, it simplifies the registration
and login process as they can use one of their existing institutional or social accounts. The
minimum required information is the email address, served by virtually any OpenID provider.
Users can connect multiple OpenID accounts to an already existing account in Onedata, which
gives them more login methods. Onezone serves as the account management center for users,
where they can personalize their settings and authentication methods, or obtain client tokens
(see 3.2) to authorize operations on their behalf across the whole system.
Internally, identity management is the responsibility of Onezone, which is the authentication
and authorization center for all storage providers and users in a federation. Support for con-
crete OpenID providers is extendable via plugins and configurable, which makes it easy to widen
the range of supported providers, or customize the available authentication methods for each
instance of Onezone independently. Onedata also supports basic (login/password) authentica-
tion, which is mostly targeted at system administrators or small isolated deployments. Upon
registration, the new user is given a unique ID, which will be used universally in the system
from now on. By storing user identifiers obtained from OpenID providers (subject id), Onezone
can easily map OpenID accounts onto the unique user ID. Later, when access to resources or
files is negotiated, this ID is used for privileges verification (see section 3.3).
3.2 Macaroon-based bearer tokens
Internally, Onedata system delegates authority through the use of Macaroons [2]. Macaroons
are a type of bearer credentials that leverage chained MACs (message authentication codes)
to allow holder to add new caveats - contextual confinements that limit the scope or degree of
the authorization. In particular, Macaroons allow to add third party caveats that can be only
satisfied by presenting a macaroon-bound proof from the specified third party. All conditions
imposed on the credentials - including those added later by subsequent credentials holders - are
verified by the authorizing party on authorization request.
The basic use of Macaroons in Onedata resembles OAuth model. Upon authentication in
Onezone and redirection to a specific storage provider, the provider receives an authorization
token in the form of a serialized Macaroon. The Macaroon is time-restricted but intended to
be long-lived, and has an additional third-party caveat that requires bearer to present proof
that the user is authenticated. Before using the credentials, storage provider first obtains the
proof from Onezone. The proof is valid for a short period of time and has to be reacquired
when expired. The storage provider can interact with Onezone in user’s name only with the
Macaroon and valid proof of authentication. Another use-case of Macaroons in Onedata is
authorizing native clients. In this case, the client is given only the long-lived token without
the authentication caveat. Note that the authentication caveat serves to ensure that client’s
actions are authorized not only by the authorization server (Onezone) but also by the user.
In the command line client’s case, the client is under full and constant control of the user
and thus does not require reauthentication. However, the native client connects to a storage
provider, which also requires authorization with Onezone to properly function and unlike the
native client is outside of user’s control. To mitigate risk to the user, the native client delegates
its authorization to the storage provider via a Macaroon with a short expiration time, and
refreshes the authorization periodically. The flow of authorization for the native client is shown
in Fig. 2.
Both the third-party authentication caveat for the web-based interface and delegating the
short-lived authorization Macaroon by the native client, enable the whole system to work on
4
Effective and Scalable Data Access Control . . . Wrzeszcz, Opiola, Zemek, . . .
Figure 2: An overview of authorization flow for a native client of Onedata.
users behalf while the user is actively using the system, and revoke the authorization when the
user stops using the system, leaving the user’s identity under their control. Macaroons-based
authorization allows for refinement of granted access and thus tightening of security in next
versions of the subsystem. Macaroons allow to leverage asymmetric encryption to enable third-
parties to determine whether the credentials are valid. This mechanism could be used by the
storage provider to independently verify given credential before or even after using it for autho-
rization with Onezone. For example, the Macaroon might contain a access-type=read-only
caveat that would be checked by the storage provider before a write operation. Examples of
other possible refinements include restricting Macaroons to a specific space, a specific time
period or a given pool of storage providers.
3.3 Groups and privileges mechanism
Existing data management systems tend to be either very complex and targeted for large user
communities with steep learning curve, or very basic solutions, targeted typically for long-tail
of science. In order to provide a unified data management solution which can scale from small
user groups to large user communities, we have implemented a flexible nested group based
mechanism. Their usability is best justified by the privileges system in Onedata. Privileges
are fine-grained and concern members of specific resource, constraining the rights of members
towards the resource. For example, each member of space can be individually granted (or re-
voked) the privileges to modify the space, invite new members, delete the space, write data
within the space (among others). Memberships and privileges of every user are crucial infor-
mation, which influences low level decisions, e.g. if a given user can write or read a certain file
in certain space.
Groups allow collaborating among users and other groups that wish to have shared access
to some data sets, and should share the memberships and privileges towards other resources.
For instance, a group can itself become a member of space. In this case, all members of the
group inherit the privileges of the group for the space. This way, by adding a single group
to the space and setting proper privileges, the administrator can effectively set privileges of a
large pool of users. To achieve diversification of privileges among members, more groups can
5
450 Michaƚ Wrzeszcz et al. / Procedia Computer Science 108C (2017) 445–454
Effective and Scalable Data Access Control . . . Wrzeszcz, Opiola, Zemek, . . .
be added. There are more resources beside spaces that use privileges system, and they are
analogous to space privileges (modify,invite members,delete, etc.). These resources include
groups, handle services and handles (for instance Digital Object Identifiers). Handle services
and handles are resources connected with Open Data publications, and they also have members
(users or groups) with associated privileges. Beside privileges associated with specific resources,
there are also general privileges enabling special features in Onezone. They are typically granted
to system administrators and include the rights to view and modify resources in the system.
These privileges can be granted to a user or a group of users. The group system in Onedata is
very flexible and allows for creating complicated structures of nested groups. In fact, they can
create an arbitrary graph, where cycles are allowed. Nevertheless, such unconstrained approach
has one significant pitfall - how to efficiently verify privileges of given user towards a resource
when he might belong to it via a long chain of nested groups? The naive approach would be
to analyse the graph of relations every time. However, resources can be accessed with high
frequency (thousands of requests per second), especially because the privileges must be checked
during every file-system operation - thus an efficient solution is required. In Onedata, we
observed that the relations graph is not modified often (adding/removing relations or updating
privileges) - in fact entire organizations can run on the same group membership setup for
months, with single users joining or leaving groups occasionally. Considering this, we devised
an algorithm where the relations graph is analysed incrementally to collect information about
direct and indirect memberships and privileges, which we call effective relations and effective
privileges (see Fig. 3). The algorithm operates on a graph of entities (users, groups, spaces
Membership
Privileges:
- modify
- invite members
- write data
- delete
M
I
W
D
Eective users:
Eective groups:
User 1:
User 2:
User 3:
MIW -
-IW-
MIWD
-IW-
-IW-
MIWD
Group A:
Group B:
Group C:
User 1
User 2 Group A Group B
Group CUser 3
M
IW
MIWD
Figure 3: Simplified entity graph with pre-calculated effective members and their privileges.
and other related resources). When the relation changes, a recalculation is scheduled which
analyses only the affected entities. If the effective relations of an entity have changed because
of the update, all adjacent entities are analysed recursively. The process spans wider and wider
until all changes have been propagated. This way, shortly after each update, we obtain a
graph of entities where every entity carries a pre-calculated information about all its effective
relations and privileges. Thanks to this approach, verifying if a user has given privilege towards
a resource is reduced to looking up a single record and it’s effective privileges. What’s most
important this ensures very low overheads on the file-system level. To enrich the functionality
of groups, Onedata introduces roles - an attribute of each group defining the characteristics of
its members. There are several available roles:
role - simplest group type associating members of certain role in arbitrary organizations,
team - a group of members that form a team,
unit - a group of members that belong to the same administrative unit,
organization - a group associating multiple units (virtual organization).
6
Effective and Scalable Data Access Control . . . Wrzeszcz, Opiola, Zemek, . . .
Roles allow for creating clear and orderly group structures for easier maintenance.
3.4 Mapping user identities to local accounts
In order to enable direct mapping between global user identities registered in the Onezone
service and local storage user identities, Onedata provides an extensible mechanism called Local
User MApping (LUMA) [16], which allows site administrators to provide a simple RESTful
service (or use our reference implementation), which returns mapping from the global user
identity as registered in the Onedata Onezone service to a local user account, which can be
storage system specific. Currently, LUMA supports mapping to the following storage systems:
Unix uid/gid identifiers, Amazon S3, OpenStack SWIFT, Ceph but more storage systems can
be easily integrated by site administrators. An example mapping returned by this service is
presented below:
{
"storageId" : "a5ec372b-9f47-44e2-8d98-87d62f055a12",
"storageType" : "POSIX",
"spaceName" : "Space1",
"userDetails" : {
"name" : "User One",
"connectedAccounts" : [ ],
"alias" : "user.one",
"emailList" : [ "user@example.com"]
}
}
LUMA mechanism supports also the feature of Onedata which allows to connect multiple exter-
nal identity providers (e.g. Facebook, Google, GitHub) to be connected to a single user identity
in the system, allowing users to authenticate using several identity providers depending on their
context.
4 Related work
Currently several data management solutions have emerged, which try to deal with the increas-
ing requirements of user applications in terms of large scale data processing, several of which
were addressing the needs of the scientific Grid computing infrastructures [10, 11, 13].
ownCloud [14] is an open-source framework for creating self-managed file hosting services,
similar to Dropbox, i.e. sync-and-share. It enables to maintain full control over data loca-
tion and transfers, while hiding the underlying storage infrastructure, which can be composed
of multiple storage resources. The main features of ownCloud include abstracting file storage
available through directory structures or WebDAV, file synchronization between various operat-
ing systems, user group administration, sharing of files using public URLs, online text editing,
viewers for various file formats, support for external Cloud storage services (e.g. Dropbox or
Google Drive). The Integrated Rule-Oriented Data System (iRODS) [20] is an open source data
management software used to manage and take control of users data regardless of the device
used to store data. Its main features include data discovery using a triple based metadata
catalog, support for data workflows, with a rule engine allowing any action to be initiated by
any trigger on any server or client in the grid, secure collaboration and data virtualization, al-
lowing access to distributed storage assets under a unified namespace, and freeing organizations
from getting locked in to single-vendor storage solutions. Distributed, parallel filesystems such
7
Michaƚ Wrzeszcz et al. / Procedia Computer Science 108C (2017) 445–454 451
Effective and Scalable Data Access Control . . . Wrzeszcz, Opiola, Zemek, . . .
be added. There are more resources beside spaces that use privileges system, and they are
analogous to space privileges (modify,invite members,delete, etc.). These resources include
groups, handle services and handles (for instance Digital Object Identifiers). Handle services
and handles are resources connected with Open Data publications, and they also have members
(users or groups) with associated privileges. Beside privileges associated with specific resources,
there are also general privileges enabling special features in Onezone. They are typically granted
to system administrators and include the rights to view and modify resources in the system.
These privileges can be granted to a user or a group of users. The group system in Onedata is
very flexible and allows for creating complicated structures of nested groups. In fact, they can
create an arbitrary graph, where cycles are allowed. Nevertheless, such unconstrained approach
has one significant pitfall - how to efficiently verify privileges of given user towards a resource
when he might belong to it via a long chain of nested groups? The naive approach would be
to analyse the graph of relations every time. However, resources can be accessed with high
frequency (thousands of requests per second), especially because the privileges must be checked
during every file-system operation - thus an efficient solution is required. In Onedata, we
observed that the relations graph is not modified often (adding/removing relations or updating
privileges) - in fact entire organizations can run on the same group membership setup for
months, with single users joining or leaving groups occasionally. Considering this, we devised
an algorithm where the relations graph is analysed incrementally to collect information about
direct and indirect memberships and privileges, which we call effective relations and effective
privileges (see Fig. 3). The algorithm operates on a graph of entities (users, groups, spaces
Membership
Privileges:
- modify
- invite members
- write data
- delete
M
I
W
D
Eective users:
Eective groups:
User 1:
User 2:
User 3:
MIW -
-IW-
MIWD
-IW-
-IW-
MIWD
Group A:
Group B:
Group C:
User 1
User 2 Group A Group B
Group CUser 3
M
IW
MIWD
Figure 3: Simplified entity graph with pre-calculated effective members and their privileges.
and other related resources). When the relation changes, a recalculation is scheduled which
analyses only the affected entities. If the effective relations of an entity have changed because
of the update, all adjacent entities are analysed recursively. The process spans wider and wider
until all changes have been propagated. This way, shortly after each update, we obtain a
graph of entities where every entity carries a pre-calculated information about all its effective
relations and privileges. Thanks to this approach, verifying if a user has given privilege towards
a resource is reduced to looking up a single record and it’s effective privileges. What’s most
important this ensures very low overheads on the file-system level. To enrich the functionality
of groups, Onedata introduces roles - an attribute of each group defining the characteristics of
its members. There are several available roles:
role - simplest group type associating members of certain role in arbitrary organizations,
team - a group of members that form a team,
unit - a group of members that belong to the same administrative unit,
organization - a group associating multiple units (virtual organization).
6
Effective and Scalable Data Access Control . . . Wrzeszcz, Opiola, Zemek, . . .
Roles allow for creating clear and orderly group structures for easier maintenance.
3.4 Mapping user identities to local accounts
In order to enable direct mapping between global user identities registered in the Onezone
service and local storage user identities, Onedata provides an extensible mechanism called Local
User MApping (LUMA) [16], which allows site administrators to provide a simple RESTful
service (or use our reference implementation), which returns mapping from the global user
identity as registered in the Onedata Onezone service to a local user account, which can be
storage system specific. Currently, LUMA supports mapping to the following storage systems:
Unix uid/gid identifiers, Amazon S3, OpenStack SWIFT, Ceph but more storage systems can
be easily integrated by site administrators. An example mapping returned by this service is
presented below:
{
"storageId" : "a5ec372b-9f47-44e2-8d98-87d62f055a12",
"storageType" : "POSIX",
"spaceName" : "Space1",
"userDetails" : {
"name" : "User One",
"connectedAccounts" : [ ],
"alias" : "user.one",
"emailList" : [ "user@example.com"]
}
}
LUMA mechanism supports also the feature of Onedata which allows to connect multiple exter-
nal identity providers (e.g. Facebook, Google, GitHub) to be connected to a single user identity
in the system, allowing users to authenticate using several identity providers depending on their
context.
4 Related work
Currently several data management solutions have emerged, which try to deal with the increas-
ing requirements of user applications in terms of large scale data processing, several of which
were addressing the needs of the scientific Grid computing infrastructures [10, 11, 13].
ownCloud [14] is an open-source framework for creating self-managed file hosting services,
similar to Dropbox, i.e. sync-and-share. It enables to maintain full control over data loca-
tion and transfers, while hiding the underlying storage infrastructure, which can be composed
of multiple storage resources. The main features of ownCloud include abstracting file storage
available through directory structures or WebDAV, file synchronization between various operat-
ing systems, user group administration, sharing of files using public URLs, online text editing,
viewers for various file formats, support for external Cloud storage services (e.g. Dropbox or
Google Drive). The Integrated Rule-Oriented Data System (iRODS) [20] is an open source data
management software used to manage and take control of users data regardless of the device
used to store data. Its main features include data discovery using a triple based metadata
catalog, support for data workflows, with a rule engine allowing any action to be initiated by
any trigger on any server or client in the grid, secure collaboration and data virtualization, al-
lowing access to distributed storage assets under a unified namespace, and freeing organizations
from getting locked in to single-vendor storage solutions. Distributed, parallel filesystems such
7
452 Michaƚ Wrzeszcz et al. / Procedia Computer Science 108C (2017) 445–454
Effective and Scalable Data Access Control . . . Wrzeszcz, Opiola, Zemek, . . .
as Lustre [12] or Ceph [24] can be classified as high performance data access solutions. They
are mature and widely used solutions designed especially for single data centres that maintain
locally distributed data on multiple storage systems. Globus Connect [4] is a client-server solu-
tion allowing users and researchers to use the Globus transfer service. It simplifies the way of
creating Globus endpoints - the different locations where data can be moved to or from using
the Globus service. It is free to install and use for users at non-profit research and education
institutions.
An emerging requirement from data management systems is the support for open data
publishing, in particular to enable easy integration with open access services such as DataCite
[5] or OpenAIRE [19]. These services rely on established standards such as OAI-PMH[22], which
enable them to integrate with the existing platforms for publication metadata harvesting, and
identify datasets through globally unique handles such as DOI[9] or PID. However, while these
services enable discovery and identification of open data sets, they do not address directly the
issue of accessing the underlying data by end users. Moreover, the publication of data sets
often involves publishing a URL, where the dataset is available along with the DOI or PID for
resolution of the data set.
With respect to authentication and authorization methods, classically most authentication
and authorization to data management systems has been based on X.509 certifcates and its
extensions for role and attribute information [23, 3]. However, several new mechanisms have
evolved recently, mainly addressing the need for easy to use and secure single sign on identity
management and authorization. OpenID Connect [15] is a simple authentication mechanism,
which allows users to be identified against remote clients based on an authentiation to a OIDC
provider. On the other hand, SAML 2.0, is a protocol for exchanging both authentication and
authorization security tokens which can contain various authorization and identity assertions
[18]. In federated data management systems, a common problem is mapping of global user
identities to local user accounts within the storage systems. So far this has been addressed
using such solutions as local mapping files, which raised several administrative issues [1].
The choice of tools and systems for distributed data management is wide and diversified, but
typically they offer selective features and are not able to comprehensively address the needs of
users operating in organizationally distributed environments. This is depicted in Table 1. The
innovative approach of Onedata is to fulfill all presented requirements within a single, unified
platform.
Classification Examples Disadvantages
File synchronization services ownCloud, Dropbox Limits on storage size and trans-
fer speed
Services for fast data movement Globus Connect Lack of location transparency
High-performance parallel file
systems Lustre, Ceph Centralized management
Widely distributed data storage
systems iRODS Manual management of data lo-
cation and low efficiency
Table 1: Summary of existing data management solutions
5 Conclusions
In this paper we have presented Onedata distributed data management platform and its sup-
port for effective data access authentication and authorization in a distributed storage system.
8
Effective and Scalable Data Access Control . . . Wrzeszcz, Opiola, Zemek, . . .
Onedata has a very strong focus on enabling users to easily and securely access and share their
data regardless of whether they work in small teams or large international collaborations. At
the same time, Onedata ensures that the storage system administrators have full control over
their storage resources. Performance tests conducted in the PLGrid production environment
also confirmed that Onedata offers good data access performance [25]. These features were
made possible by development of a flexible authentication and authorization mechanism based
on OpenID Connect and Macaroons. Onedata is targeted at global, highly distributed environ-
ments and it was developed to support large data scale and user base. Performance scalability is
achieved thanks to advanced block replication mechanisms. Files are split into blocks, and only
required blocks are replicated to the site where data is processed. Local access to blocks ensures
maximum efficiency, while the blocks are simultaneously sychronized and available globally.
The main novelty in the context of data access control achieved by Onedata platform lies
in a provision of a unified data access control mechanism for diversified types of user commu-
nities, scalable from small research groups to large international communities. Management of
privileges is possible in an easy fashion, thanks to automatic computation of effective group
membership and effective privileges of each user, implemented using fast lookups of user and
group graph structure to ensure low overheads irrespectively of user base growth. All data
access requests are independently authorized, which ensures that the data can be secure even
including storage systems.
Currently, Onedata is being used in several international projects and initiatives includes
PLGrid, EGI-Engage, INDIGO-DataCloud and is used as the basis for EGI DataHub [7], a
public service for provisioning of large reference data sets. Recently, it has been also accepted
for the second phase of Helix Nebula Science Cloud procurement for enabling high throughput
scientific data processing on commercial Cloud infrastructures [8].
Future work will include integration of SAML 2.0 identity service, enabling integration
with additional community identity providers and implementation of a P2P mechanism for
establishing trust between different zones.
Acknowledgements This work has been partially funded under Horizon 2020 EU projects: INDIGO-
DataCloud (Project ID: 653549) and EGI-Engage (Project ID: 654142). RS and JK are grateful for AGH-UST
grant no. 11.11.230.124. LO is grateful for his doctoral grant at AGH-UST.
References
[1] Alfieri, R., Cecchini, R., Ciaschini, V., dell’Agnello, L., Frohner, A., Lorentey, K., and Spataro, F.
From gridmap-file to VOMS: managing authorization in a Grid environment. Future Generation
Comp. Syst., 21(4):549–558, 2005.
[2] Birgisson, A., Politz, J. G., Erlingsson, U., Taly, A., Vrable, M., and Lentczner, M. Macaroons:
Cookies with contextual caveats for decentralized authorization in the cloud. In NDSS. The
Internet Society, 2014.
[3] Chadwick, D. W., Otenko, A., and Ball, E. Role-based access control with x.509 attribute certifi-
cates. IEEE Internet Computing, 7(2):62–69, 2003.
[4] Chard, K., Pruyne, J., Blaiszik, B., Ananthakrishnan, R., Tuecke, S., and Foster, I. Globus data
publication as a service: Lowering barriers to reproducible science. In 11th IEEE International
Conference on eScience, 2015.
[5] DataCite. Datacite : helping you to find, access, and reuse research data, 2011. http://datacite.
org.
[6] Dutka, L., Wrzeszcz, M., Licho´n, T., Slota, R., Zemek, K., Trzepla, K., Opiola, L., Slota, R. G.,
and Kitowski, J. Onedata - a step forward towards globalization of data access for computing
9
Michaƚ Wrzeszcz et al. / Procedia Computer Science 108C (2017) 445–454 453
Effective and Scalable Data Access Control . . . Wrzeszcz, Opiola, Zemek, . . .
as Lustre [12] or Ceph [24] can be classified as high performance data access solutions. They
are mature and widely used solutions designed especially for single data centres that maintain
locally distributed data on multiple storage systems. Globus Connect [4] is a client-server solu-
tion allowing users and researchers to use the Globus transfer service. It simplifies the way of
creating Globus endpoints - the different locations where data can be moved to or from using
the Globus service. It is free to install and use for users at non-profit research and education
institutions.
An emerging requirement from data management systems is the support for open data
publishing, in particular to enable easy integration with open access services such as DataCite
[5] or OpenAIRE [19]. These services rely on established standards such as OAI-PMH[22], which
enable them to integrate with the existing platforms for publication metadata harvesting, and
identify datasets through globally unique handles such as DOI[9] or PID. However, while these
services enable discovery and identification of open data sets, they do not address directly the
issue of accessing the underlying data by end users. Moreover, the publication of data sets
often involves publishing a URL, where the dataset is available along with the DOI or PID for
resolution of the data set.
With respect to authentication and authorization methods, classically most authentication
and authorization to data management systems has been based on X.509 certifcates and its
extensions for role and attribute information [23, 3]. However, several new mechanisms have
evolved recently, mainly addressing the need for easy to use and secure single sign on identity
management and authorization. OpenID Connect [15] is a simple authentication mechanism,
which allows users to be identified against remote clients based on an authentiation to a OIDC
provider. On the other hand, SAML 2.0, is a protocol for exchanging both authentication and
authorization security tokens which can contain various authorization and identity assertions
[18]. In federated data management systems, a common problem is mapping of global user
identities to local user accounts within the storage systems. So far this has been addressed
using such solutions as local mapping files, which raised several administrative issues [1].
The choice of tools and systems for distributed data management is wide and diversified, but
typically they offer selective features and are not able to comprehensively address the needs of
users operating in organizationally distributed environments. This is depicted in Table 1. The
innovative approach of Onedata is to fulfill all presented requirements within a single, unified
platform.
Classification Examples Disadvantages
File synchronization services ownCloud, Dropbox Limits on storage size and trans-
fer speed
Services for fast data movement Globus Connect Lack of location transparency
High-performance parallel file
systems Lustre, Ceph Centralized management
Widely distributed data storage
systems iRODS Manual management of data lo-
cation and low efficiency
Table 1: Summary of existing data management solutions
5 Conclusions
In this paper we have presented Onedata distributed data management platform and its sup-
port for effective data access authentication and authorization in a distributed storage system.
8
Effective and Scalable Data Access Control . . . Wrzeszcz, Opiola, Zemek, . . .
Onedata has a very strong focus on enabling users to easily and securely access and share their
data regardless of whether they work in small teams or large international collaborations. At
the same time, Onedata ensures that the storage system administrators have full control over
their storage resources. Performance tests conducted in the PLGrid production environment
also confirmed that Onedata offers good data access performance [25]. These features were
made possible by development of a flexible authentication and authorization mechanism based
on OpenID Connect and Macaroons. Onedata is targeted at global, highly distributed environ-
ments and it was developed to support large data scale and user base. Performance scalability is
achieved thanks to advanced block replication mechanisms. Files are split into blocks, and only
required blocks are replicated to the site where data is processed. Local access to blocks ensures
maximum efficiency, while the blocks are simultaneously sychronized and available globally.
The main novelty in the context of data access control achieved by Onedata platform lies
in a provision of a unified data access control mechanism for diversified types of user commu-
nities, scalable from small research groups to large international communities. Management of
privileges is possible in an easy fashion, thanks to automatic computation of effective group
membership and effective privileges of each user, implemented using fast lookups of user and
group graph structure to ensure low overheads irrespectively of user base growth. All data
access requests are independently authorized, which ensures that the data can be secure even
including storage systems.
Currently, Onedata is being used in several international projects and initiatives includes
PLGrid, EGI-Engage, INDIGO-DataCloud and is used as the basis for EGI DataHub [7], a
public service for provisioning of large reference data sets. Recently, it has been also accepted
for the second phase of Helix Nebula Science Cloud procurement for enabling high throughput
scientific data processing on commercial Cloud infrastructures [8].
Future work will include integration of SAML 2.0 identity service, enabling integration
with additional community identity providers and implementation of a P2P mechanism for
establishing trust between different zones.
Acknowledgements This work has been partially funded under Horizon 2020 EU projects: INDIGO-
DataCloud (Project ID: 653549) and EGI-Engage (Project ID: 654142). RS and JK are grateful for AGH-UST
grant no. 11.11.230.124. LO is grateful for his doctoral grant at AGH-UST.
References
[1] Alfieri, R., Cecchini, R., Ciaschini, V., dell’Agnello, L., Frohner, A., Lorentey, K., and Spataro, F.
From gridmap-file to VOMS: managing authorization in a Grid environment. Future Generation
Comp. Syst., 21(4):549–558, 2005.
[2] Birgisson, A., Politz, J. G., Erlingsson, U., Taly, A., Vrable, M., and Lentczner, M. Macaroons:
Cookies with contextual caveats for decentralized authorization in the cloud. In NDSS. The
Internet Society, 2014.
[3] Chadwick, D. W., Otenko, A., and Ball, E. Role-based access control with x.509 attribute certifi-
cates. IEEE Internet Computing, 7(2):62–69, 2003.
[4] Chard, K., Pruyne, J., Blaiszik, B., Ananthakrishnan, R., Tuecke, S., and Foster, I. Globus data
publication as a service: Lowering barriers to reproducible science. In 11th IEEE International
Conference on eScience, 2015.
[5] DataCite. Datacite : helping you to find, access, and reuse research data, 2011. http://datacite.
org.
[6] Dutka, L., Wrzeszcz, M., Licho´n, T., Slota, R., Zemek, K., Trzepla, K., Opiola, L., Slota, R. G.,
and Kitowski, J. Onedata - a step forward towards globalization of data access for computing
9
454 Michaƚ Wrzeszcz et al. / Procedia Computer Science 108C (2017) 445–454
Effective and Scalable Data Access Control . . . Wrzeszcz, Opiola, Zemek, . . .
infrastructures. In Koziel, S., Leifsson, L. P., Lees, M., Krzhizhanovskaya, V.., Dongarra, J.,
and Sloot, P. M. A., editors, ICCS, volume 51 of Procedia Computer Science, pages 2843–2847.
Elsevier, 2015.
[7] EGI. EGI DataHub website, 2016. Available at http://datahub.egi.eu.
[8] HNSciCloud. Helix Nebula Science Cloud website, 2016. Available at http://www.hnscicloud.
eu/.
[9] International DOI Foundation, editor. DOI Handbook. 2012.
[10] Kapanowski, M., Slota, R., and Kitowski, J. Resource storage management model for ensuring
quality of service in the cloud archive systems. Computer Science, 15(1):3–18, 2014.
[11] Korcyl, K., Chwastowski, J., Plazek, J., and Poznanski, P. Selected issues on histograming on
gpus. Computing and Informatics, 35(2):282–298, 2016.
[12] Lustre. Lustre website, 2016. Available at http://lustre.org/.
[13] Marco, J. et al. The interactive european grid: Project objectives and achievements. Computing
and Informatics, 27(2):161–171, 2008.
[14] Martini, B. and Choo, R. Cloud storage forensics: ownCloud as a case study. Digital Investigation,
10(4):287–299, 2013.
[15] Mladenov, V., Mainka, C., Krautwald, J., Feldmann, F., and Schwenk, J. On the security of
modern single sign-on protocols: Openid connect 1.0. CoRR, abs/1508.04324, 2015.
[16] Onedata. Local User MApping service documentation, 2016. Available at https://onedata.org/
docs/doc/administering_onedata/luma.html.
[17] Onedata. Onedata project website, 2016. Available at http://onedata.org.
[18] Organization for the Advancement of Structured Information Standards. Security Assertion
Markup Language (SAML) v2.0, 2005.
[19] Rettberg, N. and Principe, P. Paving the way to open access scientific scholarly information:
Openaire and openaireplus. In Baptista, A. A., Linde, P., Lavesson, N., and de Brito, M. A.,
editors, International Conference on Electronic Publishing, ELPUB. IOS Press, 2012.
[20] Roblitz, T. Towards implementing virtual data infrastructures - a case study with iRODS. Com-
puter Science, 13(4):21–34, 2012.
[21] SNIA. Cloud Data Management Interface. Technical report, April 2010. Available at http:
//www.snia.org/cdmi.
[22] Sompel, H. Van De, Nelson, M., Lagoze, C., and Warner, S. Resource harvesting within the
oai-pmh framework. D-Lib Magazine, 10(12), 2004.
[23] Venturi, V., Stagni, F., Gianoli, A., Ceccanti, A., and Ciaschini, V. Virtual organization man-
agement across middleware boundaries. In eScience, pages 545–552. IEEE Computer Society,
2007.
[24] Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E., and Maltzahn, C. Ceph: A scalable, high-
performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems
Design and Implementation (OSDI), pages 307–320, 2006.
[25] Wrzeszcz, M., Trzepla, K., S¤lota, R., Zemek, K., Licho´n, T., Opio¤la, ¤L., Nikolow, D., Dutka, ¤L.,
lota, R., and Kitowski, J. Metadata organization and management for globalization of data access
with onedata. In Wyrzykowski, R. et al., editors, Parallel Processing and Applied Mathematics:
11th Intnl. Conf., PPAM 2015, Krakow, Poland, September 6-9, 2015. Revised Selected Papers,
Part I, pages 312–321, Cham, 2016. Springer International Publishing.
10
... The concepts of Provider and Zone services have been already partially discussed in our past works: [28], [36]- [38]. In this paper, we further extend their responsibilities to allow for cross-zone collaboration. ...
... • Zone service, called Onezone, • data provider service, called Oneprovider, • cooperation within Zone scope, • metadata synchronization protocol, • AAI based on macaroons. In our previous works [28], [36]- [38], we describe other conceptual problems that we are facing during design and implementation. Currently, the Onedata system allows for unified data access within one Zone and we are striving to achieve cross-zone collaboration in the upcoming future. ...
... iRODS is a flexible solution due to the micro-services, which can be used to program the behavior of the storage according to the users needs. Onedata [29,55], as a successor of VeilFS [41], is a software storage solution aimed at building globally distributed data storage system to integrate storage resources from different providers, which may not trust each other. The user can have his data kept in storage spaces provided by different storage providers and the data is available transparently -the user can see all his distributed storage resources as a single namespace with transparent access to data. ...
Article
Full-text available
There is high demand for storage related services supporting scientists in their research activities. Those services are expected to provide not only capacity but also features allowing for more flexible and cost efficient usage. Such features include easy multiplatform data access, long term data retention, support for performance and cost differentiating of SLA restricted data access. The paper presents a policy-based SLA storage management model for distributed data storage services. The model allows for automated management of distributed data aimed at QoS provisioning with no strict resource reservation. The problem of providing users with the required QoS requirements is complex, and therefore the model implements heuristic approach for solving it. The corresponding system architecture, metrics and methods for SLA focused storage management are developed and tested in a real, nationwide environment.
Article
In the wide-area high-performance computing environment, heterogeneous storage resources are geographically distributed in different supercomputing centers, which leads to the barriers between applications and data. This paper proposes a global virtual data space, named GVDS, to meet the needs of unified data access across supercomputing centers. GVDS integrates the parallel/distributed file systems of supercomputing centers to present a virtual space with tremendous storage capability for users. GVDS organizes users into groups for easy management, which allows users to share, collaborate, and perform computations on the stored data. For failure tolerance, global metadata is replicated and distributed on multiple supercomputing centers, redundant I/O service components are deployed in each supercomputing center. GVDS uses adaptive prefetching, caching, and request merging to improve access performance. Experimental results running on real-world supercomputing centers show that, GVDS can deliver excellent I/O performance running micro-benchmark, real-world traces and applications.
Article
Full-text available
To satisfy requirements of data globalization and high performance access in particular, we introduce the originally created onedata system which virtualizes storage systems provided by storage resource providers distributed globally. onedata introduces new data organization concepts together with providers’ cooperation procedures that involve use of Global Registry as a mediator. The most significant features include metadata synchronization and on-demand file transfer.
Article
Full-text available
Scientists demand easy-to-use, scalable and flexible infrastructures for sharing,managing and processing their data spread over multiple resources accessiblevia different technologies and interfaces. In our previous work, we developedthe conceptual framework VISPA for addressing these requirements. This paperprovides a case study assessing the integrated Rule-Oriented Data System(iRODS) for implementing the key concepts of VISPA. We found that iRODSis already well suited for handling metadata and sharing data. Although it doesnot directly support provenance information of data and the temporal provisioningof data, basic forms of these capabilities may be provided through itscustomization mechanisms, ie rules and micro-services.
Article
Full-text available
Nowadays, service providers offer a lot of IT services in the public or private cloud. Clients can buy various kinds of services, such as SaaS, PaaS, etc. Recently, Backup as a Service (BaaS), a variety of SaaS, was introduced there. At the moment, there are several different BaaS's available to archive data in the cloud, but they provide only a basic level of service quality. In this paper, we propose a model which ensures QoS for BaaS and some methods for management of storage resources aimed at achieving the required SLA. This model introduces a set of parameters responsible for an SLA level which can be offered at the basic or higher level of quality. The storage systems (typically HSM), which are distributed between several Data Centers, are built based on disk arrays, VTL's, and tape libraries. The RSMM model does not assume bandwidth reservation or control, but rather is focused on management of storage resources.
Article
Full-text available
The Interactive European Grid (i2g) project has set up an advanced e-Infrastructure in the European Research Area specifically oriented to support the friendly execution of demanding interactive applications. While interoperable with existing large e-Infrastructures like EGEE, i2g software supports execution of parallel applications in interactive mode including powerful visualization and application steering. This article describes the strategy followed, the key technical achievements, examples of applications that benefit from this infrastructure and the sustainable model proposed for the future.
Conference Paper
Full-text available
We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.
Article
The contemporary large scale measuring systems in the real-time environment make extensive use of histogramming as a tool for the experimental data quality monitoring. The processing of a large number of data channels requires a suitable computing power where the graphical processors seem to be well suited. Histogramming operations run 011 the central and graphics processing units are discussed. Results of the performance measurements including various configurations of the allocation of the histograms in various parts of the memory of used devices are presented.
Conference Paper
Broad access to the data on which scientific results are based is essential for verification, reproducibility, and extension. Scholarly publication has long been the means to this end. But as data volumes grow, new methods beyond traditional publications are needed for communicating, discovering, and accessing scientific data. We describe data publication capabilities within the Globus research data management service, which supports publication of large datasets, with customizable policies for different institutions and researchers; the ability to publish data directly from both locally owned storage and cloud storage; extensible metadata that can be customized to describe specific attributes of different research domains; flexible publication and curation workflows that can be easily tailored to meet institutional requirements; and public and restricted collections that give complete control over who may access published data. We describe the architecture and implementation of these new capabilities and review early results from pilot projects involving nine research communities that span a range of data sizes, data types, disciplines, and publication policies.
Article
OpenID Connect is a new Single Sign-On authentication protocol, which is becoming increasingly important since its publication in February 2014. OpenID Connect relies on the OAuth protocol, which currently is the de facto standard for delegated authorization in the modern web and is supported by leading companies like, e.g., Google, Facebook and Twitter. An important limitation of OAuth is the fact that it was designed for authorization and not for authentication -- it introduces a concept that allows a third party, e.g., a mobile App or a web application, to only access a subset of resources belonging to a user. However, it does not provide a secure means to uniquely identify the user. Thus, recent research revealed existing problems in case that OAuth is used for authentication nonetheless. These problems result in severe security vulnerabilities. To fill this gap, OpenID Connect was created. It provides federated identity management and authentication by adding authentication capabilities on top of the OAuth protocol. % Although OpenID Connect is a very new standard, companies like Google, Microsoft, AOL and PayPal, who were also involved in the development, use it already. In this paper we describe the OpenID Connect protocol and provide the first in-depth analysis of one of the key features of OpenID Connect, the \emph{Discovery} and the \emph{Dynamic Registration} extensions. We show that the usage of these extensions can compromise the security of the entire protocol. We develop a new attack called \emph{Malicious Endpoints} attack, evaluate it against an existing implementation, and propose countermeasures to fix the presented issues.
Article
The storage as a service (StaaS) cloud computing architecture is showing significant growth as users adopt the capability to store data in the cloud environment across a range of devices. Cloud (storage) forensics has recently emerged as a salient area of inquiry. Using a widely used open source cloud StaaS application – ownCloud – as a case study, we document a series of digital forensic experiments with the aim of providing forensic researchers and practitioners with an in-depth understanding of the artefacts required to undertake cloud storage forensics. Our experiments focus upon client and server artefacts, which are categories of potential evidential data specified before commencement of the experiments. A number of digital forensic artefacts are found as part of these experiments and are used to support the selection of artefact categories and provide a technical summary to practitioners of artefact types. Finally we provide some general guidelines for future forensic analysis on open source StaaS products and recommendations for future work.