Content uploaded by Kenneth B. Kent
Author content
All content in this area was uploaded by Kenneth B. Kent on Nov 02, 2015
Content may be subject to copyright.
Monetary-and-QoS Aware Replica Placements in
Cloud-Based Storage Systems
Lingfang Zeng †, Shijie Xu [, Yang Wang [, Xiang Cui §, Tan Wee Kiat §, David Bremner [, Kenneth Kent [
†Wuhan National Laboratory for Optoelectronics
School of Computer, Huazhong University of Science and Technology
[IBM Centre for Advanced Studies (CAS Atlantic)
University of New Brunswick, Fredericton, Canada E3B 5A3
§Department of Electrical and Computer Engineering,
National University of Singapore, Singapore 117576
E-mail: lfzeng@hust.edu.cn, {shijiexu, ywang8, bremner, ken}@unb.ca, cuixiang23@gmail.com
Abstract—This paper proposes a replication cost model and
two greedy algorithms, named GS QoS and GS QoS C1, for
replication placements in cloud-based storage systems. The model
aims to minimize replication cost with full consideration of
user access qualities to storage nodes. Our two algorithms
employ a utility measurement to guide placement procedures.
Our final experimental results show that 1) GS QoS outperforms
GS QoS C1; 2) both algorithms have more economical results
than those from existing greedy site algorithm.
Index Terms—replication, greedy, cloud storage system
I. INTRODUCTION
Replication technology has been widely used to improve
the performance of network-based applications. By serving
clients with nearby replicas, it can significantly reduce the
overall network latency. Additionally, the replication would
also benefit service reliability. In practice, all of the computing
resources such as processors, storage, and networks, are not
failure free, and a failure is usually fatal to an existing running
system. As a result, replica sites have to be selected to serve
all clients of the failed nodes so that the service could be
continuous.
In spite of these benefits, the replication should be designed
carefully due to the incurred cost. In a cloud environment,
cloud vendors invest a large amount of money to hardware
resources (i.e. data centers, power, network bandwidth and
machines), and then hire employees to monitor and maintain
these resources. As a result, service providers have to pay for
resources their applications consumed (e.g. network traffic) in
cloud platforms. The more resources consumed by the service
applications, the higher the fee would be charged by the cloud
vendors.
It is still necessary to study replication in the cloud storage
system though it has been conducted intensively in traditional
content delivery network (CDN) and Grid systems. Contrary
to existing CDNs and grids, the distinct characteristics of
replication in the cloud storage systems are 1) end users will
contribute to the majority of network traffic as they are content
owners; 2) The data sets are diverse and dynamic. In the cloud
storage systems (e.g., Dropbox, Tencent Weiyun, and Google
Storage) users update their data to the system continuously,
and the content sizes can be ranged from several MBs to
GBs. This traffic is totally different from those in traditional
CDN where the content comes from the service providers and
remain the same for a long time. Consequently, the user access
traffic patterns in CDNs and Grids are relatively more uniform
than those in the cloud storage systems.
This paper addresses the issue of replication placements
in the cloud storage systems. In this paper, a mathematical
model is built to minimize replication cost, and two monetary
and Qos-aware algorithms, named GS QoS and GS QoS C1,
respectively, are provided to resolve this model.
II. RE LATE D WOR K
Numerous works have been conducted to address the replica
placement problem in CDNs and Grids [3]. For example,
Mansouri et al. [5] provide a selection algorithm for replica
allocation, named combination of Modified DHR Algorithm
(MDHRA). Response time is calculated using factors, e.g.,
data transfer time and storage access latency, and then the
best replica location is determined. Li et al. [4] argue that the
placement of web proxies is critical to network performance
and can significantly reduce the overall latency. In the paper,
they used a tree topology to model the placement problem, but
only the download cost is included in its cost model. In another
paper [7], Xu et al. offer solutions using a tree topology,
and include both upload and download costs. However, the
relevance of replication direction in the provisioning cost
between replica sites is not considered in their work, while
in [1], Chen et al. take a different approach that moves
away from the usual tree topology. In addition, all three
different costs i.e., upload, download, and storage costs are
included in their cost model. Although they emphasize the
importance of both choosing the set of replica sites as well as
specifying the replication directions, the algorithm is not QoS
friendly. Though all of the above algorithms may work well in
traditional CDNs and Grid systems, they overlook the traffic
volume and its cost as well in the cloud storage systems, which
are the two main characteristics in the current cloud storage
systems.
The problem of replicas placements can be defined as a
procedure to select Nnodes to host mreplicas with given
distance matrix Dso that the objectives are optimized. The
element d(i, j)in the distance matrix indicates a distance
metric between the ith request location and the jth storage
node location. As discussed in [2], this is a NP-hard problem
and thus many heuristic algorithms have been proposed to
resolve it. For example, three different greedy algorithms,
i.e., normal greedy algorithm,simple greedy algorithm and
heuristic algorithm are discussed in [6].
III. PLACEMENT MODE L AN D ALGORITHMS
In this section, we first provide a model for replication
placements in the cloud-based storage system, then we present
two new Monetary-QoS aware greedy site (GS) algorithms,
GS QoS and GS Qos C1, to obtain an optimal placement
strategy.
A. Model
There are in total three kinds of entries in the cloud storage
systems, i.e. cloud storage nodes, client users, and data sets.
The cloud nodes provide storage space for the data. In case
of one-node failure, a nearby storage node can be used to
continue the service. The users, on the other hand, contribute
the data access volume for the cloud storage systems. They
synchronize contents from the local nodes to the storage nodes
everyday, but rarely download. These traffics are typically uni-
directed upload traffic which is different from dual replication
directions inside of the cloud nodes or CDN nodes.
Thereafter, the total cost is a sum of the replication cost and
the user accesses cost. Replication cost is a sum of the network
traffic costs (i.e., incoming and outgoing costs at one node)
during the course of replication and synchronization, and the
storage cost when a new node is selected for storage. The
access cost, on the other hand, only refers to the uploading
traffic cost for a node when users access it and its value is
proportional to user access frequency.
Finally, our model is to find a procedure of selecting storage
nodes from Ngiven nodes so that the objective cost in Eq.(1)
can be minimized:
cost =Xj∈nodes Cusers access traf f ic(j) + Creplication (j)(1)
B. Modification to Greedy Site algorithm GS QoS
Algorithm 1 shows the general GS procedure for replica
placements. In this procedure, a node with the highest utility
is first selected, and then potential users are assigned to it in
each round. The selection is repeated until all the users are
assigned.
The function utility(j)(GB/$) distinguishes among differ-
ent GS algorithms. A common definition of utility is the ratio
of total potential traffic volume at node jand the total cost. A
lower utility value indicates less traffic volume per expense.
Compared to the GS algorithm in [1], our GS QoS introduces
a QoS penalization factor afor the utility function (Algorithm
Algorithm 1 GS QoS procedure
1: E is the set of unassigned users
2: Ejis the current set of users who can be assigned to Cj
3: Ukis the kth user and Cjis the jth cloud node.
4: wkis the data size of requests from user Uk
5: while E!=0 do
6: j∗=argmax(utility(j))j∈all un-selected nodes
7: Ej∗is the set of unassigned users for node j∗
8: Assign all users in Ej∗to j∗
9: Select node j∗
10: E=E−Ej∗
11: end while
Algorithm 2 Utility(j)Calculation
1: Dj: download price, Pj: upload price, Sj: storage price.
2: Fis synchronization frequency and default value is 1.
3: Data set size W and replication cost:
Creplication(j) = W∗(Sj+Pj∗F+Di∗F)
4: traffic cost of serving a user k assigned to node j:
Ck(j) = a∗wkDj
5: Analytical size of Request objects from user k to j:
wk(j) = awk
6: Size of all request objects for unassigned users:
WT(j) = X
k
wk(j), k ∈E
7: Analytical utility of site j, and k ∈all unassigned users:
utility(j) = WT(j)
(PkCk(j) + Creplication(j))
2), which adheres to the following rule:
a=(1,if user is not within QoS Distance
QoSD(i,j )
Q,otherwise (2)
As an element in QoS Distance matrix, QoSD(i, j)is the
maximal QoS distance between the user and a node.
C. Improvement GS QoS C1
It is possible that a selected node might not have a large
number of potential users and this would result in a waste
if such a node is selected. To avoid this case, an additional
constraint is used to determine whether to select a new node
for the replication or not. In GS QoS C1, we check whether
or not there would be sufficient potential users for the node
to be selected. The constraint is:
Wt≥X
k
wk(1
n+β1
N)
In this formula, k∈all assigned users. βis a coefficient while
nis the number of total nodes and Nis the number of total
TABLE I: Summary of instances where GS QoS outperforms
GS
GS GS QoS
n 20 20
nodes selection order 17,6,1,3,11,12,16 17,16,14,6,12,3,1,
,18,14,2,0,7,10, 10,18,7,11,0,4,9,
19,4,9 2,19
Numbers of nodes selected 16 16
Total cost 9.9 9.4
Relative cost 1 0.97456
users. This formula infers that potential volumes for the new
selected node should not be less than the average value over
all the existing selected nodes. The complete GS QOS C1
algorithm is shown in Algorithm 3:
Algorithm 3 GS QoS C1
1: E is the set of unassigned users
2: while E!=0 do
3: j∗=argmax(utility(j)) j∈all unselected sites
4: Ej∗is the set of unassigned users, select node j∗and
assigned users to it if:
5: Wt=Pkwk, k ∈Ej∗
6: if(Wt≥Pkwk(1
n+β1
N)) select node j∗
7: else repeat step 3 to find the next best site
8: Assign all users in Ej∗to j∗
9: select node j∗
10: E=E−Ej∗
11: end while
IV. RES ULTS
In order to make our result comparable, existing data from
[1],which are prices and parameters for the cloud storage, is re-
used in our experiments. The replicate sites are assumed to be
randomly selected in a geographical space. During repeating
tests, we calculate statistics and study parameter impacts on
the replication results.
In addition to these, we also define two terms. One is
QoSD(i, j)in the QoS Distance matrix, which is the maximal
QoS distance between the user and a node. If the distance
between user iand node jexceeds QoSD(i, j ), the QoS of
the user can not be satisfied. The other is Relative cost for
algorithm i:
relative cost(i) = PGS,k (Callusers(k) + Cr eplication(k))
Pi,k (Callusers(k) + Creplication(k))
According to the relative cost, algorithm iis better than the
GS algorithm if relative cost is greater than 1.
A. Modified Greedy Site algorithm GS QoS
The y-axis in Fig 1 is CDF (Cumulative Distribution Func-
tion). According to this figure, GS QoS is better majority of
the time. The relative cost(GS QoS)value is 1.5 in 95%
Fig. 1: Performance CDF of GS QoS.
Fig. 2: Overall Performance CDF of GS QoS and
GS QoS C1.
and 1.0 in 20% of the test cases. This indicates that there are
only 20% of the tests cases where GS QoS would result in
more cost than that of GS.
The value of relative cost is impacted by both node
selection order and number of selected nodes. In the example
of Table 1, relative cost is less than 1 though two selected
node sets are equivalent. According to this table, if nodes
having lower outgoing cost are selected first, the resultant
selection cost would effectively be less. As the utility in
GS QoS is calculated by including all unassigned users, it has
a tendency of choosing sites with less outgoing first. Similar to
order, the number of selected nodes for replication also have
impacts on the relative cost value.
The relative cost comparisons of GS QoS and
GS QoS C1 are shown in Fig 2 and Fig 3. In Fig 2,
the relative cost that is less than 1.0 occurs at 18% for
GS QoS C1 while it occurs at 23% for GS QoS. This
implies that GS QoS C1 is better than the result of GS QoS
as GS QoS C1 outperforms GS in 82% of test cases while
it is only 77% for GS QoS. Additionally, there is significant
reduction of relative cost when GS QoS C1 outperforms
GS QoS, i.e., instances with relative cost > 1. From Fig 3,
the boxed area for GS QoS C1 is also less than that of
Fig. 3: Comparison of boxplots of GS QoS and GS QoS C1.
Fig. 4: Varying Replica Size (W).
GS QoS, which further implies that GS QoS C1 has more
consistent results with GS than that of GS QoS.
The relationship between replica size and relative cost
is shown in Fig 4. According to this figure, relative cost
rises along with replica size and then remains steady at 1.1 in
both algorithms. The explanation is that the difference between
replication cost and user access traffic cost is diminished along
with replica size.
The relative cost is also impacted by QoS Distance(Q) and
the number of nodes (n) which are shown in Fig 5 and Fig 6.
In both algorithms, relative cost climbs along with QoS
distance (Q) at first, but goes down when Q is greater than 10.
This is because a node alone can nearly serve all users when
Q is large which in turn favors the the utility of GS QoS. This
figure also proves that GS QoS outperforms GS QoS C1.
Regarding to the nodes number, GS QoS is also better than
GS QoS C1. According to Fig 6, the rise in relative cost
with an increasing number of nodes for both algorithms is
a typical case where an increase in solution space lowers
the performance of a heuristic based algorithm. However, the
performance of the algorithms are still reasonably good, at a
relative cost of below 1.1 even when there are 40 sites in the
cloud.
Fig. 5: Varying QoS Distance (Q).
Fig. 6: Varying number of nodes (n).
V. CONCLUSION
We provide a model for replication placements in the
cloud storage systems and present two new monetary-and-QoS
aware greedy algorithms to minimize the replication costs.
Our results show that both the proposed algorithms not only
have more economical results than those from GS [1] but also
guarantee the QoS for user accesses.
REFERENCES
[1] Fangfei Chen, Katherine Guo, John Lin, and Thomas F. La Porta. Intra-
cloud lightning: Building cdns in the cloud. In INFOCOM, pages 433–
441, 2012.
[2] Magnus Karlsson and Christos Karamanolis. Bounds on the replication
cost for qos. Technical report, 2003.
[3] R. Kingsy Grace and R. Manimegalai. Dynamic replica placement and
selection strategies in data grids- a comprehensive survey. J. Parallel
Distrib. Comput., 74(2):2099–2108, February 2014.
[4] Bo Li, M.J. Golin, G.F. Italiano, Xin Deng, and K. Sohraby. On
the optimal placement of web proxies in the internet. In INFOCOM
’99. Eighteenth Annual Joint Conference of the IEEE Computer and
Communications Societies. Proceedings. IEEE, volume 3, pages 1282–
1290 vol.3, Mar 1999.
[5] Najme Mansouri, Gholam Hosein Dastghaibyfard, and Ehsan Mansouri.
Combination of data replication and scheduling algorithm for improving
data availability in data grids. Journal of Network and Computer
Applications, 36(2):711–722, 2013.
[6] Konstantinos Tserpes Dimosthenis Kyriazis Vassiliki Andronkou, Kon-
stantinos Mamouras and Theodora Varvarigou. Dynamic qos-aware data
replication in grid environments based on data importance. Future
Generation Computer Systems, 28:544–553, 2011.
[7] Jianliang Xu, Bo Li, and D.L. Lee. Placement problems for transparent
data replication proxy services. Selected Areas in Communications, IEEE
Journal on, 20(7):1383–1398, Sep 2002.