Conference PaperPDF Available

Minimizing replica idle time

March 2017

March 2017

DOI:10.1109/NTICT.2017.7976134

Conference: 2017 Annual Conference on New Trends in Information & Communications Technology Applications (NTICT)

Authors:

Mahdi S.Almhanna

University of Babylon

Figure1: Replica Files

…

Idle Time of Servers As an our way ,let all the Servers work in parallel ,and before starting divide the file into four parts the size of each part match with the following formula, 1=T * (∑ 1 /ti) ………2 Thus 1= T * (1/10 +1/5+1/2.5+1/2) T=1/1.2 = 0.833333333 ≈ 0.833334

…

: Time taken to download quarter of file by Servers

…

Speed of Downloading File for Each Server

…

Figures - uploaded by Mahdi S.Almhanna

Content may be subject to copyright.

Content uploaded by Mahdi S.Almhanna

Content may be subject to copyright.

Minimizing Replica Idle Time

Dr. Mahdi S. Almhanna

Information System Management Department

University of Information Technology and Communication,

Iraq, Baghdad.

E-mail : Mahdialmhanna@gmail.com

Abstract— The specialists know that the Servers are not equal

speeds as it depends on several factors Such as processor speed,

bandwidth, congestion…. etc. This means that if we use multiple

Servers for the purpose of downloading some files there will be a

problem that is the swift Server waits lazy one to complete its

task. In this work, adoption new strategy for transformation files,

called “Minimizing replica idle time” can solve this problem and

improves the performance by decreasing file transfer time more

than other strategy.

Keywords— Data grid; Replica Management System; Replica

Optimization Service; Data Transfer Service and GridFTP

I. INTRODUCTION

Data grid "DG" is a collection of services give the power to

access[1,2,3,4,5,7,8] modify and transfer huge amounts of data

spread in multiple different places. "DG" makes this potential, by

using middleware of applications and services that haul data and

resources from several domains then provided to the users. In DG the

data can be located at multiple sites, each one has its own

administrative domain controlled by collection of rules determine

who can access those data.

Several replicas distributed outside their original administrative

domain, it should be available as efficiently as possible.

Downloading huge data from multiple replicas will certainly differ in

performance rates, the most important factors may affected on

download speeds is bandwidth because bandwidth vary unpredictably

due to internet nature .The other factor is congestion in the links

between Server and clients. Choosing best replica with no congestion

link will be good way to improve the downloading speed. [

17,18,19,20 ]

There are several ways for downloading files; one way is to use

multiple Servers for downloading multiple files, but the bottle neck

problem still remains because faster Server should wait extra time for

the slowest Server to finish its job. Thus, it’s best to decrease the

variation of finish times between the Servers.

Another way using collocation technology [10, 11, 13, 14, 15, 16,

17,18] to download data. This technique enables client to download

file from multiple Servers by setting up multiple connections in

parallel, this improve the performance” [17 ,18]. In our work, we

proposed this way with new technique called “ Minimizing replica

idle time” to improves transfer performance in Data Grid

environments. Our proposed can eliminate the idle times of Server

and decrease the time of transfer files.

II. DATA ACCESS SERVICE

Data grids support replication of datasets, Data access services

and data transfer service work simultaneously to provide control for

access and management of data transfers within data grid. Useing

multiple replicas allow multiple users access the datasets. So, replicas

placed strategically within the sites where users need them, but with

some restriction because the replication of datasets and creation of

replicas depended on the availability of storage within the sites and

the bandwidth between them.

Replica Management Systems (RMS) controls the creation of replica

datasets and replication. RMS determines needs of users for replicas

and produce them dependent on storage and bandwidth availability

III. REPLICA MANAGEMENT SYSTEM (RMS)

RMS acts like a logical entry, it is a main component of Data

Grid react with other components as follows:

A. REPLICA LOCATION SERVICE (RLS)

The replicas resides on the physical storage systems[17,18,19],

anywhere it is reside? RLS can answer this by Maintains path with, it

is responsible to look after a catalogue of files registered by users or

even services at the time of creating files, such as illustrate in Figure1

below .

RLS creates relation among names of logical and physical file of

replica . We can send LFN to the Server of Replica Location Service

(RLS) and request the registered PFN of replicas. So, we can ask

Server to find LFN correlating with a particular physical file location.

Also we can inquire from RLS Server about some attributes such as

size of files or even checksum.

Figure1: Replica Files

B. Replica Optimization Service (ROS)

to choose Best Replica Site use this service, which can be done

by collecting the notification from network monitoring service (

NMS).

C. Data Transfer Service (DTS)

After knowing the physical address, Replica management

service (RMS) asks Data Transfer Service (DTS) to transfer the

related file sets using secure, high performance, and

authoritative data transfer protocol such as GridFTP or

UDT[17,18].

D. Replica Management and Selection

Group of files named Replica Management may use to create or

remove replicas in sites. The addresses of replica site and files

found in replica catalog, multiple catalogs are there in data grid.

Replica manager maintains this replica catalog [10,11,13,14].

IV. Multiple Replica Sites Vice Single File

First of all, the appropriate replica Candidate should be chosen

[17, 18, 19, 20]. Determine best sites [17,18,19,20] may be done by

sending a request to the sites and calculating the response time, the

least time is the safest site which is mean no congestion and therefore

faster one; after that check presence site whether they have our files

or not. Neglect sites that do not contain the required files.

Data grids consists many resources located in many countries or even

many counties within a country. We used grid middleware

GlobusToolkit [9] as data grid infrastructure. You can get all the

answer about resource management, information, services,

management, and security by using Globus Toolkit which its

components designed to supply mechanism for configuration

information and discovering resource, it uses GridFTP [1, 6, 9] to

provide management and transfer data in a wide-area environment.

GridFTP [1, 6, 9] enable parallel data transfer, the available Servers

assigned the requested files to deliver them in parallel. The file

should be divided into several part equal to the number of Server,

each part have a particular size from that file depended on the speed

of Server.

This way of processing file is efficacious approach to reduction the

time consumed between two machines for the purpose of

transformation. However, the inactive period of quicker Servers

waiting delayed one to finish its job is a worthy factor for expanding

time and so affects the efficiency. Several factors such as bandwidth,

CPU speed, link availability, etc between Servers are different

therefore; the loading balance of Servers to downloading files cannot

be avoided.

our job, called, Minimizing replica idle time, can overcome this

dilemma, by eliminate the idle time, where it will be assigned

calculated amount of job suit with Server capability at that moment,

so that all Servers finish their duty at the same time. Thus, with no

waiting time, improving overall transfer performance.

i. Mathematical Formulas

Let U = Size of whole file (downloaded File).

n = Number of Servers (Servers that have a copy of reqquested file).

ti = Time of dowenloading a file by a single server.

T = Time spent by all the Servers when they work in parallel.

ui = Size of file which can be download during the time T by the ith

Server.

After calculate the speed of each Server we can calculate the time (

T) "that the Servers needs to finish the entire works" by using

formula two. After that we used this amount of time with formula 1

to calculate (ui ) size of file which can be downloaded during the

time T by the ith Server.

ii. How to Calculate the Data Rate of Servers?

To calculate the Performance of Sliding-Window Let U is the

channel utilization, W is the window size and a = tt * tp when tt is

the transmission time and tp is the propagation time.

1 when w ≥ 2 * a +1.

W / (2 * a+1) when w < 2 * a +1.

Thus, Data Rate = BW (band width) * U.

iii. Procedure

1. Start

2. Calculate “S”, size of requested file.

3. Calculate “ti”, time of each Server spends for

downloading entire file if its work individually, ≈ S/

Data Rate.

4. Calculate " T", total time of transferring requested file

1 = T * ∑ (1 / ti). When i=1, 2, 3….n

5. Calculate “ui”, using the formula,

ui= (T / ti )* U.

6. Assigned each part of file to single Server.

7. End.

V. CASE STUDY.

Suppose that there are specific actions required to download a file

size of 1000 GB (U), and we want choose 4 Servers (n) for the

purpose of loading this file, first step of action is to send a small

backed to account RTT for the purpose of knowing the speed of each

Server (ti), suppose the speed of each Server is 100 Gb, 200 Gb, 400

Gb, and 500 GB respectively, this mean the Servers need 10, 5, 2.5

and 2 unit time respectively to download 1000 GB file size .

Speed Server

File downloading time

S1=100 GB/H

t 1 = 1000/100 =10 unit time

S2= 200 GB/H

t 2 = 1000/200= 5 unit time

S3= 400 GB/H

t 3 =1000/400= 2.5 unit time

S4= 500 GB/H

t 4 = 1000/500= 2 unit time

Table 1: Time taken to download entire file by each Server.

As traditional way, let all the Servers work in parallel , before

starting, divide the file into four equal parts, assign each part to one

Server, so each Server will try to download 250 GB, so, the above

table update to the following one.

SPEED SERVER

DOUNLOADING TIME

S1 = 100 GB/H

t 1 =10/4=2.5 unit time

S2 = 200 GB/H

t 2 =5/4= 1.25 unit time

S3 = 400 GB/H

t 3 =2.5/4=0.625 unit time

S4 = 500 GB/H

t 4 =2/4= 0.5 unit time

Table 2: Time taken to download quarter of file by Servers

Thus

ui=  /ti * U ……………… 1.

 = 





Thus  = 

  





Thus 1= *(





 ………………2.

From the above table, and as illustrated in the Figure 2: we can see

Server4 can finish its job in 0.5 unit time whereas Server1 needs 2.5

units time to finish its work, that's mean Server4 will waits 2 units

time as an idle time, same for Server3 and Server2 each of them will

waits 1.875 unit time and 1.25 unit time respectively.

It's clear that, if we used only Server4 then it needs two units time

only to downloading the entire file, but when we divided the file into

quarters then the time taken for downloading entire file became worst

,its need 2.5 units time because Server4 waits Server1 to finish its job.

And therefore, this method is inappropriate where it had expanded

the time takes to download the file despite the increase in the number

of Servers. The idle time will be 2 units time in Server4 "its faster

one" but the total time needs to finish the entire jobs is same time of

the slowest Server which is equal to the 2.5 unit time.

Figure 2: Idle Time of Servers

As an our way ,let all the Servers work in parallel ,and before starting

divide the file into four parts the size of each part match with the

following formula,

1=T * (∑ 1 /ti) ………2

Thus 1= T * (1/10 +1/5+1/2.5+1/2)

T=1/1.2 = 0.833333333 ≈ 0.833334

Thus T = 0.833334 (total time needs to download the entire file).

And by using formula 1

(ui = T / ti * U)

We can calculate each part of file which assigned to the Servers as

shown,

U1 = 0.833334/10 *1000= 83.3334.

(Part of file assigned to Server1).

U2 = 0.833334/5 *1000= 166.6668

(Part of file assigned to Server2).

U3 = 0.833334/2.5 *1000=333.3336

(Part of file assigned to Server3).

U4 = 0.833334/2 *1000= 416.667.

(Part of file assigned to Server4).

The total parts equal to (83.3334 + 166.6668 + 333.3336 + 416.667)

≈ 1000GB

Thus all Servers will finish the entire work at the same time with no

idle time, the total work time equal to 0.83334 unit times. We gain

around 2.5 – 0.83334 = 1.67 unit time which is mean we decrease the

work time around 300% comparing with traditional way .

Figure 3: Speed of Downloading File for Each Server

Figure 4: Compare with Traditional Ways

Figure 5: Speed of Servers

Figure 6: Portion Ratio of File Assigned to Each Server

VI. CONCLUSION:

Processor speed, Bandwidth, Congestion, and more other factors are

varies among Servers. That leads to deferent amount of file

transferring speed among servers. In another words, if there are

multiple Servers work simultaneously for downloading a huge file,

then the faster Server it’s been forced to wait for the slowest Server

until finish the entire jobs.

The current work is to eliminate the idle time of servers. To do this,

the requested files should be divided into multiple parts; each part is

assigned to a single server.

The proposed method, has improved the throughput of the network

about 300% compared with traditional method.

REFERENCES

1. ECR 2005 – Scientific Programme – “Abstracts , Eur Radial Suppl

(2005) 15 (Suppl 1)”: 1. doi:10.1007/s10406-005-0100-2

2. Kumar, K.A., Quamar, A., Deshpande, A. et al. “SWORD:

workload-aware data placement and replica selection for cloud “

data management systems, The VLDB Journal (2014) 23: 845.

doi:10.1007/s00778-014-0362-1

3. Toporkov, V.V. & Yemelyanov, D.M. “Economic model of

scheduling and fair resource sharing in distributed computations”,

Program Comput Soft (2014) 40: 35.

doi:10.1134/S0361768814010071

4. Pethuru Raj, Anupama Raman, Dhivya Nagara and Siddhartha

Duggirala “High-Performance Big-Data Analytics”, Part of the

series Computer Communications and Networks pp 275-315, High-

Performance Grids and Clusters

5. Foster, I., Kesselman, C.: “Globus: A Metacomputing Infrastructure

Toolkit”. International Journal of Supercomputer Applications and

High Performance Computing (1997) 11(2), 115–128

6. Linstead, E., Bajracharya, S., Ngo, T. et al. “Sourcerer: mining and

searching internet-scale software repositories”, Data Min Knowl

Disc (2009) 18: 300. doi:10.1007/s10618-008-0118-x

7. author, Sourav Mazumder. book , “Big Data Tools and Platforms”.

8. Manuel Sánchez, Óscar Cánovas, Diego Sevilla and

Antonio F. Gómez-Skarmeta. “Advances in Grid Computing” EGC

2005, A Service-Based Architecture for Integrating Globus 2 and

Globus 3.

9. Tan, J., Abramson, D. & Enticott C. J, “ Rerouting and

Multiplexing System for Grid Connectivity Across Firewalls Grid

Computing”, (2009) 7: 25. doi:10.1007/s10723-008-9104-

10. Cameron, D., Casey, J., Guy, L. et al. “Replica Management in the

European DataGrid Project”, J Grid Computing (2004) 2: 341.

doi:10.1007/s10723-004-5745-x

11. Ravimaran, S. & Maluk Mohamed, M.A. “Integrated Obj_FedRep:

Evaluation of Surrogate Object based Mobile Cloud System for

Federation”, Replica and Data Management, Arab J Sci Eng (2014)

39: 4577. doi:10.1007/s13369-014-1001-2

12. Caron, E., Desprez, F. & Muresan, A. “Pattern Matching Based

Forecast of Non-periodic Repetitive Behavior for Cloud Clients”,

Grid Computing (2011) 9: 49. doi:10.1007/s10723-010-9178-4

13. Hai Jin Jin Huang Xia Xie and Qin Zhang. “Using Classification

Techniques to Improve Replica Selection in Data Grid”

14. Yang, CT., Shih, PC., Lin, CF. et al. “ A resource broker with an

efficient network information model on grid environments”, J

Supercomput (2007) 40: 249. doi:10.1007/s11227-006-0025-0

15. Mansouri, N. “Adaptive data replication strategy in cloud computing

for performance improvement”, Front. Comput. Sci. (2016) 10: 925.

doi:10.1007/s11704-016-5182-6

16. Bang Zhang , Xingwei Wang , and Min Huang. “A PGSA Based

Data Replica Selection Scheme for Accessing Cloud Storage

System” ,

17. M. S. Almahanaa, R. M. Almuttairi, “Enhanced Replica Selection

Technique for binding Replica Sites in Data Grids”, In Pro. Of

International Conference on Intelligent Infrastructure, 47th Annual

National Convention of the Computer Society of India organized by

The Kolkata Chapter December 1-2, 2012, Science City, Kolkata.

18. R.M. Almuttairi, R. Wankar, A. Negi, C.R. Rao, A. Agrawal, R.

Buyya, A two phased service oriented broker for replica selection in

data grids (2SOB), Future Generation Computer Systems (2012),

doi:10.1016/j.future.2012.09.007

19. R. M. Almuttairi , Replica Optimization in Data grids, IJSR -

International Technique Journal of science research, ISSN No 2277

- 8179 , Volume : 4, Issue : 2 , February 2015

20. R.M. Almuttairi, “Smart Vogel s Approximation Method SVAM”,

International Journal of Advanced Computer Research (ISSN

(print): 2249-7277 ISSN (online): 2277-7970) ,Volume-4 Number-1

Issue-14 March-2014

Reducing waiting and idle time for a group of jobs in the grid computing

Article

Full-text available

Oct 2023

Johnson's rule is a scheduling method for the sequence of jobs. Its primary goal is to find the perfect sequence of functions to reduce the amount of idle time, and it also reduces the total time required to complete all functions. It is a suitable method for scheduling the purposes of two functions in a specific time-dependent sequence for both functions and where the time factor is the only parameter used in this way. Therefore, it is not suitable for scheduling work for computers network, where there are many factors affecting the completion time such as CPU speed, memory, bandwidth, and size of data. In this research, Johnson's method will adopt by adding many factors that affect the completion time of the work so that it becomes suitable for the site's job scheduling purposes to reduce the waiting and idle time for a group of jobs.

Promote Replica Management based on Data Mining Techniques

Article

Full-text available

Jan 2019

The data grid technique evolved largely in sharing the data in multiple geographical stations across different sites to improve the data access and increases the speed of transmission data. The performance and the availability of the resources is taken into account, when a total of sites holding a copy of files, there is a considerable benefit in selecting the best set of replica sites to be cooperated for increasing data transfer job. In this paper, new selecrtion strategy is proposed to reduce the total transfer time of required files. Pincer-Search algorithm is used to explore the common characteristics of sites to select uncongested replica sites.

Utilizing Probability Distribution for Selecting Optimal and Minimal Replicas to Achieving Fault Tolerance in a Distributed System

Article

Full-text available

Feb 2024

This paper highlights the importance of efficient task distribution and robust fault tolerance in network systems. It emphasizes the limitations of relying on a fixed resource quantity and proposes task replication as a solution to improve data availability. The introduced algorithm dynamically determines the optimal number of replicas based on network history, response time, and joint probability of successful servers, aiming to minimize task failure rates. The algorithm's advancements in grid scheduling lie in optimal resource management, fault-aware job placement, adaptability to changing conditions, and efficient fault tolerance through redundancy planning. The algorithm outperforms three other algorithms, showcasing significant enhancements.

Dynamic Weight Assignment with Least Connection Approach for Enhanced Load Balancing in Distributed Systems

Preprint

Full-text available

Jul 2023

Load balancing is a critical aspect of managing server resources efficiently and ensuring optimal performance in distributed systems. The Weighted Round Robin (WRR) algorithm is commonly used to allocate incoming requests among servers based on their assigned weights. However, static weights may not reflect the changing demands of servers, leading to imbalanced workloads. To address this issue, this study proposes a dynamic mechanism for assigning weights to servers in the WRR algorithm based on the data rate and incorporates the Least Connection approach for the best result. The dynamic mechanism takes into account the real-time data rate of each server, representing its current load. Servers with higher data rates are assigned higher weights to attract a larger share of incoming requests, while those with lower data rates receive lower weights to manage their loads effectively. This dynamic weight assignment allows the algorithm to adapt to varying workloads and achieve better load balancing across servers. To further refine the distribution of requests, the Least Connection approach is employed to handle tie-breaking situations and for more fairness in distributing the loads. The proposed algorithm is a hybrid of data rate and the Least Connection, it is evaluated through simulations and real-world experiments. The results demonstrate its superiority in achieving improved load balance compared to traditional static-weight WRR algorithms. By dynamically adjusting weights based on data rate and employing the Least Connection approach, the algorithm optimizes server resource usage, minimizes response times, and enhances overall system performance in distributed environments.

Efficient Bandwidth Management for Load Balancing in Grid Computing

Article

Full-text available

Jan 2022

Incoming Information Technology (IT) services appear with cloud computing perspectives that provide users access to IT resources anytime, anywhere. These services should be good enough for the user with some advantages for the cloud service provider. To achieve this goal, you must face many challenges, load balancing is one of these challenges. The most convenient option for some functions does not mean that option is always a good choice to achieve the entire work all the time. Resource overload and bad traffic that can lead to time exhaustion should be avoided, this can be obtained through appropriate load balancing mechanisms. This paper offers a simple solution for choosing the preferred server to distribute functions based on minimum bandwidth consumption.

Smart Vogel's Approximation Method SVAM

Article

Full-text available

Mar 2014

Rafah M. Almuttairi

Data Grid technology is designed to handle large-scale data management for worldwide distribution, primarily to improve data access and transfer performance. Several strategies have been used to exploit rate differences among various client-replica provider links and to address dynamic rate fluctuations by dividing replicas into multiple blocks of equal sizes. However, a major obstacle, the idle time of faster providers having to wait for the slowest provider to deliver the final block, makes it important to reduce differences in finishing time among replica servers. In this paper, we propose a dynamic optimization method, namely Smart Vogel's Approximation Method, to improve the performance of data transfer in Data Grids. Our approach reduces the differences that ideal time spent waiting for the slowest replica provider to be equal or less to the predefined data transfer completion time with minimum prices of replicas.

Integrated Obj_FedRep: Evaluation of Surrogate Object based Mobile Cloud System for Federation, Replica and Data Management

Article

Full-text available

Jun 2014

Advancement in wireless and mobile communication with cloud has the potential to revolutionize computing by the illusion of a virtually infinite computing infrastructure, namely, the mobile cloud. However, mobile cloud faces many challenges such as dependency on continuous network connections, data sharing applications and federation with multiple service providers. In this paper a novel technique called Integrated Obj_FedRep (IOFR) is proposed, which federates two or more service providers for the purpose of load balancing and cloud replication for fast and easy access. The proposed IOFR model creates a real-time ongoing clone of database servers of different service providers using surrogate object in the mobile support station and an exact copy updated with the database server in synchronized and unsynchronized manner. The model handles the federation and replica at the object level which in turn provides better response time and minimizes the overall network traffic incurred due to mobility and database server failure. An extensive simulation of the IOFR has been done and it shows that the proposed technique improves the response times, achieves better bandwidth utilization and provides support for disconnection.

High-Performance Big-Data Analytics

Article

Jan 2015

This important and timely text/reference presents a detailed review of high-performance computing infrastructures for next-generation big data and fast data analytics. Comprehensively covering a diverse range of computer systems and proven techniques for high-performance big-data analytics, the book also presents case studies, practical guidelines, and best practices for enabling decision-making toward implementing the appropriate computer systems and approaches. Topics and features: • Includes case studies and learning activities throughout the book, and self-study exercises at the end of every chapter • Presents detailed case studies on social media analytics for intelligent businesses, and on big data analytics in the healthcare sector • Describes the network infrastructure requirements for effective transfer of big data, and the storage infrastructure requirements of applications which generate big data • Examines real-time analytics solutions, such as machine data analytics and operational analytics • Introduces in-database processing and in-memory analytics techniques for data mining • Discusses the use of mainframes for handling real-time big data, and the latest types of data management systems for big and fast data analytics • Provides information on the use of cluster, grid and cloud computing systems for big data analytics and data-intensive processing • Reviews the peer-to-peer techniques and tools, and the common information visualization techniques, used in big data analytics Software engineers, cloud professionals and big data scientists will find this book to be an informative and inspiring read, highlighting the indispensable role data analytics will play in shaping a smart future.

Globus: A metacomputing infrastructure toolkit

Article

Jan 1998
Int J Supercomput Appl High Perform Comput

Big Data Tools and Platforms

Chapter

Mar 2016

Sourav Mazumder

The fast evolving Big Data Tools and Platforms space has given rise to various technologies to deal with different Big Data use cases. However, because of the multitude of the tools and platforms involved it is often difficult for the Big Data practitioners to understand and select the right tools for addressing a given business problem related to Big Data. In this chapter we cover an introductory discussion to the various Big Data Tools and Platforms with the aim of providing necessary breadth and depth to the Big Data practitioner so that they can have a reasonable background to start with to support the Big Data initiatives in their organizations. We start with the discussion of common Technical Concepts and Patterns typically used by the core Big Data Tools and Platforms. Then we delve into the individual characteristics of different categories of the Big Data Tools and Platforms in detail. Then we also cover the applicability of the various categories of Big Data Tools and Platforms to various enterprise level Big Data use cases. Finally, we discuss the future works happening in this space to cover the newer patterns, tools and platforms to be watched for implementation of Big Data use cases.

Adaptive data replication strategy in cloud computing for performance improvement

Article

Jun 2016

Najme Mansouri

Cloud computing is becoming a very popular word in industry and is receiving a large amount of attention from the research community. Replica management is one of the most important issues in the cloud, which can offer fast data access time, high data availability and reliability. By keeping all replicas active, the replicas may enhance system task successful execution rate if the replicas and requests are reasonably distributed. However, appropriate replica placement in a large-scale, dynamically scalable and totally virtualized data centers is much more complicated. To provide cost-effective availability, minimize the response time of applications and make load balancing for cloud storage, a new replica placement is proposed. The replica placement is based on five important parameters: mean service time, failure probability, load variance, latency and storage usage. However, replication should be used wisely because the storage size of each site is limited. Thus, the site must keep only the important replicas.We also present a new replica replacement strategy based on the availability of the file, the last time the replica was requested, number of access, and size of replica. We evaluate our algorithm using the CloudSim simulator and find that it offers better performance in comparison with other algorithms in terms of mean response time, effective network usage, load balancing, replication frequency, and storage usage.

Network Infrastructure for High-Performance Big Data Analytics

Chapter

Jan 2015

Big data is becoming a major focus area for present-day organizations. But the data types which form a part of big data need drastic changes in all aspects of IT hardware in order to store, process, and transmit them. Several IT infrastructure transformation steps are taken in an attempt to handle big data efficiently. However, whenever it comes to networks, transformations in the network architecture are very limited due to the fact that it is an interdependent family of components and hence changing one component will require modifications to all other components of the network as well. The traditional network infrastructure has several shortcomings which make them unsuitable for transmission of big data. It is the need of the day to devise techniques to make present-day networks suitable for handling big data. Some of the techniques which could be used are network virtualization, software-defined networks (SDN), two-tier leaf spine architecture, and network functions virtualization. Each of these techniques is explained in detail in this chapter. Most of the conventional networks are not optimized for transfer of big data. Protocols like TCP which are used for long-distance transmission of big data have a lot of drawbacks. A novel approach for transfer of big data efficiently using TCP/IP protocol using a technology called FASP is also discussed in this lesson. Some of the implementation aspects of FASP are also discussed in this lesson.

A PGSA Based Data Replica Selection Scheme for Accessing Cloud Storage System

Chapter

Jan 2014

The data replica management scheme is a critical component of cloud storage system. In order to enhance its scalability and reliability at the same time improve system response time, the multiple data replica scheme is adopted. When a cloud user issues an access request, a suitable replica should be selected to respond to it in order to shorten user access time and promote system load balance. In this paper, with network status, storage node load and historical information of replica selection considered comprehensively, a PGSA (Plant Growth Simulation Algorithm) based data replica selection scheme for cloud storage is proposed to improve average access time and replica utilization. The proposed scheme has been implemented based on CloudSim and performance evaluation has been done. Simulation results have shown that it is both feasible and effective with better performance than certain existent scheme.

SWORD: workload-aware data placement and replica selection for cloud data management systems

Article

Dec 2014

Cloud computing is increasingly being seen as a way to reduce infrastructure costs and add elasticity, and is being used by a wide range of organizations. Cloud data management systems today need to serve a range of different workloads, from analytical read-heavy workloads to transactional (OLTP) workloads. For both the service providers and the users, it is critical to minimize the consumption of resources like CPU, memory, communication bandwidth, and energy, without compromising on service-level agreements if any. In this article, we develop a workload-aware data placement and replication approach, called SWORD, for minimizing resource consumption in such an environment. Specifically, we monitor and model the expected workload as a hypergraph and develop partitioning techniques that minimize the average query span, i.e., the average number of machines involved in the execution of a query or a transaction. We empirically justify the use of query span as the metric to optimize, for both analytical and transactional workloads, and develop a series of replication and data placement algorithms by drawing connections to several well-studied graph theoretic concepts. We introduce a suite of novel techniques to achieve high scalability by reducing the overhead of partitioning and query routing. To deal with workload changes, we propose an incremental repartitioning technique that modifies data placement in small steps without resorting to complete repartitioning. We propose the use of fine-grained quorums defined at the level of groups of data items to control the cost of distributed updates, improve throughput, and adapt to different workloads. We empirically illustrate the benefits of our approach through a comprehensive experimental evaluation for two classes of workloads. For analytical read-only workloads, we show that our techniques result in significant reduction in total resource consumption. For OLTP workloads, we show that our approach improves transaction latencies and overall throughput by minimizing the number of distributed transactions.

Economic model of scheduling and fair resource sharing in distributed computations

Article

Jan 2014

A model for job flow scheduling and fair resource sharing in distributed computing environments is proposed. The model helps one to take into account requirements in the virtual organization imposed by the users on efficiency and job performance quality. With the proposed slot selection algorithms, one can find alternatives optimal with respect to a given criterion for each job in the batch.

Minimizing replica idle time

Figures

Recommended publications

Fussy logic replica selection algorithm for cloud storage system ‫ا

Promote Replica Management based on Data Mining Techniques

Comparative Analysis of Different Data Replication Strategies in Cloud Environment

A hybrid data replication strategy with fuzzy-based deletion for heterogeneous cloud data centers