Chapter

The Hadoop Ecosystem Technologies and Tools

January 2017
Advances in Computers

January 2017

DOI:10.1016/bs.adcom.2017.09.002

Authors:

Reliance Jio Platforms Ltd.

There are several interesting and inspiring trends and transitions succulently happening in the business as well as IT spaces. One noteworthy factor and fact is that there are fresh data sources emerging and pouring out a lot of usable and reusable data. With the number of different, distributed, and decentralized data sources is consistently on the rise, the resulting data scope, size, structure, schema, and speed are greatly changing and challenging too. The other dominant and prominent aspects include polyglot microservices are solidifying deeply as the new building and deployment/execution block in the software world toward the much-needed accelerated software design, development, deployment, and delivery. The device ecosystem expands frenetically with the arrival of trendy and handy, slim and sleek, disappearing and disposable gadgets, gizmos thereby ubiquitous (anywhere, anytime, and any device) access, and usage of web-scale information, content, and services get fructified. Finally, all sorts of casually found and cheap articles in our everyday environments (homes, hotels, hospitals, etc.) are being systematically digitized and service enabled in order to exhibit a kind of real-world smartness and sagacity in their individual as well as collective actions and reactions.Thus trillions of digitized objects, billions of connected devices, and millions of polyglot software services are bound to interact insightfully with one another over locally as well as with remote ones over any networks purposefully. And hence the amount of transactional, operational, analytical, commercial, social, personal, and professional data created through a growing array of interactions and collaborations is growing very rapidly. Now if the data getting collected, processed, and stocked are not subjected to deeper, deft, and decisive investigations, then the tactically as well as strategically sound knowledge (the beneficial patterns, tips, techniques, associations, alerts, risk factors, fresh opportunities, possibilities, etc.) hidden inside the data heaps goes unused literally. For collecting, stocking, and processing such a large amount of multistructured data, the traditional databases, analytics platforms, the ETL tools, etc., are found insufficient. Hence the Apache Hadoop ecosystem technologies and tools are being touted as the best way forward to squeeze out the right and relevant knowledge. In this chapter, you can find the details about the emerging technologies and platforms for spearheading the big data movement.

Using Big Data-machine learning models for diabetes prediction and flight delays analytics

Article

Full-text available

Sep 2020

Introduction Nowadays large data volumes are daily generated at a high rate. Data from health system, social network, financial, government, marketing, bank transactions as well as the censors and smart devices are increasing. The tools and models have to be optimized. In this paper we applied and compared Machine Learning algorithms (Linear Regression, Naïve bayes, Decision Tree) to predict diabetes. Further more, we performed analytics on flight delays. The main contribution of this paper is to give an overview of Big Data tools and machine learning models. We highlight some metrics that allow us to choose a more accurate model. We predict diabetes disease using three machine learning models and then compared their performance. Further more we analyzed flight delay and produced a dashboard which can help managers of flight companies to have a 360° view of their flights and take strategic decisions. Case description We applied three Machine Learning algorithms for predicting diabetes and we compared the performance to see what model give the best results. We performed analytics on flights datasets to help decision making and predict flight delays. Discussion and evaluation The experiment shows that the Linear Regression, Naive Bayesian and Decision Tree give the same accuracy (0.766) but Decision Tree outperforms the two other models with the greatest score (1) and the smallest error (0). For the flight delays analytics, the model could show for example the airport that recorded the most flight delays. Conclusions Several tools and machine learning models to deal with big data analytics have been discussed in this paper. We concluded that for the same datasets, we have to carefully choose the model to use in prediction. In our future works, we will test different models in other fields (climate, banking, insurance.).

Towards a Prediction Approach based on Deep Learning in Big Data Analytics

Article

Full-text available

Nov 2022
NEURAL COMPUT APPL

Nowadays, cloud computing plays an important role in the process of storing both structured and unstructured data. This contributed to a very large data growth on web servers, which has come to be called big data. Cloud computing technology is adopted in many applications, perhaps the most important of which are social networking applications, e-mail messages, and others, which represent an important source of data through the process of communication between web users. Thus, these data represent views and opinions on various topics, which can help businesses and other decision-makers in making decisions based on future predictions. To achieve this goal, several methods have been proposed. Recently, it relies on the use of deep learning as a tool for processing large volumes of data due to its high performance in extracting predictions from the opinions of web users. This paper presents a new Prediction Approach based on Big Data Analysis and Deep Learning for large-scale data, called PABIDDL. The infrastructure of the proposed approach is focused on three important stages, starting with the reduction of big data based on MapReduce using the Hadoop framework, and then. In the second stage, we performed the initialization of this data using the GloVe technique, and finally, the text data was classified into advantages and disadvantages poles depending on CNN deep learning approach. We also conducted an empirical study of our proposed approach PABIDDL and related works models on two standard data sets IMDB and MR datasets. The results we obtained showed that the best performance is given by our approach. We recorded 0.93%, 0.90%, and 0.92% as Accuracy, a Recall, and an F1-Score respectively. Our approach also Scored the fastest response time.

A Survey on Distributed Frameworks for Machine Learning Based Big Data Analysis

Chapter

Sep 2022

The rapid pace of technological progress has led to an increasing growth in the volume of digital data circulating on servers and on the web. This has contributed to the birth of the concept of Big Data. Simply put, this concept refers to the huge amount of information on the Internet; yet it also reveals the heterogeneity and complexity of such data. Therefore, analyzing these data, especially unstructured data, has become important since they can be used in many areas such as company management, health, smart city. In order to analyze these data, novel efficient tools are required as the current ones are not effective enough. This paper surveys the most frequently used tools and platforms for Big Data analysis with due emphasis on Machine Learning-based models. The results of this study provide in-depth knowledge of Big Data analytics applications related to machine learning that can contribute to the innovation and development of big data analytics platforms. Moreover, it helps to choose the right tools to ensure the best performance for designing an analytics system.

Spark’s GraphX-based link prediction for social communication using triangle counting

Article

Full-text available

Jun 2019

Link prediction in a given instance of a network topology is a crucial task for extracting and inspecting the evolution of social networks. It predicts missing links in existing community networks and new or terminating links in future systems. It also attracted much attention in many fields. In the past decade, many methodologies have been compiled to predict the suitable links in a given social network. Analyzing link prediction methods is difficult when the network is very complex due to restrictive computing cost. It is still a very challenging task to predict missing links efficiently and accurately in an incomplete complex network. Depending on the certainty, the nodes with an incredible number of normal neighbors will probably be connected. Numerous similarity indices have accomplished extensive exactness and efficiency that greatly optimized this task. To accommodate this instance, in this paper, we propose one such index, namely Clustering Coefficient Index, using triangle counting implemented on the component of Apache Spark’s GraphX methodology. The proposed index uses the property of formation of triangles in the given network topology and clustering coefficients. Experimental results show that the proposed methodology outperforms in linking the suitable communications compared to other existing methods.

Legal Decision Support: Exploring Big Data Analytics Approach to Modeling Pharma Patent Validity Cases

Article

Full-text available

Jul 2018

This exploratory research examines the potential for applying a big data analytic framework to the modeling and analysis of cases in pharmaceutical patent validity brought before the U.S. Court of Appeals of the Federal Circuit. We start with two specific goals: one, to identify the key issues or reasons the Court uses to makes validity decisions; and, two, to attempt to predict outcomes for new cases. The ultimate goal is to support legal decision-making with automation. The legal domain is a challenging one to tackle. However, current advances in analytic technologies and models hold the promise of success. Our application of Hadoop MapReduce in conjunction with a number of algorithms, such as clustering, classification, word count, word co-occurrence, and row similarity, are encouraging, in that the results are robust enough to suggest these approaches have promise and are worth pursuing. By utilizing larger case data sets and sample sizes and by using deep machine learning models in text analytics, more breakthroughs can be achieved to provide decision support to the legal domain. From an economic standpoint, the potential for litigation cost reduction is another objective of our study. Synergies are obtained in applying lessons to the computational field and vice versa, leading to acceleration in our understanding.

Examining the Interplay Between Big Data and Microservices – A Bibliometric Review

Article

Full-text available

Jul 2021

Due to the ever increasing amount of data that is produced and captured in today’s world, the concept of big data has risen to prominence. However, implementing the respective applications is still a challenging task. This holds especially true, since a high degree of flexibility is desirable. One potential approach is the utilization of novel decentralized technologies, as in the case of microservices to construct such big data analytics solutions. To obtain an overview of the current situation regarding the corresponding research, using the scientific database Scopus and its provided tools for search and analytics, this bibliometric review provides an analysis of the literature and subsequently discusses avenues for future research.

Centrality and Scalability Analysis on Distributed Graph of Large-Scale E-mail Dataset for Digital Forensics

Conference Paper

Dec 2020

Contextual understanding of microservice architecture: current and future directions

Article

Full-text available

Jan 2018

Current industry trends in enterprise architectures indicate movement from Service-Oriented Architecture (SOA) to Microservices. By understanding the key differences between these two approaches and their features, we can design a more effective Microservice architecture by avoiding SOA pitfalls. To do this, we must know why this shift is happening and how key SOA functionality is addressed by key features of the Microservice-based system. Unfortunately, Microservices do not address all SOA shortcomings. In addition, Microservices introduce new challenges. This work provides a detailed analysis of the differences between these two architectures and their features. Next, we describe both research and industry perspectives on the strengths and weaknesses of both architectural directions. Finally, we perform a systematic mapping study related to Microservice research, identifying interest and challenges in multiple categories from a range of recent research.

ResearchGate has not been able to resolve any references for this publication.

The Hadoop Ecosystem Technologies and Tools

Abstract

No full-text available

Recommended publications

Paul's thorn in the flesh: A messenger of Satan?

Emergence of an abstract categorical code enabling the discrimination of temporally structured tacti...

Implicatures in the Persian and Turkish Translations of Four American Short Stories

MONTAGE OF EDUCATIONAL ATTRACTIONS

Fossil: A Robust Relational Learner Johannes Furnkranz

D. H. Lawrence: Sons and Lovers, Women in Love

Gendering in organizations: Lessons from the prison and other iron cages

Prison-Bound: Dickens, Foucault and Great Expectations

Plain Meanings, Mischiefs and Purposes

Philosophy in a Christian empire: From the great persecution to Theodosius I

Introduction: The Persistence of Jane Austen’s Romance

Kahikinui, “Great Tahiti” (Kahikinui, Maui, 1995–2000): Explorations and Adventures of an Island Arc...