Chapter

The Hadoop Ecosystem Technologies and Tools

Authors:
  • Reliance Jio Platforms Ltd.
To read the full-text of this research, you can request a copy directly from the author.

Abstract

There are several interesting and inspiring trends and transitions succulently happening in the business as well as IT spaces. One noteworthy factor and fact is that there are fresh data sources emerging and pouring out a lot of usable and reusable data. With the number of different, distributed, and decentralized data sources is consistently on the rise, the resulting data scope, size, structure, schema, and speed are greatly changing and challenging too. The other dominant and prominent aspects include polyglot microservices are solidifying deeply as the new building and deployment/execution block in the software world toward the much-needed accelerated software design, development, deployment, and delivery. The device ecosystem expands frenetically with the arrival of trendy and handy, slim and sleek, disappearing and disposable gadgets, gizmos thereby ubiquitous (anywhere, anytime, and any device) access, and usage of web-scale information, content, and services get fructified. Finally, all sorts of casually found and cheap articles in our everyday environments (homes, hotels, hospitals, etc.) are being systematically digitized and service enabled in order to exhibit a kind of real-world smartness and sagacity in their individual as well as collective actions and reactions.Thus trillions of digitized objects, billions of connected devices, and millions of polyglot software services are bound to interact insightfully with one another over locally as well as with remote ones over any networks purposefully. And hence the amount of transactional, operational, analytical, commercial, social, personal, and professional data created through a growing array of interactions and collaborations is growing very rapidly. Now if the data getting collected, processed, and stocked are not subjected to deeper, deft, and decisive investigations, then the tactically as well as strategically sound knowledge (the beneficial patterns, tips, techniques, associations, alerts, risk factors, fresh opportunities, possibilities, etc.) hidden inside the data heaps goes unused literally. For collecting, stocking, and processing such a large amount of multistructured data, the traditional databases, analytics platforms, the ETL tools, etc., are found insufficient. Hence the Apache Hadoop ecosystem technologies and tools are being touted as the best way forward to squeeze out the right and relevant knowledge. In this chapter, you can find the details about the emerging technologies and platforms for spearheading the big data movement.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The Job Tracker distributes those tasks to the worker nodes. The output of each map task is partitioned into a group of key-value pairs for each reducer [13,18]. ...
... The Job Tracker is responsible of Data Processing and Resource management (Maintaining the list of live nodes, maintaining the list of available and occupied map and reduce slots, allocating the available slots to appropriate jobs and tasks according to selected scheduling policy). The large Hadoop clusters revealed a limitation involving a scalability bottleneck [18] (https ://data-flair .train ing/blogs /13-limit ation s-of-hadoo p/) caused by having a single Job Tracker: -If job tracker fails, all jobs are lost -According to Yahoo, the practical limits of such a design are reached with a cluster of 5000 nodes and 40,000 tasks running concurrently -A node cannot run more map tasks than map slots at any given moment, even if no reduce tasks are running-This harms the cluster utilization because when all map slots are taken (and we still want more), we cannot use any reduce slots, even if they are available, or vice versa-Hadoop was designed to run MapReduce jobs only-Hadoop cannot run other applications like: Graph (graph processing). ...
... It was designed to store structured data in tables that can have billions of rows and millions of columns. HBase is not a relational database and was not designed to support transactional and other real-time applications [18]. ...
Article
Full-text available
Introduction Nowadays large data volumes are daily generated at a high rate. Data from health system, social network, financial, government, marketing, bank transactions as well as the censors and smart devices are increasing. The tools and models have to be optimized. In this paper we applied and compared Machine Learning algorithms (Linear Regression, Naïve bayes, Decision Tree) to predict diabetes. Further more, we performed analytics on flight delays. The main contribution of this paper is to give an overview of Big Data tools and machine learning models. We highlight some metrics that allow us to choose a more accurate model. We predict diabetes disease using three machine learning models and then compared their performance. Further more we analyzed flight delay and produced a dashboard which can help managers of flight companies to have a 360° view of their flights and take strategic decisions. Case description We applied three Machine Learning algorithms for predicting diabetes and we compared the performance to see what model give the best results. We performed analytics on flights datasets to help decision making and predict flight delays. Discussion and evaluation The experiment shows that the Linear Regression, Naive Bayesian and Decision Tree give the same accuracy (0.766) but Decision Tree outperforms the two other models with the greatest score (1) and the smallest error (0). For the flight delays analytics, the model could show for example the airport that recorded the most flight delays. Conclusions Several tools and machine learning models to deal with big data analytics have been discussed in this paper. We concluded that for the same datasets, we have to carefully choose the model to use in prediction. In our future works, we will test different models in other fields (climate, banking, insurance.).
... Further, for each result, the map task is split into two components, the key, and the value, which are then used for reduction. The second function in the MapReduce model is represented in the reduction so that this function receives a set of values from the input values and then creates for each key a set of its own values [21]. ...
Article
Full-text available
Nowadays, cloud computing plays an important role in the process of storing both structured and unstructured data. This contributed to a very large data growth on web servers, which has come to be called big data. Cloud computing technology is adopted in many applications, perhaps the most important of which are social networking applications, e-mail messages, and others, which represent an important source of data through the process of communication between web users. Thus, these data represent views and opinions on various topics, which can help businesses and other decision-makers in making decisions based on future predictions. To achieve this goal, several methods have been proposed. Recently, it relies on the use of deep learning as a tool for processing large volumes of data due to its high performance in extracting predictions from the opinions of web users. This paper presents a new Prediction Approach based on Big Data Analysis and Deep Learning for large-scale data, called PABIDDL. The infrastructure of the proposed approach is focused on three important stages, starting with the reduction of big data based on MapReduce using the Hadoop framework, and then. In the second stage, we performed the initialization of this data using the GloVe technique, and finally, the text data was classified into advantages and disadvantages poles depending on CNN deep learning approach. We also conducted an empirical study of our proposed approach PABIDDL and related works models on two standard data sets IMDB and MR datasets. The results we obtained showed that the best performance is given by our approach. We recorded 0.93%, 0.90%, and 0.92% as Accuracy, a Recall, and an F1-Score respectively. Our approach also Scored the fastest response time.
... It has proven its great ability to store and process a huge volume of data. This has made it an important framework used in most different industries [33,35]. This tool works mainly on data storage using Map-Reduce Model and HDFS for data analysis. ...
Chapter
The rapid pace of technological progress has led to an increasing growth in the volume of digital data circulating on servers and on the web. This has contributed to the birth of the concept of Big Data. Simply put, this concept refers to the huge amount of information on the Internet; yet it also reveals the heterogeneity and complexity of such data. Therefore, analyzing these data, especially unstructured data, has become important since they can be used in many areas such as company management, health, smart city. In order to analyze these data, novel efficient tools are required as the current ones are not effective enough. This paper surveys the most frequently used tools and platforms for Big Data analysis with due emphasis on Machine Learning-based models. The results of this study provide in-depth knowledge of Big Data analytics applications related to machine learning that can contribute to the innovation and development of big data analytics platforms. Moreover, it helps to choose the right tools to ensure the best performance for designing an analytics system.
... Apache Spark (Shyam et al. 2015) can be used for this because of its distributed nature and it is scalable to handle large amount of data. The entire big data analytics task was done using Hadoop (Chelliah 2017) before the releasing of Spark because it processes vast amount of data using commodity hardware. Similar to Hadoop, Spark also uses to process the vast amount of data but 100 times faster than the Hadoop. ...
Article
Full-text available
Link prediction in a given instance of a network topology is a crucial task for extracting and inspecting the evolution of social networks. It predicts missing links in existing community networks and new or terminating links in future systems. It also attracted much attention in many fields. In the past decade, many methodologies have been compiled to predict the suitable links in a given social network. Analyzing link prediction methods is difficult when the network is very complex due to restrictive computing cost. It is still a very challenging task to predict missing links efficiently and accurately in an incomplete complex network. Depending on the certainty, the nodes with an incredible number of normal neighbors will probably be connected. Numerous similarity indices have accomplished extensive exactness and efficiency that greatly optimized this task. To accommodate this instance, in this paper, we propose one such index, namely Clustering Coefficient Index, using triangle counting implemented on the component of Apache Spark’s GraphX methodology. The proposed index uses the property of formation of triangles in the given network topology and clustering coefficients. Experimental results show that the proposed methodology outperforms in linking the suitable communications compared to other existing methods.
... Cloudera (https://www.cloudera.com/products/open-source/apachehadoop.html) is an example of a platform that offers scalable and flexible integration interface, facilitating management of large volumes and varieties of data in an enterprise. Cloudera enables deployment and management of Apache Hadoop and related projects in terms of manipulating and analyzing, data and keeping it protected [63]. This is a reason why we use Cloudera in this study. ...
Article
Full-text available
This exploratory research examines the potential for applying a big data analytic framework to the modeling and analysis of cases in pharmaceutical patent validity brought before the U.S. Court of Appeals of the Federal Circuit. We start with two specific goals: one, to identify the key issues or reasons the Court uses to makes validity decisions; and, two, to attempt to predict outcomes for new cases. The ultimate goal is to support legal decision-making with automation. The legal domain is a challenging one to tackle. However, current advances in analytic technologies and models hold the promise of success. Our application of Hadoop MapReduce in conjunction with a number of algorithms, such as clustering, classification, word count, word co-occurrence, and row similarity, are encouraging, in that the results are robust enough to suggest these approaches have promise and are worth pursuing. By utilizing larger case data sets and sample sizes and by using deep machine learning models in text analytics, more breakthroughs can be achieved to provide decision support to the legal domain. From an economic standpoint, the potential for litigation cost reduction is another objective of our study. Synergies are obtained in applying lessons to the computational field and vice versa, leading to acceleration in our understanding.
Article
Full-text available
Due to the ever increasing amount of data that is produced and captured in today’s world, the concept of big data has risen to prominence. However, implementing the respective applications is still a challenging task. This holds especially true, since a high degree of flexibility is desirable. One potential approach is the utilization of novel decentralized technologies, as in the case of microservices to construct such big data analytics solutions. To obtain an overview of the current situation regarding the corresponding research, using the scientific database Scopus and its provided tools for search and analytics, this bibliometric review provides an analysis of the literature and subsequently discusses avenues for future research.
Article
Full-text available
Current industry trends in enterprise architectures indicate movement from Service-Oriented Architecture (SOA) to Microservices. By understanding the key differences between these two approaches and their features, we can design a more effective Microservice architecture by avoiding SOA pitfalls. To do this, we must know why this shift is happening and how key SOA functionality is addressed by key features of the Microservice-based system. Unfortunately, Microservices do not address all SOA shortcomings. In addition, Microservices introduce new challenges. This work provides a detailed analysis of the differences between these two architectures and their features. Next, we describe both research and industry perspectives on the strengths and weaknesses of both architectural directions. Finally, we perform a systematic mapping study related to Microservice research, identifying interest and challenges in multiple categories from a range of recent research.
ResearchGate has not been able to resolve any references for this publication.