ArticlePDF Available

Big data: definition, characteristics, life cycle, applications, and challenges

Authors:

Abstract and Figures

Any data set contains large volumes of information and complex data is called Big Data (BD). BD is unlike other traditional data sets, so it requires special processing to manage it. BD faces many challenges starting from data capture through to the final results. BD exists in many subject areas such as business, governments, sciences, healthcare and transport. Thus it touches peoples’ lives in many aspects. BD is considered as the most important topic and requires good understanding in order to be fully utilized. This paper presents the basic information of BD which includes its properties and applications. Descriptions and examples of BD and its categories are elaborated upon. The BD architectural establishment is presented followed by the conclusion of the importance of BD.
Content may be subject to copyright.
IOP Conference Series: Materials Science and Engineering
PAPER • OPEN ACCESS
Big data: definition, characteristics, life cycle, applications, and
challenges
To cite this article: Hiba Basim Alwan and Ku Ruhana Ku-Mahamud 2020 IOP Conf. Ser.: Mater. Sci. Eng. 769 012007
View the article online for updates and enhancements.
This content was downloaded from IP address 158.46.153.68 on 09/06/2020 at 18:08
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd
The 6th International Conference on Software Engineering & Computer Systems
IOP Conf. Series: Materials Science and Engineering 769 (2020) 012007
IOP Publishing
doi:10.1088/1757-899X/769/1/012007
1
Big data: definition, characteristics, life cycle, applications,
and challenges
Hiba Basim Alwan1 and Ku Ruhana Ku-Mahamud2
1 Department of Computer Engineering, Al-Mansour University College, 10068 Al-
Andalus Sq., Baghdad, Iraq
2 School of Computing, Universiti Utara Malaysia, 06010 Sintok, Kedah, Malaysia
hiba.basim@muc.edu.iq
Abstract. Any data set contains large volumes of information and complex data is called Big
Data (BD). BD is unlike other traditional data sets, so it requires special processing to manage
it. BD faces many challenges starting from data capture through to the final results. BD exists in
many subject areas such as business, governments, sciences, healthcare and transport. Thus it
touches peoples’ lives in many aspects. BD is considered as the most important topic and requires
good understanding in order to be fully utilized. This paper presents the basic information of BD
which includes its properties and applications. Descriptions and examples of BD and its
categories are elaborated upon. The BD architectural establishment is presented followed by the
conclusion of the importance of BD.
1. Introduction
One of the greatest trendy ideas these days is Big Data (BD). Everyone speaks about BD as can be seen in the
media. Governments and businesses attempt to use and implement BD to their benefits [1]. The term BD was not
known until in the middle of 2011. Like cloud computing the term has been implemented from product sellers to
huge scale outsourcing and cloud service suppliers to powerfully encourage their offerings [2]. But what actually
is BD?
Lisa [3] defines BD as a group of data from conventional and digital bases inside/outside the enterprise that
characterizes a basis for continuing detection and analysis. Another definition of BD is found in [4] which defined
BD as a quantity of data that is extremely relative and cannot be managed through the usual methods while [5]
defined BD as not only one technology, but a group of old and modern technologies that assist businesses to obtain
actionable perceptions. BD has large quantities of dissimilar data which allows processing in real time analysis
and response. BD can be defined as a giant size of data [6]. Within BD, analyzing, visualizing or any other
processing can be done. From the above definitions, it can be concluded that if data cannot be stored or processed
by a common system’s capabilities or exceed a common system’s capabilities then these data are considered BD.
The powerful services in this modern domain are continuously growing volumes of data and the advances in
technologies are always able to mine the data for commercial purposes [4].
The unexpected increase of BD like a modern resource of knowledge has encouraged business decision-makers
to generate decisions more quickly and to proactively locate environmental alterations [7].
BD requires the study and thinking about both technical and business needs. There are people who need to
investigate technological specifics, whereas others need to know the cost-effective of BD equipment usage.
Applying a BD setting will need an architectural and business method and lots of planning [5, 8].
To manage BD, data scientists are needed because there is an immense amount of data available where, in the
past, there were no algorithms able to manage BD. In the past, large amount of data could not be stored. Now,
The 6th International Conference on Software Engineering & Computer Systems
IOP Conf. Series: Materials Science and Engineering 769 (2020) 012007
IOP Publishing
doi:10.1088/1757-899X/769/1/012007
2
Exabyte storage and the tools needed to manipulate BD are available and not expensive. Data virtualization and
efficient preservation of BD are now using cost efficient cloud storage [5].
BD technology is an essential progress track in the area of Internet science and technology. It has been broadly
evaluated and progressed entirely around the globe and has been used in various areas of manufacturing as well as
life [9].
There are a lot of advantages in utilizing BD technologies. BD could produce benefits like maximizing
organizational productivity, informing strategic positioning, improved client services, recognizing and evolving
modern products and services if utilized in an efficient way [10, 11]. Other advantages of utilizing BD is to improve
marketing, automated decision making, descriptions of client behaviors, better profit on investments,
quantification of dangers and market trending, understanding of commercial changing, planning and predicting,
recognition of client activities beginning with click streams as well as manufacture income expanding [12, 13].
According to the knowledge of the authors, there is no single paper that gathers the basic concepts of BD which
include BD definition, types of data, technologies to deal with BD, characteristics of BD, life cycle of BD,
architecture of BD, applications of BD, its platforms, challenges of BD, limitations to implement BD, as well as
motivations to doing BD research. Thus, this paper tries to fill in the gaps with the aim to provide the understanding
of BD in a simple and easy way.
In addition to the introductory section, Section 2 presents the types and properties of BD followed by
applications of BD in Section 3. Challenges in performing BD research are presented in Section 4. Finally, Section
5 presents the conclusion.
2. Types and proprieties of big data
There are three main categories of data within BD. The categories are structured data, semi structured data and
unstructured data [1, 14].
Structured Data usually denotes data that have a described length and format. Such data includes string and
dates. The majority of experts conclude that this category of data occupies about one quarter of the available data.
It is frequently kept in a database and are created by machine or human. Structured data created by human include
input data, like human’s names and human’s age, click-stream data game’s moves data, while structured data
created by machine include sensor data (log in data to website, data selling port, and commercial data) [5].
Unstructured Data are data that are unable to keep in a particular formula. Usually, unstructured data
comprises three quarter of any company’s data. Unstructured data may be located everywhere and are created by
machine or human. Those created by machine include satellite images, scientific data, pictures as well as video
and sensor data. Unstructured data created by humans include texts in the social media, web site and cellular phone
[5, 15].
Semi Structured Data are the data that are not categorized as structured and unstructured data. This type of
data do not essentially adapt as stable representation but may contain uncomplicated values [5, 15].
The 5Vs is used to describe BD [9, 16] which is also known as the characteristics of BD that relates to volume,
velocity, variety, veracity and value [17, 18, 19].
Volume which is the quantity of data. This characteristic is foremost in human minds when handling BD. Many
businesses have huge volumes of data that are archived in the form of logs, but do not have the ability to manage
them. The benefit obtained from the ability to handle huge quantities of data to produce information is the most
important desirability of BD analytics [17]. The huge volume of data can be helpful for businesses, but it affects
the retrieval and analytic procedures as it is time consuming because of the calculation processes [8].
Velocity is the speed of the data that during handling and also the speed at which the data are generated. Thus,
this implies the quickness that has to be considered in managing, storing and analyzing the data. Each second of
each day, hundreds of hours of video are uploaded on YouTube and more than 200 million emails were sent via
Gmail.
Variety refers to the range of data types and sources. The group to which BD is assigned is also an important
feature that requires to be recognized by the data analysts, because BD is not usually structured data and it is not
usually easy to maintain in a relational database. The difficulty of keeping and analyzing BD will maximize the
difficulty of handling structured and unstructured data as 90% of created data are unstructured data.
Veracity is related to the truth of data which is important for precision in analysis. It is impossible to ensure
that all data are 100% accurate when managing huge volume of data with great velocity as well as variety. The
quality of the data will vary and precision of analysis depends on the data veracity of the data resource.
Value is the importance of the data importance and this is a very significant feature in BD. The possible value
of BD is substantial. However, without suitable access its value cannot be exploited.
Table 1 has been provided by [20] to show the summary of 4 characteristics of BD.
The 6th International Conference on Software Engineering & Computer Systems
IOP Conf. Series: Materials Science and Engineering 769 (2020) 012007
IOP Publishing
doi:10.1088/1757-899X/769/1/012007
3
Table 1. Summary of the 4V’s.
Volume
Velocity
Veracity
Description
The amount of data
created is huge compared
with standard data
resources
Data are being created very
speedily, at a progression
that never ends and at which
data is converted into
perception
Quality and origin of
data
Attribute
Exabyte, Zettabyte,
Yottabyte, etc.
Batch, Near/Real time
streams
Reliability, totality,
integrity, uncertainty
Driver
Maximized data resource,
higher solution, sensors,
scalable infrastructure
Enhanced linking
competitive benefit pre-
calculated information
Fee, requires traceability
and explanation
The life cycle for BD consists of five phases. These phases are data collection, date cleaning, data classification,
data modelling and data delivery. Data collection phase includes the gathering and storing of data from different
resources. In the data cleaning phase, management of the confirmation whether there are some unwanted items
within the data or missing values are performed. Data classification will classify the data according to their types
either structured, semi-structured or unstructured. In the data modeling phase, analysis of data is performed and
the result is the clustered data for objectives. Finally, data delivery phase involves the creation of reports based the
results of the modelling phase. These phases are depicted in figure 1 [5, 21].
Figure 1. Life cycle of BD.
Several questions need to be considered in establishing any BD architecture. These questions relate to how
much data will a company require to handle BD now and in the future and how will a company usually require
handling data in actual time or near to actual time? Other technical issues that need to be resolved are the speed
and accuracy of the data. The elements required in any BD architectural establishment is shown in figure 2 [5].
Figure 2. Big Data Architecture.
Data
Collect
ion
Data
Cleani
ng
Data
Classif
ication
Data
Modeli
ng
Data
Delive
ry
The 6th International Conference on Software Engineering & Computer Systems
IOP Conf. Series: Materials Science and Engineering 769 (2020) 012007
IOP Publishing
doi:10.1088/1757-899X/769/1/012007
4
3. Applications of big data
Organizations are uncertain as to how data can be used and when to utilize the data. Applications are invented
particularly to gain benefit of the exclusive features of BD especially in the fields of health care, manufacturing
and city planning [5]. Figure 3 compares the different applications based on the 3V’s (variety, velocity, and
volume) [22].
Figure 3. Comparison of different data attributes in BD applications.
Other applications of BD as reported in [15] are: smart grid, e-health, Internet of Thing (IoT), public services,
transportation and logistics, and political services and government monitoring.
Smart Grid Case: Smart grids involve the forever progressively huge size of data [23]. This is a critical area
that requires real-time monitoring for almost all of its operations. This is achieved through connected devices that
form the whole of the infrastructure. BD analytic helps to provide an insight like identifying deteriorating critical
equipment on national grid that exhibit abnormal behavior like faulty transformers. In such cases, proactive
measure and the best line of preventive and maintenance actions can be deployed thereby saving cost and
optimizing operation.
E-health: Health related policies are among various fields that adapt and benefited from BD. Patients
monitoring sensors, laboratory data, medical history of patients with different ailments are among few various
sources of useful data that if utilized properly can aid into personalized medication, help policy makers to provide
adaptive health care policies and it can also be utilized to reduce general hospital operational running cost and
enhance service delivery.
Internet of Things: This is another area that benefited much from DB is IoT due to variety of interconnected
objects that consume, generate and share different types of data. Objects in IoT can be anything that is capable of
being connected to and be accessed online.
Public Services: Public services like public water systems are putting in place sensors to monitor consumption,
illegal connection and leakage on pipelines in order to benefit from real time monitoring of infrastructures. This
helps to reduce manpower needed for monitor facilities which resulted into timely intervention when needed and
helps in rendering efficient service to public.
Transportation and Logistics: Transportation sector is among the most prominent areas where application of
BD cannot be over emphasized. Near field communication devices like radio frequency identification enable
transporters to have fleet equipped with capable sensors attached to commuting vehicles. This enable
administrators to efficiently plan and manage delivery routes, access to up-to-date track record of employees, have
the ability to monitor fleet in a real-time manner, and access to up-to-date pattern of passengers commuting
behaviour which enables optimised planning and effective management.
Political Facilities and Government Observing: Several governments are mining social data to observe
political trends as well as analyze community opinions. Governments may utilize BD systems to enhance the usage
of scarce resources and services. For example, data from sensors placed on various public infrastructure like water
pipes can be used to determine consumption rate of different regions and several zones of a city which can result
into provision of required quantity of the needed resource.
Big organizations (like Yahoo, Google, and Facebook) required to develop modern tools that could permit
them for storing, accessing, and analyzing massive amount of data close to real time. Platforms like MapReduce,
The 6th International Conference on Software Engineering & Computer Systems
IOP Conf. Series: Materials Science and Engineering 769 (2020) 012007
IOP Publishing
doi:10.1088/1757-899X/769/1/012007
5
Hadoop, YARN, Oozie, Flume, Hive, HBase, Apache Pig, Apache Spark, Sqoop, Zoo keeper, and Big Table are
modern creation of data management tools that can be used to analyze huge amount of data effectively and timely
[5].
MapReduce: Is one of the data processing options that can be executed on Hadoop [15, 24]. MapReduce has
been design to handle and schedule data processing job and cluster assignment effectively. The main advantage of
MapReduce is its simplification of processing huge volume of data which is achieved through effective computing
resource sharing mechanism via parallel processing procedure. The effectiveness of MapReduce is achieved
through its capability for distributing task through many available mapping nodes. The quantity of maps is often
concluded through the entire size of the inputs which is the entire number of blocks of the input files. Once the
computation task on various distributions is accomplished, extra task named as “reduce” then gathers all the
processed results to produce a complete processed solution. This approach facilitates better load balancing,
maximizes the quantity of reduces, maximizes load equalization and minimizes the number of breakdowns [25].
Figure 4 shows data flow within MapReduce [5].
Figure 4. Data flow in MapReduce.
Big Table: This was created by Google to act as a distributed saving system that aimed to process extremely
huge amount of structured data using secure servers. Data is organized as a table that has tuples and attributes. Big
table differs from classical relational database in various ways. It is a spare, distributed, and permanently multi-
dimensional saving map [5].
Hadoop: There are different stories about the name of Hadoop. [26] stated that Hadoop is an acronym name
stand for Highly Archived Distributed Object Oriented Programming. [27] stated that Hadoop is an elephant in
the room. It is not an acronym but a yellow elephant toy for the creator’s son of it which is the Google’s engineer
Doug Cutting. However, Apache Hadoop is a hug scalable storage platform created to manage huge amount of
data sets through hundreds and thousands of calculating nodes that work concurrently [2, 17, 28]. Hadoop is also
considered as an Apache handled software structure resulting from Big table and MapReduce. Hadoop permits
functions to be established on MapReduce to execute big clusters of productized hardware. Hadoop is the basis of
the calculating architecture providing Yahoo!’s company. Hadoop is intended to simultaneous process data
through computation nodes that speed up the calculation and minimize latency. Two main elements in Hadoop are
(1) huge scalable distributed file system in providing petabytes of data and (2) huge mountable MapReduce that
calculates outcomes in batch. Figure 5 illustrates how to map Hadoop clusters into hardware [5].
Figure 5. Mapping Hadoop clusters into hardware.
The 6th International Conference on Software Engineering & Computer Systems
IOP Conf. Series: Materials Science and Engineering 769 (2020) 012007
IOP Publishing
doi:10.1088/1757-899X/769/1/012007
6
4. Challenges in big data research
Big Data Mining: BD mining presents a lot of desirable chances but with immense difficulties. The complexities
depend on various stages involving data taken, storage, seeking, sharing, analysis, administration and visualization
[15].
Big Data Management: BD management aims to provide a dependable clean data through various means of
data gathering from huge volumes of different types of data sources like company, government as well as
private/public sectors. This is achieved by various processing tasks that include preprocessing, processing and
other related activities such as encrypting the data for security, confidentiality and dependability [26]. Certainly,
suitable data management is the basis for BD analytics [15].
Big Data Recovery and Storage: Storage in BD is accomplished via virtualization where it processes huge
sets of data from sensors, media, videos, transaction data from e-businesses, mobile signal coordinates. Many
corporations manage data in huge volume through utilizing instruments such as NoSQL, Apache Drill, Horton
Works, SAMOA, IKANOW, Hadoop, MapReduce, and Grid Gain [15]. Large volume storage facilities and faster
I/O speeds enables improvement in working with BD. Thus, access to data should be quick and simple for on-time
analysis. Previously, continuing data were replaced through utilizing Hard Disk Drive (HDD). The well-known
major drawback of HDD is having slower input/output performance. Improvements in storage devices like Solid
State Drive (SSD) can minimize the problems but they are not been completely utilized. HDDs are gradually being
replaced by SSDs, and other improvements like pulse-code modulation which are also on the increase [22].
Big Data Processing: BD processing analyses the huge volume of BD in petabyte, exabyte, and zettabyte
depending on whichever batch management or batch management is best [26].
Data Visualization: The major aim of data visualization is to present the data efficiently and sufficiently
through utilizing several charts. Data visualization poses a challenge for BD applications because the huge volume
and dimension of the data. Thus, there will be a need to re-test the method under which the BD is pictured. Structure
and usefulness of presented data are of paramount importance to permit demonstration of knowledge which is
unseen in non-simple large-scale data sets. The structured data organized in tables and its associated characteristics
are necessary for informative analysis [22].
Data Transmission: Once the communication foundation is extremely big, the system's data transfer capability
is bound and blocked in a cloud circulated framework. Here cloud improvement is substituted via a cloud data
feed as its improved form [22].
Big Data Security: It is a challenge to ensure guaranteed safety and security of big data. This is attributed to
many factors like incompetent instruments, public and private databases. In distributed programming structures,
the safety challenge begins when huge amounts of personal information are kept in a database which is not encoded
in standard form. Saving the data in the hand of disgruntled and unreliable persons add to the extra complexity of
data security. The challenge of data security also surfaced when migrating or updating from similar and/or different
data-specific instruments. Occasionally, data thieves and system thieves gather a publicly accessible BD collection,
copy it and keep it in a device like a USB drive, hard disk or laptop. Therefore, when keeping of data is maximized
from one level to a multi-storage levels, the safety level should also be maximized [26].
Data Curation: This area specifically contains several sub-areas, like validation, documentation, supervision,
security, recovery and demonstration. The existing database management tools cannot manage BD. Data
warehouse and data marts have been used to manage big data sets in a suitably structured approach. These methods
follow data frameworks which are built on structured query language. These days NoSQL is utilized in BD due to
the four Vs of BD [22].
Big Data Cleaning: This challenge involves five phases (cleaning, aggregation, encoding, storage and access)
which are not modern and are used in conventional information handling. The challenge in BD is how to handle
the difficulties of BD’s nature (velocity, volume, variety, veracity and value) and operate it in a distributed situation
with a combination of functions. Nevertheless, information resources may include noises, errors or incomplete
data. The issue is how to clean large data and how to resolve which data is countable, and which data is beneficial
[15].
Big Data Aggregation: This issue is related to synchronizing external data resources and distributed BD
platforms (involving applications, repositories, sensors, networks) with the internal infrastructure of a company.
Furthermore, it is not enough to analyze the data created inside companies. In mining important insight and
information, it is necessary to move towards an advance of not collecting only internally generated data, but
required tools should be put in place to collect both internal data and external data resources. External data could
contain third-party resources, information about market fluctuations, weather predicting and traffic conditions, data
from social networks, customer comments and citizen feedback [15].
The 6th International Conference on Software Engineering & Computer Systems
IOP Conf. Series: Materials Science and Engineering 769 (2020) 012007
IOP Publishing
doi:10.1088/1757-899X/769/1/012007
7
Big Data Imbalance: The issue in classifying an imbalanced data set has received a lot of attention. Real world
applications with diverse distributions can be categorized into two major groups. The first is the under-presented
group which is characterized with insignificant number of data-points (this type is also known as minority or
positive group). The second group has significant number of data-points (also known as common or negative
group). The identification of positive group has significant importance in many areas like medical diagnosis,
software defects detection, finances, drug discovery or bioinformatics. The traditional Machine Learning (ML)
techniques cannot be implemented on imbalanced data sets. This is because the prototype building is founded on
global privilege measures which by default favours the majority instances group thereby disregarding the
significance of the minority group [15].
Big Data Analytical: BD brings a lot of challenges on how to extract meaning out of this large, ever-increasing
voluminous data. For example, data analysis allows a company to obtain important vision and observe the patterns
that may positively or negatively influence their businesses. Other data-driven applications require additional real-
time analysis, like social networks, biomedicine, astronomy, and intelligent transport systems. Therefore, advanced
algorithms and efficient approaches of data mining are required to obtain correct outcomes, to control the changes
in different areas and be able to have future predictions with real-time responsiveness. One of the challenges is
how to guarantee the timeliness of responses when the data volume is huge. Examples of the complexities observed
when performing existing analytical solutions are ML, deep learning, incremental approaches, as well as granular
computing [15]. The main issue of data analytic found with BD is related with the volume of data. Timeliness is
the highest priority for some BD applications. The main test for BD applications is to guarantee timeliness of
responses when the data being processed is huge [22].
Big Data Machine Learning: The primary aim of ML is to discover knowledge from either organized or
unorganized data. ML is presently serving as the backbone of many applications that relies and produce part of
big data composition, ranging from search engines, recognition systems, aeronautics, and military to mention a
few [15].
BD is an innovation that will fundamentally change the method that information is grouping, keeping,
monitoring, and spent by users which in turn will change the method of doing work. Several of the motivations in
doing BD research are [22]:
Changing from Classical Relational Database Management System (RDBMS): This system is utilized by
several enterprise information technology corporations and is still used now by many information technology
enterprises. Today, data are unstructured and non-clustered and NoSQL keeps all the data with no clustering and
describes them into the framework opposite to RDBMS which keeps data in suitable structures or tables.
Managing Unstructured Data: BD has the capability to manage structured as well as unstructured data. Along
with data variety characteristics, BD is about text or numbers (alphanumeric fields), and unstructured data.
Therefore, by utilizing NoSQL, BD can manage unstructured data.
Real Time Data Processing: In future, information systems will need an ability to manage increasingly huge
volumes of data, where the velocity of BD created is presented. An expression "near real time", is often utilized
with existing time of information systems, but it is not suitable. Real time data management involves the facility
of managing online data or sensor information as they are created.
Most Data are either User or Machine Created: Earlier, most data were created internally in the firewall of
an enterprise. However, present data are created either through end users or machine created, which are external
to the bounds of the firewall of the enterprise.
There are many limitations in BD and some of these limitations can be summarized as follows [29]: firstly, the
needed data are not always available because of: (i) data are simply not available; (ii) there is trouble with the
holding phases; and (iii) various data platforms are shown to be interoperable. Secondly, the main core of BD is
pattern recognition. Result from pattern analyzing is significant because it will demonstrate the problem of
security, risk, and types of crime. Present approaches of data mining will not be able to handle BD. Lastly, BD can
be used to predicate new data because it is established on old data which consist of previous patterns.
5. Conclusion
It can be concluded that BD means any quantity of data even if the data is structured, unstructured, or semi-
structured that cannot fit into a processing system. Thus, BD will need special tools and technologies to handle it
and can be characterized by the term 5Vs. If a company needs to model its BD and gain benefits from it, the
company needs to design the architecture for its BD which will require answers to questions related to the nature
of the company. BD has many applications fields in life and has many platforms to handle it. In reality, BD has
many challenges and limitations to implement it, as well as there are several motivations in performing BD
research.
The 6th International Conference on Software Engineering & Computer Systems
IOP Conf. Series: Materials Science and Engineering 769 (2020) 012007
IOP Publishing
doi:10.1088/1757-899X/769/1/012007
8
References
[1] José T and Juan R 2018 Data learning from big data Statistics and Probability Letters 136 15-19
[2] Mitchell I, Locke M, Wilson M and Fuller A 2012 The white book of big data (UK: Fujitsu Services Ltd.)
[3] Lisa A 2013 Big data marketing (New Jersey: John Wiley & Sons, Inc.)
[4] Bernard M 2016 Big data in practice: How 45 successful companies used big data analytics to deliver
extraordinary results (New Jersey: John Wiley & Sons, Inc.)
[5] Judith H, Alan N, Fern H and Marcia K 2013 Big data for dummies (New Jersey: John Wiley & Sons, Inc.)
[6] Maria B, Liyana S and Elaheh Y 2019 Big data adoption: state of the art and research challenges Information
Processing and Management 56
[7] Alessandro M, Sally D, Maureen M, Lee Q, David W, Lyndon S and Ana C 2018 Big data, big decisions:
the impact of big data on board level decision making Journal of Business Research 93 6778
[8] Michele I, Elio M, Giuseppe M, Mario M and Carlo Z 2020 Fast and effective big data exploration by
clustering Future Generation Computer Systems 102 84-94
[9] Yinghao Y, Meilin W, Shuhong Y, Jarvis J and Qing L 2019 Big data processing framework for
manufacturing Procedia CIRP 83 661-64
[10] Jaime C, Pankaj S, Unai G, Erkki J and David B 2017 A big data analytical architecture for the Asset
Management Procedia CIRP 64 369-74
[11] Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C and Hung B 2012 A big data: the next
frontier for innovation, competition, and Productivity (McKinsey Global Institute)
[12] Seref S and Duygu S 2013 Big data: a review Proc. Int. Conf. on Collaboration Technoloies and Systems
(CTS) (San Diego: CA/ USA IEEE) p 42
[13] Philip R 2011 Big Data Analytics TDWI Best Practices Report
[14] Karim M 2019 State of the art in big data applications in microgrid: a review Advanced Engineering
Informatics 42
[15] Ahmed O, Fatim-Zahra B, Ayoub A and Samir B 2018 Big data technologies: a survey Journal of King
Saud University-Computer and Information Sciences 30 431-48
[16] Ada B, Devis B, Valeria A, Massimiliano G and Alessandro M 2019 A relevance-based approach for big
data exploration Future Generation Computer Systems 101, 5169
[17] Ishwarappa K and Anuradha J 2015 A brief introduction on big data 5 vs characteristics and hadoop
technology Procedia Computer Science 48 319-24
[18] Jean-Louis M and Soraya S 2016 Big data, open data and data development (ISTE Ltd and John Wiley &
Sons, Inc.)
[19] Abdulkhaliq A, Vlad K and Michael B 2017 Addressing barriers to big data Business Horizons 60 285-92
[20] https://courses.cognitiveclass.ai/courses/coursev1:BigDataUniversity+BD0101EN+2016_T2/courseware/
407a9f86565c44189740699636b4fb85/12eab34ec218468995e4d06566ef4a32
[21] Archenaa J and Mary E 2015 A survey of big data analytics in healthcare and government Procedia
Computer Science 50 408-13
[22] Kirtida N and Abhijit J 2017 Role of big data in various sectors Proc. Int. Conf. on IoT in Social, Mobile,
Analytics and Cloud (I-SMAC) (Palladam/India IEEE) p 117
[23] Tom W, Nanlin J, Peter F and Joshua T 2019 A big data platform for smart meter data analytics Computers
in Industry 105 25059
[24] https://www.ibm.com/analytics/us/en/technology/hadoop/mapreduce/#what-is-mapreduce.
[25] https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html.
[26] Saraladevi B, Pazhanirajam N, Victer P, Saleem M S and Dhavachelvan P 2015 Big data and hadoop-a
study in security perspective Procedia Computer Science 50 596-601
[27] https://www.sas.com/en_us/insights/big-data/hadoop.html.
[28] https://www.ibm.com/analytics/us/en/technology/hadoop/.
[29] Dennis B, Erik S and Bart S 2017 Big data and security policies: towards a framework for regulating the
phases of analytics and use of big data Computer Law & Security Review 33 309-23
... The studies that addressed the privacy policy of the data collection phase are [78,79], and there are no studies that addressed privacy-preserving data collection related technologies. In data storage phase, encryption is addressed in all studies excluding [80][81][82][83], and access control in storage phase is addressed in all studies excluding [79,80,84]. In addition, the studies addressed audit trail are [78,82,[85][86][87]. ...
... Finally, they mentioned the big data life cycle's importance but not presented the big data life cycle. Alwan et al. [83] presented a big data life cycle (i.e., data collection, data cleaning, data classification, data modeling, and data delivery). In addition, they analyzed big data in specific domains such as smart grid and IoT. ...
Article
Full-text available
The use of big data in various fields has led to a rapid increase in a wide variety of data resources, and various data analysis technologies such as standardized data mining and statistical analysis techniques are accelerating the continuous expansion of the big data market. An important characteristic of big data is that data from various sources have life cycles from collection to destruction, and new information can be derived through analysis, combination, and utilization. However, each phase of the life cycle presents data security and reliability issues, making the protection of personally identifiable information a critical objective. In particular, user tendencies can be analyzed using various big data analytics, and this information leads to the invasion of personal privacy. Therefore, this paper identifies threats and security issues that occur in the life cycle of big data by confirming the current standards developed by international standardization organizations and analyzing related studies. In addition, we divide a big data life cycle into five phases (i.e., collection, storage, analytics, utilization, and destruction), and define the security taxonomy of the big data life cycle based on the identified threats and security issues.
... Thus, the basic characteristics of big data are the three Vs-'Volume', 'Veracity', and 'Variety' [13]. Over time, the three Vs have been extended to five Vs by the addition of 'Value,' the value of the data (often relative to a point in time), and 'Velocity,' the rapid increase in the volume of the data [14]. Big data processing often encounters the limitations of traditional databases, software technologies, and established processing methodologies [15]. ...
Article
Full-text available
The paper presents possible approaches for reducing the volume of data generated by simulation optimisation performed with a digital twin created in accordance with the Industry 4.0 concept. The methodology is validated using an application developed for controlling the execution of parallel simulation experiments (using client–server architecture) with the digital twin. The paper describes various pseudo-gradient, stochastic, and metaheuristic methods used for finding the global optimum without performing a complete pruning of the search space. The remote simulation optimisers reduce the volume of generated data by hashing the data. The data are sent to a remote database of simulation experiments for the digital twin for use by other simulation optimisers.
Chapter
Businesses need all data to be controlled centralized, and most corporations utilize analysis to learn where their company stands in the market. Big data tools and approaches are being used by researchers and practitioners to compute performance utilizing various algorithms. It is obvious that organisations require strong understanding of their consumers, commodities, and laws; nevertheless, with the aid of big data, organisations may find new methods to compete with other organizations. This study will focus on big data techniques and algorithms to find patterns to apply on the business cases which are lagging. Technology is simply a tool used by the business elite to keep their clientele close by. It has successfully aided the organisation in achieving cost savings, making quicker, better decisions using business big data cycle and collaborative filtering.
Article
Full-text available
Dataset is a set of data that becomes the standard for showing the behaviour of something. In this case, the industry. For this situation, the industry. Be that as it may, each dataset consistently has restrictions to help the prediction. The prediction is about the current conduct of the industry, where yet just exists in enormous information like in big data. This paper aims to construct a concept that describes a method for developing a dataset.
Article
Full-text available
Data analysis of manufacturing plays a vital part in the intelligent manufacturing service of Product-Service Systems (PSS). In order to solve the problem that, manufacturing companies can’t obtain valuable information from enterprise’s big data through traditional data analysis methods, this paper put forward a data processing architecture framework and introduce the predictive algorithm (Random Forest). Finally, a real-time prediction of quality under this framework which uses the random forest algorithm is given to verify the usefulness of the architecture framework.
Article
Full-text available
The collection, organisation and analysis of large amount of data (Big Data) in different application domains still require the involvement of experts for the identification of relevant data only, without being overwhelmed by volume, velocity and variety of collected data. According to the ``Human-In-the-Loop Data Analysis'' vision, experts explore data to take decisions in unexpected situations, based on their long-term experience. In this paper, the IDEAaS (Interactive Data Exploration As-a-Service) approach is presented, apt to enable Big Data Exploration (BDE) according to data relevance. In the approach, novel techniques have been developed: (i) an incremental clustering algorithm, to provide summarised representation of collected data streams; (ii) multi-dimensional organisation of summarised data, for data exploration according to different analysis dimensions; (iii) data relevance evaluation techniques, to attract the experts' attention on relevant data only during exploration. The approach has been experimented to apply BDE for state detection in the Industry 4.0 domain, given the strategic importance of Big Data management as enabling technology in this field. In particular, a stream of numeric features is collected from a Cyber Physical System and is explored to monitor the system health status, supporting the identification of unknown anomalous conditions. Results of an extensive experimentation in the Industry 4.0 domain are presented in the paper and demonstrated the effectiveness of developed techniques to attract the attention of experts on relevant data, also beyond the considered domain, in presence of disruptive characteristics of Big Data, namely volume (millions of collected records), velocity (measured in milliseconds) and variety (number and heterogeneity of analysis dimensions).
Article
Full-text available
Big Data (BD) has the potential to 'disrupt' the senior management of organisations, prompting directors to make decisions more rapidly and to shape their capabilities to address environmental changes. This paper explores whether, how and to what extent BD has disrupted the process of board level decision-making. Drawing upon both the knowledge-based view, and cognitive and dynamic capabilities, we undertook in-depth interviews with directors involved in high-level strategic decision-making. Our data reveal important findings in three areas. First, we find evidence of a shortfall in cognitive capabilities in relation to BD, and issues with cognitive biases and cognitive overload. Second, we reveal the challenges to board cohesion presented by BD. Finally, we show how BD impacts on responsibility/control within senior teams. This study points to areas for development at three levels of our analysis: individual directors, the board, and a broader view of the organisation with its external stakeholders.
Article
Full-text available
Technology is generating a huge and growing availability of observations of diverse nature. This big data is placing data learning as a central scientific discipline. It includes collection, storage, preprocessing, visualization and, essentially, statistical analysis of enormous batches of data. In this paper, we discuss the role of statistics regarding some of the issues raised by big data in this new paradigm and also propose the name of data learning to describe all the activities that allow to obtain relevant knowledge from this new source of information.
Article
Full-text available
Developing Big Data applications has become increasingly important in the last few years. In fact, several organizations from different sectors depend increasingly on knowledge extracted from huge volumes of data. However, in Big Data context, traditional data techniques and platforms are less efficient. They show a slow responsiveness and lack of scalability, performance and accuracy. To face the complex Big Data challenges, much work has been carried out. As a result, various types of distributions and technologies have been developed. This paper is a review that survey recent technologies developed for Big Data. It aims to help to select and adopt the right combination of different Big Data technologies according to their technological needs and specific applications’ requirements. It provides not only a global view of main Big Data technologies but also comparisons according to different system layers such as Data Storage Layer, Data Processing Layer, Data Querying Layer, Data Access Layer and Management Layer. It categorizes and discusses main technologies features, advantages, limits and usages.
Article
Full-text available
The paper highlights the characteristics of data and big data analytics in manufacturing, more specifically for the industrial asset management. The authors highlight important aspects of the analytical system architecture for purposes of asset management. The authors cover the data and big data technology aspects of the domain of interest. This is followed by application of the big data analytics and technologies, such as machine learning and data mining for asset management. The paper also presents the aspects of visualisation of the results of data analytics. In conclusion, the architecture provides a holistic view of the aspects and requirements of a big data technology application system for purposes of asset management. The issues addressed in the paper, namely equipment health, reliability, effects of unplanned breakdown, etc., are extremely important for today's manufacturing companies. Moreover, the customer's opinion and preferences of the product/services are crucial as it gives an insight into the ways to improve in order to stay competitive in the market. Finally, a successful asset management function plays an important role in the manufacturing industry, which is dependent on the support of proper ICTs for its further success.
Article
Big data adoption is a process through which businesses find innovative ways to enhance productivity and predict risk to satisfy customers need more efficiently. Despite the increase in demand and importance of big data adoption, there is still a lack of comprehensive review and classification of the existing studies in this area. This research aims to gain a comprehensive understanding of the current state-of-the-art by highlighting theoretical models, the influence factors, and the research challenges of big data adoption. By adopting a systematic selection process, twenty studies were identified in the domain of big data adoption and were reviewed in order to extract relevant information that answers a set of research questions. According to the findings, Technology–Organization–Environment and Diffusion of Innovations are the most popular theoretical models used for big data adoption in various domains. This research also revealed forty-two factors in technology, organization, environment, and innovation that have a significant influence on big data adoption. Finally, challenges found in the current research about big data adoption are represented, and future research directions are recommended. This study is helpful for researchers and stakeholders to take initiatives that will alleviate the challenges and facilitate big data adoption in various fields.
Article
The prospering Big data era is emerging in the power grid. Multiple world-wide studies are emphasizing the big data applications in the microgrid due to the huge amount of produced data. Big data analytics can impact the design and applications towards safer, better, more profitable, and effective power grid. This paper presents the recognition and challenges of the big data and the microgrid. The construction of big data analytics is introduced. The data sources, big data opportunities, and enhancement areas in the microgrid like stability improvement, asset management, renewable energy prediction, and decision-making support are summarized. Diverse case studies are presented including different planning, operation control, decision making, load forecasting, data attacks detection, and maintenance aspects of the microgrid. Finally, the open challenges of big data in the microgrid are discussed.
Article
Smart grids have started generating an ever increasingly large volume of data. Extensive research has been done in meter data analytics for small data sets of electrical grid and electricity consumption. However limited research has investigated the methods, systems and tools to support data storage and data analytics for big data generated by smart grids. This work has proposed a new core-broker-client system architecture for big data analytics. Its implemented platform is named Smart Meter Analytics Scaled by Hadoop (SMASH). Our work has demonstrated that SMASH is able to perform data storage, query, analysis and visualization tasks on large data sets at 20 TB scale. The performance of SMASH in storing and querying large quantities of data are compared with the published results provided by Google Cloud, IBM, MongoDB, and AMPLab. The experimental results suggest that SMASH provides industry a competitive and easily operable platform to manage big energy data and visualize knowledge, with potential to support data-intensive decision making.