Conference PaperPDF Available

A Framework for Improving Data Quality in Data Warehouse: A Case Study

Authors:
  • The Higher Institute for Engineering Occupation
  • Libyan Interntonal Mdical University

Abstract

Nowadays, the development of data warehouses shows the importance of data quality in business success. Data warehouse projects fail for many reasons, one of which is the low quality of data. High-quality data achievement in data warehouses is a persistent challenge. Data cleaning aims at finding, correcting data errors and inconsistencies. This paper presents a general framework for the implementation of data cleaning according to the scientific principles followed in the data warehouse field, where the framework offers guidelines that define and facilitate the implementation of the data cleaning process to the enterprises interested in the data warehouse field. The research methodology used in this study is qualitative research, in which the data are collected through system analyst interviews. The study concluded that the low level of data quality is an obstacle to any progress in the implementation of modern technological projects, where data quality is a prerequisite for the success of its business, including the data warehouse.
A Framework for Improving Data Quality in Data
Warehouse: A Case Study
Taghrid Z. Ali
Higher Institute for Engineering
Libya
taghreed_zidan@hotmail.com
Tawfig M. Abdelaziz
Faculty of Information Technology
Benghazi University, Libya
tawfig.tawuill@uob.edu.ly
Salwa M. Elakeili
Faculty of Information Technology
Benghazi University, Libya
salwa.elakeili@uob.edu.ly
Abdelsalam M. Maatuk
Faculty of Information Technology
Benghazi University, Libya
abdelsalam.maatuk@uob.edu.ly
ABSTRACT
Nowadays, the development of data warehouses shows the
importance of data quality in business success. Data warehouse
projects fail for many reasons, one of which is the poor quality of data.
High-quality data achievement in data warehouses is a persistent
challenge. Data cleaning aims at finding, correcting data errors and
inconsistencies. This paper presents a general framework for
implementation of data cleaning according to the scientific principles
followed in the data warehouse field, where the framework offers
guidelines that define and facilitate the implementation of the data
cleaning process to the enterprises interested in the data warehouse
field. The research methodology used in this study is qualitative
research, in which the data are collected through system analyst
interviews. The study concluded that the low level of data quality is
an obstacle to any progress in the implementation of modern
technological projects, where data quality is a prerequisite for the
success of its business, including the data warehouse.
Keywords
Data warehousing, data quality, data cleaning
1. INTRODUCTION
The importance of data warehouse has emerged with the presence of
major institutions with multiple fields of work, as each business field
manages its databases (administrative, financial, marketing, etc.),
which includes numerous data shared with other fields. A data
warehouse can be described as a large database that includes millions,
or billions, of data records designed to support enterprises' decision-
making. It also allows institutions to organize, update, and coordinate
their data, and to show the relationships between the information
gathered from their various departments. Data warehouse requires
data cleaning to be done as data is collected from various sources and
then cleaned and updated before being loaded into the data
warehouse. Data cleaning is the process of detecting, correcting, and
replacing missing data in a database. The data warehouse mechanism
works by extracting data from the different databases intended to be
integrated, and then they are cleaned and transformed into a unified
form so that they are consistent with each other. Data is extracted,
transformed, and loaded across a series of logical phases. These
phases are integrated into a process known as Extract, Transform and
Load (ETL). The data cleaning step is the first challenge in data
warehousing. Existing studies have shown that an estimated 40% of
the data collected by various sources were infected in one way or
another [1]. The data contamination issue is a problem that exists in
any system since the presence of ideal data is estimated at 5% [2].
Despite the need for their data quality, enterprises may return to
completing the process of cleaning up data on their data sources,
which may be difficult to accomplish. Such a process is conducted by
manual or technological methods, which may be difficult to apply due
to their complexities, high costs, and the problems associated with
them. This research aims to develop a framework for a system that
includes appropriate methods and procedures for cleaning data to
reduce data quality problems as much as possible and improve and
maintaining data warehouse efficiency. Besides, the study would help
enterprises to identify and clarify the impact of some of the sources
of data quality problems and ways to address and reduce them. This
paper investigates several existing methods, strategies and data
quality-oriented data warehouse processes. The importance of this
study is derived from the following points:
1 The study contributes to knowledge in the area of data
warehouses.
2 Increases the effectiveness and efficiency of work in the
enterprises by improving the quality of its data.
3 Reduces expenses related to data quality problems.
4 Support the use of various modern technologies that focus
on cleaning data within their field of work.
5 Provide a general framework that other enterprises can
guide.
The remainder of this paper is organized as follows. Section 2 gives a
brief background and related work about data quality and ways to
improve it in the data warehouse. Section 3 describes the proposed
framework. Section 4 explains the case study outcomes. Finally,
Section 5 concludes the paper and provides some recommendations.
2. RELATED WORK
A classification of data quality issues between single and multi-source
and schema and instance level concerns in data sources are presented
in [1]. The study discussed the important steps for data conversion
and data cleaning and highlighted the need for comprehensive
reporting of schema and instance-related data transformations.
Furthermore, it presented a description of industrial techniques for
data cleaning. Several subjects that require further study are
mentioned. Further research on designing and implementing the right
language method to help both format and data transformation is
required.
A study that aims to illustrate an analysis of challenges, solutions, and
approaches for data cleaning is presented in [2]. It described the
different types of inconsistencies that arise in data that need to be
avoided, and the authors establish many quality criteria that need to
be satisfied in a precise manner. Depending on this classification,
existing data cleaning methods are analyzed and assessed concerning
the kinds of inconsistencies addressed and removed by them. Various
steps in data cleaning are defined and the approaches used in the
cleaning process are identified and an overview for guidance for the
study that compliments current systems is provided.
The work described in [3] has tried to emphasize the importance of
data warehousing. It includes information aggregation technologies
from various distributed data sources and the use of data in explained
and analyzed format to assist enterprises in decision-making and
information management. While many strategies in data warehouses
have been explored or recently developed, including view
maintenance and Online Analytical Processing (OLAP), limited
consideration was paid to data mining strategies to facilitate the most
significant and expensive data integration activities for data
warehouse architecture.
Analyze the data cleaning problem and discuss possible data sets
errors are presented in [4]. A survey and analysis of the different
perspectives of data cleaning and a brief description of existing data
cleaning methods are discussed. Furthermore, a general data cleaning
process framework is introduced as well as a set of general approaches
that can be used to deal with the issue. The techniques used include
patterns matching, statistical outlier analysis, clustering and
techniques for data mining. Besides, the experimental findings of
applying these techniques to a data set in the real world are presented.
The study in [5] introduces a taxonomy that is used to identify costs
associated with the effects of low-quality data as well as the cost of
enhancing and ensuring continuing data quality. Moreover, a method
for assessing the importance of data quality for enterprises is
discussed. Ultimately, a data protection structure is introduced that
focuses on three basic interconnected aspects, namely: individuals,
procedures and data, where any effort to improve the quality of data
in any institution will concentrate on these three essential
components.
The work described in [6] emphasizes on what are the consequences
of data of poor quality and describes the relationship to the effort to
maintain data quality. It reflects on how the desired level of data
quality can be found. As the case study shows the principles found in
the study can be utilized to determine the appropriate data
management strategy and the costs of data of poor quality.
The research presented in [7] discussed current methods, solutions
and data quality-oriented data warehousing frameworks for designing
and developing data cleaning. A novel framework has been presented
based on this study, which intends to address two concerns: to reduce
data cleaning time and to increase the degree of efficiency in data
cleaning. This framework preserves the most positive attributes of
current solutions to data cleaning and maintains the ability to improve
data cleaning efficiency in data warehouse applications. The study
presents a range of more studies, which include: a) analyzing
additional features of data quality measurements to establish a
comprehensive guide to evaluate the preference of a specific data
cleaning technique in data warehouses; b) construct a complete data
cleaning tool based on the framework described in the paper; and c)
checking the system by implementing it to larger multi-data sources.
The research in [8] introduced a structure for Data Quality Framework
for Classification Tasks (DQF4CT) identification activities to
overcome data quality problems. The approach consists of phases: a
systematic structure for offering user feedback on how to address data
issues in classification tasks; and an ontology that reflects data
cleaning information and proposes acceptable solutions to data
cleaning. The approach introduced two research studies using real
data sets: physical activity monitoring (PAM) and occupancy
detection of an office room (OD).
3. THE PROPOSED DATA CLEANING
FRAMEWORK
The research methodology used in this study is qualitative research,
where the data is gathered by doing interviews with system analysts
and users, and by observing the course of action in the information
systems units used within the targeted work areas in an enterprise.
The purpose of this study is to propose a set of processes and
procedures to facilitate the data cleaning and to achieve an
improvement in the data quality level required for data warehouse
work. The proposed framework was conceived by developing an
initial conceptualization of what the system could be based on what it
is and how data warehouses operate, defined by many existing
international enterprises. The framework for data cleaning was
divided into four phases: Phase O concerns data quality assessment.
Phase I includes improving data quality, whereas Phase II concerns
data cleaning using the ETL process, and Phase III is for the
evaluation of the process as shown in Figure. 1. Each of these phases
consists of several key activities that have the greatest impact on the
quality of data and its operating environment. The following sections
describe these phases and their activities.
Roles Distribution
Data Management
Staff Training
Data Problems
Addressing
Clean Multiple-source Data Clean Single-source Data
Common Data Concepts
Establishment
Data Profiling
Data Reporting and
Monitoring
Data Integration
and Enhancement
Data Cleaning
Phase I: Improving Data Quality
Phase II: Data Cleaning using ETL Process
A Continuous Improving Data Duality
Documentation
And Publishing
Phase III: Performance Evaluation Process
Phase O: Data Quality Assessment
Figure 1: The Basic Phases of Data Cleaning Framework
Phase O: Data Quality Assessment
This phase starts before implementing the framework by assessing the
current status of data used for information systems at the enterprise.
This is achieved by reviewing a range of factors closely related to the
quality and safety of data include the following:
Data efficiency and effectiveness.
Data entry.
Data security and integrity.
Financial potential.
Resistance to change.
Administrative change.
Planning.
Powers granted to information systems.
Data quality check.
Data management.
Design of information systems
Documentation of the information system
Training courses
3.1 Phase I: Improving Data Quality
The following subsections describe the activities of Phase I.
3.1.1 Roles Distribution
We begin with the first activity where we distribute tasks and assign
responsibilities in the context of clear obligations that fulfill the
framework's goals to ensuring that the operations are conducted
effectively and efficiently. This requires the identification of main
roles and responsibilities, add new positions for the enterprise and
assign them to different members. The following roles are necessary
for the framework.
System Senior Management: who represents the highest
authority, and responsible for the development and
maintenance of the system policy; ensures its proper
functioning and is part of the enterprise's executive in
practice.
Administrative Coordinator: performs all the
administrative actions relevant to the framework's tasks and
operations.
Data Manager: responsible for managing the entire data
quality of the information systems and is responsible for all
data management operations such as data collection, access,
handling, use and deletion.
Data Administrator: is an expert in the field of research,
responsible for defining and maintaining data quality
standards within the data set responsible for it, in a manner
that guarantees data quality.
Database Administrator: a professional with
comprehensive administrative experience. The specialist is
responsible for developing, implementing, operating the
database system, establishing and defining policies and
procedures for setting up, handling, running, and using the
database management system. The specialist also aims to
keep up with the latest database architecture techniques and
methods.
Data entry clerk: responsible for the completion of the
information system entry.
3.1.2 Data Management
The data management department refers to a group that is responsible
for managing business data as a critical resource for the enterprise.
This department is responsible for developing, managing policies,
procedures, plans and processes within the enterprise to define, clean,
protect, and efficiently use data. Enterprise data management assists
in the development of long-term database design plans, structures,
policies and laws. The data management process involves assigning a
group of employees and experts in the field to perform data
management activities, policy development, technology, good control
and data analysis planning and setting up a special department known
as the data management department.
Data quality needs support from the enterprise structure, which
includes specialists from all main enterprise sectors. A data
management department must be established to start data
management activities within the proposed system. This department
manages the data used in the specified area of work for information
systems. The data manager manages the new department and
supervises its employee's functions as follows (data managers,
database administrators, data entry).
At this point, we have laid the foundation, on which we can perform
targeted business area data management tasks in the enterprise, so we
can start data quality improvement activities and address data issues
An important initial step to be considered here is a clear and
complete understanding of the current state of data quality in an
enterprise. Data quality assessment tools enable identifying,
addressing data issues and causes, reducing effort, improving the
extent and speed of data analysis, and helping to create a
comprehensive understanding of data quality levels. Data quality
measurement and evaluation techniques provide a certain level of
examination using the data quality business rules. These are standards
developed by industry experts or analysts of information systems to
evaluate the data quality level. Data quality rules provide a way to
describe what is expected from a data quality perspective. These rules
are used to differentiate between valid and invalid data. They are
integrated into the data quality assessment and measurement software
tools to compare the data in the source database with the data quality
business rules. Data that violates business rules are then changed to
comply with these established rules. The purpose of applying these
rules is to produce data quality reports. Once the reports are available,
industry experts check the data quality information documents and
figure out details about data quality issues. The specialists’ correct
issues and re-check data, determine that there are no problems, or that
issues fall within an acceptable allowance level. The data will then be
passed to the final database of the quality set.
In some cases, a particular business field cannot find a solution to data
problems. For example, the required solution may be outside the
administrative authority granted to the data manager within a specific
area of work or it may require changes in procedures, policies, and
processes throughout the enterprise as a whole to solve the data
problem. To correct these problems, they must be solved collectively,
through discussion, mutual offering between data managers at the
enterprise level, senior management of the framework and any other
candidates to the solution, if necessary.
The right data management structure starts with the development of
a common set of data management technologies [9] capable of
supporting and automating various data quality improvement
processes whenever possible. This is to reduce the cost of time and
effort to improve data quality. The work areas of the software tools
can be divided into five main groups as shown in Figure. 1:
1. Data profiling: Analyzes characteristics of target source
data to evaluate and understand situations within data
quality rules.
2. Data cleaning: Corrects data errors and establishes data
integrity and consistency standards within the data set.
3. Data integration and data enhancement: Integrates,
consolidates the data relationship between varieties of
sources and improves data to make it more compatible.
4. Data Reporting and monitoring: The enterprise needs to
track data quality over time and assess and measure the
results of data quality changes. This task is performed by
data monitoring and reporting tools. Reports provide and
monitor data quality issues such as data quality reporting,
data breach alerts, approved databases, and detailed
analysis of data breaches.
5. Documentation and publishing: The framework adopts the
idea of creating a website as a means of documenting and
disseminating data quality issues and as a means of
communication for all stakeholders within the enterprise.
Publish all the new works and activities within the
enterprise's data-cleaning framework, and continuously
share experience and data quality experiences across the
enterprise.
3.1.3 Staff Training
Enterprise senior management should establish a training center for
professional, technical training and development to offer a training
environment and to establish a working environment equipped with
the required technological and training equipment. It should also
develop a long-term training plan for the development of information
systems employees. The plan's purpose is to provide trainees with
information technology knowledge to recognize the types of
technologies used and to provide them with the relevant skills
necessary to successfully perform their duties. The framework's
senior management also needs to develop the implementation
framework for this plan along with the timeline, and determine the
human and material resources needed to effectively execute it.
Furthermore, the team that will implement this plan should be
identified.
3.1.4 Data Problems Addressing
The framework introduces a set of rules, instructions and procedures
to maintain the validity and consistency of data elements through the
use and application of the design phase to improve data quality. To
improve the quality of the data, the focus must be placed on certain
aspects of the information system design process, which has a
negative or positive effect on the quality of the data contained in the
system.
In this activity of the framework, we briefly present the problems of
data quality related to database design, the reasons that lead to its
occurrence and ways to address it. We also explain a set of design
standards related to designing effective input interfaces to the system,
as the application of these standards would lead to the production of
a design that supports the required data quality.
3.1.4.1 Addressing database design issues:
Most applications within the enterprise have been developed for
specific purposes and sections and have been kept separate from other
applications from the beginning. This, in turn, leads to a significant
degree of inconsistency in their respective components and especially
the components of their data. Quality problems related to database
design are grouped into two sections: single-source problems, and
problems when combining multiple sources of data. Problems within
each are grouped into two levels, one at the Schema-level database,
and another at the instance-level of the database. The following is a
brief explanation of the above problems and the proposed solutions to
them.
Single-source problems
a. Data quality problems at the schema level are caused
by poor schema design, and lack of integrity
constraints that enforce on data.
b. Data quality problems at instance level are caused by
data entry errors.
The suggested solutions to these problems are to create a design that
is appropriate for both the database schema and the data integrity
constraints. Besides, designing powerful input interfaces that support
data quality.
Multi-source problems
a. Data quality problems at the schema level are caused
by the difference between data models and schema
designs. This type of data problem occurs because of
the heterogeneous data models and schema designs
across enterprise applications between different data
sources.
b. Data quality problems at instance level are caused by
the inconsistency of data.
The suggested solutions to this issue would be by matching entities'
definitions across different data sources and perform the data cleaning
process. Besides, database schemas of different data sources have to
be integrated.
3.1.4.2 Solving the problems related to designing data entry forms:
The most common cause of data inaccuracy is manual data entry, and
the complexity of data input types also raises the data input issues.
When safety constraints apply to the data to be entered, input
interfaces will prevent the user from entering data in violation of these
restrictions. Therefore, users may implement new methods of data
entry that may be considered incorrect or accurate as required; only
data that fulfills the purpose is considered to have a type of system
circumvent. It is also a problem to enter data via forms or electronic
interfaces to the information systems on the Internet. Often, users tend
to find the easiest way to complete the form, even if that means
intentional error. The design of good electronic or paper data entry
forms and the instructions that come with these forms can reduce
some kind of problem with data entry. However, manual data entry
must be recognized as a reason for data quality problems. Through
this phase of the framework, we explain the principles for the correct
design of input models, whether paper or electronic, which helps the
system analyst to design efficient system revenue. The data input user
interface must satisfy a set of conditions to reach the required level of
design quality, as follows:
1. Effectiveness: ensures that both paper and electronic
input types fulfill the task for which they were effectively
prepared.
2. Data accuracy: means that a form for data entry can be
properly filled in.
3. Easy entry: indicates the user's ability to directly use
input forms without spending much time understanding
how it works and mobilizing it.
4. Regularity: means tracing user attention by keeping the
interface elements design in order.
5. Gravity: give the user a feeling of comfort and pleasure
when using the model.
3.1.5 Common Data Concepts Establishment
The effective implementation with various information systems
depends on several factors, including a proper understanding of how
data is used in systems which in turn requires an understanding of the
definitions of data, data design, and data enterprise. The problem of
the incoherence of information systems arises since each business
domain in the enterprise creates its applications isolated from other
applications. However, all applications will be stored in the same
metadata by sharing the data metadata. Therefore, the enterprise will
have structured standards, common data descriptions for all
applications, and data management systems. The data dictionary is
just one step towards creating a common understanding of all the
enterprise data elements. It is a reference that contains data describing
the data or what is called Metadata, referring to all data processing
operations, data warehouses, data flows, data structures, system
logical and physical data elements. The enterprise data dictionary is a
single tool to help ensure the accuracy and consistency of the data.
One of the important reasons why we need to place a data dictionary
is to keep the data clean which means that a particular system will
follow the data. Data management provides the best performance if it
works next to the data dictionary and is the best, most important and
most accurate data operation information resource for all the
enterprise’s various employees
3.2 Phase II: Data Cleaning Framework using
ETL Process
During this phase, data will be initialized and cleaned up again using
the ETL process to remove possible errors in the applications to
transfer cleaned data to the data warehouse. The process of data
cleaning is split into two steps.
Step1: clean single-source data: data cleaning for each single
data source is applied to its own.
Step2: clean multiple source data: data cleaning is then applied
to multiple data sources.
The referred methodology in [7] was also characterized by the use of
two different methods (Auto-cleaning process and Semi-auto-
cleaning process) to address data quality issues. In this methodology,
the appropriate timeliness standard was chosen as the primary
criterion for performing the cleaning process where data error types
are determined based on the time taken to perform the cleaning of the
data. To contribute to optimum efficiency and performance when
cleaning is complete, the selected criteria can be integrated into the
data cleaning process. According to the methodology in [7], the data
was cleaned in two stages: the first stage addresses errors in the single
data source and the second stage addresses errors in multiple data
sources. Each stage has two processes on data sources (Auto-cleaning
process), and (Semi-auto-cleaning process). Errors are detected and
automatically removed or corrected using appropriate algorithms in
the process of automated processing, without any user intervention.
The Semi-auto process comes up to address the remaining data errors
after the process of automated processing is completed. This type of
error cannot be detected, removed or corrected without the domain
expert's intervention, where errors are detected and identified only
through the use of algorithms, while the data cleanup operator handles
them. According to the above, the second and final phase of the
proposed framework in this study is completed by using the approach
in [7] where the methodology addressed the data pollution problem.
3.3 Phase III: Performance Evaluation Process
The performance of the framework is assessed by the top management
and managers responsible for the targeted work areas of the
enterprise, using documented material on the level of data quality and
the results achieved. Reports on data quality provide managers and all
data staff with an explanation if the data meets or exceeds the
acceptable level of data quality. These reports will also evaluate the
effectiveness of the United Nations Development Assistance
Framework (UNDAF) performance and reflect if some processes or
methods need to be re-evaluated. When the level of data quality
reached after a predetermined time has passed indicates an
improvement in the level of data quality, this indicates the efficiency
and success of the methods that were adopted and vice versa. It should
be noted that the timeframe required to evaluate the framework should
be taken into account so that a timetable is set out indicating the time
required to complete the tasks of data problems so that the framework
is not evaluated before giving the necessary time to address those
problems.
4. THE CASE STUDY
This research was implemented to study the problems of data quality
that are used in the information systems that are installed in the
faculties of Benghazi University.
4.1 Data Quality Assessment and Operation
Environment
This section describes the current status of data used for information
systems at the University of Benghazi, which is reviewed by a range
of factors closely related to the quality and safety of data.
Data efficiency and effectiveness: A range of problems
have been noted in data used in information systems such
as (incomplete, inaccurate, and unavailable to decision-
makers promptly). These problems indicate their
inefficiency and effectiveness in accomplishing various
tasks.
Data entry: Most data entry rooms in information systems
do not provide the appropriate environment for the
operation, which in turn adversely affects the accuracy and
validity of the data entry process. Besides, there is no
monitoring and follow-up of data entry work, as well as no
appropriate error correction mechanisms.
Data security and integrity: Assigning people who are
non-specialist to data-entry tasks so that this would
compromise the security and integrity of data, such as the
entry or manipulation of incorrect data.
Financial potential: The inability of the University's senior
management to provide the necessary financial means for
improving, maintaining and developing information
systems.
Resistance to change: Some managers and users of
information systems are concerned about changing and
developing their working style and rejecting what is
unusual for them.
Administrative change: Unconsidered change of
department heads has caused many problems, the most
important of which is the lack of stability in the work
environment, as the work plans are changing with the
change of head of a department, confusing staff, waste of
time and possibilities for non-continuous plans.
Planning: There are no clear, informed plans for the senior
management of the University to follow, regarding the
improvement of the quality of the data.
Powers granted to information systems: Some managers
exceeded pre-defined powers of information systems and
modifying data without complying with the conditions and
restrictions to be applied to data. This causes severe
confusion in data entry processes or multiple modifications.
Data quality check: Lack of tools and techniques to check
data quality, and solve data problems.
Data management: The lack of specialized departments
within the targeted areas of data management work
exacerbates their problems.
Design of information systems: Information systems used
by the University are individual and separate applications,
designed independently, which have led to the asymmetry
of data elements between information systems and their
replication.
Documentation of the information system: The
documentation process is neglected, whether documenting
the user's data or documentation items, which explains how
to use the system, or resolve the problems that might be
faced.
Training courses: The management neglects a data portal
development plan, as employees rely on their personal
computer experience.
5.2. Impact of Data Quality Problems
From the study and analysis of previous points, we conclude that data
in information systems and their operating environment are
problematic, with several negative effects, which can be summarized
as follows:
Impact on data quality: Low data quality within targeted areas
of work.
Impact on the performance of information systems: The
quality of data entered into the information system determines
the quality of the resulting data, so it can be argued that the
desired results of the performance of information systems are
below the required level, as a result of the low quality of their
inputs.
Their impact on decision-making: The low level of data
quality is an obstacle to sound and accurate decision-making.
Impact on cost: Subsequent adjustments to the quality problems
of data, whether through the improvement of legacy information
systems or the implementation of new systems as a result of the
uselessness of legacy systems, lead to a significant increase in
the volume of material costs, waste time and effort to find
temporary solutions for permanent problems.
The results of the study confirm that there are several problems in the
data environment, which is an unequivocal indication of the low level
of data quality in the areas of work at the University of Benghazi,
which is an obstacle to any progress in the application of modern
technology projects, as the quality of data is a prerequisite for the
success of their work, including data warehouse.
The study revealed that many factors have negatively affected the
data quality level, including (poor an infrastructure, lack of strategic
planning, the rigidity of the current organizational structure and the
lack of data quality development and employment, poor
understanding of the value of data quality, lack of employees’
interest). This indicated that the data cleaning process can be made
easier and less complicated when the planning and preparing for the
cleaning process is accomplished in advance and when it is performed
in stages rather than in one batch. This can impede its subsequent
cleaning in the data warehouse's ETL process.
5. CONCLUSIONS
5.1 Conclusion
In this study, we have proposed a framework for cleaning data, which
includes many procedures and methods aimed at achieving
improvement in the level of data quality, including assessing,
improving and monitoring the level of data quality and solving its
problems within each field of work separately. A set of basic tools
and techniques has been proposed to build the necessary technical
infrastructure to improve the level of data quality within the
enterprise. We emphasis on developing the skills of employees
through setting up a specific training mechanism and how to create a
site that deals with data quality issues. The design of information
systems was discussed, because the methods used to design
information systems in the enterprise often focus on providing
operational needs for them, and ignore the procedures for improving
the quality of the data. We believe that achieving data quality requires
reformulating the viewpoint regarding the design of information
systems so that the achievement of data quality with a focus on
providing operational aspects, rather than addressing the functional
needs of the system in an isolated manner. A unified and common
base has been established to develop unified and common concepts of
data at the level of various applications in all areas of the enterprise's
work by sharing metadata between them. In the last stage of the
framework, we explained how to improve the efficiency of data
cleaning performance within the ETL process before moving to its
final destination which is the data warehouse. We also clarified how
the framework performance can be evaluated as we relied on the
reports received on the level of data quality, and evaluated in terms of
achieving what was expected to be accomplished
We implement the proposed framework to the University of
Benghazi as a case study and reviewed the problems found in it, such
as data quality problems, environmental operational problems, and
explained their impact on the work. Then, we establish a link between
the methods and procedures put forward within the framework, and
the problems that the case study suffers from.
We believe that the data cleaning process can be accomplished
easier, and less complicated when there are planning and preparation
in advance for the cleaning process, and when it is carried out in stages
instead of being accomplished at once in the ETL process.
5.1 Recommendations
In the light of the findings, there are a set of recommendations that
support long-term improvement in the quality of data, and that support
the chances of success of the proposed framework, if adopted by an
organization, which are as follows:
1. We recommend applying the proposed framework at the
University of Benghazi and to be adopted by the higher
management of the origination. This is after the assessment
and improvement of data quality is guaranteed and the
operation environment necessary to implement the
proposed framework is well prepared.
2. The implementation of a comprehensive and consistent data
quality plan is a significant to make the enterprise move
from being responsive only to fixing data issues to an
enterprise that proactively controls the data issues and
limits the presence of defects in the data environment.
3. Apply the data bug prevention principle rather than
addressing them. This requires data quality and integrity to
be achieved from the outset, which will provide much
subsequent correction to data errors.
4. The need to work to establish a culture of data quality, for
employees and all administrators in the enterprise.
5. Reduce resistance to change, both from employees and
management leadership in the enterprise, by engaging them
in the process of improving the quality of data and
enhancing their quality concept.
6. Scientific bases and controls should be established to select
administrative leaders, especially those working in the data
field, and work on the relative stability of these
administrative leaders so that sudden and unplanned change
does not change the work plans and methods negatively
affect the work and plans.
7. To attract the scientific competence in strategic planning for
institutions, experienced professionals and those qualified
to improve data quality by providing them with full material
and moral support to improve the scientific and productive
reality of the enterprise. The availability of qualified
personnel is critical and a prerequisite for the success of the
framework.
8. The need for detailed and targeted plans that suit the nature
and capabilities of the enterprise to address data quality
problems is such that as a roadmap, managers are
committed to implementing in the event of any
management change. It is recommended to start with initial
short-term plans and then to be developed and extended.
9. Senior management of an enterprise has to review and
adjust its organizational structures to suit new business
variables.
10. Pre-conceptualization of information systems design must
be developed to have pre-defined and unified standards at
the level of all applications at the university.
11. Allocation of data entry rooms for information systems,
with the necessary technology and equipment, so that data
entry can be accomplished with minimal errors and with
sufficient efficiency.
12. Improve and expand computer networking for information
systems used at the University of Benghazi, so that better
data exchange services can be provided between the various
work units and departments of the University.
13. To make use of information systems, whether
administrative, support or expert, which are an essential
component of modern institution-building, to provide
information to decision-makers as needed with the speed,
quantity and accuracy required. A gradual transition to the
use of new technologies to improve data quality is also
recommended.
14. The capabilities of data personnel must be developed,
trained and qualified to be able to use modern technologies.
6. REFERENCES
[1] Maletic J.I. and Marcus A. 2009. Data Cleansing: A
Prelude to Knowledge Discovery. In: Maimon O., Rokach
L. (eds) Data Mining and Knowledge Discovery
Handbook. Springer, Boston, MA.
[2] Rahm E. and Do H. H. 2000. Data Cleaning: Problems and
Current Approaches. In IEEE Data Engineering Bulletin},
vol. 23, pp. 2000.
[3] Müller H. and Freytag J. C. 2003. Problems, methods, and
challenges in comprehensive data cleansing. In Humboldt
university berlin, pp. 23.
[4] Kalinka M. and Kaloyanova K. 2005. Improving data
integration for a data warehouse: a data mining approach. The
University of Sofia, Bulgaria. Available In:
http://www.nbu.bg.
[5] O’Brien T., Helfert M. and Sukumar A. 2012. Classifying
costs and effects of poor Data Quality examples and
discussion. In Annual Conference of Irish Academy of
Management, Maynooth, Ireland.
[6] Haug A., Zachariassen F. and Liempd D. V. 2011. The costs
of poor data quality. In Journal of Industrial Engineering and
Management, vol. 4, p. 171.
[7] Peng T. A. 2008. Framework for Data Cleaning in Data
Warehouses. In In Proceedings of the Tenth International
Conference on Enterprise Information Systems.
[8] Corrales D. C., Ledezma A. and Corrales J. C. 2018. From
Theory to Practice: A Data Quality Framework for
Classification Tasks. In Symmetry, vol. 10(7), 248.
https://doi.org/10.3390/sym10070248
[9] Ferguson, M. (2007). Data Ownership and Enterprise Data
Management: Leveraging Technology to Get Control of Your
Data (Part 2). A DataFlux White Paper, SAS Institute.
... Later, many data quality approaches were suggested in the literature. In [28], the authors have explored existing data quality practices and introduced a novel framework, based on research and experience, for managing data quality in DW. The framework presents a systematic approach to define, establish, and sustain data quality management within the data warehouse environment based on the following dimensions: Accuracy, Completeness, Timeliness, Integrity, Consistency, Conformity, and ...
Thesis
Full-text available
The widespread adoption of big data has ushered in a new era of data-driven decision-making, transforming numerous industries and sectors. However, the efficacy of these decisions hinges on the quality of the underlying data. Poor data quality can result in inaccurate analyses and deceptive conclusions. Managing the vast volume, velocity, and variety of data sources presents significant challenges, heightening the importance of addressing big data quality issues. While there has been increased attention from both academia and industry, current approaches often lack comprehensiveness and universality. They tend to focus on limited metrics, neglecting other dimensions of data quality. Moreover, existing methods are often context-specific, limiting their applicability across different domains. There is a clear need for intelligent, automated approaches leveraging artificial intelligence (AI) for advanced data quality corrections. To bridge these gaps, this Ph.D. thesis proposes a novel set of interconnected frameworks aimed at enhancing big data quality comprehensively. Firstly, we introduce new quality metrics and a weighted scoring system for precise data quality assessment. Secondly, we present a generic framework for detecting various quality anomalies using AI models. Thirdly, we propose an innovative framework for correcting detected anomalies through predictive modeling. Additionally, we address metadata quality enhancement within big data ecosystems. These frameworks are rigorously tested on diverse datasets, demonstrating their efficacy in improving big data quality. Finally, the thesis concludes with insights and suggestions for future research directions.
... The information system collects, stores, analyzes, and disseminates information for a specific purpose [10,11]. In this study, we collect or extract data from the data center and load it into the data warehouse for processing and analysis, which is then reported or distributed to certain parties or the public [12]. ...
Article
Child welfare decisions have life-impacting consequences which, often times, are underpinned by limited or inadequate data and poor quality. Though research of data quality has gained popularity and made advancements in various practical areas, it has not made significant inroads for child welfare fields or data systems. Poor data quality can hinder service decision-making, impacting child behavioral health and well-being as well as increasing unnecessary expenditure of time and resources. Poor data quality can also undermine the validity of research and slow policymaking processes. The purpose of this commentary is to summarize the data quality research base in other fields, describe obstacles and uniqueness to improve data quality in child welfare, and propose necessary steps to scientific research and practical implementation that enables researchers and practitioners to improve the quality of child welfare services based on the enhanced quality of data.
Article
Full-text available
Decision-makers in the educational field always seek new technologies and tools, which provide solid, fast answers that can support the decision-making process. They need a platform that utilizes the students’ academic data and turns them into knowledge to make the right strategic decisions. In this paper, a roadmap for implementing a data-driven decision support system (DSS) is presented based on an educational data mart. The independent data mart is implemented on the students’ degrees in 8 subjects in a private school (Al-Iskandaria Primary School in Basrah province, Iraq). The DSS implementation roadmap is started from pre-processing a paper-based data source and ended with providing three categories of online analytical processing (OLAP) queries (multidimensional OLAP, desktop OLAP, and web OLAP). Key performance indicator (KPI) is implemented as an essential part of educational DSS to measure school performance. The static evaluation method shows that the proposed DSS follows the privacy, security, and performance aspects with no errors after inspecting the DSS knowledge base. The evaluation shows that the data-driven DSS based on independent data mart with KPI, OLAP is one of the best platforms to support short-to-long term academic decisions.
Article
Full-text available
The data preprocessing is an essential step in knowledge discovery projects. The experts affirm that preprocessing tasks take between 50% to 70% of the total time of the knowledge discovery process. In this sense, several authors consider the data cleaning as one of the most cumbersome and critical tasks. Failure to provide high data quality in the preprocessing stage will significantly reduce the accuracy of any data analytic project. In this paper, we propose a framework to address the data quality issues in classification tasks DQF4CT. Our approach is composed of: (i) a conceptual framework to provide the user guidance on how to deal with data problems in classification tasks; and (ii) an ontology that represents the knowledge in data cleaning and suggests the proper data cleaning approaches. We presented two case studies through real datasets: physical activity monitoring (PAM) and occupancy detection of an office room (OD). With the aim of evaluating our proposal, the cleaned datasets by DQF4CT were used to train the same algorithms used in classification tasks by the authors of PAM and OD. Additionally, we evaluated DQF4CT through datasets of the Repository of Machine Learning Databases of the University of California, Irvine (UCI). In addition, 84% of the results achieved by the models of the datasets cleaned by DQF4CT are better than the models of the datasets authors.
Article
Full-text available
One of the significantaspects of software quality is usability. It is one of the characteristics that judge by the success or failure of software applications. The most important risk facing the software applications is usability which may lead to the existence of a gap between users and systems. This may lead to system failure because of Poor design. This is due to the design is not based on the desires and requirements of the customer. To overcome these problems, this paper proposed an approach to improve usability of software applications to meet the needs of the customer and interacts with the user easily with an efficient and effective manner.The proposed approach is based prototyping technique due to itssimplicity and it does not require additional costs to elicit precise and complete requirement and design.
Conference Paper
Full-text available
This study has highlighted the importance of information, the costs and consequential effects associated with poor quality data and the benefits and value that can be derived from implementing data quality improvement initiatives. This paper also provides a taxonomy which may be used to classify costs relating to both the consequences of low quality data and the costs of improving and assuring on-going data quality. In addition there is framework for analysing the business value of data quality. Finally a data governance model is proposed centring on three inter-related fundamental elements namely: People, Processes and Data, where any attempt to improve the quality of data within any organisation must be focussed around these three essential elements
Article
Full-text available
Purpose: The technological developments have implied that companies store increasingly more data. However, data quality maintenance work is often neglected, and poor quality business data constitute a significant cost factor for many companies. This paper argues that perfect data quality should not be the goal, but instead the data quality should be improved to only a certain level. The paper focuses on how to identify the optimal data quality level. Design/methodology/approach: The paper starts with a review of data quality literature. On this basis, the paper proposes a definition of the optimal data maintenance effort and a classification of costs inflicted by poor quality data. These propositions are investigated by a case study. Findings: The paper proposes: (1) a definition of the optimal data maintenance effort and (2) a classification of costs inflicted by poor quality data. A case study illustrates the usefulness of these propositions. Research limitations/implications: The paper provides definitions in relation to the costs of poor quality data and the data quality maintenance effort. Future research may build on these definitions. To further develop the contributions of the paper, more studies are needed. Practical implications: As illustrated by the case study, the definitions provided by this paper can be used for determining the right data maintenance effort and costs inflicted by poor quality data. In many companies, such insights may lead to significant savings. Originality/value: The paper provides a clarification of what are the costs of poor quality data and defines the relation to data quality maintenance effort. This represents an original contribution of value to future research and practice.
Article
Full-text available
Cleansing data from impurities is an integral part of data processing and mainte-nance. This has lead to the development of a broad range of methods intending to enhance the accuracy and thereby the usability of existing data. This paper pre-sents a survey of data cleansing problems, approaches, and methods. We classify the various types of anomalies occurring in data that have to be eliminated, and we define a set of quality criteria that comprehensively cleansed data has to ac-complish. Based on this classification we evaluate and compare existing ap-proaches for data cleansing with respect to the types of anomalies handled and eliminated by them. We also describe in general the different steps in data clean-sing and specify the methods used within the cleansing process and give an out-look to research directions that complement the existing systems.
Conference Paper
Full-text available
This paper presents an investigation into approaches and techniques used for database conversion. Constructing object views on top of a Relational DataBase (RDB), simple database integration and database migration are among these approaches. We present a categorisation of selected works proposed in the literature and translation techniques used for the problem of database conversion, concentrating on migrating an RDB as source into object-based and XML databases as targets. Database migration from the source into each of the targets is discussed in detail including semantic enrichment, schema translation and data conversion. Based on a detailed analysis of the existing literature, we conclude that an existing RDB can be migrated into object-based/XML databases according to available database standards. We propose an integrated method for migrating an RDB into object-based/XML databases using an intermediate Canonical Data Model (CDM), which enriches the source database’s semantics and captures characteristics of the target databases. A prototype has been implemented, which successfully translates CDM into object-oriented (ODMG 3.0 ODL), object-relational (Oracle 10g) and XML schemas.
Conference Paper
Full-text available
This paper surveys the recent literature about various research trends relevant to Relational DataBase (RDB) reengineering. The paper presents an analysis of approaches and techniques used in this context, including construction of object views on top of RDBs, database integration and database migration. A categorisation is presented of the selected work, concentrating on migrating an RDB as a source into object-based and XML databases as targets. Database migration from the source into each of the targets is discussed and critically evaluated, including the semantic enrichment, schema translation and data conversion. Based on a detailed analysis of the existing literature, it seems that the existing work does not provide a complete solution for more than one target database for either schema or data conversion. Besides, none of the existing proposals can be considered as a method for migrating an RDB into an object-relational database. We propose such a method based on an intermediate canonical data model, which enriches the semantics of the source RDB and captures characteristics of the target databases.
Article
Data warehousing embraces technology of integrating data from multiple distributed data sources and using that data in annotated and aggregated form to support business decision-making and enterprise management. Although many techniques have been revisited or newly developed in the context of data warehouses, such as view maintenance and OLAP, little attention has been paid to data mining techniques for supporting the most important and costly tasks of data integration for data warehouse design.
Chapter
This chapter analyzes the problem of data cleansing and the identification of potential errors in data sets. The differing views of data cleansing are surveyed and reviewed and a brief overview of existing data cleansing tools is given. A general framework of the data cleansing process is presented as well as a set of general methods that can be used to address the problem. The applicable methods include statistical outlier detection, pattern matching,clustering, and Data Mining techniques. The experimental results of applying these methods to a real world data set are also given. Finally, research directions necessary to further address the data cleansing problem are discussed. Key wordsData Cleansing-Data Cleaning-Data Mining-Ordinal Rules-Data Quality-Error Detection-Ordinal Association Rules