Conference PaperPDF Available

A Framework for Improving Data Quality in Data Warehouse: A Case Study

November 2020

November 2020

DOI:10.1109/ACIT50332.2020.9300119

Conference: 2020 21st International Arab Conference on Information Technology (ACIT)
At: Misr University for Science & Technology, 6 of October, Giza, Egypt
Volume: 28-30 Nov. 2020

Authors:

Taghrid Z. Ali

The Higher Institute for Engineering Occupation

Tawfig Abdelaziz

Libyan Interntonal Mdical University

Abdelsalam M. Maatuk

University of Benghazi

Salwa Elakeili

University of Benghazi

Nowadays, the development of data warehouses shows the importance of data quality in business success. Data warehouse projects fail for many reasons, one of which is the low quality of data. High-quality data achievement in data warehouses is a persistent challenge. Data cleaning aims at finding, correcting data errors and inconsistencies. This paper presents a general framework for the implementation of data cleaning according to the scientific principles followed in the data warehouse field, where the framework offers guidelines that define and facilitate the implementation of the data cleaning process to the enterprises interested in the data warehouse field. The research methodology used in this study is qualitative research, in which the data are collected through system analyst interviews. The study concluded that the low level of data quality is an obstacle to any progress in the implementation of modern technological projects, where data quality is a prerequisite for the success of its business, including the data warehouse.

Content uploaded by Taghrid Z. Ali

Content may be subject to copyright.

A Framework for Improving Data Quality in Data

Warehouse: A Case Study

Taghrid Z. Ali

Higher Institute for Engineering

Libya

taghreed_zidan@hotmail.com

Tawfig M. Abdelaziz

Faculty of Information Technology

Benghazi University, Libya

tawfig.tawuill@uob.edu.ly

Salwa M. Elakeili

Faculty of Information Technology

Benghazi University, Libya

salwa.elakeili@uob.edu.ly

Abdelsalam M. Maatuk

Faculty of Information Technology

Benghazi University, Libya

abdelsalam.maatuk@uob.edu.ly

ABSTRACT

Nowadays, the development of data warehouses shows the

importance of data quality in business success. Data warehouse

projects fail for many reasons, one of which is the poor quality of data.

High-quality data achievement in data warehouses is a persistent

challenge. Data cleaning aims at finding, correcting data errors and

inconsistencies. This paper presents a general framework for

implementation of data cleaning according to the scientific principles

followed in the data warehouse field, where the framework offers

guidelines that define and facilitate the implementation of the data

cleaning process to the enterprises interested in the data warehouse

field. The research methodology used in this study is qualitative

research, in which the data are collected through system analyst

interviews. The study concluded that the low level of data quality is

an obstacle to any progress in the implementation of modern

technological projects, where data quality is a prerequisite for the

success of its business, including the data warehouse.

Keywords

Data warehousing, data quality, data cleaning

1. INTRODUCTION

The importance of data warehouse has emerged with the presence of

major institutions with multiple fields of work, as each business field

manages its databases (administrative, financial, marketing, etc.),

which includes numerous data shared with other fields. A data

warehouse can be described as a large database that includes millions,

or billions, of data records designed to support enterprises' decision-

making. It also allows institutions to organize, update, and coordinate

their data, and to show the relationships between the information

gathered from their various departments. Data warehouse requires

data cleaning to be done as data is collected from various sources and

then cleaned and updated before being loaded into the data

warehouse. Data cleaning is the process of detecting, correcting, and

replacing missing data in a database. The data warehouse mechanism

works by extracting data from the different databases intended to be

integrated, and then they are cleaned and transformed into a unified

form so that they are consistent with each other. Data is extracted,

transformed, and loaded across a series of logical phases. These

phases are integrated into a process known as Extract, Transform and

Load (ETL). The data cleaning step is the first challenge in data

warehousing. Existing studies have shown that an estimated 40% of

the data collected by various sources were infected in one way or

another [1]. The data contamination issue is a problem that exists in

any system since the presence of ideal data is estimated at 5% [2].

Despite the need for their data quality, enterprises may return to

completing the process of cleaning up data on their data sources,

which may be difficult to accomplish. Such a process is conducted by

manual or technological methods, which may be difficult to apply due

to their complexities, high costs, and the problems associated with

them. This research aims to develop a framework for a system that

includes appropriate methods and procedures for cleaning data to

reduce data quality problems as much as possible and improve and

maintaining data warehouse efficiency. Besides, the study would help

enterprises to identify and clarify the impact of some of the sources

of data quality problems and ways to address and reduce them. This

paper investigates several existing methods, strategies and data

quality-oriented data warehouse processes. The importance of this

study is derived from the following points:

1 The study contributes to knowledge in the area of data

warehouses.

2 Increases the effectiveness and efficiency of work in the

enterprises by improving the quality of its data.

3 Reduces expenses related to data quality problems.

4 Support the use of various modern technologies that focus

on cleaning data within their field of work.

5 Provide a general framework that other enterprises can

guide.

The remainder of this paper is organized as follows. Section 2 gives a

brief background and related work about data quality and ways to

improve it in the data warehouse. Section 3 describes the proposed

framework. Section 4 explains the case study outcomes. Finally,

Section 5 concludes the paper and provides some recommendations.

2. RELATED WORK

A classification of data quality issues between single and multi-source

and schema and instance level concerns in data sources are presented

in [1]. The study discussed the important steps for data conversion

and data cleaning and highlighted the need for comprehensive

reporting of schema and instance-related data transformations.

Furthermore, it presented a description of industrial techniques for

data cleaning. Several subjects that require further study are

mentioned. Further research on designing and implementing the right

language method to help both format and data transformation is

required.

A study that aims to illustrate an analysis of challenges, solutions, and

approaches for data cleaning is presented in [2]. It described the

different types of inconsistencies that arise in data that need to be

avoided, and the authors establish many quality criteria that need to

be satisfied in a precise manner. Depending on this classification,

existing data cleaning methods are analyzed and assessed concerning

the kinds of inconsistencies addressed and removed by them. Various

steps in data cleaning are defined and the approaches used in the

cleaning process are identified and an overview for guidance for the

study that compliments current systems is provided.

The work described in [3] has tried to emphasize the importance of

data warehousing. It includes information aggregation technologies

from various distributed data sources and the use of data in explained

and analyzed format to assist enterprises in decision-making and

information management. While many strategies in data warehouses

have been explored or recently developed, including view

maintenance and Online Analytical Processing (OLAP), limited

consideration was paid to data mining strategies to facilitate the most

significant and expensive data integration activities for data

warehouse architecture.

Analyze the data cleaning problem and discuss possible data sets

errors are presented in [4]. A survey and analysis of the different

perspectives of data cleaning and a brief description of existing data

cleaning methods are discussed. Furthermore, a general data cleaning

process framework is introduced as well as a set of general approaches

that can be used to deal with the issue. The techniques used include

patterns matching, statistical outlier analysis, clustering and

techniques for data mining. Besides, the experimental findings of

applying these techniques to a data set in the real world are presented.

The study in [5] introduces a taxonomy that is used to identify costs

associated with the effects of low-quality data as well as the cost of

enhancing and ensuring continuing data quality. Moreover, a method

for assessing the importance of data quality for enterprises is

discussed. Ultimately, a data protection structure is introduced that

focuses on three basic interconnected aspects, namely: individuals,

procedures and data, where any effort to improve the quality of data

in any institution will concentrate on these three essential

components.

The work described in [6] emphasizes on what are the consequences

of data of poor quality and describes the relationship to the effort to

maintain data quality. It reflects on how the desired level of data

quality can be found. As the case study shows the principles found in

the study can be utilized to determine the appropriate data

management strategy and the costs of data of poor quality.

The research presented in [7] discussed current methods, solutions

and data quality-oriented data warehousing frameworks for designing

and developing data cleaning. A novel framework has been presented

based on this study, which intends to address two concerns: to reduce

data cleaning time and to increase the degree of efficiency in data

cleaning. This framework preserves the most positive attributes of

current solutions to data cleaning and maintains the ability to improve

data cleaning efficiency in data warehouse applications. The study

presents a range of more studies, which include: a) analyzing

additional features of data quality measurements to establish a

comprehensive guide to evaluate the preference of a specific data

cleaning technique in data warehouses; b) construct a complete data

cleaning tool based on the framework described in the paper; and c)

checking the system by implementing it to larger multi-data sources.

The research in [8] introduced a structure for Data Quality Framework

for Classification Tasks (DQF4CT) identification activities to

overcome data quality problems. The approach consists of phases: a

systematic structure for offering user feedback on how to address data

issues in classification tasks; and an ontology that reflects data

cleaning information and proposes acceptable solutions to data

cleaning. The approach introduced two research studies using real

data sets: physical activity monitoring (PAM) and occupancy

detection of an office room (OD).

3. THE PROPOSED DATA CLEANING

FRAMEWORK

The research methodology used in this study is qualitative research,

where the data is gathered by doing interviews with system analysts

and users, and by observing the course of action in the information

systems units used within the targeted work areas in an enterprise.

The purpose of this study is to propose a set of processes and

procedures to facilitate the data cleaning and to achieve an

improvement in the data quality level required for data warehouse

work. The proposed framework was conceived by developing an

initial conceptualization of what the system could be based on what it

is and how data warehouses operate, defined by many existing

international enterprises. The framework for data cleaning was

divided into four phases: Phase O concerns data quality assessment.

Phase I includes improving data quality, whereas Phase II concerns

data cleaning using the ETL process, and Phase III is for the

evaluation of the process as shown in Figure. 1. Each of these phases

consists of several key activities that have the greatest impact on the

quality of data and its operating environment. The following sections

describe these phases and their activities.

Roles Distribution

Data Management

Staff Training

Data Problems

Addressing

Clean Multiple-source Data Clean Single-source Data

Common Data Concepts

Establishment

Data Profiling

Data Reporting and

Monitoring

Data Integration

and Enhancement

Data Cleaning

Phase I: Improving Data Quality

Phase II: Data Cleaning using ETL Process

A Continuous Improving Data Duality

Documentation

And Publishing

Phase III: Performance Evaluation Process

Phase O: Data Quality Assessment

Figure 1: The Basic Phases of Data Cleaning Framework

Phase O: Data Quality Assessment

This phase starts before implementing the framework by assessing the

current status of data used for information systems at the enterprise.

This is achieved by reviewing a range of factors closely related to the

quality and safety of data include the following:

 Data efficiency and effectiveness.

 Data entry.

 Data security and integrity.

 Financial potential.

 Resistance to change.

 Administrative change.

 Planning.

 Powers granted to information systems.

 Data quality check.

 Data management.

 Design of information systems

 Documentation of the information system

 Training courses

3.1 Phase I: Improving Data Quality

The following subsections describe the activities of Phase I.

3.1.1 Roles Distribution

We begin with the first activity where we distribute tasks and assign

responsibilities in the context of clear obligations that fulfill the

framework's goals to ensuring that the operations are conducted

effectively and efficiently. This requires the identification of main

roles and responsibilities, add new positions for the enterprise and

assign them to different members. The following roles are necessary

for the framework.

 System Senior Management: who represents the highest

authority, and responsible for the development and

maintenance of the system policy; ensures its proper

functioning and is part of the enterprise's executive in

practice.

 Administrative Coordinator: performs all the

administrative actions relevant to the framework's tasks and

operations.

 Data Manager: responsible for managing the entire data

quality of the information systems and is responsible for all

data management operations such as data collection, access,

handling, use and deletion.

 Data Administrator: is an expert in the field of research,

responsible for defining and maintaining data quality

standards within the data set responsible for it, in a manner

that guarantees data quality.

 Database Administrator: a professional with

comprehensive administrative experience. The specialist is

responsible for developing, implementing, operating the

database system, establishing and defining policies and

procedures for setting up, handling, running, and using the

database management system. The specialist also aims to

keep up with the latest database architecture techniques and

methods.

 Data entry clerk: responsible for the completion of the

information system entry.

3.1.2 Data Management

The data management department refers to a group that is responsible

for managing business data as a critical resource for the enterprise.

This department is responsible for developing, managing policies,

procedures, plans and processes within the enterprise to define, clean,

protect, and efficiently use data. Enterprise data management assists

in the development of long-term database design plans, structures,

policies and laws. The data management process involves assigning a

group of employees and experts in the field to perform data

management activities, policy development, technology, good control

and data analysis planning and setting up a special department known

as the data management department.

Data quality needs support from the enterprise structure, which

includes specialists from all main enterprise sectors. A data

management department must be established to start data

management activities within the proposed system. This department

manages the data used in the specified area of work for information

systems. The data manager manages the new department and

supervises its employee's functions as follows (data managers,

database administrators, data entry).

At this point, we have laid the foundation, on which we can perform

targeted business area data management tasks in the enterprise, so we

can start data quality improvement activities and address data issues

An important initial step to be considered here is a clear and

complete understanding of the current state of data quality in an

enterprise. Data quality assessment tools enable identifying,

addressing data issues and causes, reducing effort, improving the

extent and speed of data analysis, and helping to create a

comprehensive understanding of data quality levels. Data quality

measurement and evaluation techniques provide a certain level of

examination using the data quality business rules. These are standards

developed by industry experts or analysts of information systems to

evaluate the data quality level. Data quality rules provide a way to

describe what is expected from a data quality perspective. These rules

are used to differentiate between valid and invalid data. They are

integrated into the data quality assessment and measurement software

tools to compare the data in the source database with the data quality

business rules. Data that violates business rules are then changed to

comply with these established rules. The purpose of applying these

rules is to produce data quality reports. Once the reports are available,

industry experts check the data quality information documents and

figure out details about data quality issues. The specialists’ correct

issues and re-check data, determine that there are no problems, or that

issues fall within an acceptable allowance level. The data will then be

passed to the final database of the quality set.

In some cases, a particular business field cannot find a solution to data

problems. For example, the required solution may be outside the

administrative authority granted to the data manager within a specific

area of work or it may require changes in procedures, policies, and

processes throughout the enterprise as a whole to solve the data

problem. To correct these problems, they must be solved collectively,

through discussion, mutual offering between data managers at the

enterprise level, senior management of the framework and any other

candidates to the solution, if necessary.

The right data management structure starts with the development of

a common set of data management technologies [9] capable of

supporting and automating various data quality improvement

processes whenever possible. This is to reduce the cost of time and

effort to improve data quality. The work areas of the software tools

can be divided into five main groups as shown in Figure. 1:

1. Data profiling: Analyzes characteristics of target source

data to evaluate and understand situations within data

quality rules.

2. Data cleaning: Corrects data errors and establishes data

integrity and consistency standards within the data set.

3. Data integration and data enhancement: Integrates,

consolidates the data relationship between varieties of

sources and improves data to make it more compatible.

4. Data Reporting and monitoring: The enterprise needs to

track data quality over time and assess and measure the

results of data quality changes. This task is performed by

data monitoring and reporting tools. Reports provide and

monitor data quality issues such as data quality reporting,

data breach alerts, approved databases, and detailed

analysis of data breaches.

5. Documentation and publishing: The framework adopts the

idea of creating a website as a means of documenting and

disseminating data quality issues and as a means of

communication for all stakeholders within the enterprise.

Publish all the new works and activities within the

enterprise's data-cleaning framework, and continuously

share experience and data quality experiences across the

enterprise.

3.1.3 Staff Training

Enterprise senior management should establish a training center for

professional, technical training and development to offer a training

environment and to establish a working environment equipped with

the required technological and training equipment. It should also

develop a long-term training plan for the development of information

systems employees. The plan's purpose is to provide trainees with

information technology knowledge to recognize the types of

technologies used and to provide them with the relevant skills

necessary to successfully perform their duties. The framework's

senior management also needs to develop the implementation

framework for this plan along with the timeline, and determine the

human and material resources needed to effectively execute it.

Furthermore, the team that will implement this plan should be

identified.

3.1.4 Data Problems Addressing

The framework introduces a set of rules, instructions and procedures

to maintain the validity and consistency of data elements through the

use and application of the design phase to improve data quality. To

improve the quality of the data, the focus must be placed on certain

aspects of the information system design process, which has a

negative or positive effect on the quality of the data contained in the

system.

In this activity of the framework, we briefly present the problems of

data quality related to database design, the reasons that lead to its

occurrence and ways to address it. We also explain a set of design

standards related to designing effective input interfaces to the system,

as the application of these standards would lead to the production of

a design that supports the required data quality.

3.1.4.1 Addressing database design issues:

Most applications within the enterprise have been developed for

specific purposes and sections and have been kept separate from other

applications from the beginning. This, in turn, leads to a significant

degree of inconsistency in their respective components and especially

the components of their data. Quality problems related to database

design are grouped into two sections: single-source problems, and

problems when combining multiple sources of data. Problems within

each are grouped into two levels, one at the Schema-level database,

and another at the instance-level of the database. The following is a

brief explanation of the above problems and the proposed solutions to

them.

 Single-source problems

a. Data quality problems at the schema level are caused

by poor schema design, and lack of integrity

constraints that enforce on data.

b. Data quality problems at instance level are caused by

data entry errors.

The suggested solutions to these problems are to create a design that

is appropriate for both the database schema and the data integrity

constraints. Besides, designing powerful input interfaces that support

data quality.

 Multi-source problems

a. Data quality problems at the schema level are caused

by the difference between data models and schema

designs. This type of data problem occurs because of

the heterogeneous data models and schema designs

across enterprise applications between different data

sources.

b. Data quality problems at instance level are caused by

the inconsistency of data.

The suggested solutions to this issue would be by matching entities'

definitions across different data sources and perform the data cleaning

process. Besides, database schemas of different data sources have to

be integrated.

3.1.4.2 Solving the problems related to designing data entry forms:

The most common cause of data inaccuracy is manual data entry, and

the complexity of data input types also raises the data input issues.

When safety constraints apply to the data to be entered, input

interfaces will prevent the user from entering data in violation of these

restrictions. Therefore, users may implement new methods of data

entry that may be considered incorrect or accurate as required; only

data that fulfills the purpose is considered to have a type of system

circumvent. It is also a problem to enter data via forms or electronic

interfaces to the information systems on the Internet. Often, users tend

to find the easiest way to complete the form, even if that means

intentional error. The design of good electronic or paper data entry

forms and the instructions that come with these forms can reduce

some kind of problem with data entry. However, manual data entry

must be recognized as a reason for data quality problems. Through

this phase of the framework, we explain the principles for the correct

design of input models, whether paper or electronic, which helps the

system analyst to design efficient system revenue. The data input user

interface must satisfy a set of conditions to reach the required level of

design quality, as follows:

1. Effectiveness: ensures that both paper and electronic

input types fulfill the task for which they were effectively

prepared.

2. Data accuracy: means that a form for data entry can be

properly filled in.

3. Easy entry: indicates the user's ability to directly use

input forms without spending much time understanding

how it works and mobilizing it.

4. Regularity: means tracing user attention by keeping the

interface elements design in order.

5. Gravity: give the user a feeling of comfort and pleasure

when using the model.

3.1.5 Common Data Concepts Establishment

The effective implementation with various information systems

depends on several factors, including a proper understanding of how

data is used in systems which in turn requires an understanding of the

definitions of data, data design, and data enterprise. The problem of

the incoherence of information systems arises since each business

domain in the enterprise creates its applications isolated from other

applications. However, all applications will be stored in the same

metadata by sharing the data metadata. Therefore, the enterprise will

have structured standards, common data descriptions for all

applications, and data management systems. The data dictionary is

just one step towards creating a common understanding of all the

enterprise data elements. It is a reference that contains data describing

the data or what is called Metadata, referring to all data processing

operations, data warehouses, data flows, data structures, system

logical and physical data elements. The enterprise data dictionary is a

single tool to help ensure the accuracy and consistency of the data.

One of the important reasons why we need to place a data dictionary

is to keep the data clean which means that a particular system will

follow the data. Data management provides the best performance if it

works next to the data dictionary and is the best, most important and

most accurate data operation information resource for all the

enterprise’s various employees

3.2 Phase II: Data Cleaning Framework using

ETL Process

During this phase, data will be initialized and cleaned up again using

the ETL process to remove possible errors in the applications to

transfer cleaned data to the data warehouse. The process of data

cleaning is split into two steps.

 Step1: clean single-source data: data cleaning for each single

data source is applied to its own.

 Step2: clean multiple source data: data cleaning is then applied

to multiple data sources.

The referred methodology in [7] was also characterized by the use of

two different methods (Auto-cleaning process and Semi-auto-

cleaning process) to address data quality issues. In this methodology,

the appropriate timeliness standard was chosen as the primary

criterion for performing the cleaning process where data error types

are determined based on the time taken to perform the cleaning of the

data. To contribute to optimum efficiency and performance when

cleaning is complete, the selected criteria can be integrated into the

data cleaning process. According to the methodology in [7], the data

was cleaned in two stages: the first stage addresses errors in the single

data source and the second stage addresses errors in multiple data

sources. Each stage has two processes on data sources (Auto-cleaning

process), and (Semi-auto-cleaning process). Errors are detected and

automatically removed or corrected using appropriate algorithms in

the process of automated processing, without any user intervention.

The Semi-auto process comes up to address the remaining data errors

after the process of automated processing is completed. This type of

error cannot be detected, removed or corrected without the domain

expert's intervention, where errors are detected and identified only

through the use of algorithms, while the data cleanup operator handles

them. According to the above, the second and final phase of the

proposed framework in this study is completed by using the approach

in [7] where the methodology addressed the data pollution problem.

3.3 Phase III: Performance Evaluation Process

The performance of the framework is assessed by the top management

and managers responsible for the targeted work areas of the

enterprise, using documented material on the level of data quality and

the results achieved. Reports on data quality provide managers and all

data staff with an explanation if the data meets or exceeds the

acceptable level of data quality. These reports will also evaluate the

effectiveness of the United Nations Development Assistance

Framework (UNDAF) performance and reflect if some processes or

methods need to be re-evaluated. When the level of data quality

reached after a predetermined time has passed indicates an

improvement in the level of data quality, this indicates the efficiency

and success of the methods that were adopted and vice versa. It should

be noted that the timeframe required to evaluate the framework should

be taken into account so that a timetable is set out indicating the time

required to complete the tasks of data problems so that the framework

is not evaluated before giving the necessary time to address those

problems.

4. THE CASE STUDY

This research was implemented to study the problems of data quality

that are used in the information systems that are installed in the

faculties of Benghazi University.

4.1 Data Quality Assessment and Operation

Environment

This section describes the current status of data used for information

systems at the University of Benghazi, which is reviewed by a range

of factors closely related to the quality and safety of data.

 Data efficiency and effectiveness: A range of problems

have been noted in data used in information systems such

as (incomplete, inaccurate, and unavailable to decision-

makers promptly). These problems indicate their

inefficiency and effectiveness in accomplishing various

tasks.

 Data entry: Most data entry rooms in information systems

do not provide the appropriate environment for the

operation, which in turn adversely affects the accuracy and

validity of the data entry process. Besides, there is no

monitoring and follow-up of data entry work, as well as no

appropriate error correction mechanisms.

 Data security and integrity: Assigning people who are

non-specialist to data-entry tasks so that this would

compromise the security and integrity of data, such as the

entry or manipulation of incorrect data.

 Financial potential: The inability of the University's senior

management to provide the necessary financial means for

improving, maintaining and developing information

systems.

 Resistance to change: Some managers and users of

information systems are concerned about changing and

developing their working style and rejecting what is

unusual for them.

 Administrative change: Unconsidered change of

department heads has caused many problems, the most

important of which is the lack of stability in the work

environment, as the work plans are changing with the

change of head of a department, confusing staff, waste of

time and possibilities for non-continuous plans.

 Planning: There are no clear, informed plans for the senior

management of the University to follow, regarding the

improvement of the quality of the data.

 Powers granted to information systems: Some managers

exceeded pre-defined powers of information systems and

modifying data without complying with the conditions and

restrictions to be applied to data. This causes severe

confusion in data entry processes or multiple modifications.

 Data quality check: Lack of tools and techniques to check

data quality, and solve data problems.

 Data management: The lack of specialized departments

within the targeted areas of data management work

exacerbates their problems.

 Design of information systems: Information systems used

by the University are individual and separate applications,

designed independently, which have led to the asymmetry

of data elements between information systems and their

replication.

 Documentation of the information system: The

documentation process is neglected, whether documenting

the user's data or documentation items, which explains how

to use the system, or resolve the problems that might be

faced.

 Training courses: The management neglects a data portal

development plan, as employees rely on their personal

computer experience.

5.2. Impact of Data Quality Problems

From the study and analysis of previous points, we conclude that data

in information systems and their operating environment are

problematic, with several negative effects, which can be summarized

as follows:

 Impact on data quality: Low data quality within targeted areas

of work.

 Impact on the performance of information systems: The

quality of data entered into the information system determines

the quality of the resulting data, so it can be argued that the

desired results of the performance of information systems are

below the required level, as a result of the low quality of their

inputs.

 Their impact on decision-making: The low level of data

quality is an obstacle to sound and accurate decision-making.

 Impact on cost: Subsequent adjustments to the quality problems

of data, whether through the improvement of legacy information

systems or the implementation of new systems as a result of the

uselessness of legacy systems, lead to a significant increase in

the volume of material costs, waste time and effort to find

temporary solutions for permanent problems.

The results of the study confirm that there are several problems in the

data environment, which is an unequivocal indication of the low level

of data quality in the areas of work at the University of Benghazi,

which is an obstacle to any progress in the application of modern

technology projects, as the quality of data is a prerequisite for the

success of their work, including data warehouse.

The study revealed that many factors have negatively affected the

data quality level, including (poor an infrastructure, lack of strategic

planning, the rigidity of the current organizational structure and the

lack of data quality development and employment, poor

understanding of the value of data quality, lack of employees’

interest). This indicated that the data cleaning process can be made

easier and less complicated when the planning and preparing for the

cleaning process is accomplished in advance and when it is performed

in stages rather than in one batch. This can impede its subsequent

cleaning in the data warehouse's ETL process.

5. CONCLUSIONS

5.1 Conclusion

In this study, we have proposed a framework for cleaning data, which

includes many procedures and methods aimed at achieving

improvement in the level of data quality, including assessing,

improving and monitoring the level of data quality and solving its

problems within each field of work separately. A set of basic tools

and techniques has been proposed to build the necessary technical

infrastructure to improve the level of data quality within the

enterprise. We emphasis on developing the skills of employees

through setting up a specific training mechanism and how to create a

site that deals with data quality issues. The design of information

systems was discussed, because the methods used to design

information systems in the enterprise often focus on providing

operational needs for them, and ignore the procedures for improving

the quality of the data. We believe that achieving data quality requires

reformulating the viewpoint regarding the design of information

systems so that the achievement of data quality with a focus on

providing operational aspects, rather than addressing the functional

needs of the system in an isolated manner. A unified and common

base has been established to develop unified and common concepts of

data at the level of various applications in all areas of the enterprise's

work by sharing metadata between them. In the last stage of the

framework, we explained how to improve the efficiency of data

cleaning performance within the ETL process before moving to its

final destination which is the data warehouse. We also clarified how

the framework performance can be evaluated as we relied on the

reports received on the level of data quality, and evaluated in terms of

achieving what was expected to be accomplished

We implement the proposed framework to the University of

Benghazi as a case study and reviewed the problems found in it, such

as data quality problems, environmental operational problems, and

explained their impact on the work. Then, we establish a link between

the methods and procedures put forward within the framework, and

the problems that the case study suffers from.

We believe that the data cleaning process can be accomplished

easier, and less complicated when there are planning and preparation

in advance for the cleaning process, and when it is carried out in stages

instead of being accomplished at once in the ETL process.

5.1 Recommendations

In the light of the findings, there are a set of recommendations that

support long-term improvement in the quality of data, and that support

the chances of success of the proposed framework, if adopted by an

organization, which are as follows:

1. We recommend applying the proposed framework at the

University of Benghazi and to be adopted by the higher

management of the origination. This is after the assessment

and improvement of data quality is guaranteed and the

operation environment necessary to implement the

proposed framework is well prepared.

2. The implementation of a comprehensive and consistent data

quality plan is a significant to make the enterprise move

from being responsive only to fixing data issues to an

enterprise that proactively controls the data issues and

limits the presence of defects in the data environment.

3. Apply the data bug prevention principle rather than

addressing them. This requires data quality and integrity to

be achieved from the outset, which will provide much

subsequent correction to data errors.

4. The need to work to establish a culture of data quality, for

employees and all administrators in the enterprise.

5. Reduce resistance to change, both from employees and

management leadership in the enterprise, by engaging them

in the process of improving the quality of data and

enhancing their quality concept.

6. Scientific bases and controls should be established to select

administrative leaders, especially those working in the data

field, and work on the relative stability of these

administrative leaders so that sudden and unplanned change

does not change the work plans and methods negatively

affect the work and plans.

7. To attract the scientific competence in strategic planning for

institutions, experienced professionals and those qualified

to improve data quality by providing them with full material

and moral support to improve the scientific and productive

reality of the enterprise. The availability of qualified

personnel is critical and a prerequisite for the success of the

framework.

8. The need for detailed and targeted plans that suit the nature

and capabilities of the enterprise to address data quality

problems is such that as a roadmap, managers are

committed to implementing in the event of any

management change. It is recommended to start with initial

short-term plans and then to be developed and extended.

9. Senior management of an enterprise has to review and

adjust its organizational structures to suit new business

variables.

10. Pre-conceptualization of information systems design must

be developed to have pre-defined and unified standards at

the level of all applications at the university.

11. Allocation of data entry rooms for information systems,

with the necessary technology and equipment, so that data

entry can be accomplished with minimal errors and with

sufficient efficiency.

12. Improve and expand computer networking for information

systems used at the University of Benghazi, so that better

data exchange services can be provided between the various

work units and departments of the University.

13. To make use of information systems, whether

administrative, support or expert, which are an essential

component of modern institution-building, to provide

information to decision-makers as needed with the speed,

quantity and accuracy required. A gradual transition to the

use of new technologies to improve data quality is also

recommended.

14. The capabilities of data personnel must be developed,

trained and qualified to be able to use modern technologies.

6. REFERENCES

[1] Maletic J.I. and Marcus A. 2009. Data Cleansing: A

Prelude to Knowledge Discovery. In: Maimon O., Rokach

L. (eds) Data Mining and Knowledge Discovery

Handbook. Springer, Boston, MA.

[2] Rahm E. and Do H. H. 2000. Data Cleaning: Problems and

Current Approaches. In IEEE Data Engineering Bulletin},

vol. 23, pp. 2000.

[3] Müller H. and Freytag J. C. 2003. Problems, methods, and

challenges in comprehensive data cleansing. In Humboldt

university berlin, pp. 23.

[4] Kalinka M. and Kaloyanova K. 2005. Improving data

integration for a data warehouse: a data mining approach. The

University of Sofia, Bulgaria. Available In:

http://www.nbu.bg.

[5] O’Brien T., Helfert M. and Sukumar A. 2012. Classifying

costs and effects of poor Data Quality – examples and

discussion. In Annual Conference of Irish Academy of

Management, Maynooth, Ireland.

[6] Haug A., Zachariassen F. and Liempd D. V. 2011. The costs

of poor data quality. In Journal of Industrial Engineering and

Management, vol. 4, p. 171.

[7] Peng T. A. 2008. Framework for Data Cleaning in Data

Warehouses. In In Proceedings of the Tenth International

Conference on Enterprise Information Systems.

[8] Corrales D. C., Ledezma A. and Corrales J. C. 2018. From

Theory to Practice: A Data Quality Framework for

Classification Tasks. In Symmetry, vol. 10(7), 248.

https://doi.org/10.3390/sym10070248

[9] Ferguson, M. (2007). Data Ownership and Enterprise Data

Management: Leveraging Technology to Get Control of Your

Data (Part 2). A DataFlux White Paper, SAS Institute.

AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems Error Detection Correction and Metadata Integration

Thesis

Full-text available

Dec 2023

Widad Elouataoui

The widespread adoption of big data has ushered in a new era of data-driven decision-making, transforming numerous industries and sectors. However, the efficacy of these decisions hinges on the quality of the underlying data. Poor data quality can result in inaccurate analyses and deceptive conclusions. Managing the vast volume, velocity, and variety of data sources presents significant challenges, heightening the importance of addressing big data quality issues. While there has been increased attention from both academia and industry, current approaches often lack comprehensiveness and universality. They tend to focus on limited metrics, neglecting other dimensions of data quality. Moreover, existing methods are often context-specific, limiting their applicability across different domains. There is a clear need for intelligent, automated approaches leveraging artificial intelligence (AI) for advanced data quality corrections. To bridge these gaps, this Ph.D. thesis proposes a novel set of interconnected frameworks aimed at enhancing big data quality comprehensively. Firstly, we introduce new quality metrics and a weighted scoring system for precise data quality assessment. Secondly, we present a generic framework for detecting various quality anomalies using AI models. Thirdly, we propose an innovative framework for correcting detected anomalies through predictive modeling. Additionally, we address metadata quality enhancement within big data ecosystems. These frameworks are rigorously tested on diverse datasets, demonstrating their efficacy in improving big data quality. Finally, the thesis concludes with insights and suggestions for future research directions.

Optimization of Data Warehouse Architecture to Improve Information System Performance

Conference Paper

Feb 2023

Research and Implementation of Complex Real-time Computing based on Data Middle Platform

Conference Paper

Dec 2023

From Scientific Research to Practical Implementations: Applications to Improve Data Quality in Child Welfare

Article

Dec 2023

Child welfare decisions have life-impacting consequences which, often times, are underpinned by limited or inadequate data and poor quality. Though research of data quality has gained popularity and made advancements in various practical areas, it has not made significant inroads for child welfare fields or data systems. Poor data quality can hinder service decision-making, impacting child behavioral health and well-being as well as increasing unnecessary expenditure of time and resources. Poor data quality can also undermine the validity of research and slow policymaking processes. The purpose of this commentary is to summarize the data quality research base in other fields, describe obstacles and uniqueness to improve data quality in child welfare, and propose necessary steps to scientific research and practical implementation that enables researchers and practitioners to improve the quality of child welfare services based on the enhanced quality of data.

Implementing data-driven decision support system based on independent educational data mart

Article

Full-text available

Dec 2021
IJECE

Decision-makers in the educational field always seek new technologies and tools, which provide solid, fast answers that can support the decision-making process. They need a platform that utilizes the students’ academic data and turns them into knowledge to make the right strategic decisions. In this paper, a roadmap for implementing a data-driven decision support system (DSS) is presented based on an educational data mart. The independent data mart is implemented on the students’ degrees in 8 subjects in a private school (Al-Iskandaria Primary School in Basrah province, Iraq). The DSS implementation roadmap is started from pre-processing a paper-based data source and ended with providing three categories of online analytical processing (OLAP) queries (multidimensional OLAP, desktop OLAP, and web OLAP). Key performance indicator (KPI) is implemented as an essential part of educational DSS to measure school performance. The static evaluation method shows that the proposed DSS follows the privacy, security, and performance aspects with no errors after inspecting the DSS knowledge base. The evaluation shows that the data-driven DSS based on independent data mart with KPI, OLAP is one of the best platforms to support short-to-long term academic decisions.

From Theory to Practice: A Data Quality Framework for Classification Tasks

Article

Full-text available

Jul 2018

The data preprocessing is an essential step in knowledge discovery projects. The experts affirm that preprocessing tasks take between 50% to 70% of the total time of the knowledge discovery process. In this sense, several authors consider the data cleaning as one of the most cumbersome and critical tasks. Failure to provide high data quality in the preprocessing stage will significantly reduce the accuracy of any data analytic project. In this paper, we propose a framework to address the data quality issues in classification tasks DQF4CT. Our approach is composed of: (i) a conceptual framework to provide the user guidance on how to deal with data problems in classification tasks; and (ii) an ontology that represents the knowledge in data cleaning and suggests the proper data cleaning approaches. We presented two case studies through real datasets: physical activity monitoring (PAM) and occupancy detection of an office room (OD). With the aim of evaluating our proposal, the cleaned datasets by DQF4CT were used to train the same algorithms used in classification tasks by the authors of PAM and OD. Additionally, we evaluated DQF4CT through datasets of the Repository of Machine Learning Databases of the University of California, Irvine (UCI). In addition, 84% of the results achieved by the models of the datasets cleaned by DQF4CT are better than the models of the datasets authors.

An Approach to Improvement The Usability in Software Products

Article

Full-text available

Apr 2016

One of the significantaspects of software quality is usability. It is one of the characteristics that judge by the success or failure of software applications. The most important risk facing the software applications is usability which may lead to the existence of a gap between users and systems. This may lead to system failure because of Poor design. This is due to the design is not based on the desires and requirements of the customer. To overcome these problems, this paper proposed an approach to improve usability of software applications to meet the needs of the customer and interacts with the user easily with an efficient and effective manner.The proposed approach is based prototyping technique due to itssimplicity and it does not require additional costs to elicit precise and complete requirement and design.

Classifying costs and effects of poor Data Quality – examples and discussion

Conference Paper

Full-text available

Sep 2012

This study has highlighted the importance of information, the costs and consequential effects associated with poor quality data and the benefits and value that can be derived from implementing data quality improvement initiatives. This paper also provides a taxonomy which may be used to classify costs relating to both the consequences of low quality data and the costs of improving and assuring on-going data quality. In addition there is framework for analysing the business value of data quality. Finally a data governance model is proposed centring on three inter-related fundamental elements namely: People, Processes and Data, where any attempt to improve the quality of data within any organisation must be focussed around these three essential elements

The costs of poor data quality

Article

Full-text available

Jul 2011

Purpose: The technological developments have implied that companies store increasingly more data. However, data quality maintenance work is often neglected, and poor quality business data constitute a significant cost factor for many companies. This paper argues that perfect data quality should not be the goal, but instead the data quality should be improved to only a certain level. The paper focuses on how to identify the optimal data quality level. Design/methodology/approach: The paper starts with a review of data quality literature. On this basis, the paper proposes a definition of the optimal data maintenance effort and a classification of costs inflicted by poor quality data. These propositions are investigated by a case study. Findings: The paper proposes: (1) a definition of the optimal data maintenance effort and (2) a classification of costs inflicted by poor quality data. A case study illustrates the usefulness of these propositions. Research limitations/implications: The paper provides definitions in relation to the costs of poor quality data and the data quality maintenance effort. Future research may build on these definitions. To further develop the contributions of the paper, more studies are needed. Practical implications: As illustrated by the case study, the definitions provided by this paper can be used for determining the right data maintenance effort and costs inflicted by poor quality data. In many companies, such insights may lead to significant savings. Originality/value: The paper provides a clarification of what are the costs of poor quality data and defines the relation to data quality maintenance effort. This represents an original contribution of value to future research and practice.

Problems, methods, and challenges in comprehensive data cleansing

Article

Full-text available

Jan 2003

Cleansing data from impurities is an integral part of data processing and mainte-nance. This has lead to the development of a broad range of methods intending to enhance the accuracy and thereby the usability of existing data. This paper pre-sents a survey of data cleansing problems, approaches, and methods. We classify the various types of anomalies occurring in data that have to be eliminated, and we define a set of quality criteria that comprehensively cleansed data has to ac-complish. Based on this classification we evaluate and compare existing ap-proaches for data cleansing with respect to the types of anomalies handled and eliminated by them. We also describe in general the different steps in data clean-sing and specify the methods used within the cleansing process and give an out-look to research directions that complement the existing systems.

Relational Database Migration: A Perspective

Conference Paper

Full-text available

Sep 2008

This paper presents an investigation into approaches and techniques used for database conversion. Constructing object views on top of a Relational DataBase (RDB), simple database integration and database migration are among these approaches. We present a categorisation of selected works proposed in the literature and translation techniques used for the problem of database conversion, concentrating on migrating an RDB as source into object-based and XML databases as targets. Database migration from the source into each of the targets is discussed in detail including semantic enrichment, schema translation and data conversion. Based on a detailed analysis of the existing literature, we conclude that an existing RDB can be migrated into object-based/XML databases according to available database standards. We propose an integrated method for migrating an RDB into object-based/XML databases using an intermediate Canonical Data Model (CDM), which enriches the source database’s semantics and captures characteristics of the target databases. A prototype has been implemented, which successfully translates CDM into object-oriented (ODMG 3.0 ODL), object-relational (Oracle 10g) and XML schemas.

Re-engineering relational databases: The way forward

Conference Paper

Full-text available

Apr 2011

This paper surveys the recent literature about various research trends relevant to Relational DataBase (RDB) reengineering. The paper presents an analysis of approaches and techniques used in this context, including construction of object views on top of RDBs, database integration and database migration. A categorisation is presented of the selected work, concentrating on migrating an RDB as a source into object-based and XML databases as targets. Database migration from the source into each of the targets is discussed and critically evaluated, including the semantic enrichment, schema translation and data conversion. Based on a detailed analysis of the existing literature, it seems that the existing work does not provide a complete solution for more than one target database for either schema or data conversion. Besides, none of the existing proposals can be considered as a method for migrating an RDB into an object-relational database. We propose such a method based on an intermediate canonical data model, which enriches the semantics of the source RDB and captures characteristics of the target databases.

Improving data integration for data warehouse: a data mining approach

Article

Jan 2005

Data warehousing embraces technology of integrating data from multiple distributed data sources and using that data in annotated and aggregated form to support business decision-making and enterprise management. Although many techniques have been revisited or newly developed in the context of data warehouses, such as view maintenance and OLAP, little attention has been paid to data mining techniques for supporting the most important and costly tasks of data integration for data warehouse design.

Data Cleansing: A Prelude to Knowledge Discovery

Chapter

Jul 2010

This chapter analyzes the problem of data cleansing and the identification of potential errors in data sets. The differing views of data cleansing are surveyed and reviewed and a brief overview of existing data cleansing tools is given. A general framework of the data cleansing process is presented as well as a set of general methods that can be used to address the problem. The applicable methods include statistical outlier detection, pattern matching,clustering, and Data Mining techniques. The experimental results of applying these methods to a real world data set are also given. Finally, research directions necessary to further address the data cleansing problem are discussed. Key wordsData Cleansing-Data Cleaning-Data Mining-Ordinal Rules-Data Quality-Error Detection-Ordinal Association Rules

A Framework for Data Cleaning in Data Warehouses.

Conference Paper

Jan 2008

Taoxin Peng

A Framework for Improving Data Quality in Data Warehouse: A Case Study

Abstract

Recommended publications

Migrating Relational Databases into XML Documents

MIGRATING RELATIONAL DATABASES INTO OBJECT-BASED AND XML DATABASES

Generating UML Class Diagram using NLP Techniques and Heuristic Rules

An ETL based framework for data cleaning in multi data source