Cross Industry Standard Process for Data Mining

Book

May 2022

Giang Nguyen

http://elvira.fiit.stuba.sk/ This textbook presents an introduction to Data Science in the context of the responsible devel- opment of human-centric and trustworthy Artificial Intelligence systems. It presents the recent transition from focusing on modeling to the underlying data used to train and evaluate models. In the textbook, a systematical way to examine data is described with details about data source, data collection, data integration, and data preparation as a part of Data Science process. The aim of the process is to provide the best quality for machine learning data and to build an intelligent application in an organization. Intelligent data modelling is dedicated in one separated chapter to several selected machine learning methods and deep learning architectures. The emphasis is on model evaluation and model selection with optimizations to select the best models to deploy in production. In the context of intelligent software development, the textbook presents the most popular and the most used machine learning and deep learning frameworks and libraries with the dominance of Python open source software at small scale as well as large scale. A notation about specialized hardware for speeding up general purpose computation in the last decade is included. The textbook also emphasises ethics in Artificial Intelligence development with the most notable document “European Union Guideline on Ethics in Artificial Intelligence: Context and Implementation”.

Integrating Business Intelligence and Analytics in Managing Public Sector Performance: An Empirical Study

Article

Full-text available

Jan 2019

Business intelligence and analytics (BIA) is emerging as a critical area to boost organizational performance. Nowadays, data is not only important and valuable to the organization but recognized as necessary to spike the organization performance and success. As a result, many organizations spend a considerable amount of investment toward obtaining faster accurate information on a real-time basis. The previous study revealed that even though many organizations use business intelligence technologies for obtaining information, yet they still lack analytics implementation. Therefore, this study aims to discover the integrated implementation factors of business intelligence and analytics in managing organizational performance, particularly for organizations of the public sector. In achieving this, a depth literature review was carried out to identify the influential factors in the implementation of business intelligence, business analytics, and performance management. The subject matter experts in Business Intelligence (BI), Business Analytics (BA) and Organisational Performance Management (OPM) were invited to participate in this empirical study, which was conducted in Malaysia. The study was carried out through interviewing experts, in order to identify the essential factors for business intelligence and data analytics implementation. Twenty essential factors and sixty-four sub-factors were identified and analyzed to construct the integrated factors in BIA and OPM implementation. The result of the study revealed four integrated factors of the BIA and OPM implementation, such as skill, documentation, visualization, and work culture. Finance, data management, software, strategic planning, and decision-making are other factors integrated with BI, BA, and OPM respectively. Finally, this study illustrates the integrated factors in a visual form.

CMIN - a CRISP-DM-based case tool for supporting data mining projects

Article

Full-text available

Dec 2010

This paper introduces CMIN, an integrated computer aided software engineering (CASE) tool based on cross-industry standard process for data mining (CRISP-DM) 1.0 designed to support carrying out data mining projects. It is "integrated" in the sense that it supports all phases of a process. A general overview of how CMIN works is presented first, including a treatment of processes, templates and project management. CMIN's capacity for easily and intuitively monitoring projects is highlighted, as is the manner in which CMIN allows a user to increase knowledge regarding using CRISP-DM or any other process defined in the CASE tool through the help and information presented in each step. Next, it is shown how CMIN can bind new data mining algorithms in runtime (without the need to recompile the tool) to support modelling tasks (based on a Workflow) and evaluate data mining projects. Finally, the results of two evaluations of the tool, some conclusions and suggestions for future work are presented.

CMIN – A Case Tool Based on CRISP-DM to Support Data Mining Projects

Article

Full-text available

Jan 2010

This paper introduces CMIN, an integrated computer aided software engineering (CASE) tool based on cross-industry standard process for data mining (CRISP-DM) 1.0 designed to support carrying out data mining projects. It is “integrated” in the sense that it supports all phases of a process. A general overview of how CMIN works is presented first, including a treatment of processes, templates and project management. CMIN’s capacity for easily and intuitively monitoring projects is highlighted, as is the manner in which CMIN allows a user to increase knowledge regarding using CRISP-DM or any other process defined in the CASE tool through the help and information presented in each step. Next, it is shown how CMIN can bind new data mining algorithms in runtime (without the need to recompile the tool) to support modelling tasks (based on a Workflow) and evaluate data mining projects. Finally, the results of two evaluations of the tool, some conclusions and suggestions for future work are presented.

Effort estimation in information systems projects using data mining techniques

Conference Paper

Full-text available

Jul 2009

Project scheduling is a crucial task that can lead to project failure due to the lengthening of its duration or a wrong estimation of the needed effort to implement it. It is necessary to have available a tool that help us learn more about the project in order to choose the influential attributes on the project deviations and also to provide us with more accurate estimations. This paper analyses the feasibility and the advantages of the development of a system based on Artificial Intelligence techniques capable of selecting the attributes that affect the project duration and the effort it requires to make it possible through a data set of software projects historical information, against current techniques. To do that, it is proposed a method for the analysis of existing data set and their pre-processing, to obtain a model that can meet the project manager standards.

Multiple factor hierarchical clustering algorithm for large scale web page and search engine clickstream data

Article

Full-text available

Aug 2010

The developments in World Wide Web and the advances in digital data collection and storage technologies during the last two decades allow companies and organizations to store and share huge amounts of electronic documents. It is hard and inefficient to manually organize, analyze and present these documents. Search engine helps users to find relevant information by present a list of web pages in response to queries. How to assist users to find the most relevant web pages from vast text collections efficiently is a big challenge. The purpose of this study is to propose a hierarchical clustering method that combines multiple factors to identify clusters of web pages that can satisfy users’ information needs. The clusters are primarily envisioned to be used for search and navigation and potentially for some form of visualization as well. An experiment on Clickstream data from a processional search engine was conducted to examine the results shown that the clustering method is effective and efficient, in terms of both objective and subjective measures.

Evaluation of a process for the Experimental Development of Data Mining, AI and Data Science applications aligned with the Strategic Planning

Article

Full-text available

Nov 2022

The Big Data phenomenon has imposed maturity on companies regarding the exploration of their data, as a prerogative to obtain valuable insights into their clients and the power of analysis to guide decision-making processes. Therefore, a general approach that describes how to extract knowledge for the execution of the business strategy needs to be established. The purpose of this research paper is to introduce and evaluate the implementation of a process for the experimental development of Data Mining (DM), AI and Data Science applications aligned with the strategic planning. A case study with the proposed process was conducted in a federal educational institution. The results generated evidence showing that it is possible to integrate a strategic alignment approach, an experimental method, and a methodology for the development of DM applications. Data Mining (DM) and Data Science (DS) applications also present the risks of other Information Systems, and the adoption of strategy-driven and scientific method processes are critical success factors. Moreover, it was possible to conclude that the application of the scientific method was facilitated, besides being an important tool to ensure the quality, reproducibility and transparency of intelligent applications. In conclusion, the process needs to be mapped to foment and guide the strategic alignment. Keywords: Big Data; Strategic Alignment; Experimentation; Small Data; Reproducibility

Quão experimentais e estratégicas são as aplicações de Business Intelligence (BI) e Data Mining? ; HOW EXPERIMENTAL AND STRATEGIC ARE BUSINESS INTELLIGENCE (BI) AND DATA MINING APPLICATIONS?

Article

Full-text available

May 2022

Objective: Identify and characterize the methodologies used for the experimental development of intelligent applications aligned with strategic planning.Methodology: A systematic mapping was carried out to characterize the research in the area, considering the last ten years.Originality: No scientific studies were found with the same research object of this article, to identify and characterize the methodologies for the experimental development of intelligent applications aligned with strategic planning, which increases the importance of the results presented here.Main results: As a result, no studies were found that presented any complete approach to discipline strategic alignment and experimentation, providing clear compliance with strategic objectives and an experimental phase in the validation of results. However, some trials of parts of these characteristics could be mapped, such as experimentation found in 28,57% of the studies. Among the countries, China, the United States and Brazil led the ranking of publications on the subject. As for the medium of publication, Journal was the most used option for publication. In addition, the "IEEE International Conference on Advanced Communications, Control and Computing Technologies" and the journal "Expert Systems with Applications" stood out as major publishers.Theoretical Contributions: This research presents results relevant to academia and entrepreneurs, providing evidence that there is a gap in research on a formal method of BI and Data Mining applications experimental and strategy-driven development. In addition, this work is presented as a source of consultation to the existing method standards for the development of intelligent applications, as well as being replicable and extended by the applied systematization. Finally, there is a focus on research that proposes methods of creating experimental applications validated experimentally and aligned with strategy.

A data analytics approach to contrast the performance of teaching (only) vs. research professors

Article

Full-text available

Dec 2020
Int J Interact Des Manuf

This research article presents a study to compare the teaching performance of teaching-only versus teaching-and-research professors at higher education institutions. It is a common belief that, generally, teaching professors outperform research professors in teaching-and-research universities according to student perceptions reflected in student surveys. This case study presents experimental evidence that shows this is not always the case and that, under certain circumstances, it can be the contrary. The case study is from Tecnologico de Monterrey (Tec), a teaching-and-research, private university in Mexico that has developed a research profile during the last two decades using a mix of teaching-only and teaching-and-research faculty members; during this time period, the university has had a growing ascendancy in world university rankings. Data from an institutional student survey called the ECOA was used. The data set contains more than 118,000 graduate and undergraduate courses for 5 semesters (January 2017 to May 2019). The results presented were derived from statistical to data mining methods, including Analysis of Variance and Logistic Regression, that were applied to this data set of more than nine thousand professors who taught those courses. The results show that teaching-and-research professors perform better or at least the same as teaching-only professors. The differences found in teaching with respect to attributes like professors’ gender, age, and research level are also presented.

ISSN : 2249-0868 Foundation of Computer Science FCS

Article

Full-text available

May 2013

Disease rates vary between different locations particularly in the rural areas. While a database of diseases occurrence could be easily found, studies have been limited to descriptive statistical analysis, and are mostly restricted to diseases affecting adults. This paper therefore presents a Mathematical Model (MM) for predicting immunize-able diseases that affect children between ages 0-5 years. The model was adapted and deployed for use in six (6) selected localized areas within Osun State in Nigeria. Using the MATLAB's ANN toolbox, the Statistics toolbox for classification and regression, and the Naïve Bayesian classifier the MM was developed. The MM is robust in that it takes advantage of three (3) data mining techniques: ANN, Decision Tree Algorithm and Naïve Bayes Classifier. These data mining techniques provided the means by which hidden information were discovered for detecting trends within databases, and thus facilitate the prediction of future disease occurrence in the tested locations. Results obtained showed that diseases have peak periods depending on their epidemicity, hence the need to adequately administer immunization to the right places at the right time. Therefore, this paper argues that using this model would enhance the effectiveness of routine immunization in Nigeria.

Developing a big data analytics platform for manufacturing systems: architecture, method, and implementation

Article

Full-text available

Dec 2018
INT J ADV MANUF TECH

Manufacturing industries have recently promoted smart manufacturing (SM) for achieving intelligence, connectedness, and responsiveness of manufacturing objects consisting of man, machine, and material. Traditional manufacturing platforms, which identify generic frameworks where common functionalities are shareable and diverse applications are workable, mainly focused on remote collaboration, distributed control, and data integration; however, they are limited to incorporating those characteristic achievements. The present work introduces an SM-toward manufacturing platform. The proposed platform incorporates the capabilities of (1) virtualization of manufacturing objects for their autonomy and cooperation, (2) processing of real and various manufacturing data for mediating physical and virtual objects, and (3) data-driven decision-making for predictive planning on those objects. For such capabilities, the proposed platform advances the framework of Holonic Manufacturing Systems with the use of agent technology. It integrates a distributed data warehouse to encompass data specification, storage, processing, and retrieval. It applies a data analytics approach to create empirical decision-making models based on real and historical data. Furthermore, it uses open and standardized data interfaces to embody interoperable data exchange across shop floors and manufacturing applications. We present the architecture and technical methods for implementing the proposed platform. We also present a prototype implementation to demonstrate the feasibility and effectiveness of the platform in energy-efficient machining.

The determinants of integrated business intelligence and analytics in organisational performance process

Conference Paper

Nov 2017

On Mining of Data

Article

Mar 2015

This paper presents an overview of the fast growing field of Knowledge Discovery in Database (KDD) and Data Mining. Data Mining and knowledge discovery have numerous applications in business and scientific domains. They improve effectiveness, efficiency and enhance the quality of decision making in business organizations and result in interesting discoveries in scientific research. Various techniques of data mining along with some related issues are also presented.

Implementación de patrones de secuencia en el aula extendida de la Universidad Autónoma del Caribe, para la exploración y análisis de las actividades realizadas por los docentes.

Article

Jan 2010

Se implementa una herramienta informática que apoya el análisis del comportamiento de las actividades de aprendizaje que el docente registra en el Aula Extendida de la Universidad Autónoma del Caribe. Para esto, se adopta una metodología para desarrollo de proyectos en minería de datos, CRISP- DM que provee las fases de comprensión del negocio, comprensión de los datos y preparación de los datos; teniendo en cuenta la herramienta de ETL (Extracción, Transformación y Carga) Talend Open, que permite depurar los datos para la integración. Para la búsqueda de patrones del comportamiento en el desempeño docente en el aula extendida, se tienen en cuenta las variables socio demográficas y niveles de estudio. Se utiliza la herramienta WEKA que permite; a través de los datos de entrenamiento y validación, realizar los modelos que determinan el comportamiento de un docente. Los modelos de comportamiento se generan a través de diferentes técnicas [4]; árboles de decisión, redes neuronales y reglas de decisión, que dependiendo de una situación específica y la variable de estudio permiten mostrar y seleccionar el mejor análisis. Estos resultados permiten generar estrategias que apoyan el proceso de formación académica y elevar la calidad de la educación en la Universidad Autónoma del Caribe.

Using decision trees for estimating mode choice of trips in buca-izmir

Article

Full-text available

May 2013

Decision makers develop transportation plans and models for providing sustainable transport systems in urban areas. Mode Choice is one of the stages in transportation modelling. Data mining techniques can discover factors affecting the mode choice. These techniques can be applied with knowledge process approach. In this study a data mining process model is applied to determine the factors affecting the mode choice with decision trees techniques by considering individual trip behaviours from household survey data collected within Izmir Transportation Master Plan. From this perspective transport mode choice problem is solved on a case in district of Buca-Izmir, Turkey with CRISP-DM knowledge process model.

Data Mining Techniques for Predicting Immunize-able Disease: Nigeria as a Case Study

Article

Full-text available

May 2013

APPLICATION OF DATA MINING TECHNIQUES IN THE PREDICTION OF CORONARY ARTERY DISEASE: USE OF ANAESTHESIA TIME-SERIES AND PATIENT RISK FACTOR DATA

Article

Richi Nayak

Naïve Bayesian Classifier for On-line Remaining Useful Life Prediction of Degrading Bearings

Article

Full-text available

Jun 2011

In this paper, the estimation of the Residual Useful Life (RUL) of degraded thrust ball bearings is made resorting to a data-driven stochastic approach that relies on an iterative Naïve Bayesian Classifier (NBC) for regression task. NBC is a simple stochastic classifier based on applying Bayes' theorem for posterior estimate updating. Indeed, the implemented iterative procedure allows for updating the RUL estimation based on new information collected by sensors located on the degrading bearing, and is suitable for an on-line monitoring of the component health status. The feasibility of the approach is shown with respect to real world vibration-based degradation data.

Feature Extraction for Supervised Learning in Knowledge Discovery Systems

Article

Aplicando Mineração de Dados para Apoiar na Tomada de Decisão na Segurança Pública no Estado de Alagoas

Conference Paper

Full-text available

Jan 2009

Data mining is becoming increasingly common in both the private and public sectors. In the area of public safety data mining can be used to: determine where the levels of crime are higher, define profiles victims and crim- inals and detect the days that occur the most number of crimes. The aim of this paper is to use data mining in the system SISGOP, a database that records the police reports of the occurrences of Macei´ o, in order to discover informa- tions that aid in the strategical actions of the police department, based on the behaviour of the criminals and victims.

Unexplored Territory: Seeds For Future Research Investigations

Article

Jan 2008

Jason Brownlee

Artificial Intelligence for Decisions Making in Predictive Maintenance to Turbocompressors Equipment in the Oil Industry

Conference Paper

Full-text available

May 2023

RESUMEN En la industria petrolera los equipos turbocompresores son esenciales para el manejo de la producción del gas natural, en el caso de la Planta compresora Muscar PDVSA, dichos equipos presentan fallas y baja disponibilidad debido a la ausencia del mantenimiento que permita aumentar su la vida útil .El objetivo del presente trabajo fue implementar la inteligencia artificial para la predicción de fallas en equipos turbocompresores que permitan optimizar la toma de decisiones gerenciales aplicables al mantenimiento predictivo mediante la selección de un modelo eficiente. Para el logro de los objetivos planteados se utilizó la metodología de Cross Industry Standard Process for Data Mining aplicada a la inteligencia artificial, apoyada en un diseño Mixto de investigación documental y de Campo. El desarrollo permitió analizar la funcionalidad de las técnicas de aprendizaje automático, redes neuronales combinado con lógica difusa para predecir fallas y tomar decisiones en la planificación de las actividades de mantenimiento de los equipos turbocompresores en base a las condiciones detectadas y la solución óptima correspondiente en cada caso particular. La realización del modelo de inteligencia artificial proporcionó la clasificación y predicción de fallas para ofrecer finalmente un conjunto de soluciones con conocimiento experto. Palabras clave: Aprendizaje Automático, Predicción de Fallas, Decisiones Gerenciales. 1. INTRODUCCIÓN La toma de decisiones gerenciales es un proceso elemental en las organizaciones cuya complejidad depende del contexto organizacional, se trata de escoger entre alternativas de solución ante situaciones encontradas de la forma más óptima posible en el tiempo aceptado sin que afecte la productividad de la empresa. La información es importante para tomar decisiones, mediante esta se construye el conocimiento y experiencia para evaluar entre posibles modos de actuar, dentro de esta perspectiva la inteligencia artificial es actualmente aplicable en diversos campos. las técnicas de inteligencia artificial pueden ser empleadas como herramientas tecnológica de apoyo en la toma de decisiones, como complemento ante las posibles incertidumbres no cubiertas en las decisiones de los gerentes, por otro lado, ofrecen resultados analíticos y predicciones que generan información útil a la gerencia para conocer y adelantarse en la toma de decisiones sobre un proceso específico.

Conference Paper

Oct 2022

SUPPORTING E-MARKETING DECISION MAKING BY THE MANAGEMENT OF THE ESZTERHÁZY KÁROLY COLLEGE VIA BEHAVIOUR-BASED SEGMENTATION OF THE VISITORS OF THE INSTITUTIONAL WEB-PAGE

Article

May 2011

László Bóta

The increasingly competitive higher educational environment compels the management of universities and colleges to assign high priority to an overall maximisation of client services. Consequently, while academic leaders must become familiar with the aspects of on-line communication much favoured by today’s younger generation, the intensification and improvement of the quality of available on-line services cannot be imagined without reliable information on the Internet use habits and behaviour of clients. The managers and administrators of Hungarian college and university websites are mostly unfamiliar with the web-related conduct or habits of their customers as in case of long-running web-pages based on an unchanging structure only basic visitor statistics are available at best. Yet marketing communication decisions should be based on information reflecting real website-consumer traits acquired via a more professional analysis. Data mining is one such decision-making support mechanism. Data mining models are capable of revealing and predicting information hidden beneath the respective critical mass. Therefore inspired by the methodology of marketing science this type of research concentrates on the segmentation of on-line consumers via the elaboration of visitor clusters. The present article provides a scientific overview and analysis of the main difficulties related to cluster construction, especially the development of the relevant algorithmic forms. The successful application of the model provides much-needed reliable and vital support to the institutional decision making process. Thus pertinent data yielded by cluster research can facilitate more effective on-line service customized to the needs of the users. Key words: clustering model, data mining, marketing communication, on-line conduct, web-ergonomics.

Data Mining in the Investigation of Money Laundering and Terrorist Financing

Chapter

Jan 2013

In this chapter, the authors explore the operational data related to transactions in a financial organisation to find out the suitable techniques to assess the origin and purpose of these transactions and to detect if they are relevant to money laundering. The authors' purpose is to provide an AML/CTF compliance report that provides AUSTRAC with information about reporting entities' compliance with the Anti-Money Laundering and Counter-Terrorism Financing Act 2006. Their aim is to look into the Money Laundering activities and try to identify the most critical classifiers that can be used in building a decision tree. The tree has been tested using a sample of the data and passing it through the relevant paths/scenarios on the tree. The success rate is 92%, however, the tree needs to be enhanced so that it can be used solely to identify the suspicious transactions. The authors propose that a decision tree using the classifiers identified in this chapter can be incorporated into financial applications to enable organizations to identify the High Risk transactions and monitor or report them accordingly.

A Framework for Privacy Assurance and Ubiquitous Knowledge Discovery in Health 2.0 Data Mashups

Chapter

Jan 2012

Knowledge discovery is a critical component in improving health care. Health 2.0 leverages Web 2.0 technologies to integrate and share data from a wide variety of sources on the Internet. There are a number of issues which must be addressed before knowledge discovery can be leveraged effectively and ubiquitously in Health 2.0. Health care data is very sensitive in nature so privacy and security of personal data must be protected. Regulatory compliance must also be addressed if cooperative sharing of data is to be facilitated to ensure that relevant legislation and policies of individual health care organizations are respected. Finally, interoperability and data quality must be addressed in any framework for knowledge discovery on the Internet. In this chapter, we lay out a framework for ubiquitous knowledge discovery in Health 2.0 based on a combination of architecture and process. Emerging Internet standards and specifications for defining a Circle of Trust, in which data is shared but identity and personal information protected, are used to define an enabling architecture for knowledge discovery. Within that context, a step-by-step process for knowledge discovery is defined and illustrated using a scenario related to analyzing the correlation between emergency room visits and adverse effects of prescription drugs. The process we define is arrived at by reviewing an existing standards-based process, CRISP-DM, and extending it to address the new context of Health 2.0.

Context-Aware Transfer of Task-Based IoT Service Settings

Chapter

Jan 2021

The number of available Internet of Things (IoT) devices is growing rapidly, and users can utilize them via associated services to accomplish their tasks more efficiently. However, setting up IoT services based on the user, and environmental context, and the task requirements is usually a time-consuming job. Moreover, these IoT services operate in distributed computing environments in which spatially-cohesive IoT devices communicate via an ad-hoc network, and their availability is not predictable due to their mobility characteristic. To the best of our knowledge, there have been no researches done on saving and recovering users’ task-based IoT service settings with considering the context and task requirements. In this paper, we propose a framework for describing task-based IoT services and their settings in a semantical manner, and providing semantic task-based IoT services in an effective manner. The framework uses a machine learning technique to store and recover users’ task-based IoT service settings. We evaluated the effectiveness of the framework by conducting a user study.

Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey

Article

Full-text available

Jun 2019
ARTIF INTELL REV

The combined impact of new computing resources and techniques with an increasing avalanche of large datasets, is transforming many research areas and may lead to technological breakthroughs that can be used by billions of people. In the recent years, Machine Learning and especially its subfield Deep Learning have seen impressive advances. Techniques developed within these two fields are now able to analyze and learn from huge amounts of real world examples in a disparate formats. While the number of Machine Learning algorithms is extensive and growing, their implementations through frameworks and libraries is also extensive and growing too. The software development in this field is fast paced with a large number of open-source software coming from the academy, industry, start-ups or wider open-source communities. This survey presents a recent time-slide comprehensive overview with comparisons as well as trends in development and usage of cutting-edge Artificial Intelligence software. It also provides an overview of massive parallelism support that is capable of scaling computation effectively and efficiently in the era of Big Data.

Data Mining via Redes Neuronais Artificiais e Máquinas de Vectores de Suporte

Thesis

Full-text available

Mar 2008

Armando Jorge Ribeiro Cruz

The interest in the fields of Knowledge Discovery in Databases (KDD) and Data Mining emerged due to the rapid development of the Information and Communication Technologies, which made available vast amount of data to be stored in computers. Human experts have limitations and may fail in identifying important details. As an alternative, automatic discovery tools can be used in order to obtain high level knowledge from raw data. Considering this need, several Data Mining techniques have been proposed. This dissertation intends to infer about the advantages of two non-linear Data Mining models: Artificial Neural Networks (ANN) and Support Vector Machines (SVM). In particular, it pretends to measure their performance when applied to classification and regression tasks, being compared with other techniques, i.e. Decision/Regression Trees. Thus, an analysis was performed over a wide range of software tools that implement the referred models. From this set, two open-source applications (i.e. the R programming environment and the Weka) where selected to conduct the experiments. Several real world problems from the UCI public repository where used as benchmarks. The results show that in general the SVM achieves better forecasts, followed by the ANN. Nevertheless, this increase in performance is achieved with a higher computational effort.

Platelets, from sample to big data. Exploring Granularity in Platelet Research

Thesis

Full-text available

Jul 2017

Sven Van Poucke

ISBN 978 94 6159 713 7

Data Mining in the Investigation of Money Laundering and Terrorist Financing

Article

Jan 2010

In this chapter, the authors explore the operational data related to transactions in a financial organi-sation to find out the suitable techniques to assess the origin and purpose of these transactions and to detect if they are relevant to money laundering. The authors' purpose is to provide an AML/CTF compliance report that provides AUSTRAC with information about reporting entities' compliance with the Anti-Money Laundering and Counter-Terrorism Financing Act 2006. Their aim is to look into the Money Laundering activities and try to identify the most critical classifiers that can be used in building a decision tree. The tree has been tested using a sample of the data and passing it through the relevant paths/scenarios on the tree. The success rate is 92%, however, the tree needs to be enhanced so that it can be used solely to identify the suspicious transactions. The authors propose that a decision tree using the classifiers identified in this chapter can be incorporated into financial applications to enable organizations to identify the High Risk transactions and monitor or report them accordingly.

O desempenho escolar via uma abordagem de descoberta de conhecimento em bases de dados

Article

Jul 2007

Alice Maria Gonçalves Silva

Descubrimiento de conocimiento en los negocios

Article

Full-text available

Jun 2013

Johany Armando Carreno Gamboa

Ante la internacionalización de la economía, las organizaciones requieren basarse en la información y el conocimiento, apoyadas en tecnologías de la información y comunicación (TIC), pensar globalmente en políticas integrales y basarse en economías en red bajo esquemas asociativos que las fortalezcan. El creciendo de empresas en los últimos años, hace prioritario tratar de obtener conocimiento útil desde los propios datos y dar un paso más allá en el apoyo a la toma de decisiones más acertada. A tal fin, se ofrece en el documento información básica acerca de la minería de datos, se reconocen sus diferentes etapas y se determina su relación con otras disciplinas. Además se da a conocer el funcionamiento del tipo de algoritmo “árboles de decisión” y, se utiliza la herramienta “Weka” para ajustar modelos a conjuntos de datos.

Privacy-Preserving Data Integration in Public Health Surveillance

Article

Jun Hu

Maschinelles Lernen zur Hautkrebs-Vorhersage

Article

Data Mining via Redes Neuronais Artificiais e Máquinas de Vectores de Suporte

Article

La evaluación sensorial de bebidas a base de fruta: Una aproximación difusa

Article

Full-text available

Sep 2011

La calidad de los alimentos está asociada a un conjunto de propiedades y características que les confieren la capacidad de satisfacer las necesidades del consumidor. La industria de los alimentos tiene en la evaluación sensorial una herramienta que permite valorar la percepción del consumidor de un producto como un todo, o de un aspecto específico del mismo. Dicha herramienta, sin embargo, es intrínsecamente subjetiva debido a su dependencia de los sentidos humanos; en consecuencia, diversos evaluadores podrán diferir en cuanto a su apreciación de un producto determinado. La incertidumbre asociada a la percepción sensorial no tiene por qué ser un problema; ésta puede aprovecharse como parte del proceso de evaluación si se trata mediante lógica difusa. El siguiente trabajo tiene como objetivos valorar la aplicación de la lógica difusa en la evaluación sensorial, y determinar la aceptabilidad de una bebida, empleando una serie de pruebas afectivas y datos instrumentales. Para ello se empleará como ejemplo la evaluación de una muestra de una bebida a base de piña. Los resultados muestran que es posible predecir la aceptación de la bebida mediante el sistema de lógica difusa con una exactitud comparable a la exhibida por los evaluadores humanos.

Incorporating Privacy into the Undergraduate Curriculum

Conference Paper

Oct 2013

Today, our social, economic and political systems all make increasing use of the underlying computing infrastructure, and are heavily reliant on its safety and robustness. The ubiquitous collection and analysis of data through this infrastructure creates a burgeoning privacy problem. Indeed, special care must be taken to ensure that privacy is not breached from misuse of data flowing through these systems. Recently, the severity of this problem has been recognized both in the legislature and in the computing research field. However, we still lack a comprehensive view of this important topic in the undergraduate curriculum. Privacy is a critical problem for individuals and society at large. Serious problems are caused inadvertently due to ignorance of the subject and general lack of knowledge. Raising awareness of privacy issues, along with knowledge of the current state of the art technical and sociological solutions is best inculcated in young minds right from the start. In this paper, we explore how a comprehensive view of privacy can be incorporated into the undergraduate curriculum at the appropriate level. We present two alternative approaches towards this -- having an independent course for privacy or including small modules on privacy within existing courses.

Data Science and Its Relationship to Big Data and Data-Driven Decision Making

Article

Full-text available

Mar 2013

Companies have realized they need to hire data scientists, academic institutions are scrambling to put together data science programs, and publications are touting data science as a hot -- even "sexy" -- career choice. However, there is confusion about what exactly data science is, and this confusion could lead to disillusionment as the concept diffuses into meaningless buzz. In this paper we argue that there are good reasons why it has been hard to pin down exactly what data science is. One reason is that data science is intricately intertwined with other important concepts also of growing importance, such as big data and data-driven decision making. Another reason is the natural tendency to associate what a practitioner does with the definition of the practitioner's field; this can result in overlooking the fundamentals of the field. We believe that trying to define the boundaries of Data Science precisely right now is not of the utmost importance. We can debate the boundaries of the field in an academic setting, but in order for data science to serve business effectively, it is important (i) to understand its relationships to other important related concepts, and (ii) to begin to identify the fundamental principles underlying data science. Once we embrace (ii) we can much better understand and explain exactly what data science has to offer. Furthermore, only once we embrace (ii) should we be comfortable calling it data science. In this paper we present a perspective that addresses all these things. We close by offering as examples a partial list of fundamental principles underlying data science.

HENUFOOD: Development of New Methodologies and Emergent Technologies for Showing Food with Health Claims on Chronic Diseases Risk Reduction in the Middle Age of Life

Book

Mar 2013

HENUFOOD looks forward reduce chronic disease pathologies risk factor and, in this way, improve adult population health, between the range of 45 and 65 years. However, the benefits of this project, based on healthy ingredients and foods development, try to reach the rest of the population from the beginning up to the seniors. The main objective of HENUFOOD is discovering the healthy benefits from aliments using innovative methodologies, and scientifically demonstrate it. That will permit develop value products at nutritional level and demonstrate their health effects. These foods must keep on being foods, and must demonstrate their effects in quantities which are usually consumed in a diet. The project is looking forward determining in a clear way which foods or ingredients are absorbed by the organism and produce the beneficial effect that they are supposed to. This paper will focus on describing the ICT platform developed to support the scientists reach that purpose.

A knowledge mining framework for business analysts

Article

Full-text available

Feb 2012

Themis Palpanas

Several studies have focused on problems related to data mining techniques, including several applications of these techniques in the e-commerce setting. In this work, we describe how data mining technology can be effectively applied in an e-commerce environment, delivering significant benefits to the business analyst. We propose a framework that takes the results of the data mining process as input, and converts these results into actionable knowledge, by enriching them with information that can be readily interpreted by the business analyst. The framework can accommodate various data mining algorithms, and provides a customizable user interface. We experimentally evaluate the proposed approach by using a real-world case study that demonstrates the added benefit of the proposed method. The same study validates the claim that the produced results represent actionable knowledge that can help the business analyst improve the business performance, since it significantly reduces the time needed for data analysis, which results in substantial financial savings.

The Application of Exploratory Data Analysis in Telecom Pseudo-family Customer Recognition

Article

Jan 2009

Exploratory data analysis is a data analysis methods to analysis data and find the inherent law based on the actual distribution of data. This article explored the use of data analysis, aiming at a communication operator for a pseudo-family customer to identify customers to focus on customers and achieve targeted marketing.

Spatial Data Mining And Decision Support Systems

Article

Jan 2007

Using Data Mining Techniques to Build a Classification Model for Predicting Employees Performance

Article

Full-text available

Feb 2012

Human capital is of a high concern for companies' management where their most interest is in hiring the highly qualified personnel which are expected to perform highly as well. Recently, there has been a growing interest in the data mining area, where the objective is the discovery of knowledge that is correct and of high benefit for users. In this paper, data mining techniques were utilized to build a classification model to predict the performance of employees. To build the classification model the CRISP-DM data mining methodology was adopted. Decision tree was the main data mining tool used to build the classification model, where several classification rules were generated. To validate the generated model, several experiments were conducted using real data collected from several companies. The model is intended to be used for predicting new applicants' performance.

Student Performance in E-Learning Environments: An Empirical Analysis Through Data Mining

Chapter

Jan 2011

The aim of this chapter is to explore the application of data mining for analyzing performance and satis-faction of the students enrolled in an online two-year master degree programme in project management. This programme is delivered by the Academy of Economic Studies, the biggest Romanian university in economics and business administration in parallel, as an online programme and as a traditional one. The main data sources for the mining process are the survey made for gathering students’ opinions, the operational database with the students’ records and data regarding students activities recorded by the e-learning platform are. More than 180 students have responded, and more than 150 distinct characteristics/ variable per student were identified. Due the large number of variables data mining is a recommended approach to analysis this data. Clustering, classification, and association rules were employed in order to identify the factor explaining students’ performance and satisfaction, and the relationship between them. The results are very encouraging and suggest several future developments.

Role of Human Intelligence in Domain Driven Data Mining

Chapter

Jan 2009

Data Mining is an iterative, multi-step process consisting of different phases such as domain (or business) understanding, data understanding, data preparation, modeling, evaluation and deployment. Various data mining tasks are dependent on the human user for their execution. These tasks and activities that require human intelligence are not amenable to automation like tasks in other phases such as data preparation or modeling are. Nearly all Data Mining methodologies acknowledge the importance of the human user but do not clearly delineate and explain the tasks where human intelligence should be leveraged or in what manner. In this chapter we propose to describe various tasks of the domain understanding phase which require human intelligence for their appropriate execution.

Mining on Terms Extraction from Web News

Conference Paper

Full-text available

Nov 2010

Li-Fu Hsu

Thousand of news stories are reported each day. How to extract the useful information from the large web news is the important technology today. However, information technology advances have partially automated to processing documents, reducing the amount of text which must be read. In this paper we present a Web News Search System, called WNSS. WNSS can discover automatically phrase extraction from large corpora of web news stories. In addition, we give concrete examples of how to preprocess texts based on the intended use of the discovered results. We also evaluate the extracted phrases can be used for important tasks. Keywordsweb news-information technology-phrase extraction-pre-process texts

Facilitating design learning through faceted classification of in-service information

Article

Oct 2009
ADV ENG INFORM

The maintenance and service records collected and maintained by engineering companies are a useful resource for the ongoing support of products. Such records are typically semi-structured and contain key information such as a description of the issue and the product affected. It is suggested that further value can be realised from the collection of these records for indicating recurrent and systemic issues which may not have been apparent previously. This paper presents a faceted classification approach to organise the information collection that might enhance retrieval and also facilitate learning from in-service experiences. The faceted classification may help to expedite responses to urgent in-service issues as well as to allow for patterns and trends in the records to be analysed, either automatically using suitable data mining algorithms or by manually browsing the classification tree. The paper describes the application of the approach to aerospace in-service records, where the potential for knowledge discovery is demonstrated.

Framework for formal implementation of the business understanding phase of data mining projects

Article

Mar 2009
EXPERT SYST APPL

Various data mining methodologies have been proposed in the literature to provide guidance towards the process of implementing data mining projects. The methodologies describe a data mining project as comprised of a sequence of phases and highlight the particular tasks and their corresponding activities to be performed during each of the phases. It seems that the large number of tasks and activities, often presented in a checklist manner, are cumbersome to implement and may explain why all the recommended tasks are not always formally implemented. Additionally, there is often little guidance provided towards how to implement a particular task. These issues seem to be especially dominant in case of the business understanding phase which is the foundational phase of any data mining project. In this paper, we present an organizationally grounded framework to formally implement the business understanding phase of data mining projects. The framework serves to highlight the dependencies between the various tasks of this phase and proposes how and when each task can be implemented. An illustrative example of a credit scoring application from the financial sector is used to exemplify the tasks discussed in the proposed framework.

User Subjectivity in Change Modeling of Streaming Itemsets

Conference Paper

Jul 2005
Lect Notes Comput Sci

Online mining of changes from data streams is an important problem in view of growing number of applications such as network flow analysis, e-business, stock market analysis etc. Monitoring of these changes is a challenging task because of the high speed, high volume, only-one-look characteristics of the data streams. User subjectivity in monitoring and modeling of the changes adds to the complexity of the problem. This paper addresses the problem of i) capturing user subjectivity and ii) change modeling, in applications that monitor frequency behavior of item-sets. We propose a three stage strategy for focusing on item-sets, which are of current interest to the user and introduce metrics that model changes in their frequency (support) behavior.

Cross Industry Standard Process for Data Mining

No full-text available

Recommended publications

First and Second Derivatives in Time Series Classification Using DTW

Data-Analytics-Projekte in der Beschaffung erfolgreich umsetzen

A kddse-independent pmml visualizer

Figure S1