Article

Cross Industry Standard Process for Data Mining

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... In another case, reports can be used in scientific publishing and so on. 6. Intelligent data modelling: after EDA, we want to model our data using some intelligent algorithm like Logistic regression, Naïve Bayes, neural network, or something else. ...
... CRISP-DM is one of the most used DM concept and methodology [6]. The business understanding is usually based on the provided quest formulations and data description. ...
... The evaluation phase can be performed under various criteria for thorough testing of the machine learning models in order to choose the best model for the deployment phase. 6. The deployment phase, also called the production phase, involves usage of a trained model to exploit its functionality, and the creation of a data pipeline into production. ...
Book
http://elvira.fiit.stuba.sk/ This textbook presents an introduction to Data Science in the context of the responsible devel- opment of human-centric and trustworthy Artificial Intelligence systems. It presents the recent transition from focusing on modeling to the underlying data used to train and evaluate models. In the textbook, a systematical way to examine data is described with details about data source, data collection, data integration, and data preparation as a part of Data Science process. The aim of the process is to provide the best quality for machine learning data and to build an intelligent application in an organization. Intelligent data modelling is dedicated in one separated chapter to several selected machine learning methods and deep learning architectures. The emphasis is on model evaluation and model selection with optimizations to select the best models to deploy in production. In the context of intelligent software development, the textbook presents the most popular and the most used machine learning and deep learning frameworks and libraries with the dominance of Python open source software at small scale as well as large scale. A notation about specialized hardware for speeding up general purpose computation in the last decade is included. The textbook also emphasises ethics in Artificial Intelligence development with the most notable document “European Union Guideline on Ethics in Artificial Intelligence: Context and Implementation”.
... In order to understand the process involved in business analytics implementation, several business analytics standards have been reviewed. The most famous standard employed by organizations is the CRISP-DM [26]. It involves six levels of processes that begin with understanding the needs to perform analytics in an organization, to business analyst who shall define suitable data to be used and its source. ...
... Financial [31], [32] Infrastructure [26], [27], [32] Technology ...
... Hardware and Software [26], [27] performance [30]. This research was conducted to inquire into the implementation of BI for Organizational Performance Management (OPM). ...
Article
Full-text available
Business intelligence and analytics (BIA) is emerging as a critical area to boost organizational performance. Nowadays, data is not only important and valuable to the organization but recognized as necessary to spike the organization performance and success. As a result, many organizations spend a considerable amount of investment toward obtaining faster accurate information on a real-time basis. The previous study revealed that even though many organizations use business intelligence technologies for obtaining information, yet they still lack analytics implementation. Therefore, this study aims to discover the integrated implementation factors of business intelligence and analytics in managing organizational performance, particularly for organizations of the public sector. In achieving this, a depth literature review was carried out to identify the influential factors in the implementation of business intelligence, business analytics, and performance management. The subject matter experts in Business Intelligence (BI), Business Analytics (BA) and Organisational Performance Management (OPM) were invited to participate in this empirical study, which was conducted in Malaysia. The study was carried out through interviewing experts, in order to identify the essential factors for business intelligence and data analytics implementation. Twenty essential factors and sixty-four sub-factors were identified and analyzed to construct the integrated factors in BIA and OPM implementation. The result of the study revealed four integrated factors of the BIA and OPM implementation, such as skill, documentation, visualization, and work culture. Finance, data management, software, strategic planning, and decision-making are other factors integrated with BI, BA, and OPM respectively. Finally, this study illustrates the integrated factors in a visual form.
... Los criterios generales para dicha valoración fueron: el acceso (costo de las herramientas), la interfaz de usuario (facilidad o dificultad que puede llegar a tener el uso de la herramienta por parte de los usuarios), el proceso (o metodología) en la que se basan, la extensibilidad (capacidad de ampliar fácil y dinámicamente el conjunto de algoritmos que ofrece la herramienta) y el soporte al desarrollo del proyecto por parte de equipos de trabajo. Como resultado se encontró que ninguna de las herramientas cumple completamente con CRISP-DM (Cross -Industry Standard Process for Data Minning) (CRISP-DM, 2006;Chapman et al., 2000), un proceso para el desarrollo de proyectos de minería de datos iterativo, abierto, personalizable y de gran reconocimiento por la industria y la academia; que ninguna de estas herramientas permite la ampliación dinámica y en tiempo de ejecución (sin volver a compilar el código) del conjunto de algoritmos de minería que se entregan inicialmente con la herramienta; y que a pesar de que algunas herramientas cuentan con una interfaz fácil de usar ninguna de ellas guía apropiadamente el desarrollo de un proyecto y mucho menos ayudan a sus usuarios a conocer y profundizar en el manejo del proceso y en general del desarrollo de proyectos de minería. Por lo anterior, el grupo de investigación GTI decidió desarrollar una herramienta CASE integrada (que soporta todas las fases de un proceso), basada en CRISP-DM (CRISP-DM, 2006;Chapman et al., 2000), fácilmente extensible en tiempo de ejecución, fácil de usar y que ayude al usuario a mejorar sus conocimientos y habilidades en el desarrollo de proyectos de minería. ...
... Como resultado se encontró que ninguna de las herramientas cumple completamente con CRISP-DM (Cross -Industry Standard Process for Data Minning) (CRISP-DM, 2006;Chapman et al., 2000), un proceso para el desarrollo de proyectos de minería de datos iterativo, abierto, personalizable y de gran reconocimiento por la industria y la academia; que ninguna de estas herramientas permite la ampliación dinámica y en tiempo de ejecución (sin volver a compilar el código) del conjunto de algoritmos de minería que se entregan inicialmente con la herramienta; y que a pesar de que algunas herramientas cuentan con una interfaz fácil de usar ninguna de ellas guía apropiadamente el desarrollo de un proyecto y mucho menos ayudan a sus usuarios a conocer y profundizar en el manejo del proceso y en general del desarrollo de proyectos de minería. Por lo anterior, el grupo de investigación GTI decidió desarrollar una herramienta CASE integrada (que soporta todas las fases de un proceso), basada en CRISP-DM (CRISP-DM, 2006;Chapman et al., 2000), fácilmente extensible en tiempo de ejecución, fácil de usar y que ayude al usuario a mejorar sus conocimientos y habilidades en el desarrollo de proyectos de minería. ...
... Existen varias metodologías para orientar el proceso de minería de datos; ellas pretenden facilitar la realización de nuevos proyectos con características similares, optimizar la planificación y dirección de éstos, reducir su complejidad y permitir hacerle un mejor seguimiento a ellos (Gondar Nores, 2004). Entre esas metodologías se destacan CRISP-DM (2006) y SEMMA -Sample, Explore, Modify, Model, Assess- (SAS, 2009b). SEMMA se centra en las características técnicas del desarrollo del proceso, mientras que CRISP-DM mantiene como foco central los objetivos empresariales del proyecto. ...
Article
Full-text available
This paper introduces CMIN, an integrated computer aided software engineering (CASE) tool based on cross-industry standard process for data mining (CRISP-DM) 1.0 designed to support carrying out data mining projects. It is "integrated" in the sense that it supports all phases of a process. A general overview of how CMIN works is presented first, including a treatment of processes, templates and project management. CMIN's capacity for easily and intuitively monitoring projects is highlighted, as is the manner in which CMIN allows a user to increase knowledge regarding using CRISP-DM or any other process defined in the CASE tool through the help and information presented in each step. Next, it is shown how CMIN can bind new data mining algorithms in runtime (without the need to recompile the tool) to support modelling tasks (based on a Workflow) and evaluate data mining projects. Finally, the results of two evaluations of the tool, some conclusions and suggestions for future work are presented.
... Los criterios generales para dicha valoración fueron: el acceso (costo de las herramientas), la interfaz de usuario (facilidad o dificultad que puede llegar a tener el uso de la herramienta por parte de los usuarios), el proceso (o metodología) en la que se basan, la extensibilidad (capacidad de ampliar fácil y dinámicamente el conjunto de algoritmos que ofrece la herramienta) y el soporte al desarrollo del proyecto por parte de equipos de trabajo. Como resultado se encontró que ninguna de las herramientas cumple completamente con CRISP-DM (Cross -Industry Standard Process for Data Minning) (CRISP-DM, 2006;Chapman et al., 2000), un proceso para el desarrollo de proyectos de minería de datos iterativo, abierto, personalizable y de gran reconocimiento por la industria y la academia; que ninguna de estas herramientas permite la ampliación dinámica y en tiempo de ejecución (sin volver a compilar el código) del conjunto de algoritmos de minería que se entregan inicialmente con la herramienta; y que a pesar de que algunas herramientas cuentan con una interfaz fácil de usar ninguna de ellas guía apropiadamente el desarrollo de un proyecto y mucho menos ayudan a sus usuarios a conocer y profundizar en el manejo del proceso y en general del desarrollo de proyectos de minería. Por lo anterior, el grupo de investigación GTI decidió desarrollar una herramienta CASE integrada (que soporta todas las fases de un proceso), basada en CRISP-DM (CRISP-DM, 2006;Chapman et al., 2000), fácilmente extensible en tiempo de ejecución, fácil de usar y que ayude al usuario a mejorar sus conocimientos y habilidades en el desarrollo de proyectos de minería. ...
... Como resultado se encontró que ninguna de las herramientas cumple completamente con CRISP-DM (Cross -Industry Standard Process for Data Minning) (CRISP-DM, 2006;Chapman et al., 2000), un proceso para el desarrollo de proyectos de minería de datos iterativo, abierto, personalizable y de gran reconocimiento por la industria y la academia; que ninguna de estas herramientas permite la ampliación dinámica y en tiempo de ejecución (sin volver a compilar el código) del conjunto de algoritmos de minería que se entregan inicialmente con la herramienta; y que a pesar de que algunas herramientas cuentan con una interfaz fácil de usar ninguna de ellas guía apropiadamente el desarrollo de un proyecto y mucho menos ayudan a sus usuarios a conocer y profundizar en el manejo del proceso y en general del desarrollo de proyectos de minería. Por lo anterior, el grupo de investigación GTI decidió desarrollar una herramienta CASE integrada (que soporta todas las fases de un proceso), basada en CRISP-DM (CRISP-DM, 2006;Chapman et al., 2000), fácilmente extensible en tiempo de ejecución, fácil de usar y que ayude al usuario a mejorar sus conocimientos y habilidades en el desarrollo de proyectos de minería. ...
... Existen varias metodologías para orientar el proceso de minería de datos; ellas pretenden facilitar la realización de nuevos proyectos con características similares, optimizar la planificación y dirección de éstos, reducir su complejidad y permitir hacerle un mejor seguimiento a ellos (Gondar Nores, 2004). Entre esas metodologías se destacan CRISP-DM (2006) y SEMMA -Sample, Explore, Modify, Model, Assess- (SAS, 2009b). SEMMA se centra en las características técnicas del desarrollo del proceso, mientras que CRISP-DM mantiene como foco central los objetivos empresariales del proyecto. ...
Article
Full-text available
This paper introduces CMIN, an integrated computer aided software engineering (CASE) tool based on cross-industry standard process for data mining (CRISP-DM) 1.0 designed to support carrying out data mining projects. It is “integrated” in the sense that it supports all phases of a process. A general overview of how CMIN works is presented first, including a treatment of processes, templates and project management. CMIN’s capacity for easily and intuitively monitoring projects is highlighted, as is the manner in which CMIN allows a user to increase knowledge regarding using CRISP-DM or any other process defined in the CASE tool through the help and information presented in each step. Next, it is shown how CMIN can bind new data mining algorithms in runtime (without the need to recompile the tool) to support modelling tasks (based on a Workflow) and evaluate data mining projects. Finally, the results of two evaluations of the tool, some conclusions and suggestions for future work are presented.
... Given that we start from a historical data set taken from former projects, a data mining methodology will be applied. The CRISP DM [2] methodology, used to solve this problem, will be described. Finally, it will also be described the modelling method and the conclusions we have drawn to solve the problem faced. ...
... In order to get a successful model a work methodology must be use for Data Mining projects. CRISP-DM [2] is one of the most usual process models. It divides life cycle for Data Mining projects in six phases. ...
Conference Paper
Full-text available
Project scheduling is a crucial task that can lead to project failure due to the lengthening of its duration or a wrong estimation of the needed effort to implement it. It is necessary to have available a tool that help us learn more about the project in order to choose the influential attributes on the project deviations and also to provide us with more accurate estimations. This paper analyses the feasibility and the advantages of the development of a system based on Artificial Intelligence techniques capable of selecting the attributes that affect the project duration and the effort it requires to make it possible through a data set of software projects historical information, against current techniques. To do that, it is proposed a method for the analysis of existing data set and their pre-processing, to obtain a model that can meet the project manager standards.
... This data includes approximately 100 million clicks on 6 million pages in about 20 million search sessions. In order to understand the Clickstream data, identify its quality, and discover insights into the data (CRISP-DM 1996), several statistical tests (Zhang and Segall 2008) were preformed on the data and the results are shown in Tables 1, 2, and 3. Table 1 lists overall web page popularity as measured by number of clicks. Take the bolded row as an example. ...
... In a data mining project, it is very important to understand the project objectives and requirements from a business perspective (CRISP-DM 1996). In order to find the good quality web page clusters, we need to define cluster in search domain from a business perspective. ...
Article
Full-text available
The developments in World Wide Web and the advances in digital data collection and storage technologies during the last two decades allow companies and organizations to store and share huge amounts of electronic documents. It is hard and inefficient to manually organize, analyze and present these documents. Search engine helps users to find relevant information by present a list of web pages in response to queries. How to assist users to find the most relevant web pages from vast text collections efficiently is a big challenge. The purpose of this study is to propose a hierarchical clustering method that combines multiple factors to identify clusters of web pages that can satisfy users’ information needs. The clusters are primarily envisioned to be used for search and navigation and potentially for some form of visualization as well. An experiment on Clickstream data from a processional search engine was conducted to examine the results shown that the clustering method is effective and efficient, in terms of both objective and subjective measures.
... Thus, several models of Data Mining processes were proposed by researchers and professionals. The examples include Fayyad, et al. (1996), Cabena et al. (1998), Cios et al. (2000), CRISP-DM (2003), Berry & Linoff (1997), Sharma, Osei-Bryson & Kasper (2012) and Ławrynowicz & Potoniec (2014). ...
Article
Full-text available
The Big Data phenomenon has imposed maturity on companies regarding the exploration of their data, as a prerogative to obtain valuable insights into their clients and the power of analysis to guide decision-making processes. Therefore, a general approach that describes how to extract knowledge for the execution of the business strategy needs to be established. The purpose of this research paper is to introduce and evaluate the implementation of a process for the experimental development of Data Mining (DM), AI and Data Science applications aligned with the strategic planning. A case study with the proposed process was conducted in a federal educational institution. The results generated evidence showing that it is possible to integrate a strategic alignment approach, an experimental method, and a methodology for the development of DM applications. Data Mining (DM) and Data Science (DS) applications also present the risks of other Information Systems, and the adoption of strategy-driven and scientific method processes are critical success factors. Moreover, it was possible to conclude that the application of the scientific method was facilitated, besides being an important tool to ensure the quality, reproducibility and transparency of intelligent applications. In conclusion, the process needs to be mapped to foment and guide the strategic alignment. Keywords: Big Data; Strategic Alignment; Experimentation; Small Data; Reproducibility
... Desta forma, vários modelos de processos de Data Mining foram propostos por pesquisadores e profissionais. Os exemplos de autores incluem Fayyad, et al. (1996), Cabena et al. (1998), Cios et al. (2000), CRISP-DM (2003), Berry & Linoff (1997), Sharma, Osei-Bryson & Kasper (2012) e Ławrynowicz & Potoniec (2014). ...
Article
Full-text available
Objective: Identify and characterize the methodologies used for the experimental development of intelligent applications aligned with strategic planning.Methodology: A systematic mapping was carried out to characterize the research in the area, considering the last ten years.Originality: No scientific studies were found with the same research object of this article, to identify and characterize the methodologies for the experimental development of intelligent applications aligned with strategic planning, which increases the importance of the results presented here.Main results: As a result, no studies were found that presented any complete approach to discipline strategic alignment and experimentation, providing clear compliance with strategic objectives and an experimental phase in the validation of results. However, some trials of parts of these characteristics could be mapped, such as experimentation found in 28,57% of the studies. Among the countries, China, the United States and Brazil led the ranking of publications on the subject. As for the medium of publication, Journal was the most used option for publication. In addition, the "IEEE International Conference on Advanced Communications, Control and Computing Technologies" and the journal "Expert Systems with Applications" stood out as major publishers.Theoretical Contributions: This research presents results relevant to academia and entrepreneurs, providing evidence that there is a gap in research on a formal method of BI and Data Mining applications experimental and strategy-driven development. In addition, this work is presented as a source of consultation to the existing method standards for the development of intelligent applications, as well as being replicable and extended by the applied systematization. Finally, there is a focus on research that proposes methods of creating experimental applications validated experimentally and aligned with strategy.
... CRISP-DM was conceived in 1996 and became a European Union project under the ESPRIT funding initiative in 1997 under the leadership of several companies that included Integral Solutions Ltd, Teradata, Daimler AG, NCR Corporation, and OHRA. The first version of the methodology was presented at the 4th CRISP-DM SIG Workshop in Brussels in March 1999 and was published as a step-by-step data mining guide later that year [22]. While many non-IBM data mining practitioners use CRISP-DM, IBM is the primary corporation that currently uses the CRISP-DM process model, and it has incorporated it into its SPSS (Statistical Package for the Social Sciences) modeler product [23]. ...
Article
Full-text available
This research article presents a study to compare the teaching performance of teaching-only versus teaching-and-research professors at higher education institutions. It is a common belief that, generally, teaching professors outperform research professors in teaching-and-research universities according to student perceptions reflected in student surveys. This case study presents experimental evidence that shows this is not always the case and that, under certain circumstances, it can be the contrary. The case study is from Tecnologico de Monterrey (Tec), a teaching-and-research, private university in Mexico that has developed a research profile during the last two decades using a mix of teaching-only and teaching-and-research faculty members; during this time period, the university has had a growing ascendancy in world university rankings. Data from an institutional student survey called the ECOA was used. The data set contains more than 118,000 graduate and undergraduate courses for 5 semesters (January 2017 to May 2019). The results presented were derived from statistical to data mining methods, including Analysis of Variance and Logistic Regression, that were applied to this data set of more than nine thousand professors who taught those courses. The results show that teaching-and-research professors perform better or at least the same as teaching-only professors. The differences found in teaching with respect to attributes like professors’ gender, age, and research level are also presented.
... Furthermore, scores of literature exists where detailed discourse of DM and its techniques could be found, but some examples are e.g. [11]; [12]; [13]; [14]; [9]; and [10]; [15]. In this paper, three classification methods of 6 the DM based techniques are applied to propose a predictive model for immunize-able diseases. ...
Article
Full-text available
Disease rates vary between different locations particularly in the rural areas. While a database of diseases occurrence could be easily found, studies have been limited to descriptive statistical analysis, and are mostly restricted to diseases affecting adults. This paper therefore presents a Mathematical Model (MM) for predicting immunize-able diseases that affect children between ages 0-5 years. The model was adapted and deployed for use in six (6) selected localized areas within Osun State in Nigeria. Using the MATLAB's ANN toolbox, the Statistics toolbox for classification and regression, and the Naïve Bayesian classifier the MM was developed. The MM is robust in that it takes advantage of three (3) data mining techniques: ANN, Decision Tree Algorithm and Naïve Bayes Classifier. These data mining techniques provided the means by which hidden information were discovered for detecting trends within databases, and thus facilitate the prediction of future disease occurrence in the tested locations. Results obtained showed that diseases have peak periods depending on their epidemicity, hence the need to adequately administer immunization to the right places at the right time. Therefore, this paper argues that using this model would enhance the effectiveness of routine immunization in Nigeria.
... As mentioned in Section 2.2.3, the data analytics modeling approach aims at creating machine-specific and granular models using manufacturing data. Some good approaches, such as CRISP-DM, which defines a standardized process model for data mining from a business perspective [54], have been introduced; however, they are not specific to the manufacturing domain. The proposed approach identifies a logical modeling procedure for creating models based on machine-learning, statistical, or stochastic analysis, as shown in Figure 9. Here, "component model" means the model that figures out a numerical relationship between Cause-and-Effect (CE) data up to the designated level and, thus, can predict target performance at a certain manufacturing configuration. ...
Article
Full-text available
Manufacturing industries have recently promoted smart manufacturing (SM) for achieving intelligence, connectedness, and responsiveness of manufacturing objects consisting of man, machine, and material. Traditional manufacturing platforms, which identify generic frameworks where common functionalities are shareable and diverse applications are workable, mainly focused on remote collaboration, distributed control, and data integration; however, they are limited to incorporating those characteristic achievements. The present work introduces an SM-toward manufacturing platform. The proposed platform incorporates the capabilities of (1) virtualization of manufacturing objects for their autonomy and cooperation, (2) processing of real and various manufacturing data for mediating physical and virtual objects, and (3) data-driven decision-making for predictive planning on those objects. For such capabilities, the proposed platform advances the framework of Holonic Manufacturing Systems with the use of agent technology. It integrates a distributed data warehouse to encompass data specification, storage, processing, and retrieval. It applies a data analytics approach to create empirical decision-making models based on real and historical data. Furthermore, it uses open and standardized data interfaces to embody interoperable data exchange across shop floors and manufacturing applications. We present the architecture and technical methods for implementing the proposed platform. We also present a prototype implementation to demonstrate the feasibility and effectiveness of the platform in energy-efficient machining.
... In reviewing the implementation of business analytics, the CRISP-DM (2013) was used as the main reference. CRISP-DM is the most popular standard used by most organisations [24]. It consists of six processes initiated by the process of obtaining an understanding of the organization and the need to perform analytics. ...
... These works underscore the need for human interaction and its role in successful culmination of the KDD en-deavor. CRISP-DM -CRoss-Industry Standard Process for Data Mining [5] -advocates a data mining methodology consisting of tasks described at four levels of abstraction. The methodology is based on the KDD process model that offers systematic understanding of step-by-step direction, tasks and objectives for every stage of the process. ...
Article
This paper presents an overview of the fast growing field of Knowledge Discovery in Database (KDD) and Data Mining. Data Mining and knowledge discovery have numerous applications in business and scientific domains. They improve effectiveness, efficiency and enhance the quality of decision making in business organizations and result in interesting discoveries in scientific research. Various techniques of data mining along with some related issues are also presented.
... La metodología de CRISP-DM [10] [12]. está descrita en términos de un modelo de proceso jerárquico, consistente en un conjunto de tareas descritas en cuatro niveles de abstracción (de lo general a lo específico): fase, tarea genérica, tarea especializada, e instancia de procesos. ...
Article
Se implementa una herramienta informática que apoya el análisis del comportamiento de las actividades de aprendizaje que el docente registra en el Aula Extendida de la Universidad Autónoma del Caribe. Para esto, se adopta una metodología para desarrollo de proyectos en minería de datos, CRISP- DM que provee las fases de comprensión del negocio, comprensión de los datos y preparación de los datos; teniendo en cuenta la herramienta de ETL (Extracción, Transformación y Carga) Talend Open, que permite depurar los datos para la integración. Para la búsqueda de patrones del comportamiento en el desempeño docente en el aula extendida, se tienen en cuenta las variables socio demográficas y niveles de estudio. Se utiliza la herramienta WEKA que permite; a través de los datos de entrenamiento y validación, realizar los modelos que determinan el comportamiento de un docente. Los modelos de comportamiento se generan a través de diferentes técnicas [4]; árboles de decisión, redes neuronales y reglas de decisión, que dependiendo de una situación específica y la variable de estudio permiten mostrar y seleccionar el mejor análisis. Estos resultados permiten generar estrategias que apoyan el proceso de formación académica y elevar la calidad de la educación en la Universidad Autónoma del Caribe.
... Data mining techniques can be applied with several knowledge process models (Kurgan & Musilek, 2006;Cios and et al., 2007) Cross Industry Standard Process for Data Mining (CRISP-DM) which is a knowledge discovery and data mining process, is one of these models. This process model is jointly developed by cooperations DaimlerChrysler AG, SPSS, NCR, and OHRA (CRISP-DM, 2000). As shown in Figure 1, the phases of CRISP-DM method can be listed as business understanding, data understanding, data preparation, modelling (step of using data mining methods), evaluation, and deployment. ...
Article
Full-text available
Decision makers develop transportation plans and models for providing sustainable transport systems in urban areas. Mode Choice is one of the stages in transportation modelling. Data mining techniques can discover factors affecting the mode choice. These techniques can be applied with knowledge process approach. In this study a data mining process model is applied to determine the factors affecting the mode choice with decision trees techniques by considering individual trip behaviours from household survey data collected within Izmir Transportation Master Plan. From this perspective transport mode choice problem is solved on a case in district of Buca-Izmir, Turkey with CRISP-DM knowledge process model.
... Furthermore, scores of literature exists where detailed discourse of DM and its techniques could be found, but some examples are e.g. [11]; [12]; [13]; [14]; [9]; and [10]; [15]. In this paper, three classification methods of 6 the DM based techniques are applied to propose a predictive model for immunize-able diseases. ...
... Data pre-processing entails the discretization of the target range, because it has been demonstrated that by so doing NBC for regression performs comparably to well known methods for time series prediction [14]. The pre-processing step, or data preparation, is a key step in the non-trivial Knowledge Discovery and Data Mining process upon which the success of the entire process depends [6,4,2,3]. The second and the third steps are necessary within a supervised learning scheme, as it is the one proposed in this work. ...
Article
Full-text available
In this paper, the estimation of the Residual Useful Life (RUL) of degraded thrust ball bearings is made resorting to a data-driven stochastic approach that relies on an iterative Naïve Bayesian Classifier (NBC) for regression task. NBC is a simple stochastic classifier based on applying Bayes' theorem for posterior estimate updating. Indeed, the implemented iterative procedure allows for updating the RUL estimation based on new information collected by sensors located on the degrading bearing, and is suitable for an on-line monitoring of the component health status. The feasibility of the approach is shown with respect to real world vibration-based degradation data.
... Moreover, nowadays some initiatives to standardize definition of data mining techniques and the process of knowledge discovery, to provide API are gaining in strength (Grossman et al., 2002). Good examples are: the Predictive Model Markup Language (PMML, 2004) that is an XML-based language which provides a way for applications to define statistical and data mining models and to share models between PMML compliant applications; the SQL Multimedia and Applications Packages Standard (Melton & Eisenberg, 2001), which specifies SQL interface to data mining applications and services, and provides an API for data mining applications to access data from SQL/MM-compliant relational databases; the Java Specification Request-73 (JSR, 2004) that defines a pure Java API supporting the building of data mining models and the creation, storage, and access to data and metadata; the Microsoft-supported OLE DB for DM defining an API for data mining for Microsoft-based applications (OLE DB, 2004); the CRoss-Industry Standard Process for Data Mining (CRISP-DM, 2004) capturing the data mining process from business problems to deployment of the knowledge gained during the process. ...
... CRISP-DM [CRISP-DM 1996] (Cross Industry Standard Process for Data Mining) ´ e uma metodologia para descoberta do conhecimento em base de dados, que impõe ao pro- jeto um detalhado planejamento e avaliaçavaliaç˜avaliação do processo em suas fases, facilitando a organizaçorganizaç˜organização, a compreensão e o controle dos eventos na coordenaçcoordenaç˜coordenação do projeto. O mo- delo CRISP-DM foi idealizado objetivando ser uma metodologia padrão, as suas fases vão desde um planejamento na identificaçidentificaç˜identificação dos objetivos sob a ´ otica da compreensão do negócio até a aplicaçaplicaç˜aplicação do conhecimento extraído. ...
Conference Paper
Full-text available
Data mining is becoming increasingly common in both the private and public sectors. In the area of public safety data mining can be used to: determine where the levels of crime are higher, define profiles victims and crim- inals and detect the days that occur the most number of crimes. The aim of this paper is to use data mining in the system SISGOP, a database that records the police reports of the occurrences of Macei´ o, in order to discover informa- tions that aid in the strategical actions of the police department, based on the behaviour of the criminals and victims.
... The field of data mining has a relatively structured and consistent methodology for addressing data mining problems with machine learning techniques. This method is articulated in most data mining books, for example see [24], and is sometime referred to as Knowledge Discovery in Databases (KDD) [65], another example is the CRISP-DM [64]. The method provides a best practice for clearly defining a given a problem, preparing the tools and data, application of tools, and analysis and interpretation of results. ...
Conference Paper
Full-text available
RESUMEN En la industria petrolera los equipos turbocompresores son esenciales para el manejo de la producción del gas natural, en el caso de la Planta compresora Muscar PDVSA, dichos equipos presentan fallas y baja disponibilidad debido a la ausencia del mantenimiento que permita aumentar su la vida útil .El objetivo del presente trabajo fue implementar la inteligencia artificial para la predicción de fallas en equipos turbocompresores que permitan optimizar la toma de decisiones gerenciales aplicables al mantenimiento predictivo mediante la selección de un modelo eficiente. Para el logro de los objetivos planteados se utilizó la metodología de Cross Industry Standard Process for Data Mining aplicada a la inteligencia artificial, apoyada en un diseño Mixto de investigación documental y de Campo. El desarrollo permitió analizar la funcionalidad de las técnicas de aprendizaje automático, redes neuronales combinado con lógica difusa para predecir fallas y tomar decisiones en la planificación de las actividades de mantenimiento de los equipos turbocompresores en base a las condiciones detectadas y la solución óptima correspondiente en cada caso particular. La realización del modelo de inteligencia artificial proporcionó la clasificación y predicción de fallas para ofrecer finalmente un conjunto de soluciones con conocimiento experto. Palabras clave: Aprendizaje Automático, Predicción de Fallas, Decisiones Gerenciales. 1. INTRODUCCIÓN La toma de decisiones gerenciales es un proceso elemental en las organizaciones cuya complejidad depende del contexto organizacional, se trata de escoger entre alternativas de solución ante situaciones encontradas de la forma más óptima posible en el tiempo aceptado sin que afecte la productividad de la empresa. La información es importante para tomar decisiones, mediante esta se construye el conocimiento y experiencia para evaluar entre posibles modos de actuar, dentro de esta perspectiva la inteligencia artificial es actualmente aplicable en diversos campos. las técnicas de inteligencia artificial pueden ser empleadas como herramientas tecnológica de apoyo en la toma de decisiones, como complemento ante las posibles incertidumbres no cubiertas en las decisiones de los gerentes, por otro lado, ofrecen resultados analíticos y predicciones que generan información útil a la gerencia para conocer y adelantarse en la toma de decisiones sobre un proceso específico.
Article
The increasingly competitive higher educational environment compels the management of universities and colleges to assign high priority to an overall maximisation of client services. Consequently, while academic leaders must become familiar with the aspects of on-line communication much favoured by today’s younger generation, the intensification and improvement of the quality of available on-line services cannot be imagined without reliable information on the Internet use habits and behaviour of clients. The managers and administrators of Hungarian college and university websites are mostly unfamiliar with the web-related conduct or habits of their customers as in case of long-running web-pages based on an unchanging structure only basic visitor statistics are available at best. Yet marketing communication decisions should be based on information reflecting real website-consumer traits acquired via a more professional analysis. Data mining is one such decision-making support mechanism. Data mining models are capable of revealing and predicting information hidden beneath the respective critical mass. Therefore inspired by the methodology of marketing science this type of research concentrates on the segmentation of on-line consumers via the elaboration of visitor clusters. The present article provides a scientific overview and analysis of the main difficulties related to cluster construction, especially the development of the relevant algorithmic forms. The successful application of the model provides much-needed reliable and vital support to the institutional decision making process. Thus pertinent data yielded by cluster research can facilitate more effective on-line service customized to the needs of the users. Key words: clustering model, data mining, marketing communication, on-line conduct, web-ergonomics.
Chapter
In this chapter, the authors explore the operational data related to transactions in a financial organisation to find out the suitable techniques to assess the origin and purpose of these transactions and to detect if they are relevant to money laundering. The authors' purpose is to provide an AML/CTF compliance report that provides AUSTRAC with information about reporting entities' compliance with the Anti-Money Laundering and Counter-Terrorism Financing Act 2006. Their aim is to look into the Money Laundering activities and try to identify the most critical classifiers that can be used in building a decision tree. The tree has been tested using a sample of the data and passing it through the relevant paths/scenarios on the tree. The success rate is 92%, however, the tree needs to be enhanced so that it can be used solely to identify the suspicious transactions. The authors propose that a decision tree using the classifiers identified in this chapter can be incorporated into financial applications to enable organizations to identify the High Risk transactions and monitor or report them accordingly.
Chapter
Knowledge discovery is a critical component in improving health care. Health 2.0 leverages Web 2.0 technologies to integrate and share data from a wide variety of sources on the Internet. There are a number of issues which must be addressed before knowledge discovery can be leveraged effectively and ubiquitously in Health 2.0. Health care data is very sensitive in nature so privacy and security of personal data must be protected. Regulatory compliance must also be addressed if cooperative sharing of data is to be facilitated to ensure that relevant legislation and policies of individual health care organizations are respected. Finally, interoperability and data quality must be addressed in any framework for knowledge discovery on the Internet. In this chapter, we lay out a framework for ubiquitous knowledge discovery in Health 2.0 based on a combination of architecture and process. Emerging Internet standards and specifications for defining a Circle of Trust, in which data is shared but identity and personal information protected, are used to define an enabling architecture for knowledge discovery. Within that context, a step-by-step process for knowledge discovery is defined and illustrated using a scenario related to analyzing the correlation between emergency room visits and adverse effects of prescription drugs. The process we define is arrived at by reviewing an existing standards-based process, CRISP-DM, and extending it to address the new context of Health 2.0.
Chapter
The number of available Internet of Things (IoT) devices is growing rapidly, and users can utilize them via associated services to accomplish their tasks more efficiently. However, setting up IoT services based on the user, and environmental context, and the task requirements is usually a time-consuming job. Moreover, these IoT services operate in distributed computing environments in which spatially-cohesive IoT devices communicate via an ad-hoc network, and their availability is not predictable due to their mobility characteristic. To the best of our knowledge, there have been no researches done on saving and recovering users’ task-based IoT service settings with considering the context and task requirements. In this paper, we propose a framework for describing task-based IoT services and their settings in a semantical manner, and providing semantic task-based IoT services in an effective manner. The framework uses a machine learning technique to store and recover users’ task-based IoT service settings. We evaluated the effectiveness of the framework by conducting a user study.
Article
Full-text available
The combined impact of new computing resources and techniques with an increasing avalanche of large datasets, is transforming many research areas and may lead to technological breakthroughs that can be used by billions of people. In the recent years, Machine Learning and especially its subfield Deep Learning have seen impressive advances. Techniques developed within these two fields are now able to analyze and learn from huge amounts of real world examples in a disparate formats. While the number of Machine Learning algorithms is extensive and growing, their implementations through frameworks and libraries is also extensive and growing too. The software development in this field is fast paced with a large number of open-source software coming from the academy, industry, start-ups or wider open-source communities. This survey presents a recent time-slide comprehensive overview with comparisons as well as trends in development and usage of cutting-edge Artificial Intelligence software. It also provides an overview of massive parallelism support that is capable of scaling computation effectively and efficiently in the era of Big Data.
Thesis
Full-text available
The interest in the fields of Knowledge Discovery in Databases (KDD) and Data Mining emerged due to the rapid development of the Information and Communication Technologies, which made available vast amount of data to be stored in computers. Human experts have limitations and may fail in identifying important details. As an alternative, automatic discovery tools can be used in order to obtain high level knowledge from raw data. Considering this need, several Data Mining techniques have been proposed. This dissertation intends to infer about the advantages of two non-linear Data Mining models: Artificial Neural Networks (ANN) and Support Vector Machines (SVM). In particular, it pretends to measure their performance when applied to classification and regression tasks, being compared with other techniques, i.e. Decision/Regression Trees. Thus, an analysis was performed over a wide range of software tools that implement the referred models. From this set, two open-source applications (i.e. the R programming environment and the Weka) where selected to conduct the experiments. Several real world problems from the UCI public repository where used as benchmarks. The results show that in general the SVM achieves better forecasts, followed by the ANN. Nevertheless, this increase in performance is achieved with a higher computational effort.
Article
In this chapter, the authors explore the operational data related to transactions in a financial organi-sation to find out the suitable techniques to assess the origin and purpose of these transactions and to detect if they are relevant to money laundering. The authors' purpose is to provide an AML/CTF compliance report that provides AUSTRAC with information about reporting entities' compliance with the Anti-Money Laundering and Counter-Terrorism Financing Act 2006. Their aim is to look into the Money Laundering activities and try to identify the most critical classifiers that can be used in building a decision tree. The tree has been tested using a sample of the data and passing it through the relevant paths/scenarios on the tree. The success rate is 92%, however, the tree needs to be enhanced so that it can be used solely to identify the suspicious transactions. The authors propose that a decision tree using the classifiers identified in this chapter can be incorporated into financial applications to enable organizations to identify the High Risk transactions and monitor or report them accordingly.
Article
Full-text available
Ante la internacionalización de la economía, las organizaciones requieren basarse en la información y el conocimiento, apoyadas en tecnologías de la información y comunicación (TIC), pensar globalmente en políticas integrales y basarse en economías en red bajo esquemas asociativos que las fortalezcan. El creciendo de empresas en los últimos años, hace prioritario tratar de obtener conocimiento útil desde los propios datos y dar un paso más allá en el apoyo a la toma de decisiones más acertada. A tal fin, se ofrece en el documento información básica acerca de la minería de datos, se reconocen sus diferentes etapas y se determina su relación con otras disciplinas. Además se da a conocer el funcionamiento del tipo de algoritmo “árboles de decisión” y, se utiliza la herramienta “Weka” para ajustar modelos a conjuntos de datos.
Article
Full-text available
La calidad de los alimentos está asociada a un conjunto de propiedades y características que les confieren la capacidad de satisfacer las necesidades del consumidor. La industria de los alimentos tiene en la evaluación sensorial una herramienta que permite valorar la percepción del consumidor de un producto como un todo, o de un aspecto específico del mismo. Dicha herramienta, sin embargo, es intrínsecamente subjetiva debido a su dependencia de los sentidos humanos; en consecuencia, diversos evaluadores podrán diferir en cuanto a su apreciación de un producto determinado. La incertidumbre asociada a la percepción sensorial no tiene por qué ser un problema; ésta puede aprovecharse como parte del proceso de evaluación si se trata mediante lógica difusa. El siguiente trabajo tiene como objetivos valorar la aplicación de la lógica difusa en la evaluación sensorial, y determinar la aceptabilidad de una bebida, empleando una serie de pruebas afectivas y datos instrumentales. Para ello se empleará como ejemplo la evaluación de una muestra de una bebida a base de piña. Los resultados muestran que es posible predecir la aceptación de la bebida mediante el sistema de lógica difusa con una exactitud comparable a la exhibida por los evaluadores humanos.
Conference Paper
Today, our social, economic and political systems all make increasing use of the underlying computing infrastructure, and are heavily reliant on its safety and robustness. The ubiquitous collection and analysis of data through this infrastructure creates a burgeoning privacy problem. Indeed, special care must be taken to ensure that privacy is not breached from misuse of data flowing through these systems. Recently, the severity of this problem has been recognized both in the legislature and in the computing research field. However, we still lack a comprehensive view of this important topic in the undergraduate curriculum. Privacy is a critical problem for individuals and society at large. Serious problems are caused inadvertently due to ignorance of the subject and general lack of knowledge. Raising awareness of privacy issues, along with knowledge of the current state of the art technical and sociological solutions is best inculcated in young minds right from the start. In this paper, we explore how a comprehensive view of privacy can be incorporated into the undergraduate curriculum at the appropriate level. We present two alternative approaches towards this -- having an independent course for privacy or including small modules on privacy within existing courses.
Article
Full-text available
Companies have realized they need to hire data scientists, academic institutions are scrambling to put together data science programs, and publications are touting data science as a hot -- even "sexy" -- career choice. However, there is confusion about what exactly data science is, and this confusion could lead to disillusionment as the concept diffuses into meaningless buzz. In this paper we argue that there are good reasons why it has been hard to pin down exactly what data science is. One reason is that data science is intricately intertwined with other important concepts also of growing importance, such as big data and data-driven decision making. Another reason is the natural tendency to associate what a practitioner does with the definition of the practitioner's field; this can result in overlooking the fundamentals of the field. We believe that trying to define the boundaries of Data Science precisely right now is not of the utmost importance. We can debate the boundaries of the field in an academic setting, but in order for data science to serve business effectively, it is important (i) to understand its relationships to other important related concepts, and (ii) to begin to identify the fundamental principles underlying data science. Once we embrace (ii) we can much better understand and explain exactly what data science has to offer. Furthermore, only once we embrace (ii) should we be comfortable calling it data science. In this paper we present a perspective that addresses all these things. We close by offering as examples a partial list of fundamental principles underlying data science.
Book
HENUFOOD looks forward reduce chronic disease pathologies risk factor and, in this way, improve adult population health, between the range of 45 and 65 years. However, the benefits of this project, based on healthy ingredients and foods development, try to reach the rest of the population from the beginning up to the seniors. The main objective of HENUFOOD is discovering the healthy benefits from aliments using innovative methodologies, and scientifically demonstrate it. That will permit develop value products at nutritional level and demonstrate their health effects. These foods must keep on being foods, and must demonstrate their effects in quantities which are usually consumed in a diet. The project is looking forward determining in a clear way which foods or ingredients are absorbed by the organism and produce the beneficial effect that they are supposed to. This paper will focus on describing the ICT platform developed to support the scientists reach that purpose.
Article
Full-text available
Several studies have focused on problems related to data mining techniques, including several applications of these techniques in the e-commerce setting. In this work, we describe how data mining technology can be effectively applied in an e-commerce environment, delivering significant benefits to the business analyst. We propose a framework that takes the results of the data mining process as input, and converts these results into actionable knowledge, by enriching them with information that can be readily interpreted by the business analyst. The framework can accommodate various data mining algorithms, and provides a customizable user interface. We experimentally evaluate the proposed approach by using a real-world case study that demonstrates the added benefit of the proposed method. The same study validates the claim that the produced results represent actionable knowledge that can help the business analyst improve the business performance, since it significantly reduces the time needed for data analysis, which results in substantial financial savings.
Article
Exploratory data analysis is a data analysis methods to analysis data and find the inherent law based on the actual distribution of data. This article explored the use of data analysis, aiming at a communication operator for a pseudo-family customer to identify customers to focus on customers and achieve targeted marketing.
Article
Full-text available
Human capital is of a high concern for companies' management where their most interest is in hiring the highly qualified personnel which are expected to perform highly as well. Recently, there has been a growing interest in the data mining area, where the objective is the discovery of knowledge that is correct and of high benefit for users. In this paper, data mining techniques were utilized to build a classification model to predict the performance of employees. To build the classification model the CRISP-DM data mining methodology was adopted. Decision tree was the main data mining tool used to build the classification model, where several classification rules were generated. To validate the generated model, several experiments were conducted using real data collected from several companies. The model is intended to be used for predicting new applicants' performance.
Chapter
The aim of this chapter is to explore the application of data mining for analyzing performance and satis-faction of the students enrolled in an online two-year master degree programme in project management. This programme is delivered by the Academy of Economic Studies, the biggest Romanian university in economics and business administration in parallel, as an online programme and as a traditional one. The main data sources for the mining process are the survey made for gathering students’ opinions, the operational database with the students’ records and data regarding students activities recorded by the e-learning platform are. More than 180 students have responded, and more than 150 distinct characteristics/ variable per student were identified. Due the large number of variables data mining is a recommended approach to analysis this data. Clustering, classification, and association rules were employed in order to identify the factor explaining students’ performance and satisfaction, and the relationship between them. The results are very encouraging and suggest several future developments.
Chapter
Data Mining is an iterative, multi-step process consisting of different phases such as domain (or business) understanding, data understanding, data preparation, modeling, evaluation and deployment. Various data mining tasks are dependent on the human user for their execution. These tasks and activities that require human intelligence are not amenable to automation like tasks in other phases such as data preparation or modeling are. Nearly all Data Mining methodologies acknowledge the importance of the human user but do not clearly delineate and explain the tasks where human intelligence should be leveraged or in what manner. In this chapter we propose to describe various tasks of the domain understanding phase which require human intelligence for their appropriate execution.
Conference Paper
Full-text available
Thousand of news stories are reported each day. How to extract the useful information from the large web news is the important technology today. However, information technology advances have partially automated to processing documents, reducing the amount of text which must be read. In this paper we present a Web News Search System, called WNSS. WNSS can discover automatically phrase extraction from large corpora of web news stories. In addition, we give concrete examples of how to preprocess texts based on the intended use of the discovered results. We also evaluate the extracted phrases can be used for important tasks. Keywordsweb news-information technology-phrase extraction-pre-process texts
Article
The maintenance and service records collected and maintained by engineering companies are a useful resource for the ongoing support of products. Such records are typically semi-structured and contain key information such as a description of the issue and the product affected. It is suggested that further value can be realised from the collection of these records for indicating recurrent and systemic issues which may not have been apparent previously. This paper presents a faceted classification approach to organise the information collection that might enhance retrieval and also facilitate learning from in-service experiences. The faceted classification may help to expedite responses to urgent in-service issues as well as to allow for patterns and trends in the records to be analysed, either automatically using suitable data mining algorithms or by manually browsing the classification tree. The paper describes the application of the approach to aerospace in-service records, where the potential for knowledge discovery is demonstrated.
Article
Various data mining methodologies have been proposed in the literature to provide guidance towards the process of implementing data mining projects. The methodologies describe a data mining project as comprised of a sequence of phases and highlight the particular tasks and their corresponding activities to be performed during each of the phases. It seems that the large number of tasks and activities, often presented in a checklist manner, are cumbersome to implement and may explain why all the recommended tasks are not always formally implemented. Additionally, there is often little guidance provided towards how to implement a particular task. These issues seem to be especially dominant in case of the business understanding phase which is the foundational phase of any data mining project. In this paper, we present an organizationally grounded framework to formally implement the business understanding phase of data mining projects. The framework serves to highlight the dependencies between the various tasks of this phase and proposes how and when each task can be implemented. An illustrative example of a credit scoring application from the financial sector is used to exemplify the tasks discussed in the proposed framework.
Conference Paper
Online mining of changes from data streams is an important problem in view of growing number of applications such as network flow analysis, e-business, stock market analysis etc. Monitoring of these changes is a challenging task because of the high speed, high volume, only-one-look characteristics of the data streams. User subjectivity in monitoring and modeling of the changes adds to the complexity of the problem. This paper addresses the problem of i) capturing user subjectivity and ii) change modeling, in applications that monitor frequency behavior of item-sets. We propose a three stage strategy for focusing on item-sets, which are of current interest to the user and introduce metrics that model changes in their frequency (support) behavior.
ResearchGate has not been able to resolve any references for this publication.