ArticlePDF Available

Hybrid clustering analysis using improved krill herd algorithm

Authors:

Abstract and Figures

In this paper, a novel text clustering method, improved krill herd algorithm with a hybrid function, called MMKHA, is proposed as an efficient clustering way to obtain promising and precise results in this domain. Krill herd is a new swarm-based optimization algorithm that imitates the behavior of a group of live krill. The potential of this algorithm is high because it performs better than other optimization methods; it balances the process of exploration and exploitation by complementing the strength of local nearby searching and global wide-range searching. Text clustering is the process of grouping significant amounts of text documents into coherent clusters in which documents in the same cluster are relevant. For the purpose of the experiments, six versions are thoroughly investigated to determine the best version for solving the text clustering. Eight benchmark text datasets are used for the evaluation process available at the Laboratory of Computational Intelligence (LABIC). Seven evaluation measures are utilized to validate the proposed algorithms, namely, ASDC, accuracy, precision, recall, F-measure, purity, and entropy. The proposed algorithms are compared with the other successful algorithms published in the literature. The results proved that the proposed improved krill herd algorithm with hybrid function achieved almost all the best results for all datasets in comparison with the other comparative algorithms.
This content is subject to copyright. Terms and conditions apply.
A preview of the PDF is not available
... Further evaluation is done using different metrics such as F-measure (F), Precision (P), and Recall (R). F, P, and R have been frequently employed in the literature for comparing different FS methods in the context of text clustering [73], [74]. Same criteria has been also adopted to examine the reliability of various detection methods [21], [75]. ...
Article
Full-text available
Several Event Detection (ED) applications utilize various wrapper Feature Selection (FS) techniques based on wrapping the Markov Clustering Algorithm (MCL) with the Binary Bat Algorithm (BBA) or Adaptive Bat Algorithm (ABBA). These approaches have shown promising results in identifying relevant feature subset for MCL, leading to more precise event cluster from heterogeneous news articles. However, such wrapped FS methods involve coupling two methods (FS and ED) within ED model, with their performance influencing each other. While ABBA improved upon BBA’s limitations, MCL’s rapid convergence can hinder detection effectiveness. This fast convergence can lead to local optima and the detection of meaningless clusters. Additionally, MCL’s identification ability diminishes as the feature space grows. To address these issues, this paper develops two novel adaptive techniques to control MCL’s inflation ( inf ) and pruning ( p ) parameters, thereby managing its convergence behavior. Consequently, a new variant called Adaptive MCL (AMCL) is introduced and combined with ABBA. The effectiveness of the ABBA-AMCL method is evaluated using 10 benchmark datasets and two substantial Facebook news datasets. Various performance measures are employed to compare ABBA-AMCL against established methods. The empirical results demonstrate that ABBA-AMCL excels at extracting high-quality, real-world event clusters from various news text sources.
... Operations research is a field that employs mathematical formulation to solve complex engineering and management problems and gain insight into potential solutions [2]. Mathematical approaches used in this field can be classified into classical techniques like simplex method and dynamic programming, and heuristic and metaheuristic methods like the Emperor Penguin Optimizer algorithm (EPO) [3]. ...
... The CE hypothesis looks at correlations among many factors that adhere to the Copula function theory. A discrete form of the Shannon entropy formulae is defined by Eq. (8) (Pan, et al., 2023), for which a continuous shape can be created by Eq. (9) (Abualigah, et al., 2018). ...
... These papers were carefully selected based on their relevance to the topic, source of publication and originality of findings. During the evaluation process, 653 articles were excluded after reading the titles, keywords, and abstracts, mainly because of the following two reasons: 1) the articles were not in the context of information research; 2) empirical studies were not within the scope of understanding serendipity (Abualigah et al., 2018a;Abualigah et al., 2018b;Agarwal, 2015;Björneborn, 2017;Foster & Ellis, 2014). ...
Article
Background: The concept of serendipity has become increasingly interesting for those undertaking serendipity research in recent years. However, serendipitous encounters are subjective and rare in a real-world context, making this an extremely challenging subject to study. Methods: Various methods have been proposed to enable researchers to understand and measure serendipity, but there is no broad consensus on which methods to use in different experimental settings. A comprehensive literature review was first conducted, which summarizes the research methods being employed to study serendipity. It was followed by a series of interviews with experts that specified the relative strengths and weaknesses of each method identified in the literature review, in addition to the challenges usually confronted in serendipity research. Results: The findings suggest using mixed research methods to produce a more complete picture of serendipity and contribute to the verification of any research findings. Several challenges and implications relating to empirical studies in the investigation of serendipity have been derived from this study. Conclusions: This paper investigated research methods employed to study serendipity by synthesizing finding from a literature review and the interviews with experts. It provides a methodological contribution to serendipity studies by systematically summarizing the methods employed in the studies of serendipity and identifying the strengths and weakness of each method. It also suggests the novel approach of using mixed research methods to study serendipity. This study has potential limitations related to a small number of experts involved in the expert interview. However, it should be noted that the nature of the topic is a relatively focused area, and it was observed after interviewing the experts that new data seems to not contribute to the findings owing to its repetition of comments
Article
Full-text available
The government may be able to develop more effective strategies for dealing with COVID-19 cases if it groups districts and cities according to the features of the number of Covid-19 cases being reported in each district or city. The data can be more easily summarized with the help of cluster analysis, which organizes items into groups according to the degree of similarity between members. Since it is possible to group more than one period together, the generation of clusters based on time series is a more efficient method than clusters that are created for each individual unit. Using a time series cluster hierarchical technique that has complete linkage, the purpose of this study is to categorize the number of instances of Covid-19 that have been found in West Java by district or city. The data that was used comes from monthly reports of Covid-19 instances compiled by West Java districts from 2020 to 2022. The Autocorrelation Function (ACF) distance cluster was utilized in this investigation to determine how closely cluster members are related to one another. According to the findings, there could be as many as seven separate clusters, each including a unique assortment of districts and cities. Cluster 3, which is comprised of three different cities and regencies, including Bandung City, West Bandung Regency, and Sumedang Regency, has an average number of cases that is 66, making it the cluster with the highest number of cases overall. A value of 0.2787590 is obtained for the silhouette coefficient as a result of the established grouping. This value suggests that the structure of the newly created cluster is quite fragile.The government may be able to develop more eective strategies fordealing with COVID-19 cases if it groups districts and cities according to the featuresof the number of Covid-19 cases being reported in each district or city. The data canbe more easily summarized with the help of cluster analysis, which organizes items intogroups according to the degree of similarity between members. Since it is possible togroup more than one period together, the generation of clusters based on time series isa more ecient method than clusters that are created for each individual unit. Using atime series cluster hierarchical technique that has complete linkage, the purpose of thisstudy is to categorize the number of instances of Covid-19 that have been found in WestJava by district or city. The data that was used comes from monthly reports of Covid-19 instances compiled by West Java districts from 2020 to 2022. The AutocorrelationFunction (ACF) distance cluster was utilized in this investigation to determine howclosely cluster members are related to one another. According to the ndings, there couldbe as many as seven separate clusters, each including a unique assortment of districtsand cities. Cluster 3, which is comprised of three dierent cities and regencies, includingBandung City, West Bandung Regency, and Sumedang Regency, has an average numberof cases that is 66, making it the cluster with the highest number of cases overall. Avalue of 0.2787590 is obtained for the silhouette coecient as a result of the establishedgrouping. This value suggests that the structure of the newly created cluster is quitefragile.
Article
Full-text available
This paper focuses on connectivity-based data clustering for categorizing similar and dissimilar data into distinct groups. Although classical clustering algorithms such as K-means are efficient techniques, they often trap in local optima and have a slow convergence rate in solving high-dimensional problems. To address these issues, many successful meta-heuristic optimization algorithms and intelligence-based methods have been introduced to attain the optimal solution in a reasonable time. In this study, we attempt to conceptualize a powerful approach using the three main components: Chimp Optimization Algorithm (ChOA), Generalized Normal Distribution Algorithm (GNDA), and Opposition-Based Learning (OBL) method. First, two versions of ChOA with two different dynamic coefficients and seven chaotic maps, entitled ChOA(I) and ChOA(II), are presented to achieve the best possible result for data clustering purposes. Second, a novel combination of ChOA and GNDA algorithms with the OBL strategy is devised to solve the major shortcomings of the original algorithms. Lastly, the proposed ChOAGNDA method is a Selective Opposition (SO) algorithm based on ChOA and GNDA, which can be used to tackle large and complex real-world optimization problems, particularly data clustering applications. In this study, eight benchmark datasets, including five datasets of the UCI machine learning repository and three challenging shape datasets, are used to investigate the general performance of the proposed method. The results are evaluated against several popular and recent state-of-the-art clustering techniques. Experimental results illustrate that the proposed work significantly outperforms other existing methods in terms of the Sum of Intra-Cluster Distances (SICD), Error Rate (ER), and convergence rate.
Article
Full-text available
In the coastal areas of China, scientists have collected nearly 500 species of coastal plants and seaweeds. The collected information includes species description, morphological characteristics, habitat distribution and resource value of plants in China. By effectively extracting Chinese text information, this article establishes a Chinese text information extraction model based on DL. This article is based on short-term and short-term memory artificial neural networks for short text classification. In addition, this article also integrates the L-MFCNN models of MFCNN for short text classification. Comparing the two methods with traditional text recognition algorithms, information extraction based on syntax analysis and deep learning, the results show that, compared with the comparison method, the recognition accuracy of Chinese text information of this neural network model can reach 96.69%. Through model training and parameter adjustment, Chinese text information of coastal biodiversity can be quickly extracted, and species categories or names can be identified.
Article
Full-text available
In this paper, the bat-inspired algorithm (BA) is tolerated to gene selection for cancer classification using microarray datasets. Microarray data consists of irrelevant, redundant, and noisy genes. Gene selection problem is tackled by determining the most informative genes taken from microarray data to accurately diagnose the cancer disease. Gene selection problem is widely solved by optimisation algorithms. BA is a recent swarm-based algorithm, which imitates the echolocation system of bat individuals. It has been successfully applied to several optimisation problems. Gene selection is tackled by combining two stages, namely, filter stage, which uses Minimum Redundancy Maximum Relevancy (MRMR) method; and wrapper stage, which uses BA and SVM. To test the accuracy performance of the proposed method, ten microarray datasets were used. For comparative evaluation, the proposed method was compared with popular gene selection methods. The proposed method achieves comparable results of some datasets and produced new results for one dataset..
Chapter
Full-text available
Text clustering is an efficient analysis technique used in the domain of the text mining to arrange a huge of unorganized text documents into a subset of coherent clusters. Where, the similar documents in the same cluster. In this paper, we proposed a novel term weighting scheme, namely, length feature weight (LFW), to improve the text document clustering algorithms based on new factors. The proposed scheme assigns a favorable term weight according to the obtained information from the documents collection. It recognizes the terms which are particular to each cluster and enhances their weights based on the proposed factors at the level of the document. β-hill climbing technique is used to validate the proposed scheme in the text clustering. The proposed weight scheme is compared with the existing weight scheme (TF-IDF) to validate its results in that domain. Experiments are conducted on eight standard benchmark text datasets taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed weighting scheme LFW overcomes the existing weighting scheme and enhances the result of text document clustering technique in terms of the F-measure, precision, and recall.
Conference Paper
Full-text available
Rapid moving in the technology and the need to respond to the massive changes in the big data create another challenge for the government to make the deal with a huge amount of data easier and to implement effectively multi-channel platforms for digital transformation. The need for technology such as social media, e-participation tools and new models of open data to generate big data also have added to these challenges, in addition to the slow adoption of the public sector and citizens for these new concepts of openness and effective interaction through electronic technology. For citizen's improvements the e-participation processes, government innovations, and citizen satisfaction governments need to enhance the collaboration and engagement. As well, it needs to improve the value that delivers inside and outside of government sectors also satisfies the citizens' demands for better services by collecting data from citizens' activities. When e-government utilizing the big data technologies, which offers a new effective technology to provide interactive services, the e-government will be more than just a big and more than just a data. The term of "Big data" must be used in e-government. In this paper, the authors explore a review in big data issues that applied to e-government as well as challenges and issues facing these agencies and proposed a possible solution for the challenges of implementing big data in e-government. Most of the recently published papers clearly show that the challenges are difficult and the growing in terms of big data is increasing exponentially.
Conference Paper
Full-text available
Regarding the increasing volume of document information (text) on Internet network pages, recent applications, and so on, the dealing with this knowledge has become incredibly complex because of the size. The text clustering is a proper technique used to arrange a tremendous amount of text information by classifying into a subset of clusters. In this paper, we present a novel local search method, namely, β-hill climbing technique to solve the text document clustering problem. The primary innovation of β-hill climbing technique is β. It has been introduced to perform a balance between local and global search. Local search (exploitation) methods are success applied to the text document clustering problem as the k-mean. Experiments conducted on five benchmark text datasets taken randomly from " Dmoz-Business " dataset with varying characteristics. The results prove that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique measured by F-measure, precision, recall, and accuracy. The results show that the proposed (β-hill climbing) obtained better results in comparison with the other original technique by tuning the tuning the parameter of the β-hill claiming.
Article
Full-text available
The microarray technology facilitates biologist in monitoring the activity of thousands of genes (features) in one experiment. This technology generates gene expression data, which are significantly applicable for cancer classification. However, gene expression data consider as high- dimensional data which consists of irrelevant, redundant, and noisy genes that are unnecessary from the classification point of view. Recently, researchers have tried to figure out the most informative genes that contribute to cancer classification using computational intelligence algorithms. In this paper, we propose a filter method (Minimum Redundancy Maximum Relevancy, MRMR) and a wrapper method (Bat algorithm, BA) for gene selection in microarray dataset. MRMR was used to find the most important genes from all genes in gene expression data, and BA was employed to find the most informative gene subset from the reduce set generated by MRMR that can contribute in identifying the cancers. The wrapper method using support vector machine (SVM) method with 10-fold cross-validation served as evaluator of the BA. In order to test the accuracy performance of the proposed method, extensive experiments were conducted. Three microarray datasets are used, which include: colon, Breast, and Ovarian. Same method procedure was performed to Genetic algorithm (GA) to conducts comparison with our proposed method (MRMR-BA). The results show that our proposed method is able to find the smallest gene subset with highest classification accuracy.
Article
Feature selection is a significant task in data mining and machine learning applications which eliminates irrelevant and redundant features and improves learning performance. This paper proposes a new feature selection method for unsupervised text clustering named link based particle swarm optimization (LBPSO). This method introduces a new neighbour selection strategy in BPSO to select prominent features. The performance of traditional particle swarm optimization(PSO)is limited by using global best updating mechanism for feature selection. Instead of using global best, LBPSO particles are updated based on neighbour best position to enhance the exploitation and exploration capability. These prominent features are then tested using k-means clustering algorithm to improve the performance and reduce the cost of computational time of the proposed algorithm. The performance of LBPSO are investigated on three published datasets, namely Reuter 21578, TDT2 and Tr11. Our results based on evaluation measures show that the performance of LBPSO is superior than other PSO based algorithms.
Article
The large amount of text information on the Internet and in modern applications makes dealing with this volume of information complicated. The text clustering technique is an appropriate tool to deal with an enormous amount of text documents by grouping these documents into coherent groups. The document size decreases the effectiveness of the text clustering technique. Subsequently, text documents contain sparse and uninformative features (i.e., noisy, irrelevant, and unnecessary features), which affect the effectiveness of the text clustering technique. The feature selection technique is a primary unsupervised learning method employed to select the informative text features to create a new subset of a document's features. This method is used to increase the effectiveness of the underlying clustering algorithm. Recently, several complex optimization problems have been successfully solved using metaheuristic algorithms. This paper proposes a novel feature selection method, namely, feature selection method using the particle swarm optimization (PSO) algorithm (FSPSOTC) to solve the feature selection problem by creating a new subset of informative text features. This new subset of features can improve the performance of the text clustering technique and reduce the computational time. Experiments were conducted using six standard text datasets with several characteristics. These datasets are commonly used in the domain of the text clustering. The results revealed that the proposed method (FSPSOTC) enhanced the effectiveness of the text clustering technique by dealing with a new subset of informative features. The proposed method is compared with the other well-known algorithms i.e., feature selection method using a genetic algorithm to improve the text clustering (FSGATC), and feature selection method using the harmony search algorithm to improve the text clustering (FSHSTC) in the text feature selection.