ArticlePDF Available

Performance Analysis on Clustering Approaches for Gene Expression Data

February 2016

February 2016
5(2):196-200

DOI:10.17148/IJARCCE.2016.5242

Authors:

Asir Antony Gnana Singh Danasingh

Anna University , Tiruchirappalli

JEBAMALAR LEAVLINE EPIPHANY

Anna University,Tiruchirappalli , India

Clustering is a way of finding the structures from a collection of unlabeled gene expression data. A number of algorithms are developed to tackle the problem of clustering the gene expression data. It is important for solving the problems that originate due to unsupervised learning. This paper presents a performance analysis on various clustering algorithm namely K-means, expectation maximization, and density based clustering in order to identify the best clustering algorithm for microarray data. Sum of squared error, log likelihood measures are used to evaluate the performance of these clustering methods.

SUM OF SQUARED ERRORS FOR THE K-MEANS AND DENSITY BASED CLUSTERING

…

Figures - uploaded by Asir Antony Gnana Singh Danasingh

Content may be subject to copyright.

Content uploaded by Asir Antony Gnana Singh Danasingh

Content may be subject to copyright.

IJARCCE

ISSN (Online) 2278-1021

ISSN (Print) 2319 5940

International Journal of Advanced Research in Computer and Communication Engineering

Vol. 5, Issue 2, February 2016

Performance Analysis on Clustering Approaches

for Gene Expression Data

D. Asir Antony Gnana Singh1, A. Escalin Fernando2, E. Jebamalar Leavline3

Department of CSE, Anna University, BIT Campus, Tiruchirappalli, India1, 2

Department of ECE, Anna University, BIT Campus, Tiruchirappalli, India3

Abstract: Clustering is a way of finding the structures from a collection of unlabeled gene expression data. A number

of algorithms are developed to tackle the problem of clustering the gene expression data. It is important for solving the

problems that originate due to unsupervised learning. This paper presents a performance analysis on various clustering

algorithm namely K-means, expectation maximization, and density based clustering in order to identify the best

clustering algorithm for microarray data. Sum of squared error, log likelihood measures are used to evaluate the

performance of these clustering methods.

Keywords: Clustering analysis on microarray data, comparison of clustering algorithms, clustering analysis on gene

expression data, literature review on clustering methods, survey on clustering techniques.

I. INTRODUCTION

Clustering is a process of organizing objects into groups

whose members are similar in any of the ways. Therefore,

a cluster contains similar objects and dissimilar objects are

present in different clusters. The ultimate aim of the

clustering algorithm is to form the perfect clusters that

means grouping objects based on their similarity. There

are many similarity measures such as density; distance,

etc. have been used. In general, the clustering algorithms

learn the unlabeled data therefore it is also called

unsupervised learning algorithm. The unsupervised

learning algorithm learns the unlabeled data and develops

the clustering model. Then, the developed clustering

model can be employed to predict the group or cluster of

the ungrouped or un-clustered data. The clustering

algorithms can be used for various applications such as

gene expression data analysis, outlier detection, features

selection [1-3], etc. The clustering algorithms can be

classified into various types based on the fashion with

which the objects are clustered namely model based

clustering, density based clustering, connectivity based

clustering, centroid based clustering, etc. In this paper, the

performance of the density based clustering, expectation

maximization (EM) clustering, and K-means clustering are

analysed in terms of the sum of squared error (SSE), and

log likelihood on various gene expression data.

The rest of the paper is organised as follows. In Section II

the clustering techniques are discussed. In Section III

experimental setup and the experimental procedures are

explained. The results are discussed in Section IV. Finally,

Section V concludes this paper.

II. CLUSTERING TECHNIQUES

This section presents the various clustering techniques and

their merits and demerits.

A. Model Based Clustering

Cobweb is one of the model based clustering method. It is

basically an incremental system intended for hierarchical

conceptual clustering. It can set the acuity value which is

the minimum standard deviation of the numerical data.

The categorical utility threshold value is set by using cut-

off to prune the data. Arifovic et al [4] explained that the

genetic algorithm would have a better convergence for

wider range of parameters. Hommes et al [5] suggested

that cobweb model can show consistent rational behaviour

for non-linear dynamic models. The similarity measure

used in the cobweb is the distance measure. Alejos et al

[6] presented a technique to calculate the magnetization of

the simulated system with improved accuracy by means of

the Preisach model. Zhechong Zhao et al [7] presented a

cobweb plot which is used to illustrate graphically the

iterative procedure and to analyse stability.

It investigates the quantitative behaviour of one

dimensional iterated function using a fixed point known as

invariant point. Yuni Xia et al [8] proposed a conceptual

clustering algorithm which can explicitly handle the

uncertainties in the values of the dataset. Total utility (TU)

index is introduced to measure the quality of the

clustering. This finally increases the internal probabilistic

information of the clustering performance. The advantage

of cobweb is that it allows a bidirectional search. It allows

merging and splitting of classes using the category utility.

The major disadvantage is that it is purely based on the

assumption of probability distributions were updating and

storing of clusters is quite expensive.

Expectation maximization (EM) clustering is another type

under model based clustering, which is an iterative method

and is capable of finding the maximum likelihood in

statistical methods. The expectation of the likelihood is

computed by performing the alternate iteration between

the performances of the expectation step. Tood et al [9]

stated that this algorithm is suitable for outcomes that are

clumped together. Brankov et al [10] proposed the

normalized cross correlation that has better performance

than the traditionally used Euclidean distance which is

IJARCCE

ISSN (Online) 2278-1021

ISSN (Print) 2319 5940

International Journal of Advanced Research in Computer and Communication Engineering

Vol. 5, Issue 2, February 2016

used as the similarity measure for the expectation

maximization. The EM algorithm is mainly suitable for the

analysis of the image data. Lagendijk et al [11] applied the

maximum likelihood approach to identify and restore

noisy data in the blurred images. This EM method can

facilitate maximizing likelihood functions that arise in

statistically estimating problems. Figueiredo et al [12]

presented the algorithm for the restoration of the images

using the penalized likelihood. Fessler et al [13] presented

a new update of sequentially alternating the parameters

between the several small hidden data spaces which are

defined by the algorithm designer. EM is most suitable for

the real world dataset and is best suited for performing

cluster analysis for a small scene and when not satisfactory

with the results of simple K-means algorithm. The

drawback of EM algorithm is its inherent complexity.

B. Farthest First Clustering

The farthest first cluster places the centre of each cluster at

a point farthest from the existing cluster centres. It is a

variant of simple K-means. Manoj et al [14] suggested that

the farthest first algorithm is suitable for the large dataset

and the clusters produced are non-uniform. So they

developed an optimized farthest first clustering algorithm

to produce uniform clusters. Chung-Ming et al [15]

proposed a farthest first forwarding algorithm to reduce

the transmission delay in the vehicular adhoc networks

(VANETs). H. K. Yogish et al [16] proposed a strategy of

farthest first traversal for finding the frequent traversal

path in the navigation and reorganization of the website

structure. This clustering algorithm can eventually speed

up the clustering since there are only few adjustments in

the data. The constraint based methods and distance-

function learning methods according to Bilenko et al [17]

are the similarity metric used in the algorithm. The major

advantage is that it is a heuristic based method that is fast,

scalable and appropriate for large datasets. But it is

difficult to compare the quality of the cluster produced. It

does not hold good for non-globular clusters and is very

sensitive to outliers.

C. Filter Based Clustering

This type of clustering method is for filtering the

information or any pattern which are essentially needed.

The filtration is carried out based on the keywords that are

supplied or some relevant information. Jiang-She Zhang et

al [18] proposed a clustering algorithm for the processing

of the images. They are computationally stable and

insensitive to initialization. They also produce consistent

clusters. Thomas et al [19] proposed a collaborative

filtering which is a combination of the correlation and

singular value decomposition (SVD) to improve accuracy.

A weighted co-clustering algorithm is designed in

incremental and parallel versions and the results are

empirically evaluated. Lagendijk et al [20] proposed two

different methods to estimate the performances of

individual classifiers and then combine them based on the

weight of the individual classifiers. The advantage is that

it compares the new arriving keywords with the existing

profile and information is provided to the user. It also

checks the information instantly rather than waiting for the

other information from the user. The major drawback is

that the user cannot get the information about the filtering

algorithm that is being used. It also depends on the

feedback of the retrieved information.

D. Connectivity Based Clustering

Hierarchical clustering is a type of connectivity based

clustering and it is the way of relating the objects having

core idea of the objects closer than the objects that are

farther. The main categories of the hierarchical clustering

are “Agglomerative” and “Divisive” methods. Edward J.

Coyle et al [21] proposed a randomized algorithm which is

mainly applicable for the sensors for the generation of the

cluster heads in a hierarchical manner. Michael Dittenbach

et al [22] presented a growing hierarchical self-organizing

map that evolves on the input data during the unsupervised

training process. Guangyu et al [23] developed a

comparative analysis and suggested that hierarchical

clustering is better when compared with the conventional

clustering. It produces an extensive hierarchy of clusters

that merge with other ideas that are present at a certain

distance. The disadvantage of using hierarchical cluster is

that it cannot provide single partitioning of the dataset.

E. Density Based Clustering

The density based clustering (DBC) groups the objects

mainly based on the density of the objects that are

reachable and connective. Li Tu et al [24] proposed a

framework called D-stream for clustering using the density

based approach. Mitra et al [25] suggested a

nonparametric data reduction scheme. The procedure

followed here is separating the dense area objects from

less dense area with the aid of an arbitrary object. The

density based clusters (DBC) are robust to noise but the

datasets are problematic and requires high densely

connected data.

F. Centroid Based clustering

K-Means is a centroid based clustering method. It

partitions the dataset into various clusters based on the

mean distance. It is one of the simplest forms of

unsupervised algorithm. The main objective of this

algorithm is to reduce the squared error. Tapas et al [26]

identified that the algorithm works faster as the separation

between the cluster increases. This algorithm is applicable

for the segmentation of images and data compression.

Kanungo et al [27] proposed that the K-means algorithm

runs faster as the separation between the cluster increases.

Jakob J. Verbeek et al [28] suggested a solution to reduce

the computational load without affecting the quality of the

solution significantly. The algorithm is robust, fast and

easy to understand. It also yields better results when the

dataset are well separated or distinct from each other. It

does not work efficiently for non-linear and categorical

data. Further, it is unable to handle outliers and noisy data

if the cluster centres are randomly chosen.

III. EXPERIMENTAL SETUP

The performances of the clustering methods are compared

by considering the gene expression datasets such as

SRBCT, Lymphoma and three different Leukaemia

IJARCCE

ISSN (Online) 2278-1021

ISSN (Print) 2319 5940

International Journal of Advanced Research in Computer and Communication Engineering

Vol. 5, Issue 2, February 2016

datasets namely Leukaemia, Leukaemia3C, and

Leukaemia-1. The medical datasets that were considered

are SRBCT dataset with 2309 attributes and 83 instances.

The Lymphoma dataset consists of 4027 attributes and 66

instances. Three different datasets of Leukaemia with

7130 attributes and 72 instances each have also been used.

The performance is evaluated by performing the operation

of clustering in each of the datasets. The numbers of the

clusters are varied from two to ten and the resulting sum of

squared error (SSE) and the log likelihood (LL) are noted

for each of the methods. The comparison is carried out

among the clustering methods namely, K-means, density

based clustering (DBC) and expectation maximization

(EM) clustering. The experiment is carried out to obtain

the better clustering method for the gene expression data.

The experiments were performed using the WEKA data

mining tool. It is developed with the Java programming

language and it contains the GUI that is capable of

interacting with the various data files and even produces

visual results. WEKA tool provides various other options

on pre-processing, classification, clustering, association,

selection of attributes, and visualization.

2.1 Experimental Procedure

The experiment is carried out using the experimental

procedure with the following steps:

Step 1: Read the dataset.

Step 2: Set the number of clusters to be formed for

clustering the instances of the dataset.

Step 3: The sum of squared error is noted for the K-means

and density based clustering method.

Step 4: The log likelihood is noted for the density based

clustering and expectation maximization clustering

method.

Initially, the data set is read. Then, the number of clusters

to formed is set (from 2 to 10) for clustering the instances

of the dataset. Then, sum of squared error is noted for the

K- means and density based clustering method and the log

likelihood is noted for the density based clustering and

expectation maximization clustering method.

IV. RESULTS AND DISCUSSION

This section illustrates the results obtained from the

conducted experiments. Table I shows the sum of squared

errors for the K-means and density based clustering, Table

II shows the log likelihood values of the density based

clustering, Table III shows the log likelihood values of the

expectation maximization clustering, Figure 1 depicts the

sum of squared error for the K-means and density based

clustering for five different datasets.

Figure 2 illustrates the log likelihood of the density based

clustering and expectation maximization clustering for the

SRBCT dataset. Figure 3 depicts log likelihood of the

density based clustering and expectation maximization

clustering for the Lymphoma dataset. Figure 4 illustrates

the log likelihood of the density based clustering and

expectation maximization clustering for the Leukaemia

dataset. Figure 5 depicts log likelihood of the density

based clustering and expectation maximization clustering

for the Leukaemia 3C dataset. Figure 6 depicts the log

likelihood of the density based clustering and expectation

maximization clustering for the Leukaemia-1 dataset.

TABLE 1 SUM OF SQUARED ERRORS FOR THE K-MEANS AND DENSITY BASED CLUSTERING

Datasets

Number of clusters

SRBCT

07131.72

06743.21

06440.08

06123.40

05719.74

05467.57

05062.60

05022.87

04817.73

Lymphoma

08970.75

08546.91

07960.22

07684.71

07516.37

07139.95

06880.25

06791.88

06669.48

Leukaemia

16376.73

15409.02

15137.53

14803.38

14006.20

13693.59

13411.96

12966.56

12632.31

Leukaemia3C

16373.73

15407.02

15236.14

14900.66

14028.98

13765.59

13483.96

13011.07

12676.82

Leukaemia-1

16368.73

15400.02

15227.14

14891.66

14020.98

13737.54

13351.06

13005.07

12670.82

TABLE 2 LOG LIKELIHOOD VALUES OF THE DENSITY BASED CLUSTERING

Datasets

Number of Clusters

SRBCT

-01252.27

-01093.28

-01016.29

-00865.44

-00701.76

-00596.40

-00418.96

-00334.48

-00286.77

Lymphoma

-03162.03

-02993.16

-02787.76

-02658.80

-02630.60

-02469.81

-02352.89

-02341.43

-02231.36

Leukaemia

-47267.70

-46848.50

-46753.89

-46686.34

-46276.76

-46089.52

-45885.16

-45665.65

-45582.44

Leukaemia3C

-47267.55

-46848.34

-46715.70

-46648.40

-46216.85

-46041.36

-45837.00

-45612.04

-45528.83

Leukaemia-1

-47267.23

-46848.02

-46715.37

-46648.07

-46216.55

-46034.78

-45789.04

-45611.77

-45528.56

TABLE 3 LOG LIKELIHOOD VALUES OF THE EXPECTATION MAXIMIZATION CLUSTERING

Datasets

Number of Clusters

SRBCT

-01122.50

-00943.74

-00759.65

-00654.04

-00535.71

-00451.53

-00340.92

-00259.80

-00168.69

Lymphoma

-03137.02

-02916.35

-02759.67

-02683.67

-02413.22

-02260.96

-02355.60

-01979.22

-01910.53

Leukaemia

-47351.64

-46882.29

-46578.92

-46291.84

-46070.29

-45886.79

-45770.37

-45456.07

-45410.69

Leukaemia3C

-47351.49

-46882.13

-46578.76

-46291.69

-46070.18

-45886.67

-45770.27

-45455.95

-45410.59

Leukaemia -1

-47324.03

-46881.82

-46560.46

-46291.44

-46069.86

-45886.44

-45770.03

-45455.67

-45410.34

IJARCCE

ISSN (Online) 2278-1021

ISSN (Print) 2319 5940

International Journal of Advanced Research in Computer and Communication Engineering

Vol. 5, Issue 2, February 2016

Fig1. Sum of squared error for the K-means and density

based clustering for five different datasets.

Fig2. Log likelihood of the density based clustering and

expectation maximization clustering for the SRBCT

dataset.

Fig3. Log likelihood of the density based clustering and

expectation maximization clustering for the Lymphoma

dataset.

Fig4. Log likelihood of the density based clustering and

expectation maximization clustering for the Leukaemia

dataset.

Fig5. Log likelihood of the density based clustering and

expectation maximization clustering for the Leukaemia 3C

dataset.

Fig6. Log likelihood of the density based clustering and

expectation maximization clustering for the Leukaemia-1

dataset.

From Table 1 and Figure 1, it observed that the K-means

clustering and density based clustering methods perform

similarly in terms of SSE for different number of clusters

on all the datasets and also it is observed that the

clustering methods produce lesser SSE with SRBCT

dataset compared to other datasets. From the Table 2,

Table 3 and Figure 2 to 6, it is observed that the

expectation maximization clustering method performs

better than the density based clustering method in terms of

log likelihood.

V. CONCLUSION

This paper conducted an empirical study on various

clustering algorithms in order to observe their performance

on gene expression data in terms of sum of squared error

and log likelihood. In this empirical study, the

performance of the clustering algorithms namely density

based clustering, expectation maximization clustering and

K-means clustering are evaluated on various gene

expression data. From this evaluation, it is observed that

the performance of expectation maximization clustering

algorithm is comparatively better than the density based

clustering algorithm in terms of log likelihood. The

clustering algorithms namely K-means and density based

clustering method perform similarly in terms of sum of

squared error. The SRBCT dataset possesses less sum of

IJARCCE

ISSN (Online) 2278-1021

ISSN (Print) 2319 5940

International Journal of Advanced Research in Computer and Communication Engineering

Vol. 5, Issue 2, February 2016

squared error for the clustering algorithms K-means and

density based clustering than all other datasets compared.

REFERENCES

[1] Danasingh Asir Antony Gnana Singh, Subramanian Appavu Alias

Balamurugan, Epiphany Jebamalar Leavline, „A novel feature

selection method for image classification‟, Optoelectronics and

Advanced Materials, Rapid Communications, 9.11-12 (2015)

:1362- 1368

[2] Danasingh Asir Antony Gnana Singh, Subramanian Appavu Alias

Balamurugan, Epiphany Jebamalar Leavline. "An unsupervised

feature selection algorithm with feature ranking for maximizing

performance of the classifiers." International Journal of Automation

and Computing 12.5 (2015): 511-517.

[3] Danasingh Asir Antony Gnana Singh, Subramanian Appavu Alias

Balamurugan, Epiphany Jebamalar Leavline, „Improving the

Accuracy of the Supervised Learners using Unsupervised based

Variable Selection‟, Asian Journal of Information Technology, 13.9

(2014): 530-537.

[4] Arifovic, Jasmina. "Genetic algorithm learning and the cobweb

model." Journal of Economic dynamics and Control 18.1 (1994): 3-

28.

[5] Hommes, Cars H. "On the consistency of backward-looking

expectations: The case of the cobweb." Journal of Economic

Behavior & Organization 33.3 (1998): 333-362.

[6] Alejos, Óscar, and Edward Della Torre. "The generalized cobweb

method." Magnetics, IEEE Transactions on 41.5 (2005): 1552-1555.

[7] Zhao, Zhechong, and Lei Wu. "Stability analysis for power systems

with pricebased demand response via Cobweb Plot." Proc. IEEE

PES General Meeting. 2013.

[8] Yuni Xia,Bowei Xi “Conceptual Clustering Categorical Data with

Uncertainty” 19th IEEE International Conference on Tools with

Artificial Intelligence

[9] Moon, Tood K. "The expectation-maximization algorithm." Signal

processing magazine, IEEE 13.6 (1996): 47-60.

[10] Brankov, Jovan G., et al. "Similarity based clustering using the

expectation maximization algorithm." Image Processing. 2002.

Proceedings. 2002 International Conference on. Vol. 1. IEEE, 2002.

[11] Lagendijk, Reginald L., Jan Biemond, and Dick E. Boekee.

"Identification and restoration of noisy blurred images using the

expectation-maximization algorithm." IEEE Transactions on

Acoustics, Speech, and Signal Processing [see also IEEE

Transactions on Signal Processing], 38 (7) (1990).

[12] Figueiredo, Mário AT, and Robert D. Nowak. "An EM algorithm

for wavelet-based image restoration." Image Processing, IEEE

Transactions on 12.8 (2003): 906-916.

[13] Fessler, Jeffrey, and Alfred O. Hero. "Space-alternating generalized

expectation-maximization algorithm." Signal Processing, IEEE

Transactions on 42.10 (1994): 2664-2677.

[14] Kumar, Manoj. "An optimized farthest first clustering algorithm."

Engineering (NUiCONE), 2013 Nirma University International

Conference on. IEEE, 2013.

[15] Huang, Chung-Ming, et al. "A farthest-first forwarding algorithm in

VANETs." ITS Telecommunications (ITST), 2012 12th

International Conference on. IEEE, 2012.

[16] Vadeyar, Deepshree A., and H. K. Yogish. "Farthest First

Clustering in Links Reorganization." International Journal of Web

& Semantic Technology 5.3 (2014): 17.

[17] Bilenko, Mikhail, Sugato Basu, and Raymond J. Mooney.

"Integrating constraints and metric learning in semi-supervised

clustering." Proceedings of the twenty-first international conference

on Machine learning. ACM, 2004.

[18] Leung, Yee, Jiang-She Zhang, and Zong-Ben Xu. "Clustering by

scale-space filtering." Pattern Analysis and Machine Intelligence,

IEEE Transactions on 22.12 (2000): 1396-1410.

[19] George, Thomas, and Srujana Merugu. "A scalable collaborative

filtering framework based on co-clustering." Data Mining, Fifth

IEEE International Conference on. IEEE, 2005.

[20] Lagendijk, Reginald L., Jan Biemond, and Dick E. Boekee.

"Identification and restoration of noisy blurred images using the

expectation-maximization algorithm." IEEE Transactions on

Acoustics, Speech, and Signal Processing [see also IEEE

Transactions on Signal Processing], 38 (7) (1990).

[21] Bandyopadhyay, Seema, and Edward J. Coyle. "An energy efficient

hierarchical clustering algorithm for wireless sensor networks."

INFOCOM 2003. Twenty-Second Annual Joint Conferences of the

IEEE Computer and Communications. IEEE Societies. Vol. 3.

IEEE, 2003.

[22] Dittenbach, Michael, Dieter Merkl, and Andreas Rauber. "The

growing hierarchical self-organizing map." ijcnn. IEEE, 2000.

[23] Pei, Guangyu, et al. "A wireless hierarchical routing protocol with

group mobility." Wireless Communications and Networking

Conference, 1999. WCNC. 1999 IEEE. IEEE, 1999.

[24] Chen, Yixin, and Li Tu. "Density-based clustering for real-time

stream data." Proceedings of the 13th ACM SIGKDD international

conference on Knowledge discovery and data mining. ACM, 2007.

[25] Mitra, Pabitra, C. A. Murthy, and Sankar K. Pal. "Density-based

multiscale data condensation." Pattern Analysis and Machine

Intelligence, IEEE Transactions on 24.6 (2002): 734-747.

[26] Kanungo, Tapas, et al. "An efficient k-means clustering algorithm:

Analysis and implementation." Pattern Analysis and Machine

Intelligence, IEEE Transactions on 24.7 (2002): 881-892.

[27] Kanungo, Tapas, et al. "An efficient k-means clustering algorithm:

Analysis and implementation." Pattern Analysis and Machine

Intelligence, IEEE Transactions on 24.7 (2002): 881-892.

[28] Likas, Aristidis, Nikos Vlassis, and Jakob J. Verbeek. "The global

k-means clustering algorithm." Pattern recognition 36.2 (2003):

451-461.

Big Data Architecture for Network Security

Chapter

Full-text available

Feb 2022

Research is considering security of big data and retaining the performance during its transmission over network. It has been observed that there have been several researches that have considered the concept of big data. Moreover, a lot of those researches also provided security against data but failed to retain the performance. Use of several encryption mechanisms such as RSA [43] and AES [44] has been used in previous researches. But, if these encryption mechanisms are applied, then the performance of network system gets degraded. In order to resolve those issues, the proposed work is making using of compression mechanism to reduce the size before implementing encryption. Moreover, data is spitted in order to make the transmission more reliable. After splitting the data contents data has been transferred from multiple route. If some hackers opt to capture that data in unauthentic manner, then they would be unable to get complete and meaning full information. Thus, the proposed model has improved the security of big data in network environment by integration of compression and splitting mechanism with big data encryption. Moreover, the use of user‐defined port and use of multiple paths during transmission of big data in split manner increases the reliability and security of big data over network environment.

An Experimental Study on Microarray Expression Data from Plants under Salt Stress by using Clustering Methods

Article

Full-text available

Jun 2020

Current Genome-wide advancements in Gene chips technology provide in the “Omics (genomics, proteomics and transcriptomics) research”, an opportunity to analyze the expression levels of thousand of genes across multiple experiments. In this regard, many machine learning approaches were proposed to deal with this deluge of information. Clustering methods are one of these approaches. Their process consists of grouping data (gene profiles) into homogeneous clusters using distance measurements. Various clustering techniques are applied, but there is no consensus for the best one. In this context, a comparison of seven clustering algorithms was performed and tested against the gene expression datasets of three model plants under salt stress. These techniques are evaluated by internal and relative validity measures. It appears that the AGNES algorithm is the best one for internal validity measures for the three plant datasets. Also, K-Means profiles a trend for relative validity measures for these datasets.

Performance Evaluation of VANETs for Evaluating Node Stability in Dynamic Scenarios

Article

Full-text available

Sep 2018

Vehicular ad hoc networks (VANETs) are a favorable area of exploration which empowers the interconnection amid the movable vehicles and between transportable units (vehicles) and road side units (RSU). In Vehicular Ad Hoc Networks (VANETs), mobile vehicles can be organized into assemblage to promote interconnection links. The assemblage arrangement according to dimensions and geographical extend has serious influence on attribute of interaction. Vehicular ad hoc networks (VANETs) are subclass of mobile Ad-hoc network involving more complex mobility patterns. Because of mobility the topology changes very frequently. This raises a number of technical challenges including the stability of the network. There is a need for assemblage configuration leading to more stable realistic network. The paper provides investigation of various simulation scenarios in which cluster using k-means algorithm are generated and their numbers are varied to find the more stable configuration in real scenario of road.

Clarans & birch datamining techniques for disease diagnosis

Article

Full-text available

Feb 2024
MULTIMED TOOLS APPL

Data Mining is the action of performing intelligent analysis of discovering patterns, trends, and insights from large datasets various methods. The most common challenges include selecting the appropriate data mining algorithms from widely used techniques include classification, regression, clustering and association rule mining, and anomaly detection based on the the nature of the issue. By analyzing medical information, data mining techniques can help with predicting the likelihood of diseases and improving diagnostic accuracy. This can lead to early detection and intervention, potentially saving lives and reducing healthcare costs. Also, data mining enables the identification of patient subgroups with similar characteristics, allowing for personalized treatment plans. This can enhance the effectiveness of treatments by tailoring them to individual patient profiles. In this paper, the performance evaluation in the classification algorithms like k-means clustering and FCM methods are dicussed in relation to medical dataset. The objective is to enhance the intelligence and comprehension of researchers and methodically retrieve pertinent information from large databases that contain details on long-term patients. This process is a valuable option for improving the analysis of scientific data and gaining insights into the health patterns of individuals over extended periods. Treating lung, diabetes and liver diseases are important and laborious step for medicine. In this work, the above said algorithm’s performance were evaluated using UCI Repository datasets like Diabetes, Liver Disorder and Lung Cancer data on the basis of Performance Measure and Cost Measure. This empirical outcome caters the better accuracy by experimenting in tuning various parameters.

Artificial intelligence for a bio-sensored detection of tuberculosis

Article

Full-text available

Apr 2021

The advancement and well-evolved technology in international clinical discipline, tuberculosis remains a major health hazard. To resolve the hassle of tuberculosis, synthetic artificial intelligence (AI) affords the manner for solving trouble in actual global and enlightens the sector in bringing a human’s mind to a gadget. This paper targets to discover the presence of Mycobacterium tuberculosis infection within a brief span of time as compared to historical technique. it is designed in this type of way that the breath of the inflamed man or woman may be used to diagnose the sickness at premiere stage. The main goal is to design and implement a portable diagnostic package for Tuberculosis the use of Neural Networks and synthetic Intelligence. The device package known as digital nose that is enormous synthetic sensible element built through neural network consists of a biosensor having an electrode lined with the galectin. The indicators of hybridization on binding can be captured and processed with the aid of system gaining knowledge of sensor and the output is displayed the use of artificial intelligence. Feed-ahead back propagation neural community is used as a classifier to distinguish between inflamed or non-infected man and woman. In advance case of analysis approached on the basis of grouping microorganism’s days together. However this observe quicker to find the document in an hour. Therefore, the time taken for diagnosing the presence of the bacterium may be decreased and this additionally paves the manner for starting the remedy right now with ease.

Dimensionality Reduction for Classification and Clustering

Article

Apr 2019
IJISA

Proposed KDBSCAN Algorithm for Clustering

Article

Full-text available

Jun 2018

Science, technology and many other fields are use clustering algorithm widely for many applications, this paper presents a new hybrid algorithm called KDBSCAN that work on improving k-mean algorithm and solve two of its problems, the first problem is number of cluster, when it`s must be entered by user, this problem solved by using DBSCAN algorithm for estimating number of cluster, and the second problem is randomly initial centroid problem that has been dealt with by choosing the centroid in steady method and removing randomly choosing for a better results, this work used DUC 2002 dataset to obtain the results of KDBSCAN algorithm, it`s work in many application fields such as electronics libraries, biology and marketing, the KDBSCAN algorithm that described in this paper has better results than traditional K-mean and DBSCAN algorithms in many aspects, its preform stable result with lower entropy.

ACGLM: A Hybrid Approach to Select and Combine Gene Expression Regulation in Cancer Datasets

Article

Full-text available

Mar 2020

Cancer is one of the causes of death in the world and many genes are involved in it. Transcription factors (TFs) and microRNAs (miRNAs) are primary gene regulators and regulatory mechanisms for cells to define their targets. The study of the Regulatory mechanisms of the two main regulators is complex, but this lead to a deeper interpretation of biological processes. In order to avoid exhaustive search and unnecessary genes, firstly, mRNA expression and miRNA expression are clustered by K-means cluster, then, applied ANOVA test to select significant genes. We proposed a gene regulatory network (GRN) estimation method, using Directed networks with generalized linear regression to predict and explain the relationships between regulators and their targets. Where through GO TERM and KEGG PATHWAY for target genes we got many processes such as cell communication, regulation of the biologic process, biological regulation and cell cycle, DNA replication, and cell cycle, these processes are considered significant to the cancer diseases. by comparing with other methodologies Our approach was better, as well as the results were consistent with the medical literature, where the important regulators in our gene regulatory network have a major role in cancer this explains the efficiency of this approach.

DeepCancer: Detecting Cancer via Deep Generative Learning Through Gene Expressions

Conference Paper

Nov 2017

Identifying Sensitive Attributes for Preserving Privacy

Chapter

Jul 2017

In recent past, the data are generated massively from medical sector due to the advancements and growth of technology leading to high-dimensional and massive data. Handling the medical data is crucial since it contains some sensitive data of the individuals. If the sensitive data is revealed to the adversaries or others then that may be vulnerable to attack. Hiding the huge volume of data is practically difficult task among the researchers. Therefore, in real life only the sensitive data are hidden from the huge volume of data for providing security since hiding the entire data is costlier. Therefore, the sensitive attributes must be identified from the huge volume of data in order to hide them for preserving the privacy. In order to identify the sensitive data to hide them for providing security, reducing the computational and transmission cost in secured data transmission, this chapter presents a pragmatic approach to identify the sensitive attributes for preserving privacy. This proposed method is tested on the various real-world datasets with different classifiers and also the results are presented.

A novel feature selection method for image classification

Article

Full-text available

Nov 2015
OPTOELECTRON ADV MAT

Image classification plays a significant role in pattern recognition. In the recent past, due to the advancements in imaging technology, massive data are being generated through various image acquisition techniques. Classifying these massive images is a challenging task among the researchers. This paper presents a novel feature selection method to improve the performance of image classification. The performance of the proposed method is tested on the publically available real image dataset and compared with various state-of-the-art feature selection methods. The experimental results show that the proposed method outperforms the other state-of-the-art methods. © 2015, National Institute of Optoelectronics. All right reserved.

Improving the Accuracy of the Supervised Learners using Unsupervised based Variable Selection

Article

Full-text available

Jan 2014

An unsupervised feature selection algorithm with feature ranking for maximizing performance of the classifiers

Article

Full-text available

Jul 2015

Prediction plays a vital role in decision making. Correct prediction leads to right decision making to save the life, energy, efforts, money and time. The right decision prevents physical and material losses and it is practiced in all the fields including medical, finance, environmental studies, engineering and emerging technologies. Prediction is carried out by a model called classifier. The predictive accuracy of the classifier highly depends on the training datasets utilized for training the classifier. The irrelevant and redundant features of the training dataset reduce the accuracy of the classifier. Hence, the irrelevant and redundant features must be removed from the training dataset through the process known as feature selection. This paper proposes a feature selection algorithm namely unsupervised learning with ranking based feature selection (FSULR). It removes redundant features by clustering and eliminates irrelevant features by statistical measures to select the most significant features from the training dataset. The performance of this proposed algorithm is compared with the other seven feature selection algorithms by well known classifiers namely naive Bayes (NB), instance based (IB1) and tree based J48. Experimental results show that the proposed algorithm yields better prediction accuracy for classifiers. © 2015, Institute of Automation, Chinese Academy of Sciences and Springer-Verlag Berlin Heidelberg.

Farthest First Clustering in Links Reorganization

Article

Full-text available

Jul 2014

Website can be easily design but to efficient user navigation is not a easy task since user behavior is keep changing and developer view is quite different from what user wants, so to improve navigation one way is reorganization of website structure. For reorganization here proposed strategy is farthest first traversal clustering algorithm perform clustering on two numeric parameters and for finding frequent traversal path of user Apriori algorithm is used. Our aim is to perform reorganization with fewer changes in website structure.

An optimized farthest first clustering algorithm

Conference Paper

Full-text available

Nov 2013

Mukesh Kumar

Data mining is the process of analyzing data from different viewpoints and summarizing it into useful information. Clustering is the process of grouping of similar objects together. The group of the objects is called the cluster which contains similar objects compared to objects of the other cluster. Different clustering algorithm can be used according to the behavior of data. Farthest first algorithm is suitable for the large dataset but it creates the non-uniform cluster. The paper forms optimization of farthest first algorithm of clustering resulting uniform clusters.

Stability Analysis for Power Systems with Price-Based Demand Response Via Cobweb Plot

Conference Paper

Jan 2013

Price-based demand response (DR) is a mechanism for dynamically managing electric energy consumption in response to varying electricity prices. Price-sensitive DR would encourage consumers to reduce energy consumption when prices are higher, thereby reducing the peak electricity demand and alleviating pressure to power systems. However, it brings additional dynamics and challenges to the real-time supply and demand balance. Specifically, price-sensitive DR loads are constantly changing based on dynamic real-time prices, and changes in DR loads will in turn impact electricity prices. This paper adopts a closed loop model between economic dispatch (ED) and the price-sensitive DR load adjustment for exploring the stability of power systems. The stability refers to the ability that the system will finally converge to a fixed point after finite iterations between the ED and the price-sensitive DR load adjustment procedure. Ellipse and exponential curves are used to represent the non-linear price elasticity characteristics of DR, and the cobweb plot is applied to analyze the stability of power systems. Numerical case studies illustrate how non-linear price elasticity DR curve parameters will affect the system stability.

A farthest-first forwarding algorithm in VANETs

Conference Paper

Nov 2012

In order to reduce the transmission delay between the source vehicle and the destination vehicle, a Farthest-First forwarding algorithm (FF) algorithm is proposed in this paper to find dissemination paths with fewer hop counts, lower transmission delay, and fewer control packets. The proposed FF algorithm utilizes the general idea of the contention-based technique to choose the farthest vehicle from a sender to be a packet forwarder, which is decided by receivers themselves. In the FF algorithm, each receiver waits for a specific time period when a packet is received. The time period is calculated based on the distance between the sender vehicle and each receiver vehicle, and the waiting time period of the farthest vehicle is always less than other nearby vehicles. When the timer expires, farthest vehicle will first forwards the packet. Although the concept of FF algorithm similar to the greedy forwarding algorithm, the message exchanging for neighboring information maintenance in the proposed FF is unnecessary. Compare to the other mechanisms, the simulation results show that of using the proposed FF algorithm can reduce the transmission delay and fewer hop counts, significantly.

An Efficient K-Means Clustering Algorithm Analysis and Implementation

Article

Jul 2002

In k\hbox{-}{\rm{means}} clustering, we are given a set of n data points in d\hbox{-}{\rm{dimensional}} space {\bf{R}}^d and an integer k and the problem is to determine a set of k points in {\bf{R}}^d, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k\hbox{-}{\rm{means}} clustering is Lloyd's algorithm. In this paper, we present a simple and efficient implementation of Lloyd's k\hbox{-}{\rm{means}} clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.

The Global K-Means Clustering Algorithm

Article

Aug 2002
PATTERN RECOGN

We present the global k-means algorithm which is an incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global search procedure consisting of N (with N being the size of the data set) executions of the k-means algorithm from suitable initial positions. We also propose modifications of the method to reduce the computational load without significantly affecting solution quality. The proposed clustering methods are tested on well-known data sets and they compare favorably to the k-means algorithm with random restarts.

On the consistency of backward-looking expectations: The case of the cobweb

Article

Jan 1998
J ECON BEHAV ORGAN

Cars Hommes

In dynamic models of economic fluctuations backward-looking expectations with systematic forecasting errors are inconsistent with rational behaviour. In non-linear dynamic models exhibiting seemingly unpredictable, chaotic fluctuations, however, simple habitual ‘rule of thumb’ backward-looking expectation rules may yield non-zero but nevertheless non-systematic forecasting errors. In a chaotic model expectational forecasting errors may have zero autocorrelations at all lags. Even for rational agents patterns in these forecasting errors may be very difficult to detect, especially in the presence of (small) noise. Backward-looking expectations are then not necessarily inconsistent with rational behaviour. We investigate whether simple expectation schemes such as naive or adaptive expectations can be consistent with rational behaviour in the simplest of all non-linear dynamic economic models, the non-linear cobweb model.

Performance Analysis on Clustering Approaches for Gene Expression Data

Abstract and Figures

Recommended publications

A pattern matching approach for clustering gene expression data

An improved quantum-inspired evolutionary algorithm for clustering gene expression data

Experimental study on feature selection methods for software fault detection

Dimensionality Reduction for Classification and Clustering

Decision Making in Enterprise Computing: A Data Mining Approach

Identifying redundant features using unsupervised learning for high-dimensional data