ArticlePDF Available

Performance Analysis on Clustering Approaches for Gene Expression Data

Authors:
  • Anna University , Tiruchirappalli
  • Anna University,Tiruchirappalli , India

Abstract and Figures

Clustering is a way of finding the structures from a collection of unlabeled gene expression data. A number of algorithms are developed to tackle the problem of clustering the gene expression data. It is important for solving the problems that originate due to unsupervised learning. This paper presents a performance analysis on various clustering algorithm namely K-means, expectation maximization, and density based clustering in order to identify the best clustering algorithm for microarray data. Sum of squared error, log likelihood measures are used to evaluate the performance of these clustering methods.
Content may be subject to copyright.
IJARCCE
ISSN (Online) 2278-1021
ISSN (Print) 2319 5940
International Journal of Advanced Research in Computer and Communication Engineering
Vol. 5, Issue 2, February 2016
Copyright to IJARCCE DOI 10.17148/IJARCCE.2016.5242 196
Performance Analysis on Clustering Approaches
for Gene Expression Data
D. Asir Antony Gnana Singh1, A. Escalin Fernando2, E. Jebamalar Leavline3
Department of CSE, Anna University, BIT Campus, Tiruchirappalli, India1, 2
Department of ECE, Anna University, BIT Campus, Tiruchirappalli, India3
Abstract: Clustering is a way of finding the structures from a collection of unlabeled gene expression data. A number
of algorithms are developed to tackle the problem of clustering the gene expression data. It is important for solving the
problems that originate due to unsupervised learning. This paper presents a performance analysis on various clustering
algorithm namely K-means, expectation maximization, and density based clustering in order to identify the best
clustering algorithm for microarray data. Sum of squared error, log likelihood measures are used to evaluate the
performance of these clustering methods.
Keywords: Clustering analysis on microarray data, comparison of clustering algorithms, clustering analysis on gene
expression data, literature review on clustering methods, survey on clustering techniques.
I. INTRODUCTION
Clustering is a process of organizing objects into groups
whose members are similar in any of the ways. Therefore,
a cluster contains similar objects and dissimilar objects are
present in different clusters. The ultimate aim of the
clustering algorithm is to form the perfect clusters that
means grouping objects based on their similarity. There
are many similarity measures such as density; distance,
etc. have been used. In general, the clustering algorithms
learn the unlabeled data therefore it is also called
unsupervised learning algorithm. The unsupervised
learning algorithm learns the unlabeled data and develops
the clustering model. Then, the developed clustering
model can be employed to predict the group or cluster of
the ungrouped or un-clustered data. The clustering
algorithms can be used for various applications such as
gene expression data analysis, outlier detection, features
selection [1-3], etc. The clustering algorithms can be
classified into various types based on the fashion with
which the objects are clustered namely model based
clustering, density based clustering, connectivity based
clustering, centroid based clustering, etc. In this paper, the
performance of the density based clustering, expectation
maximization (EM) clustering, and K-means clustering are
analysed in terms of the sum of squared error (SSE), and
log likelihood on various gene expression data.
The rest of the paper is organised as follows. In Section II
the clustering techniques are discussed. In Section III
experimental setup and the experimental procedures are
explained. The results are discussed in Section IV. Finally,
Section V concludes this paper.
II. CLUSTERING TECHNIQUES
This section presents the various clustering techniques and
their merits and demerits.
A. Model Based Clustering
Cobweb is one of the model based clustering method. It is
basically an incremental system intended for hierarchical
conceptual clustering. It can set the acuity value which is
the minimum standard deviation of the numerical data.
The categorical utility threshold value is set by using cut-
off to prune the data. Arifovic et al [4] explained that the
genetic algorithm would have a better convergence for
wider range of parameters. Hommes et al [5] suggested
that cobweb model can show consistent rational behaviour
for non-linear dynamic models. The similarity measure
used in the cobweb is the distance measure. Alejos et al
[6] presented a technique to calculate the magnetization of
the simulated system with improved accuracy by means of
the Preisach model. Zhechong Zhao et al [7] presented a
cobweb plot which is used to illustrate graphically the
iterative procedure and to analyse stability.
It investigates the quantitative behaviour of one
dimensional iterated function using a fixed point known as
invariant point. Yuni Xia et al [8] proposed a conceptual
clustering algorithm which can explicitly handle the
uncertainties in the values of the dataset. Total utility (TU)
index is introduced to measure the quality of the
clustering. This finally increases the internal probabilistic
information of the clustering performance. The advantage
of cobweb is that it allows a bidirectional search. It allows
merging and splitting of classes using the category utility.
The major disadvantage is that it is purely based on the
assumption of probability distributions were updating and
storing of clusters is quite expensive.
Expectation maximization (EM) clustering is another type
under model based clustering, which is an iterative method
and is capable of finding the maximum likelihood in
statistical methods. The expectation of the likelihood is
computed by performing the alternate iteration between
the performances of the expectation step. Tood et al [9]
stated that this algorithm is suitable for outcomes that are
clumped together. Brankov et al [10] proposed the
normalized cross correlation that has better performance
than the traditionally used Euclidean distance which is
IJARCCE
ISSN (Online) 2278-1021
ISSN (Print) 2319 5940
International Journal of Advanced Research in Computer and Communication Engineering
Vol. 5, Issue 2, February 2016
Copyright to IJARCCE DOI 10.17148/IJARCCE.2016.5242 197
used as the similarity measure for the expectation
maximization. The EM algorithm is mainly suitable for the
analysis of the image data. Lagendijk et al [11] applied the
maximum likelihood approach to identify and restore
noisy data in the blurred images. This EM method can
facilitate maximizing likelihood functions that arise in
statistically estimating problems. Figueiredo et al [12]
presented the algorithm for the restoration of the images
using the penalized likelihood. Fessler et al [13] presented
a new update of sequentially alternating the parameters
between the several small hidden data spaces which are
defined by the algorithm designer. EM is most suitable for
the real world dataset and is best suited for performing
cluster analysis for a small scene and when not satisfactory
with the results of simple K-means algorithm. The
drawback of EM algorithm is its inherent complexity.
B. Farthest First Clustering
The farthest first cluster places the centre of each cluster at
a point farthest from the existing cluster centres. It is a
variant of simple K-means. Manoj et al [14] suggested that
the farthest first algorithm is suitable for the large dataset
and the clusters produced are non-uniform. So they
developed an optimized farthest first clustering algorithm
to produce uniform clusters. Chung-Ming et al [15]
proposed a farthest first forwarding algorithm to reduce
the transmission delay in the vehicular adhoc networks
(VANETs). H. K. Yogish et al [16] proposed a strategy of
farthest first traversal for finding the frequent traversal
path in the navigation and reorganization of the website
structure. This clustering algorithm can eventually speed
up the clustering since there are only few adjustments in
the data. The constraint based methods and distance-
function learning methods according to Bilenko et al [17]
are the similarity metric used in the algorithm. The major
advantage is that it is a heuristic based method that is fast,
scalable and appropriate for large datasets. But it is
difficult to compare the quality of the cluster produced. It
does not hold good for non-globular clusters and is very
sensitive to outliers.
C. Filter Based Clustering
This type of clustering method is for filtering the
information or any pattern which are essentially needed.
The filtration is carried out based on the keywords that are
supplied or some relevant information. Jiang-She Zhang et
al [18] proposed a clustering algorithm for the processing
of the images. They are computationally stable and
insensitive to initialization. They also produce consistent
clusters. Thomas et al [19] proposed a collaborative
filtering which is a combination of the correlation and
singular value decomposition (SVD) to improve accuracy.
A weighted co-clustering algorithm is designed in
incremental and parallel versions and the results are
empirically evaluated. Lagendijk et al [20] proposed two
different methods to estimate the performances of
individual classifiers and then combine them based on the
weight of the individual classifiers. The advantage is that
it compares the new arriving keywords with the existing
profile and information is provided to the user. It also
checks the information instantly rather than waiting for the
other information from the user. The major drawback is
that the user cannot get the information about the filtering
algorithm that is being used. It also depends on the
feedback of the retrieved information.
D. Connectivity Based Clustering
Hierarchical clustering is a type of connectivity based
clustering and it is the way of relating the objects having
core idea of the objects closer than the objects that are
farther. The main categories of the hierarchical clustering
are “Agglomerative” and “Divisive” methods. Edward J.
Coyle et al [21] proposed a randomized algorithm which is
mainly applicable for the sensors for the generation of the
cluster heads in a hierarchical manner. Michael Dittenbach
et al [22] presented a growing hierarchical self-organizing
map that evolves on the input data during the unsupervised
training process. Guangyu et al [23] developed a
comparative analysis and suggested that hierarchical
clustering is better when compared with the conventional
clustering. It produces an extensive hierarchy of clusters
that merge with other ideas that are present at a certain
distance. The disadvantage of using hierarchical cluster is
that it cannot provide single partitioning of the dataset.
E. Density Based Clustering
The density based clustering (DBC) groups the objects
mainly based on the density of the objects that are
reachable and connective. Li Tu et al [24] proposed a
framework called D-stream for clustering using the density
based approach. Mitra et al [25] suggested a
nonparametric data reduction scheme. The procedure
followed here is separating the dense area objects from
less dense area with the aid of an arbitrary object. The
density based clusters (DBC) are robust to noise but the
datasets are problematic and requires high densely
connected data.
F. Centroid Based clustering
K-Means is a centroid based clustering method. It
partitions the dataset into various clusters based on the
mean distance. It is one of the simplest forms of
unsupervised algorithm. The main objective of this
algorithm is to reduce the squared error. Tapas et al [26]
identified that the algorithm works faster as the separation
between the cluster increases. This algorithm is applicable
for the segmentation of images and data compression.
Kanungo et al [27] proposed that the K-means algorithm
runs faster as the separation between the cluster increases.
Jakob J. Verbeek et al [28] suggested a solution to reduce
the computational load without affecting the quality of the
solution significantly. The algorithm is robust, fast and
easy to understand. It also yields better results when the
dataset are well separated or distinct from each other. It
does not work efficiently for non-linear and categorical
data. Further, it is unable to handle outliers and noisy data
if the cluster centres are randomly chosen.
III. EXPERIMENTAL SETUP
The performances of the clustering methods are compared
by considering the gene expression datasets such as
SRBCT, Lymphoma and three different Leukaemia
IJARCCE
ISSN (Online) 2278-1021
ISSN (Print) 2319 5940
International Journal of Advanced Research in Computer and Communication Engineering
Vol. 5, Issue 2, February 2016
Copyright to IJARCCE DOI 10.17148/IJARCCE.2016.5242 198
datasets namely Leukaemia, Leukaemia3C, and
Leukaemia-1. The medical datasets that were considered
are SRBCT dataset with 2309 attributes and 83 instances.
The Lymphoma dataset consists of 4027 attributes and 66
instances. Three different datasets of Leukaemia with
7130 attributes and 72 instances each have also been used.
The performance is evaluated by performing the operation
of clustering in each of the datasets. The numbers of the
clusters are varied from two to ten and the resulting sum of
squared error (SSE) and the log likelihood (LL) are noted
for each of the methods. The comparison is carried out
among the clustering methods namely, K-means, density
based clustering (DBC) and expectation maximization
(EM) clustering. The experiment is carried out to obtain
the better clustering method for the gene expression data.
The experiments were performed using the WEKA data
mining tool. It is developed with the Java programming
language and it contains the GUI that is capable of
interacting with the various data files and even produces
visual results. WEKA tool provides various other options
on pre-processing, classification, clustering, association,
selection of attributes, and visualization.
2.1 Experimental Procedure
The experiment is carried out using the experimental
procedure with the following steps:
Step 1: Read the dataset.
Step 2: Set the number of clusters to be formed for
clustering the instances of the dataset.
Step 3: The sum of squared error is noted for the K-means
and density based clustering method.
Step 4: The log likelihood is noted for the density based
clustering and expectation maximization clustering
method.
Initially, the data set is read. Then, the number of clusters
to formed is set (from 2 to 10) for clustering the instances
of the dataset. Then, sum of squared error is noted for the
K- means and density based clustering method and the log
likelihood is noted for the density based clustering and
expectation maximization clustering method.
IV. RESULTS AND DISCUSSION
This section illustrates the results obtained from the
conducted experiments. Table I shows the sum of squared
errors for the K-means and density based clustering, Table
II shows the log likelihood values of the density based
clustering, Table III shows the log likelihood values of the
expectation maximization clustering, Figure 1 depicts the
sum of squared error for the K-means and density based
clustering for five different datasets.
Figure 2 illustrates the log likelihood of the density based
clustering and expectation maximization clustering for the
SRBCT dataset. Figure 3 depicts log likelihood of the
density based clustering and expectation maximization
clustering for the Lymphoma dataset. Figure 4 illustrates
the log likelihood of the density based clustering and
expectation maximization clustering for the Leukaemia
dataset. Figure 5 depicts log likelihood of the density
based clustering and expectation maximization clustering
for the Leukaemia 3C dataset. Figure 6 depicts the log
likelihood of the density based clustering and expectation
maximization clustering for the Leukaemia-1 dataset.
TABLE 1 SUM OF SQUARED ERRORS FOR THE K-MEANS AND DENSITY BASED CLUSTERING
Datasets
Number of clusters
2
3
4
5
6
7
8
9
10
SRBCT
07131.72
06743.21
06440.08
06123.40
05719.74
05467.57
05062.60
05022.87
04817.73
Lymphoma
08970.75
08546.91
07960.22
07684.71
07516.37
07139.95
06880.25
06791.88
06669.48
Leukaemia
16376.73
15409.02
15137.53
14803.38
14006.20
13693.59
13411.96
12966.56
12632.31
Leukaemia3C
16373.73
15407.02
15236.14
14900.66
14028.98
13765.59
13483.96
13011.07
12676.82
Leukaemia-1
16368.73
15400.02
15227.14
14891.66
14020.98
13737.54
13351.06
13005.07
12670.82
TABLE 2 LOG LIKELIHOOD VALUES OF THE DENSITY BASED CLUSTERING
Datasets
2
3
4
5
6
7
8
9
10
SRBCT
-01252.27
-01093.28
-01016.29
-00865.44
-00701.76
-00596.40
-00418.96
-00334.48
-00286.77
Lymphoma
-03162.03
-02993.16
-02787.76
-02658.80
-02630.60
-02469.81
-02352.89
-02341.43
-02231.36
Leukaemia
-47267.70
-46848.50
-46753.89
-46686.34
-46276.76
-46089.52
-45885.16
-45665.65
-45582.44
Leukaemia3C
-47267.55
-46848.34
-46715.70
-46648.40
-46216.85
-46041.36
-45837.00
-45612.04
-45528.83
Leukaemia-1
-47267.23
-46848.02
-46715.37
-46648.07
-46216.55
-46034.78
-45789.04
-45611.77
-45528.56
TABLE 3 LOG LIKELIHOOD VALUES OF THE EXPECTATION MAXIMIZATION CLUSTERING
Datasets
2
3
4
5
6
7
8
9
10
SRBCT
-01122.50
-00943.74
-00759.65
-00654.04
-00535.71
-00451.53
-00340.92
-00259.80
-00168.69
Lymphoma
-03137.02
-02916.35
-02759.67
-02683.67
-02413.22
-02260.96
-02355.60
-01979.22
-01910.53
Leukaemia
-47351.64
-46882.29
-46578.92
-46291.84
-46070.29
-45886.79
-45770.37
-45456.07
-45410.69
Leukaemia3C
-47351.49
-46882.13
-46578.76
-46291.69
-46070.18
-45886.67
-45770.27
-45455.95
-45410.59
Leukaemia -1
-47324.03
-46881.82
-46560.46
-46291.44
-46069.86
-45886.44
-45770.03
-45455.67
-45410.34
IJARCCE
ISSN (Online) 2278-1021
ISSN (Print) 2319 5940
International Journal of Advanced Research in Computer and Communication Engineering
Vol. 5, Issue 2, February 2016
Copyright to IJARCCE DOI 10.17148/IJARCCE.2016.5242 199
Fig1. Sum of squared error for the K-means and density
based clustering for five different datasets.
Fig2. Log likelihood of the density based clustering and
expectation maximization clustering for the SRBCT
dataset.
Fig3. Log likelihood of the density based clustering and
expectation maximization clustering for the Lymphoma
dataset.
Fig4. Log likelihood of the density based clustering and
expectation maximization clustering for the Leukaemia
dataset.
Fig5. Log likelihood of the density based clustering and
expectation maximization clustering for the Leukaemia 3C
dataset.
Fig6. Log likelihood of the density based clustering and
expectation maximization clustering for the Leukaemia-1
dataset.
From Table 1 and Figure 1, it observed that the K-means
clustering and density based clustering methods perform
similarly in terms of SSE for different number of clusters
on all the datasets and also it is observed that the
clustering methods produce lesser SSE with SRBCT
dataset compared to other datasets. From the Table 2,
Table 3 and Figure 2 to 6, it is observed that the
expectation maximization clustering method performs
better than the density based clustering method in terms of
log likelihood.
V. CONCLUSION
This paper conducted an empirical study on various
clustering algorithms in order to observe their performance
on gene expression data in terms of sum of squared error
and log likelihood. In this empirical study, the
performance of the clustering algorithms namely density
based clustering, expectation maximization clustering and
K-means clustering are evaluated on various gene
expression data. From this evaluation, it is observed that
the performance of expectation maximization clustering
algorithm is comparatively better than the density based
clustering algorithm in terms of log likelihood. The
clustering algorithms namely K-means and density based
clustering method perform similarly in terms of sum of
squared error. The SRBCT dataset possesses less sum of
IJARCCE
ISSN (Online) 2278-1021
ISSN (Print) 2319 5940
International Journal of Advanced Research in Computer and Communication Engineering
Vol. 5, Issue 2, February 2016
Copyright to IJARCCE DOI 10.17148/IJARCCE.2016.5242 200
squared error for the clustering algorithms K-means and
density based clustering than all other datasets compared.
REFERENCES
[1] Danasingh Asir Antony Gnana Singh, Subramanian Appavu Alias
Balamurugan, Epiphany Jebamalar Leavline, „A novel feature
selection method for image classification‟, Optoelectronics and
Advanced Materials, Rapid Communications, 9.11-12 (2015)
:1362- 1368
[2] Danasingh Asir Antony Gnana Singh, Subramanian Appavu Alias
Balamurugan, Epiphany Jebamalar Leavline. "An unsupervised
feature selection algorithm with feature ranking for maximizing
performance of the classifiers." International Journal of Automation
and Computing 12.5 (2015): 511-517.
[3] Danasingh Asir Antony Gnana Singh, Subramanian Appavu Alias
Balamurugan, Epiphany Jebamalar Leavline, „Improving the
Accuracy of the Supervised Learners using Unsupervised based
Variable Selection‟, Asian Journal of Information Technology, 13.9
(2014): 530-537.
[4] Arifovic, Jasmina. "Genetic algorithm learning and the cobweb
model." Journal of Economic dynamics and Control 18.1 (1994): 3-
28.
[5] Hommes, Cars H. "On the consistency of backward-looking
expectations: The case of the cobweb." Journal of Economic
Behavior & Organization 33.3 (1998): 333-362.
[6] Alejos, Óscar, and Edward Della Torre. "The generalized cobweb
method." Magnetics, IEEE Transactions on 41.5 (2005): 1552-1555.
[7] Zhao, Zhechong, and Lei Wu. "Stability analysis for power systems
with pricebased demand response via Cobweb Plot." Proc. IEEE
PES General Meeting. 2013.
[8] Yuni Xia,Bowei Xi “Conceptual Clustering Categorical Data with
Uncertainty” 19th IEEE International Conference on Tools with
Artificial Intelligence
[9] Moon, Tood K. "The expectation-maximization algorithm." Signal
processing magazine, IEEE 13.6 (1996): 47-60.
[10] Brankov, Jovan G., et al. "Similarity based clustering using the
expectation maximization algorithm." Image Processing. 2002.
Proceedings. 2002 International Conference on. Vol. 1. IEEE, 2002.
[11] Lagendijk, Reginald L., Jan Biemond, and Dick E. Boekee.
"Identification and restoration of noisy blurred images using the
expectation-maximization algorithm." IEEE Transactions on
Acoustics, Speech, and Signal Processing [see also IEEE
Transactions on Signal Processing], 38 (7) (1990).
[12] Figueiredo, Mário AT, and Robert D. Nowak. "An EM algorithm
for wavelet-based image restoration." Image Processing, IEEE
Transactions on 12.8 (2003): 906-916.
[13] Fessler, Jeffrey, and Alfred O. Hero. "Space-alternating generalized
expectation-maximization algorithm." Signal Processing, IEEE
Transactions on 42.10 (1994): 2664-2677.
[14] Kumar, Manoj. "An optimized farthest first clustering algorithm."
Engineering (NUiCONE), 2013 Nirma University International
Conference on. IEEE, 2013.
[15] Huang, Chung-Ming, et al. "A farthest-first forwarding algorithm in
VANETs." ITS Telecommunications (ITST), 2012 12th
International Conference on. IEEE, 2012.
[16] Vadeyar, Deepshree A., and H. K. Yogish. "Farthest First
Clustering in Links Reorganization." International Journal of Web
& Semantic Technology 5.3 (2014): 17.
[17] Bilenko, Mikhail, Sugato Basu, and Raymond J. Mooney.
"Integrating constraints and metric learning in semi-supervised
clustering." Proceedings of the twenty-first international conference
on Machine learning. ACM, 2004.
[18] Leung, Yee, Jiang-She Zhang, and Zong-Ben Xu. "Clustering by
scale-space filtering." Pattern Analysis and Machine Intelligence,
IEEE Transactions on 22.12 (2000): 1396-1410.
[19] George, Thomas, and Srujana Merugu. "A scalable collaborative
filtering framework based on co-clustering." Data Mining, Fifth
IEEE International Conference on. IEEE, 2005.
[20] Lagendijk, Reginald L., Jan Biemond, and Dick E. Boekee.
"Identification and restoration of noisy blurred images using the
expectation-maximization algorithm." IEEE Transactions on
Acoustics, Speech, and Signal Processing [see also IEEE
Transactions on Signal Processing], 38 (7) (1990).
[21] Bandyopadhyay, Seema, and Edward J. Coyle. "An energy efficient
hierarchical clustering algorithm for wireless sensor networks."
INFOCOM 2003. Twenty-Second Annual Joint Conferences of the
IEEE Computer and Communications. IEEE Societies. Vol. 3.
IEEE, 2003.
[22] Dittenbach, Michael, Dieter Merkl, and Andreas Rauber. "The
growing hierarchical self-organizing map." ijcnn. IEEE, 2000.
[23] Pei, Guangyu, et al. "A wireless hierarchical routing protocol with
group mobility." Wireless Communications and Networking
Conference, 1999. WCNC. 1999 IEEE. IEEE, 1999.
[24] Chen, Yixin, and Li Tu. "Density-based clustering for real-time
stream data." Proceedings of the 13th ACM SIGKDD international
conference on Knowledge discovery and data mining. ACM, 2007.
[25] Mitra, Pabitra, C. A. Murthy, and Sankar K. Pal. "Density-based
multiscale data condensation." Pattern Analysis and Machine
Intelligence, IEEE Transactions on 24.6 (2002): 734-747.
[26] Kanungo, Tapas, et al. "An efficient k-means clustering algorithm:
Analysis and implementation." Pattern Analysis and Machine
Intelligence, IEEE Transactions on 24.7 (2002): 881-892.
[27] Kanungo, Tapas, et al. "An efficient k-means clustering algorithm:
Analysis and implementation." Pattern Analysis and Machine
Intelligence, IEEE Transactions on 24.7 (2002): 881-892.
[28] Likas, Aristidis, Nikos Vlassis, and Jakob J. Verbeek. "The global
k-means clustering algorithm." Pattern recognition 36.2 (2003):
451-461.
... It may be flexibly encrypted one-to-many rather than one-to-one. All are depended on performance of big data [8]. ...
... The encryption methods employed to improve the support system's [8] security that is hosting big data are time-consuming and have impact on system's performance. Several variables affect the big data transmission performance, as follows: ...
... Bhardwaj [7] implemented of various MapReduce jobs like Pi, TeraSort, Word-Count has been done on cloud based Hadoop deployment by using Microsoft Azure cloud services. D. Asir [8] presented a performance analysis on various clustering algorithm namely K-means, expectation maximization, and density based clustering in order to identify the best clustering algorithm for microarray data. Sum of squared error, log likelihood measures are used to evaluate the performance of these clustering methods. ...
Chapter
Full-text available
Research is considering security of big data and retaining the performance during its transmission over network. It has been observed that there have been several researches that have considered the concept of big data. Moreover, a lot of those researches also provided security against data but failed to retain the performance. Use of several encryption mechanisms such as RSA [43] and AES [44] has been used in previous researches. But, if these encryption mechanisms are applied, then the performance of network system gets degraded. In order to resolve those issues, the proposed work is making using of compression mechanism to reduce the size before implementing encryption. Moreover, data is spitted in order to make the transmission more reliable. After splitting the data contents data has been transferred from multiple route. If some hackers opt to capture that data in unauthentic manner, then they would be unable to get complete and meaning full information. Thus, the proposed model has improved the security of big data in network environment by integration of compression and splitting mechanism with big data encryption. Moreover, the use of user‐defined port and use of multiple paths during transmission of big data in split manner increases the reliability and security of big data over network environment.
... Mostly, each clustering method has its own parameters for calculating clusters. The decision to use a particular method for clustering will depend on the nature of the datasets being studied and what the researcher expects to achieve using that method [26]- [29]. ...
... Bihari et al. [29] compared the performance of KM, HC clustering, SOM and DBSCAN on Iris flower gene expression data. The comparison results of these methods were validated by using internal and external indices. ...
Article
Full-text available
Current Genome-wide advancements in Gene chips technology provide in the “Omics (genomics, proteomics and transcriptomics) research”, an opportunity to analyze the expression levels of thousand of genes across multiple experiments. In this regard, many machine learning approaches were proposed to deal with this deluge of information. Clustering methods are one of these approaches. Their process consists of grouping data (gene profiles) into homogeneous clusters using distance measurements. Various clustering techniques are applied, but there is no consensus for the best one. In this context, a comparison of seven clustering algorithms was performed and tested against the gene expression datasets of three model plants under salt stress. These techniques are evaluated by internal and relative validity measures. It appears that the AGNES algorithm is the best one for internal validity measures for the three plant datasets. Also, K-Means profiles a trend for relative validity measures for these datasets.
... file to its local xml document and then traffic is created and then ported in NS2.So as the simulation can be done in NS2 using the locations as traced by mobility file. The clusters are formed by nodes using the K-means algorithm travelling in different directions according to real scenario of road as shown in figure 4 and figure 5.The node is chosen as cluster head within one cluster when the distance of that node is minimum with the cluster head of other node in different cluster [23] [30]. Thus, nearest cluster heads forward the packets following different routes. ...
... Then again recalculating the new centre's of clusters and loop has been initiated again. The assemblage of node data is used by reducing the aggregate of squares of range between nodes i.e. location values and the corresponding cluster Centroid [29] [30]. The Centroid are considered as cluster head for communication in order to attain the stability in terms of data dissemination or flow of packets successfully from one node location to another node within the stipulated period of time. ...
Article
Full-text available
Vehicular ad hoc networks (VANETs) are a favorable area of exploration which empowers the interconnection amid the movable vehicles and between transportable units (vehicles) and road side units (RSU). In Vehicular Ad Hoc Networks (VANETs), mobile vehicles can be organized into assemblage to promote interconnection links. The assemblage arrangement according to dimensions and geographical extend has serious influence on attribute of interaction. Vehicular ad hoc networks (VANETs) are subclass of mobile Ad-hoc network involving more complex mobility patterns. Because of mobility the topology changes very frequently. This raises a number of technical challenges including the stability of the network. There is a need for assemblage configuration leading to more stable realistic network. The paper provides investigation of various simulation scenarios in which cluster using k-means algorithm are generated and their numbers are varied to find the more stable configuration in real scenario of road.
... The log likelihood measure and sum of squared error are utilized to find the efficiency of the clustering techniques. This work is an practical study of different cluster algorithms in the way to detect the efficiency on the data of gene expression (likelihood measure and sum of squared error) [5]. ...
Article
Full-text available
Data Mining is the action of performing intelligent analysis of discovering patterns, trends, and insights from large datasets various methods. The most common challenges include selecting the appropriate data mining algorithms from widely used techniques include classification, regression, clustering and association rule mining, and anomaly detection based on the the nature of the issue. By analyzing medical information, data mining techniques can help with predicting the likelihood of diseases and improving diagnostic accuracy. This can lead to early detection and intervention, potentially saving lives and reducing healthcare costs. Also, data mining enables the identification of patient subgroups with similar characteristics, allowing for personalized treatment plans. This can enhance the effectiveness of treatments by tailoring them to individual patient profiles. In this paper, the performance evaluation in the classification algorithms like k-means clustering and FCM methods are dicussed in relation to medical dataset. The objective is to enhance the intelligence and comprehension of researchers and methodically retrieve pertinent information from large databases that contain details on long-term patients. This process is a valuable option for improving the analysis of scientific data and gaining insights into the health patterns of individuals over extended periods. Treating lung, diabetes and liver diseases are important and laborious step for medicine. In this work, the above said algorithm’s performance were evaluated using UCI Repository datasets like Diabetes, Liver Disorder and Lung Cancer data on the basis of Performance Measure and Cost Measure. This empirical outcome caters the better accuracy by experimenting in tuning various parameters.
... Neural community is also said as artificial neural network (ANN) which is described as an architecture system that processing paradigm. ANN is comprised of a couple of number of neurons which might be maximum rather steady for problem fixing which works in unity (Danasingh et al. 2016). A sign from node is the simple shape of sample of real neuron that receives an input sign which is strong from the node to which it need to be connected and analysed (Fig. 1). ...
Article
Full-text available
The advancement and well-evolved technology in international clinical discipline, tuberculosis remains a major health hazard. To resolve the hassle of tuberculosis, synthetic artificial intelligence (AI) affords the manner for solving trouble in actual global and enlightens the sector in bringing a human’s mind to a gadget. This paper targets to discover the presence of Mycobacterium tuberculosis infection within a brief span of time as compared to historical technique. it is designed in this type of way that the breath of the inflamed man or woman may be used to diagnose the sickness at premiere stage. The main goal is to design and implement a portable diagnostic package for Tuberculosis the use of Neural Networks and synthetic Intelligence. The device package known as digital nose that is enormous synthetic sensible element built through neural network consists of a biosensor having an electrode lined with the galectin. The indicators of hybridization on binding can be captured and processed with the aid of system gaining knowledge of sensor and the output is displayed the use of artificial intelligence. Feed-ahead back propagation neural community is used as a classifier to distinguish between inflamed or non-infected man and woman. In advance case of analysis approached on the basis of grouping microorganism’s days together. However this observe quicker to find the document in an hour. Therefore, the time taken for diagnosing the presence of the bacterium may be decreased and this additionally paves the manner for starting the remedy right now with ease.
... Initially, the chosen k data points are considered as the cluster center and the distance between the cluster center and other data points are calculated in an iterative manner. Then, the cluster center is changed iteratively for forming the perfect clusters [25] [26]. ...
... Clustering techniques are used to partition dataset into several groups each group has objects shear similar features, while has different feature from other groups, clusters represents by those groups, clustering is unsupervised method works to learn the unlabeled data, the main purpose of all clustering methods is to offer a perfect partition for set of objects also it offers a prediction to group the ungrouped data [1], clustering algorithms has wild rang in world its use publicly in many application such as pattern recognition, artificial intelligence, information technology, image ...
Article
Full-text available
Science, technology and many other fields are use clustering algorithm widely for many applications, this paper presents a new hybrid algorithm called KDBSCAN that work on improving k-mean algorithm and solve two of its problems, the first problem is number of cluster, when it`s must be entered by user, this problem solved by using DBSCAN algorithm for estimating number of cluster, and the second problem is randomly initial centroid problem that has been dealt with by choosing the centroid in steady method and removing randomly choosing for a better results, this work used DUC 2002 dataset to obtain the results of KDBSCAN algorithm, it`s work in many application fields such as electronics libraries, biology and marketing, the KDBSCAN algorithm that described in this paper has better results than traditional K-mean and DBSCAN algorithms in many aspects, its preform stable result with lower entropy.
Article
Full-text available
Cancer is one of the causes of death in the world and many genes are involved in it. Transcription factors (TFs) and microRNAs (miRNAs) are primary gene regulators and regulatory mechanisms for cells to define their targets. The study of the Regulatory mechanisms of the two main regulators is complex, but this lead to a deeper interpretation of biological processes. In order to avoid exhaustive search and unnecessary genes, firstly, mRNA expression and miRNA expression are clustered by K-means cluster, then, applied ANOVA test to select significant genes. We proposed a gene regulatory network (GRN) estimation method, using Directed networks with generalized linear regression to predict and explain the relationships between regulators and their targets. Where through GO TERM and KEGG PATHWAY for target genes we got many processes such as cell communication, regulation of the biologic process, biological regulation and cell cycle, DNA replication, and cell cycle, these processes are considered significant to the cancer diseases. by comparing with other methodologies Our approach was better, as well as the results were consistent with the medical literature, where the important regulators in our gene regulatory network have a major role in cancer this explains the efficiency of this approach.
Chapter
In recent past, the data are generated massively from medical sector due to the advancements and growth of technology leading to high-dimensional and massive data. Handling the medical data is crucial since it contains some sensitive data of the individuals. If the sensitive data is revealed to the adversaries or others then that may be vulnerable to attack. Hiding the huge volume of data is practically difficult task among the researchers. Therefore, in real life only the sensitive data are hidden from the huge volume of data for providing security since hiding the entire data is costlier. Therefore, the sensitive attributes must be identified from the huge volume of data in order to hide them for preserving the privacy. In order to identify the sensitive data to hide them for providing security, reducing the computational and transmission cost in secured data transmission, this chapter presents a pragmatic approach to identify the sensitive attributes for preserving privacy. This proposed method is tested on the various real-world datasets with different classifiers and also the results are presented.
Article
Full-text available
Image classification plays a significant role in pattern recognition. In the recent past, due to the advancements in imaging technology, massive data are being generated through various image acquisition techniques. Classifying these massive images is a challenging task among the researchers. This paper presents a novel feature selection method to improve the performance of image classification. The performance of the proposed method is tested on the publically available real image dataset and compared with various state-of-the-art feature selection methods. The experimental results show that the proposed method outperforms the other state-of-the-art methods. © 2015, National Institute of Optoelectronics. All right reserved.
Article
Full-text available
Prediction plays a vital role in decision making. Correct prediction leads to right decision making to save the life, energy, efforts, money and time. The right decision prevents physical and material losses and it is practiced in all the fields including medical, finance, environmental studies, engineering and emerging technologies. Prediction is carried out by a model called classifier. The predictive accuracy of the classifier highly depends on the training datasets utilized for training the classifier. The irrelevant and redundant features of the training dataset reduce the accuracy of the classifier. Hence, the irrelevant and redundant features must be removed from the training dataset through the process known as feature selection. This paper proposes a feature selection algorithm namely unsupervised learning with ranking based feature selection (FSULR). It removes redundant features by clustering and eliminates irrelevant features by statistical measures to select the most significant features from the training dataset. The performance of this proposed algorithm is compared with the other seven feature selection algorithms by well known classifiers namely naive Bayes (NB), instance based (IB1) and tree based J48. Experimental results show that the proposed algorithm yields better prediction accuracy for classifiers. © 2015, Institute of Automation, Chinese Academy of Sciences and Springer-Verlag Berlin Heidelberg.
Article
Full-text available
Website can be easily design but to efficient user navigation is not a easy task since user behavior is keep changing and developer view is quite different from what user wants, so to improve navigation one way is reorganization of website structure. For reorganization here proposed strategy is farthest first traversal clustering algorithm perform clustering on two numeric parameters and for finding frequent traversal path of user Apriori algorithm is used. Our aim is to perform reorganization with fewer changes in website structure.
Conference Paper
Full-text available
Data mining is the process of analyzing data from different viewpoints and summarizing it into useful information. Clustering is the process of grouping of similar objects together. The group of the objects is called the cluster which contains similar objects compared to objects of the other cluster. Different clustering algorithm can be used according to the behavior of data. Farthest first algorithm is suitable for the large dataset but it creates the non-uniform cluster. The paper forms optimization of farthest first algorithm of clustering resulting uniform clusters.
Conference Paper
Price-based demand response (DR) is a mechanism for dynamically managing electric energy consumption in response to varying electricity prices. Price-sensitive DR would encourage consumers to reduce energy consumption when prices are higher, thereby reducing the peak electricity demand and alleviating pressure to power systems. However, it brings additional dynamics and challenges to the real-time supply and demand balance. Specifically, price-sensitive DR loads are constantly changing based on dynamic real-time prices, and changes in DR loads will in turn impact electricity prices. This paper adopts a closed loop model between economic dispatch (ED) and the price-sensitive DR load adjustment for exploring the stability of power systems. The stability refers to the ability that the system will finally converge to a fixed point after finite iterations between the ED and the price-sensitive DR load adjustment procedure. Ellipse and exponential curves are used to represent the non-linear price elasticity characteristics of DR, and the cobweb plot is applied to analyze the stability of power systems. Numerical case studies illustrate how non-linear price elasticity DR curve parameters will affect the system stability.
Conference Paper
In order to reduce the transmission delay between the source vehicle and the destination vehicle, a Farthest-First forwarding algorithm (FF) algorithm is proposed in this paper to find dissemination paths with fewer hop counts, lower transmission delay, and fewer control packets. The proposed FF algorithm utilizes the general idea of the contention-based technique to choose the farthest vehicle from a sender to be a packet forwarder, which is decided by receivers themselves. In the FF algorithm, each receiver waits for a specific time period when a packet is received. The time period is calculated based on the distance between the sender vehicle and each receiver vehicle, and the waiting time period of the farthest vehicle is always less than other nearby vehicles. When the timer expires, farthest vehicle will first forwards the packet. Although the concept of FF algorithm similar to the greedy forwarding algorithm, the message exchanging for neighboring information maintenance in the proposed FF is unnecessary. Compare to the other mechanisms, the simulation results show that of using the proposed FF algorithm can reduce the transmission delay and fewer hop counts, significantly.
Article
In k\hbox{-}{\rm{means}} clustering, we are given a set of n data points in d\hbox{-}{\rm{dimensional}} space {\bf{R}}^d and an integer k and the problem is to determine a set of k points in {\bf{R}}^d, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k\hbox{-}{\rm{means}} clustering is Lloyd's algorithm. In this paper, we present a simple and efficient implementation of Lloyd's k\hbox{-}{\rm{means}} clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.
Article
We present the global k-means algorithm which is an incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global search procedure consisting of N (with N being the size of the data set) executions of the k-means algorithm from suitable initial positions. We also propose modifications of the method to reduce the computational load without significantly affecting solution quality. The proposed clustering methods are tested on well-known data sets and they compare favorably to the k-means algorithm with random restarts.
Article
In dynamic models of economic fluctuations backward-looking expectations with systematic forecasting errors are inconsistent with rational behaviour. In non-linear dynamic models exhibiting seemingly unpredictable, chaotic fluctuations, however, simple habitual ‘rule of thumb’ backward-looking expectation rules may yield non-zero but nevertheless non-systematic forecasting errors. In a chaotic model expectational forecasting errors may have zero autocorrelations at all lags. Even for rational agents patterns in these forecasting errors may be very difficult to detect, especially in the presence of (small) noise. Backward-looking expectations are then not necessarily inconsistent with rational behaviour. We investigate whether simple expectation schemes such as naive or adaptive expectations can be consistent with rational behaviour in the simplest of all non-linear dynamic economic models, the non-linear cobweb model.