Conference PaperPDF Available

K-means clustering using Max-min distance measure

Authors:

Abstract and Figures

The cluster analysis deals with the problems of organization of a collection of data objects into clusters based on similarity. It is also known as the unsupervised classification of objects and has found many applications in different areas. An important component of a clustering algorithm is the distance measure which is used to find the similarity between data objects. K-means is one of the most popular and widespread partitioning clustering algorithms due to its superior scalability and efficiency. Typically, the K-means algorithm determines the distance between an object and its cluster centroid by Euclidean distance measure. This paper proposes a variant of K-means which uses an alternate distance measure namely, Max-min measure. The modified K-means algorithm is tested with six benchmark datasets taken from UCI machine learning data repository and found that the proposed algorithm takes less number of iterations to converge than the existing one with improved performance.
Content may be subject to copyright.
K-Means Clustering using Max-min Distance
Measure
N. Karthikeyani Visalakshi
Lecturer in Computer Science
Vellalar College for Women
Erode, Tamil Nadu, India
karthichitru@yahoo.co.in
J. Suguna
Lecturer (SG) in Computer Science
Vellalar College for Women
Erode, Tamil Nadu, India
sugunajravi@yahoo.co.in
AbstractThe cluster analysis deals with the problems of
organization of a collection of data objects into clusters based on
similarity. It is also known as the unsupervised classification of
objects and has found many applications in different areas. An
important component of a clustering algorithm is the distance
measure which is used to find the similarity between data objects.
K-means is one of the most popular and widespread partitioning
clustering algorithms due to its superior scalability and
efficiency. Typically, the K-means algorithm determines the
distance between an object and its cluster centroid by Euclidean
distance measure. This paper proposes a variant of K-means
which uses an alternate distance measure namely, Max-min
measure. The modified K-means algorithm is tested with six
benchmark datasets taken from UCI machine learning data
repository and found that the proposed algorithm takes less
number of iterations to converge than the existing one with
improved performance.
Keywords - Clustering; Euclidean distance; K-Means algorithm;
Max-min distance.
I. INTRODUCTION
Cluster analysis is an unsupervised learning method that
constitutes a cornerstone of an intelligent data analysis
process. It is the process of grouping objects into clusters such
that the objects in the same cluster are similar where as objects
in different clusters is different. Clustering has become an
increasingly important task in modern application domains
such as marketing and purchasing assistance, multimedia, and
molecular biology as well as many others. There is no
clustering algorithm performing best for all datasets. Each
dataset requires both expertise and insight to choose a single
best clustering algorithm, and it depends on the nature of
application and patterns to be extracted. Different types of
algorithms have been proposed in the literature [7, 13] to solve
the clustering problem.
Each clustering algorithm is based on some kind of
distance measures, which leads to grouping of related objects.
As each distance measure follows different methods for
determining the degree of similarity between two objects, the
selection of an appropriate distance measure plays a vital role
in any clustering algorithm. The distance measure will
influence the shape of the clusters, as some elements may be
close to one another according to one distance measure and
farther away according to another.
K-Means [11] is a prototype-based, partitional clustering
technique that attempts to find a user-specified number of
clusters (K), which are represented by centroids. There are
different variants of K-Means algorithm in the literature [1, 2,
3, 9, 14], each emphasizes various aspects like centroid
initialization [14], number of clusters determination [2, 3],
global optimization [1 ], etc.
The K-Means algorithm typically uses Euclidean or
squared Euclidean distance to measure the distortion between
a data object and its cluster centroid [17]. The Euclidean and
squared Euclidean distances are usually computed from raw
data and not from standardized data. While using Euclidean
distances, the distance between any two objects is not affected
by the addition of new objects to the analysis. However, the
clustering results can be greatly affected by differences in
scale among the dimension from which the distances are
computed. Hence an effort is taken here to standardize the
input objects before clustering and suggest an alternate
distance measure. This paper proposes a modified K-Means
algorithm, by applying Min-max normalization procedure [10]
and Max-min distance measure [16] to reach better
performance.
The paper is organized as follows: Section 2 presents an
overview of clustering algorithms, K-Means clustering,
distance measures and need for normalization. The modified
K-Means algorithm is proposed in Section 3. Section 4
discusses the experimental analysis. Section 5 concludes the
paper and outlines scope for future research work.
II. BACKGROUND
A. Clustering Algorithms
Clustering is a process of partitioning a set of data objects
into a set of meaningful subclasses, called clusters. A cluster is
a collection of data objects that are similar to one another
based on their attribute values, and thus can be treated
collectively as one group. The goal of the clustering technique
is to decompose or partition a data set into groups such that
both intra-group similarity and inter-group dissimilarity are
maximized [17].
Clustering algorithms can be classified along different,
independent dimensions. One well-known dimension
categorizes clustering methods according to the result they
produce. Here, we can distinguish between hierarchical and
partitioning clustering algorithms [8]. Partitioning algorithms
construct a flat (single level) partition of a database D of n
objects into a set of K clusters such that the objects in a cluster
are more similar to each other than to objects in different
clusters. Hierarchical algorithms decompose the database into
several levels of nested partitioning (clustering), represented
for example by a dentogram, i.e. a tree that iteratively splits D
into smaller subsets until each subset consists of only one
object. In such a hierarchy, each node of the tree represents a
cluster of D.
Another dimension according to which we can classify
clustering algorithms is from an algorithmic point of view.
Here we can distinguish between optimization based or
distance based algorithms and density based algorithms.
Distance based methods use the distances between the objects
directly in order to optimize a global cluster criterion. In
contrast, density based algorithms apply a local cluster
criterion. Clusters are regarded as regions in the data space in
which the objects are dense, and which are separated by
regions of low object density (noise).
Studies have shown that partitional clustering algorithms
are more suitable for clustering large datasets due to their
relatively low computational requirements [15]. The time
complexity of the partitioning technique is almost linear,
which makes it widely used. One of the best-known distance
based partitioning clustering algorithms is the K-means
algorithm [7, 13 ].
B. K-Means Clustering
K-Means is a typical clustering algorithm [17]. It is
attractive in practice, because it is simple and it is generally
very fast. A set of n objects
n
, , , ixiK21 , =, are to be
partitioned into K groups, ....,,2,1, njC j= The objective
function, based on the Euclidean distance between an object x
in group j and the corresponding cluster centroid Cj, can be
defined by:
= =
=
K
1j
n
1i
2
ji CxJ ((1)
In detail, it randomly selects K of the given objects to
represent the cluster centroid. Based on the selected objects,
all remaining objects are assigned to their closer centroid one
by one. The Euclidean distance between the object and every
centroid is computed, and then the object is moved to the one
of the clusters which yields minimum distance. The value of
the selected centroid is recalculated by taking the mean of all
data points belonging to the same cluster. The operation is
iterated for all the objects. The same procedure is repeated
until the objective function converges. If K cannot be known
ahead of time, various values of K can be evaluated until the
most suitable one is found. The effectiveness of this method as
well as of others relies heavily on the objective function used
in measuring the distance between objects. Choosing the
proper initial centroid is the key step of the basic K-Means
procedure.
Generally, the K-Means algorithm has the following
important properties: (i) it is efficient in processing large data
sets, (ii) it often terminates at a local optimum, (iii) the
clusters have spherical shapes, (iv) it is sensitive to noise.
Distance Measures
The concept of distance is the essential component of any
form of clustering that helps to navigate through the data space
and form clusters [15]. By computing distance, it is sensed and
articulated how close together two patterns are and, based on
this closeness, allocate them to the same cluster. Formally, the
distance ),( yxd between x and y is considered to be a two-
argument function satisfying the following conditions:
0),(
yxd for every x and y
0),(
=
xxd for every x (2)
),(),(),(
),(),(
zxdzydyxd
xydyxd
+
=
When the components of the data object vectors are all in
the same physical units then it is possible that the simple
Euclidean distance metric is sufficient to successfully group
similar data instances. However, even in this case the
Euclidean distance may sometimes be misleading. Despite
different measurements being taken in the same physical units,
an informed decision has to be made as to the relative scaling,
because different scaling can lead to different types of
clustering.
Actually a mathematical formula is used to combine the
distances between the single components of the data feature
vectors into a unique distance measure. When the clustering
process is being done using this formula, different formulas
may also lead to different kinds of clustering. Domain
knowledge must be used to guide the formulation of a suitable
distance measure for each particular application. However,
there are no general theoretical guidelines for selecting a
measure for any given application.
It is often the case that the components of the data feature
vectors are not immediately comparable. It can be that the
components are not continuous variables, like length, but
nominal categories, such as the days of the week. In these
cases again, domain knowledge must be used to formulate an
appropriate measure.
There are several distance measures used in the literature
[15, 17]. Three distance measures used in this work for
comparative analysis are defined as follows:
Euclidean distances are usually computed from raw
data and not from standardized data.
( )
=
=
d
k
jkikji ttttd
1
2
),( (3)
Cosine coefficient relates the overlap to the
geometric average of the two sets.
( )
==
=
=d
h
jh
d
h
ih
d
h
jhih
ji
tt
tt
ttd
1
2
1
2
1
, (4)
Max-min distance measure is found through simple
min and max operations on pairs of data objects.
( )
( )
=
=
=d
k
jkik
d
k
jkik
ji
tt
tt
ttd
1
1
,max
,min
),( (5)
C. Need for Normalization
Preprocessing [10] is required before using any data
mining algorithms to improve the results’ performance. Data
normalization is one of the preprocessing procedures in data
mining, where the attribute data are scaled so as to fall within
a small specified range such as -1.0 to 1.0 or 0.0 to 1.0.
Normalization before clustering is often needed for distance
metric, such as Euclidian distance, which are sensitive to
differences in the magnitude or scales of the attributes. In real
applications, because of the differences in range of attributes’
value, one attribute might overpower the other one.
Normalization prevents outweighing features with large range
like ‘salary’ over features with smaller range like ‘age’. The
goal is to equalize the size or magnitude and the variability of
these features.
There are many methods for data normalization which
include Min-max normalization, z-score normalization and
normalization by decimal scaling [5, 10]. Min-max
normalization performs a linear transformation on the original
data. Min-max normalization maps a value of each attribute to
the range [0, 1]. In z-score normalization, the values of
attributes are normalized based on the mean and standard
deviation of corresponding attributes. This method of
normalization is useful when the actual minimum and
maximum of every attribute is unknown. Normalization by
decimal scaling normalizes by moving the decimal point of
values of attribute. The number of decimal points moved
depends on the maximum absolute value of the attribute.
III. PROPOSED ALGORITHM
Normally, K-Means clustering algorithm uses Euclidean
distance as the distance measure to compute the similarity
between the object and its centroid. Alternately, the Cosine
distance measure [15] is often used for document clustering. In
this paper, Max-min distance measure [16] is suggested in
place of Euclidean distance measure. The Max-min distance
measure requires the values to be in the range [0, 1]. In order
to scale the given objects to fall within a small specified range
[0, 1], Min-max normalization procedure [10] is followed as a
pre-processing step for the proposed K-Means algorithm. The
step by step procedure of the proposed K-Means clustering
algorithm based on Max-min distance measure is given here
under.
---------------------------------------------------------------------------
Algorithm 1
K-Means Algorithm based on Max-min Distance Measure
---------------------------------------------------------------------------
Input : Dataset of n objects with d features and the value of K
Output: Partition of the input data into K clusters
Procedure:
Step 1: Normalize the input data objects to fall into the range
[0, 1] using Min-max normalization procedure
Step 2: Declare a membership matrix U of size n x K
Step 3: Generate K cluster centroids randomly within the
range of the data or select K objects randomly as
initial cluster centroids. Let the centroids be C1, C2, .
., CK
Step 4: Calculate the distance measure ij
d using Max-min
similarity measure
=
=
=d
kjkik
d
kjkik
ij Cx
Cx
d
1
1
),(max
),(min
for all cluster centroids j
C, , K, 2, 1j K
=
and
data objects , n, 2, 1ixiK , =
Step 5: Compute the U membership matrix
K21j
n21i
otherwise0
ljdd1
Uilij
ij .,..,,
,...,,
;
,;
=
=
=
Step 6: Compute new cluster centroids Cj
K21jfor
U
xU
Cn
1i
ij
n
1i
iij
j.,..,,
)(
)(
==
=
=
Step 7: Repeat steps 4 to 6 until convergence
IV. EXPERIMENTAL ANALYSIS
The main purpose of this work is to explore the impact of
Max-min distance measure in the K-Means clustering
algorithm with normalization. The experiment analysis is
performed with six benchmark datasets available in the UCI
machine learning data repository [12]. The information about
the datasets is shown in Table 1. The values of all datasets are
normalized by Min-max normalization method before
performing clustering process using K-Means algorithm.
The performance of K-Means algorithm is measured in
terms of four external validity measures [4, 6] namely Rand
index, Jaccard index, F-Measure and Entropy along with
number of iterations required to reach the desired clusters. The
external validity measures test the quality of clusters by
comparing the results of clustering with the ‘ground truth’
(true class labels). All these four measures have a value
between 0 and 1. In case of Rand index, Jaccard index and F-
Measure, the value 1 indicates that the data clusters are exactly
same and so the increase in the values of these measures
proves the better performance. But, the value 1 signifies that
the data clusters are entirely different for Entropy measure and
so the value of this measure is to be decreased to reach better
quality clusters.
The results of K-Means algorithm with Max-min distance,
in comparison with the results of K-Means algorithm with
traditional distances Euclidean and Cosine, in terms of Rand
index, Jaccard index, F-Measure and Entropy are shown in
Table 2, Table 3, Table 4 and Table 5 respectively. From the
Tables, it is observed that Max-min distance yields better
results than other two distances for almost all datasets. It is
noted that both Euclidean and Max-min distances produce
exactly same performance, for iris dataset, where as they yield
approximately same performance for mammography dataset,
in terms of all four validity measures. When dermatology
dataset is considered, it is evident that the results of Max-min
distance are highly appreciable than Euclidean distance, in
terms of all four validity measures. However, the results of
Max-min distance are approximately same as Cosine distance,
in terms of Rand index, Jaccard index and F-Measure. The
domino effect of Max-min distance in K-Means algorithm
based on four validity measures Rand index, Jaccard index, F-
Measure and Entropy is explored in Figure 1, Figure 2, Figure
3 and Figure 4 respectively.
In order to measure the computational efficiency of
proposed K-Means clustering algorithm, number of iterations
required for convergence is compared as shown in Figure 5.
The figure shows that the Max-min distance measure has
considerable advantage over Euclidean and Cosine distance
measures, because it requires less number of iterations to reach
convergence. From the comparative analysis, it is concluded
that the Max-min distance is more suitable than the other two
distances, namely Euclidean and Cosine for all experimented
numerical datasets.
TABLE 1. DETAILS OF DATASETS
S. No. Dataset No. of
Attributes
No. of
Classes
No. of
Instances
1 Australian 14 2 690
2 Breast Cancer 10 2 699
3 Dermatology 34 6 366
4 Hepatitis 19 2 155
5 Iris 4 3 150
6 Mammography 5 2 961
TABLE 2. COMPARATIVE ANALYSIS BASED ON RAND INDEX
Dataset Euclidean Cosine Max-min
Australian 0.5071 0.6228 0.6697
Breast Cancer 0.9049 0.9417 0.8647
Dermatology 0.7018 0.9104 0.9134
Hepatitis 0.6434 0.5934 0.6723
Iris 0.9499 0.9272 0.9499
Mammography 0.6573 0.6305 0.6550
TABLE 3. COMPARATIVE ANALYSIS BASED ON JACCARD INDEX
Dataset Euclidean Cosine Max-min
Australian 0.5047 0.4585 0.5096
Breast Cancer 0.8419 0.7758 0.8984
Dermatology 0.3065 0.6359 0.6476
Hepatitis 0.5700 0.5106 0.6012
Iris 0.8602 0.8033 0.8602
Mammography 0.4952 0.4705 0.4954
TABLE 4. COMPARATIVE ANALYSIS BASED ON F-MEASURE
Dataset Euclidean Cosine Max-min
Australian 0.6071 0.4561 0.4538
Breast Cancer 0.6589 0.6179 0.6426
Dermatology 0.5631 0.8084 0.8191
Hepatitis 0.7602 0.6992 0.7892
Iris 0.9600 0.9401 0.9600
Mammography 0.7815 0.7579 0.7802
TABLE 5. COMPARATIVE ANALYSIS BASED ON ENTROPY INDEX
Dataset Euclidean Cosine Max-min
Australian 0.3592 0.2943 0.2668
Breast Cancer 0.0811 0.1254 0.0689
Dermatology 0.9264 0.2831 0.2289
Hepatitis 0.4576 0.4772 0.4391
Iris 0.1494 0.1992 0.1494
Mammography 0.5057 0.5313 0.5013
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Australian
Breast Cancer
Hepatitis
Iris
Mammography
Rand index
Euclidean Cosine Max-min
Figure 1. Performance Analysis based on Rand Idex
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Australian
Breast Cancer
Derm
atology
Hepatitis
Iris
Mamm
ogram
Jaccard Index
Euclidean Cosine Max-min
Figure 2. Performance Analysis based on Jaccard Coefficient
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Australian
Breast Cancer
Dermatology
Hepatitis
Iris
Mammography
F-Measure
Euclidean Cosine Max-min
Figure 3. Performance Analysis based on F-Measure
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Australian
Breast Cancer
Dermatology
Hepatitis
Iris
Mammography
Entropy
Euclidean Cosine Max-min
Figure 4. Performance Analysis based on Entropy
0
2
4
6
8
10
12
14
16
18
Australian
Breast Cancer
Dermatology
Hepatitis
Iris
Mammography
No. of Iterations
Euclidean Cosine Max-min
Figure 5. Performance Analysis based on No. of Iterations
V. CONCLUSION
The clustering problem has been widely studied since it
arises in much knowledge management oriented applications. It
aims at identifying the distribution of patterns and intrinsic
correlations in datasets by partitioning the data objects into
similarity clusters. In this paper, the modified K-Means
clustering algorithm is proposed by applying Max-min distance
measure in the place of Euclidean distance measure. The
proposed algorithm performs normalization process using
Min-max method, as the initial step of clustering process. The
results of numerical experiments on six benchmark datasets
demonstrate the superiority of the proposed algorithm,
produces high quality clusters with minimum computational
complexity. In future, the Max-min distance measure can be
applied in fuzzy C-Means algorithm and its variants, to
improve the performance.
REFERENCES
[1] Adil M. Bagirov, “Modified Global K-Means Algorithm for Minimum
Sum-of-squares Clustering Problems”, Pattern Recognition, Vol. 41,
2008, 3192-3199.
[2] Chieh-Yuan Tsai, Chuang-Cheng Chiu, “Developing a Feature Weight
self-adjustment Mechanism for a K-Means Clustering Algorithm”,
Computational Statistics and Data Analysis, Vol. 52, 2008, pp. 4658-
4672.
[3] Daxin Jiang, Chun Tang and Aidong Zhang, “Cluster Analysis for Gene
Expression Data: A Survery”, IEEE Transactions on Knowledge and
Data Engineering, Vol. 16, No. 11, November 2004, pp. 1370-1386
[4] M. Halkidi, Y. Batistakis, M. Vazirgiannis, “Cluster Validity Methods”,
ACM SIGMOD Record, Vol. 31, No. 3, 2002, pp. 19-27.
[5] J. Han, M. Kamber, Data Mining Concepts and Techniques, Morgan
Kaufmarm Publishers, San Francisco, 2006.
[6] Hui Xiong, Junjie Wu, Jian Chen, “K-means clustering versus validation
measures: a data distribution perspective”, Proceedings of the 12th
ACM SIGKDD international conference on Knowledge Discovery and
Data Mining, August 2006, pp. 779-784.
[7] Jain A K, Murthy M N, Flynn P J., “Data Clustering: A Review”, ACM
Computing Surveys, Vol. 31 No.3, Septermber 1999, pp. 265-323.
[8] Januzaj E, Kriegel Hans P, Pfeifle M., “DBDC: Density Based
Distributed Clustering”, Advances in Databases Technology – EDBT
2004, Springer Berlin / Heidelberg, Vol. 2992, February 2004, pp.
529-530.
[9] Krista Rizman Zalik, “An Efficient K’means Clustering Algorithm”,
Pattern Recognition Letters, Vol. 29, 2008, pp. 1385-1391.
[10] Luai A. Shalabi, Zyad Shaaban and Basel Kassabeh, “Data Mining A
Preprocessing Engine”, Journal of Computer Science, Vol. 2, No. 9,
2006, pp. 735-739.
[11] J. B. MacQueen, "Some Methods for classification and Analysis of
Multivariate Observations", Proceedings of 5th Berkeley Symposium on
Mathematical Statistics and Probability. University of California Press,
1967, Vol 1, pp. 281–297
[12] Merz C J, Murphy P M., “UCI Repository of Machine Learning
Databases”, Irvine, University of California, 1998,
http://www.ics.uci.eedu/~mlearn/.
[13] Pang-Ning Tan, Steinbach M, Kumar V., ”Cluster Analysis: Basic
Concepts and Algorithms”, Introduction to Data Mining, Pearson
Addison Wesley, Boston, 2006.
[14] Shehroz S. Khan, Amir Ahmad, “Cluster Center Initialization Algorithm
for K-Means Clustering”, Pattern Recognition Letters, Vol. 25, 2004, pp.
1293-1302.
[15] K. P. Soman, Shyam Diwakar and V. Ajay, Insight into Data Mining
Theory and Practice, PHI, India, 2006.
[16] Timothy J Ross, Fuzzy Logic with Engineering Applications, Mcgraw
Hill, Newyork.
[17] Xu R, Wunsch D II, “Survey of clustering algorithms”, IEEE
Transaction on Neural Networks, Vol. 16, Issue 3, May 2005, pp.
645-678.
.
... Table 2 shows the average value of accuracy in each cluster after removing the outliers and categorizing them into clusters. • Data Normalization-Data normalization was performed using the min-max scaler method [24] in Python using the sci-kit-learn package. The min-max scaler is given as follows: ...
Article
Full-text available
The growing urban population and traffic congestion underline the importance of building pedestrian-friendly environments to encourage walking as a preferred mode of transportation. However, a major challenge remains, which is the absence of such pedestrian-friendly walking environments. Identifying locations and routes with high pedestrian concentration is critical for improving pedestrian-friendly walking environments. This paper presents a quantitative method to map pedestrian walking behavior by utilizing real-time data from mobile phone sensors, focusing on the University of Moratuwa, Sri Lanka, as a case study. This holistic method integrates new urban data, such as location-based service (LBS) positioning data, and data clustering with unsupervised machine learning techniques. This study focused on the following three criteria for quantifying walking behavior: walking speed, walking time, and walking direction inside the experimental research context. A novel signal processing method has been used to evaluate speed signals, resulting in the identification of 622 speed clusters using K-means clustering techniques during specific morning and evening hours. This project uses mobile GPS signals and machine learning algorithms to track and classify pedestrian walking activity in crucial sites and routes, potentially improving urban walking through mapping.
... Feature scaling is a popular technique to normalize a set of independent variables or data features, where the data is scaled to fall within a smaller range, such as 0.0 to 1.0[33] [34]. This could help minimise the error rate produced by the algorithm on the dataset. ...
Preprint
Full-text available
Following the deployment of the Learning Management System (LMS) platform in higher educational institutions in Ethiopia, a massive amount of potentially helpful but as-yet untapped educational data has been generated. Despite the fact that the data is powerful enough to contribute to reducing student dropout rates through the application of modern educational data mining techniques such as machine learning, it has not been successfully employed to tackle student academic performance problems in higher education institutions(HEIs). As a result, a machine learning model was proposed based on data from three semesters of undergraduate students at Bule Hora University. To predict students' academic achievement, five machine learning methods (SVM, Random Forest, KNN, Gradient Boosting, and Decision Tree) were used. The Decision Tree model outperformed other models with a promising result of 97.3% on test accuracy and was selected as a proposed model. Moreover, our findings suggest that academic factors (entrance result, study time, attendance, and Internet access) and socio-demographic factors (age, gender, father job, mother job, family size, and address) had a greater impact on students' academic success. However, the academic performance of students was less affected by additional features like extra classes, another job, and guidance and counselling. Moreover, we are working on improving the accuracy of the proposed model. In this study, the student entrance result is the only variable used from the pre-university data. However, a CGPA result is not sufficient to qualify a student. Therefore, in the future, the exit exam results of the students will be incorporated.
... Feature scaling is a popular technique to normalize a set of independent variables or data features, where the data is scaled to fall within a smaller range, such as 0.0 to 1.0[33] [34]. This could help minimise the error rate produced by the algorithm on the dataset. ...
Preprint
Full-text available
Following the deployment of the Learning Management System (LMS) platform in higher educational institutions in Ethiopia, a massive amount of potentially helpful but as-yet untapped educational data has been generated. Despite the fact that the data is powerful enough to contribute to reducing student dropout rates through the application of modern educational data mining techniques such as machine learning, it has not been successfully employed to tackle student academic performance problems in higher education institutions(HEIs). As a result, a machine learning model was proposed based on data from three semesters of undergraduate students at Bule Hora University. To predict students' academic achievement, five machine learning methods (SVM, Random Forest, KNN, Gradient Boosting, and Decision Tree) were used. The Decision Tree model outperformed other models with a promising result of 97.3% on test accuracy and was selected as a proposed model. Moreover, our findings suggest that academic factors (entrance result, study time, attendance, and Internet access) and socio-demographic factors (age, gender, father job, mother job, family size, and address) had a greater impact on students' academic success. However, the academic performance of students was less affected by additional features like extra classes, another job, and guidance and counselling. Moreover, we are working on improving the accuracy of the proposed model. In this study, the student entrance result is the only variable used from the pre-university data. However, a CGPA result is not sufficient to qualify a student. Therefore, in the future, the exit exam results of the students will be incorporated.
... The Hidden Markov model is a commonly used method for image segmentation, which is a two-level structure model consisting of an unobservable hidden layer and an observable upper layer. Clustering analysis technology is also widely used to solve this issue, such as multi-center clustering algorithm [8], fast fuzzy segmentation [9], adaptive fuzzy C-means algorithm [10], and the bias correction fuzzy C-means algorithm [11]. As for SAR image segmentation, the most representative methods are the image segmentation algorithms [12][13][14] based on the constant false alarm rate (CFAR) detector [15], in which a threshold is determined based on the statistical characteristics of each image, and the image is segmented by comparing the gray level value of each pixel against the threshold value. ...
Article
Full-text available
Target detection and segmentation in synthetic aperture radar (SAR) images are vital steps for many remote sensing applications. In the era of data-driven deep learning, this task is extremely challenging due to the limited labeled data. Few-shot learning has the ability to learn quickly from a few samples with supervised information. Inspired by this, a few-shot learning framework named MSG-FN is proposed to solve the segmentation of ship targets in heterologous SAR images with few annotated samples. The proposed MSG-FN adopts a dual-branch network consisting of a support branch and a query branch. The support branch is used to extract features with an encoder, and the query branch uses a U-shaped encoder–decoder structure to segment the target in the query image. The encoder of each branch is composed of well-designed residual blocks combined with filter response normalization to capture robust and domain-independent features. A multi-scale similarity guidance module is proposed to improve the scale adaptability of detection by applying hand-on-hand guidance of support features to query features of various scales. In addition, a SAR dataset named SARShip-4i is built to evaluate the proposed MSG-FN, and the experimental results show that the proposed method achieves superior segmentation results compared with the state-of-the-art.
... widely used to solve this issue, such as multi-center clustering algorithm [8], fast fuzzy segmentation [9], adaptive fuzzy C-means algorithm [10], and the bias correction fuzzy c-means algorithm [11]. As for SAR image segmentation, the most representative methods are the image segmentation algorithms [12][13][14] base on the constant false alarm rate (CFAR) detector [15], in which a threshold is determined based on the statistical characteristics of each image, and the image is segmented by comparing the gray level value of each pixel against the threshold value. ...
Preprint
Full-text available
Target detection and segmentation in synthetic aperture radar (SAR) images are vital steps for many remote sensing applications. In the era of data-driven deep learning, this task is extremely challenging due to the limited labeled data. Few-shot Learning has the ability to learn quickly from few samples with supervised information. Inspired by this, a few-shot learning framework named MSG-FN is proposed to solve the segmentation of ship targets in heterologous SAR images with few annotated samples. The proposed MSG-FN adopts a dual-branch network consisting of a support branch and a query branch. The support branch is used to extract features with an encoder, and the query branch uses a U-shaped encoder-decoder structure to segment the target in the query image. The encoder of each branch is composed of well-designed residual blocks combined with filter response normalization to capture robust and domain-independent features. A multi-scale similarity guidance module is proposed to improve the scale adaptability of detection by applying hand-on-hand guidance of support features to query features of various scales. In addition, a SAR dataset named SARShip-4i is built to evaluate the proposed MSG-FN and the experimental results show that the proposed method achieves superior segmentation results compared with the state-of-the-arts.
... This procedure for optimal feature selection was conducted separately on confirmed cases (conf_pm) and on deaths (deaths_pm) as response variables. The resulting dataset was finally normalized to bring all the variables to the same scale between 0 and 1, prior to clustering (Visalakshi & Suguna, 2009), and using min-max normalization. ...
Article
Full-text available
The COVID-19 pandemic, which outbroke in Wuhan (China) in December 2019, severely hit almost all sectors of activity in the world as a consequence of the restrictive measures imposed. Two years later, Africa still emerges as the least affected continent by the pandemic. This study analyzed COVID-19 prevalence across African countries through country-level variables prior to clustering. Using Spearman-rank correlation, multicollinearity analysis and univariate filtering, 9 country-level variables were identified from an initial set of 34 variables. These variables relate to socioeconomic status, population structure, healthcare system and environment and the climatic setting. A clustering of the 54 African countries is further carried out through the use of agglomerative hierarchical clustering (AHC) method, which generated 3 distinctive clusters. Cluster 1 (11 countries) is the most affected by COVID-19 (median of 63,508.6 confirmed cases and 946.5 deaths per million) and is composed of countries with the highest socioeconomic status. Cluster 2 (27 countries) is the least affected (median of 4473.7 confirmed cases and 81.2 deaths per million), and mainly features countries with the least socioeconomic features and international exposure. Cluster 3 (16 countries) is intermediate in terms of COVID-19 prevalence (median of 2569.3 confirmed cases and 35.7 deaths per million) and features countries the least urbanized and geographically close to the equator, with intermediate international exposure and socioeconomic features. These findings shed light on the main features of COVID-19 prevalence in Africa and might help refine effectively coping management strategies of the ongoing pandemic. Supplementary information: The online version contains supplementary material available at 10.1007/s10668-022-02646-3.
Article
Full-text available
In the modern era, there is a sudden rise in data due to the wide use of the web, social networks and so on. Now, it becomes difficult to explore desired data from such fused complex data. For mining the data, many clustering algorithms have been proposed, and it is very difficult to find a clustering method which is suitable for all types of datasets. The present study proposes a hybrid fuzzy clustering technique based on the fusion of intuitionistic modified fuzzy c-means and improved genetic algorithm. This study overcomes the problem of high sensitivity toward initial centroids by using an improved genetic algorithm with developed normalized crossover and mutation operator. The proposed algorithm reduces the effect of noise by developing a metric and resolve uncertainty in assigning membership value by using two negation functions. The strength of the proposed algorithm over existing methods in the literature is revealed by testing it on 11 benchmark real-world and three artificial datasets. The experiment results state the admirable achievement of the proposed algorithm over other tested algorithms.
Chapter
A novel effective method for identifying “Dispersed, Disordered, and Polluting” (DDP) sites was proposed for the purpose of promoting the modernization of ecological and environmental governance capabilities and building a big data application platform for enterprises’ pollution prevention. This paper aggregated data from the electricity consumption information collection system and characteristic sensing terminals, including user daily total electricity consumption records, peak and valley electricity consumption, and other information. Firstly, we used the hierarchical K-means algorithm to cluster the time series of user electricity consumption data. After ranking by cluster features, the electricity consumption time series of the selected suspicious users were encoded into Gramian angular field (GAF) images. Finally, we adopted the perceptual hash algorithm to build the model to identify “Dispersed, Disordered, and Polluting” sites. The case analysis results verified this method’s feasibility, rationality, and effectiveness.KeywordsPower consumption dataImage intelligencePerceptual hash algorithmPollution prevention
Article
Fuzzy clustering is a well-established technique among the well-known clustering techniques in several real-world applications due to easy implementation and produces satisfactory clustering result. However, it has some deficiency such as sensitive to outliers, result dependency on choosing initial centroid, etc. To eradicate the shortcoming of FCM algorithm, this article introduces a robust clustering technique, particle swarm optimization improved fuzzy c-means is developed by the hybridization of particle swarm optimization and improved fuzzy c-means techniques, to deal with noisy data and initialization problem. In this article, a fuzzy clustering technique is developed to increase the convergence performance of clustering techniques. Fuzzy c-means is improved by developing a new metric to tolerate the noisy environment. Particle swarm optimization has an inbuilt guidance strategy which leads the solution in particle swarm optimization to obtain useful information from the better solution and thereby helping them improve their own solution. To handle the initialization problem of fuzzy c-means, particle swarm optimization technique is used. PSO effectively enhance the performance of improved FCM to increase the effectiveness of clustering. The effectiveness of the proposed clustering technique over existing techniques in literature has been illustrated by adopting eight real worlds and three artificial data sets. The results show that the proposed algorithm generates encouraging results as compared to the established clustering technique in literature.
Article
Full-text available
K-means is a well-known and widely used partitional clustering method. While there are considerable research efforts to characterize the key features of the K-means clustering algorithm, further investigation is needed to understand how data distributions can have impact on the performance of K-means clustering. To that end, in this paper, we provide a formal and organized study of the effect of skewed data distributions on K-means clustering. Along this line, we first formally illustrate that K-means tends to produce clusters of relatively uniform size, even if input data have varied ldquotruerdquo cluster sizes. In addition, we show that some clustering validation measures, such as the entropy measure, may not capture this uniform effect and provide misleading information on the clustering performance. Viewed in this light, we provide the coefficient of variation (CV) as a necessary criterion to validate the clustering results. Our findings reveal that K-means tends to produce clusters in which the variations of cluster sizes, as measured by CV, are in a range of about 0.3-1.0. Specifically, for data sets with large variation in ldquotruerdquo cluster sizes (e.g., CV > 1.0 ), K-means reduces variation in resultant cluster sizes to less than 1.0. In contrast, for data sets with small variation in ldquotruerdquo cluster sizes (e.g., CV < 0.3), K-means increases variation in resultant cluster sizes to greater than 0.3. In other words, for the earlier two cases, K-means produces the clustering results which are away from the ldquotruerdquo cluster distributions.
Conference Paper
K-means is a widely used partitional clustering method. While there are considerable research efforts to characterize the key features of K-means clustering, further investigation is needed to reveal whether and how the data distributions can have the impact on the performance of K-means clustering. Indeed, in this paper, we revisit the K-means clustering problem by answering three questions. First, how the "true" cluster sizes can make impact on the performance of K-means clustering? Second, is the entropy an algorithm-independent validation measure for K-means clustering? Finally, what is the distribution of the clustering results by K-means? To that end, we first illustrate that K-means tends to generate the clusters with the relatively uniform distribution on the cluster sizes. In addition, we show that the entropy measure, an external clustering validation measure, has the favorite on the clustering algorithms which tend to reduce high variation on the cluster sizes. Finally, our experimental results indicate that K-means tends to produce the clusters in which the variation of the cluster sizes, as measured by the Coefficient of Variation(CV), is in a specific range, approximately from 0.3 to 1.0.
Article
This chapter summarizes only two popular methods of classification. The first is classification using equivalence relations. This approach makes use of certain special properties of equivalence relations and the concept of defuzzification known as lambda-cuts on the relations. The second method of classification is a very popular method known as fuzzy c-means (FCM), so named because of its close analog in the crisp world, hard c-means (HCM). This method uses concepts in n-dimensional Euclidean space to determine the geometric closeness of data points by assigning them to various clusters or classes and then determining the distance between the clusters. In the case of fuzzy relations, for all fuzzy equivalence relations, their ?-cuts are equivalent ordinary relations. Hence, to classify data points in the universe using fuzzy relations, we need to find the associated fuzzy equivalence relation. fuzzy logic; pattern clustering