Conference PaperPDF Available

K-means clustering using Max-min distance measure

July 2009

July 2009

DOI:10.1109/NAFIPS.2009.5156398

Source
IEEE Xplore

Conference: Fuzzy Information Processing Society, 2009. NAFIPS 2009. Annual Meeting of the North American

Authors:

N. Karthikeyani Visalakshi

Kongu Engineering College

J. Suguna

Vellalar College for Women

The cluster analysis deals with the problems of organization of a collection of data objects into clusters based on similarity. It is also known as the unsupervised classification of objects and has found many applications in different areas. An important component of a clustering algorithm is the distance measure which is used to find the similarity between data objects. K-means is one of the most popular and widespread partitioning clustering algorithms due to its superior scalability and efficiency. Typically, the K-means algorithm determines the distance between an object and its cluster centroid by Euclidean distance measure. This paper proposes a variant of K-means which uses an alternate distance measure namely, Max-min measure. The modified K-means algorithm is tested with six benchmark datasets taken from UCI machine learning data repository and found that the proposed algorithm takes less number of iterations to converge than the existing one with improved performance.

. COMPARATIVE ANALYSIS BASED ON RAND INDEX

…

Figures - uploaded by J. Suguna

Content may be subject to copyright.

Content uploaded by J. Suguna

Content may be subject to copyright.

K-Means Clustering using Max-min Distance

Measure

N. Karthikeyani Visalakshi

Lecturer in Computer Science

Vellalar College for Women

Erode, Tamil Nadu, India

karthichitru@yahoo.co.in

J. Suguna

Lecturer (SG) in Computer Science

Vellalar College for Women

Erode, Tamil Nadu, India

sugunajravi@yahoo.co.in

Abstract—The cluster analysis deals with the problems of

organization of a collection of data objects into clusters based on

similarity. It is also known as the unsupervised classification of

objects and has found many applications in different areas. An

important component of a clustering algorithm is the distance

measure which is used to find the similarity between data objects.

K-means is one of the most popular and widespread partitioning

clustering algorithms due to its superior scalability and

efficiency. Typically, the K-means algorithm determines the

distance between an object and its cluster centroid by Euclidean

distance measure. This paper proposes a variant of K-means

which uses an alternate distance measure namely, Max-min

measure. The modified K-means algorithm is tested with six

benchmark datasets taken from UCI machine learning data

repository and found that the proposed algorithm takes less

number of iterations to converge than the existing one with

improved performance.

Keywords - Clustering; Euclidean distance; K-Means algorithm;

Max-min distance.

I. INTRODUCTION

Cluster analysis is an unsupervised learning method that

constitutes a cornerstone of an intelligent data analysis

process. It is the process of grouping objects into clusters such

that the objects in the same cluster are similar where as objects

in different clusters is different. Clustering has become an

increasingly important task in modern application domains

such as marketing and purchasing assistance, multimedia, and

molecular biology as well as many others. There is no

clustering algorithm performing best for all datasets. Each

dataset requires both expertise and insight to choose a single

best clustering algorithm, and it depends on the nature of

application and patterns to be extracted. Different types of

algorithms have been proposed in the literature [7, 13] to solve

the clustering problem.

Each clustering algorithm is based on some kind of

distance measures, which leads to grouping of related objects.

As each distance measure follows different methods for

determining the degree of similarity between two objects, the

selection of an appropriate distance measure plays a vital role

in any clustering algorithm. The distance measure will

influence the shape of the clusters, as some elements may be

close to one another according to one distance measure and

farther away according to another.

K-Means [11] is a prototype-based, partitional clustering

technique that attempts to find a user-specified number of

clusters (K), which are represented by centroids. There are

different variants of K-Means algorithm in the literature [1, 2,

3, 9, 14], each emphasizes various aspects like centroid

initialization [14], number of clusters determination [2, 3],

global optimization [1 ], etc.

The K-Means algorithm typically uses Euclidean or

squared Euclidean distance to measure the distortion between

a data object and its cluster centroid [17]. The Euclidean and

squared Euclidean distances are usually computed from raw

data and not from standardized data. While using Euclidean

distances, the distance between any two objects is not affected

by the addition of new objects to the analysis. However, the

clustering results can be greatly affected by differences in

scale among the dimension from which the distances are

computed. Hence an effort is taken here to standardize the

input objects before clustering and suggest an alternate

distance measure. This paper proposes a modified K-Means

algorithm, by applying Min-max normalization procedure [10]

and Max-min distance measure [16] to reach better

performance.

The paper is organized as follows: Section 2 presents an

overview of clustering algorithms, K-Means clustering,

distance measures and need for normalization. The modified

K-Means algorithm is proposed in Section 3. Section 4

discusses the experimental analysis. Section 5 concludes the

paper and outlines scope for future research work.

II. BACKGROUND

A. Clustering Algorithms

Clustering is a process of partitioning a set of data objects

into a set of meaningful subclasses, called clusters. A cluster is

a collection of data objects that are similar to one another

based on their attribute values, and thus can be treated

collectively as one group. The goal of the clustering technique

is to decompose or partition a data set into groups such that

both intra-group similarity and inter-group dissimilarity are

maximized [17].

Clustering algorithms can be classified along different,

independent dimensions. One well-known dimension

categorizes clustering methods according to the result they

produce. Here, we can distinguish between hierarchical and

partitioning clustering algorithms [8]. Partitioning algorithms

construct a flat (single level) partition of a database D of n

objects into a set of K clusters such that the objects in a cluster

are more similar to each other than to objects in different

clusters. Hierarchical algorithms decompose the database into

several levels of nested partitioning (clustering), represented

for example by a dentogram, i.e. a tree that iteratively splits D

into smaller subsets until each subset consists of only one

object. In such a hierarchy, each node of the tree represents a

cluster of D.

Another dimension according to which we can classify

clustering algorithms is from an algorithmic point of view.

Here we can distinguish between optimization based or

distance based algorithms and density based algorithms.

Distance based methods use the distances between the objects

directly in order to optimize a global cluster criterion. In

contrast, density based algorithms apply a local cluster

criterion. Clusters are regarded as regions in the data space in

which the objects are dense, and which are separated by

regions of low object density (noise).

Studies have shown that partitional clustering algorithms

are more suitable for clustering large datasets due to their

relatively low computational requirements [15]. The time

complexity of the partitioning technique is almost linear,

which makes it widely used. One of the best-known distance

based partitioning clustering algorithms is the K-means

algorithm [7, 13 ].

B. K-Means Clustering

K-Means is a typical clustering algorithm [17]. It is

attractive in practice, because it is simple and it is generally

very fast. A set of n objects

, , , ixiK21 , =, are to be

partitioned into K groups, ....,,2,1, njC j= The objective

function, based on the Euclidean distance between an object x

in group j and the corresponding cluster centroid Cj, can be

defined by:

∑∑

= =

−=

ji CxJ ((1)

In detail, it randomly selects K of the given objects to

represent the cluster centroid. Based on the selected objects,

all remaining objects are assigned to their closer centroid one

by one. The Euclidean distance between the object and every

centroid is computed, and then the object is moved to the one

of the clusters which yields minimum distance. The value of

the selected centroid is recalculated by taking the mean of all

data points belonging to the same cluster. The operation is

iterated for all the objects. The same procedure is repeated

until the objective function converges. If K cannot be known

ahead of time, various values of K can be evaluated until the

most suitable one is found. The effectiveness of this method as

well as of others relies heavily on the objective function used

in measuring the distance between objects. Choosing the

proper initial centroid is the key step of the basic K-Means

procedure.

Generally, the K-Means algorithm has the following

important properties: (i) it is efficient in processing large data

sets, (ii) it often terminates at a local optimum, (iii) the

clusters have spherical shapes, (iv) it is sensitive to noise.

Distance Measures

The concept of distance is the essential component of any

form of clustering that helps to navigate through the data space

and form clusters [15]. By computing distance, it is sensed and

articulated how close together two patterns are and, based on

this closeness, allocate them to the same cluster. Formally, the

distance ),( yxd between x and y is considered to be a two-

argument function satisfying the following conditions:

0),(

≥

yxd for every x and y

0),(

xxd for every x (2)

),(),(),(

),(),(

zxdzydyxd

xydyxd

≥+

When the components of the data object vectors are all in

the same physical units then it is possible that the simple

Euclidean distance metric is sufficient to successfully group

similar data instances. However, even in this case the

Euclidean distance may sometimes be misleading. Despite

different measurements being taken in the same physical units,

an informed decision has to be made as to the relative scaling,

because different scaling can lead to different types of

clustering.

Actually a mathematical formula is used to combine the

distances between the single components of the data feature

vectors into a unique distance measure. When the clustering

process is being done using this formula, different formulas

may also lead to different kinds of clustering. Domain

knowledge must be used to guide the formulation of a suitable

distance measure for each particular application. However,

there are no general theoretical guidelines for selecting a

measure for any given application.

It is often the case that the components of the data feature

vectors are not immediately comparable. It can be that the

components are not continuous variables, like length, but

nominal categories, such as the days of the week. In these

cases again, domain knowledge must be used to formulate an

appropriate measure.

There are several distance measures used in the literature

[15, 17]. Three distance measures used in this work for

comparative analysis are defined as follows:

• Euclidean distances are usually computed from raw

data and not from standardized data.

( )

∑

−=

jkikji ttttd

),( (3)

• Cosine coefficient relates the overlap to the

geometric average of the two sets.

( )

∑∑

∑

jhih

ttd

, (4)

• Max-min distance measure is found through simple

min and max operations on pairs of data objects.

( )

∑

jkik

ttd

,max

,min

),( (5)

C. Need for Normalization

Preprocessing [10] is required before using any data

mining algorithms to improve the results’ performance. Data

normalization is one of the preprocessing procedures in data

mining, where the attribute data are scaled so as to fall within

a small specified range such as -1.0 to 1.0 or 0.0 to 1.0.

Normalization before clustering is often needed for distance

metric, such as Euclidian distance, which are sensitive to

differences in the magnitude or scales of the attributes. In real

applications, because of the differences in range of attributes’

value, one attribute might overpower the other one.

Normalization prevents outweighing features with large range

like ‘salary’ over features with smaller range like ‘age’. The

goal is to equalize the size or magnitude and the variability of

these features.

There are many methods for data normalization which

include Min-max normalization, z-score normalization and

normalization by decimal scaling [5, 10]. Min-max

normalization performs a linear transformation on the original

data. Min-max normalization maps a value of each attribute to

the range [0, 1]. In z-score normalization, the values of

attributes are normalized based on the mean and standard

deviation of corresponding attributes. This method of

normalization is useful when the actual minimum and

maximum of every attribute is unknown. Normalization by

decimal scaling normalizes by moving the decimal point of

values of attribute. The number of decimal points moved

depends on the maximum absolute value of the attribute.

III. PROPOSED ALGORITHM

Normally, K-Means clustering algorithm uses Euclidean

distance as the distance measure to compute the similarity

between the object and its centroid. Alternately, the Cosine

distance measure [15] is often used for document clustering. In

this paper, Max-min distance measure [16] is suggested in

place of Euclidean distance measure. The Max-min distance

measure requires the values to be in the range [0, 1]. In order

to scale the given objects to fall within a small specified range

[0, 1], Min-max normalization procedure [10] is followed as a

pre-processing step for the proposed K-Means algorithm. The

step by step procedure of the proposed K-Means clustering

algorithm based on Max-min distance measure is given here

under.

---------------------------------------------------------------------------

Algorithm 1

K-Means Algorithm based on Max-min Distance Measure

---------------------------------------------------------------------------

Input : Dataset of n objects with d features and the value of K

Output: Partition of the input data into K clusters

Procedure:

Step 1: Normalize the input data objects to fall into the range

[0, 1] using Min-max normalization procedure

Step 2: Declare a membership matrix U of size n x K

Step 3: Generate K cluster centroids randomly within the

range of the data or select K objects randomly as

initial cluster centroids. Let the centroids be C1, C2, .

., CK

Step 4: Calculate the distance measure ij

d using Max-min

similarity measure

∑

kjkik

ij Cx

),(max

),(min

for all cluster centroids j

C, , K, 2, 1j K

and

data objects , n, 2, 1ixiK , =

Step 5: Compute the U membership matrix

K21j

n21i

otherwise0

ljdd1

Uilij

ij .,..,,

,...,,

;











≠≤

Step 6: Compute new cluster centroids Cj

K21jfor

iij

j.,..,,

)(

∑

Step 7: Repeat steps 4 to 6 until convergence

IV. EXPERIMENTAL ANALYSIS

The main purpose of this work is to explore the impact of

Max-min distance measure in the K-Means clustering

algorithm with normalization. The experiment analysis is

performed with six benchmark datasets available in the UCI

machine learning data repository [12]. The information about

the datasets is shown in Table 1. The values of all datasets are

normalized by Min-max normalization method before

performing clustering process using K-Means algorithm.

The performance of K-Means algorithm is measured in

terms of four external validity measures [4, 6] namely Rand

index, Jaccard index, F-Measure and Entropy along with

number of iterations required to reach the desired clusters. The

external validity measures test the quality of clusters by

comparing the results of clustering with the ‘ground truth’

(true class labels). All these four measures have a value

between 0 and 1. In case of Rand index, Jaccard index and F-

Measure, the value 1 indicates that the data clusters are exactly

same and so the increase in the values of these measures

proves the better performance. But, the value 1 signifies that

the data clusters are entirely different for Entropy measure and

so the value of this measure is to be decreased to reach better

quality clusters.

The results of K-Means algorithm with Max-min distance,

in comparison with the results of K-Means algorithm with

traditional distances Euclidean and Cosine, in terms of Rand

index, Jaccard index, F-Measure and Entropy are shown in

Table 2, Table 3, Table 4 and Table 5 respectively. From the

Tables, it is observed that Max-min distance yields better

results than other two distances for almost all datasets. It is

noted that both Euclidean and Max-min distances produce

exactly same performance, for iris dataset, where as they yield

approximately same performance for mammography dataset,

in terms of all four validity measures. When dermatology

dataset is considered, it is evident that the results of Max-min

distance are highly appreciable than Euclidean distance, in

terms of all four validity measures. However, the results of

Max-min distance are approximately same as Cosine distance,

in terms of Rand index, Jaccard index and F-Measure. The

domino effect of Max-min distance in K-Means algorithm

based on four validity measures Rand index, Jaccard index, F-

Measure and Entropy is explored in Figure 1, Figure 2, Figure

3 and Figure 4 respectively.

In order to measure the computational efficiency of

proposed K-Means clustering algorithm, number of iterations

required for convergence is compared as shown in Figure 5.

The figure shows that the Max-min distance measure has

considerable advantage over Euclidean and Cosine distance

measures, because it requires less number of iterations to reach

convergence. From the comparative analysis, it is concluded

that the Max-min distance is more suitable than the other two

distances, namely Euclidean and Cosine for all experimented

numerical datasets.

TABLE 1. DETAILS OF DATASETS

S. No. Dataset No. of

Attributes

No. of

Classes

No. of

Instances

1 Australian 14 2 690

2 Breast Cancer 10 2 699

3 Dermatology 34 6 366

4 Hepatitis 19 2 155

5 Iris 4 3 150

6 Mammography 5 2 961

TABLE 2. COMPARATIVE ANALYSIS BASED ON RAND INDEX

Dataset Euclidean Cosine Max-min

Australian 0.5071 0.6228 0.6697

Breast Cancer 0.9049 0.9417 0.8647

Dermatology 0.7018 0.9104 0.9134

Hepatitis 0.6434 0.5934 0.6723

Iris 0.9499 0.9272 0.9499

Mammography 0.6573 0.6305 0.6550

TABLE 3. COMPARATIVE ANALYSIS BASED ON JACCARD INDEX

Dataset Euclidean Cosine Max-min

Australian 0.5047 0.4585 0.5096

Breast Cancer 0.8419 0.7758 0.8984

Dermatology 0.3065 0.6359 0.6476

Hepatitis 0.5700 0.5106 0.6012

Iris 0.8602 0.8033 0.8602

Mammography 0.4952 0.4705 0.4954

TABLE 4. COMPARATIVE ANALYSIS BASED ON F-MEASURE

Dataset Euclidean Cosine Max-min

Australian 0.6071 0.4561 0.4538

Breast Cancer 0.6589 0.6179 0.6426

Dermatology 0.5631 0.8084 0.8191

Hepatitis 0.7602 0.6992 0.7892

Iris 0.9600 0.9401 0.9600

Mammography 0.7815 0.7579 0.7802

TABLE 5. COMPARATIVE ANALYSIS BASED ON ENTROPY INDEX

Dataset Euclidean Cosine Max-min

Australian 0.3592 0.2943 0.2668

Breast Cancer 0.0811 0.1254 0.0689

Dermatology 0.9264 0.2831 0.2289

Hepatitis 0.4576 0.4772 0.4391

Iris 0.1494 0.1992 0.1494

Mammography 0.5057 0.5313 0.5013

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Australian

Breast Cancer

Dermatology

Hepatitis

Iris

Mammography

Rand index

Euclidean Cosine Max-min

Figure 1. Performance Analysis based on Rand Idex

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Australian

Breast Cancer

Derm

atology

Hepatitis

Iris

Mamm

ogram

Jaccard Index

Euclidean Cosine Max-min

Figure 2. Performance Analysis based on Jaccard Coefficient

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Australian

Breast Cancer

Dermatology

Hepatitis

Iris

Mammography

F-Measure

Euclidean Cosine Max-min

Figure 3. Performance Analysis based on F-Measure

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Australian

Breast Cancer

Dermatology

Hepatitis

Iris

Mammography

Entropy

Euclidean Cosine Max-min

Figure 4. Performance Analysis based on Entropy

Australian

Breast Cancer

Dermatology

Hepatitis

Iris

Mammography

No. of Iterations

Euclidean Cosine Max-min

Figure 5. Performance Analysis based on No. of Iterations

V. CONCLUSION

The clustering problem has been widely studied since it

arises in much knowledge management oriented applications. It

aims at identifying the distribution of patterns and intrinsic

correlations in datasets by partitioning the data objects into

similarity clusters. In this paper, the modified K-Means

clustering algorithm is proposed by applying Max-min distance

measure in the place of Euclidean distance measure. The

proposed algorithm performs normalization process using

Min-max method, as the initial step of clustering process. The

results of numerical experiments on six benchmark datasets

demonstrate the superiority of the proposed algorithm,

produces high quality clusters with minimum computational

complexity. In future, the Max-min distance measure can be

applied in fuzzy C-Means algorithm and its variants, to

improve the performance.

REFERENCES

[1] Adil M. Bagirov, “Modified Global K-Means Algorithm for Minimum

Sum-of-squares Clustering Problems”, Pattern Recognition, Vol. 41,

2008, 3192-3199.

[2] Chieh-Yuan Tsai, Chuang-Cheng Chiu, “Developing a Feature Weight

self-adjustment Mechanism for a K-Means Clustering Algorithm”,

Computational Statistics and Data Analysis, Vol. 52, 2008, pp. 4658-

4672.

[3] Daxin Jiang, Chun Tang and Aidong Zhang, “Cluster Analysis for Gene

Expression Data: A Survery”, IEEE Transactions on Knowledge and

Data Engineering, Vol. 16, No. 11, November 2004, pp. 1370-1386

[4] M. Halkidi, Y. Batistakis, M. Vazirgiannis, “Cluster Validity Methods”,

ACM SIGMOD Record, Vol. 31, No. 3, 2002, pp. 19-27.

[5] J. Han, M. Kamber, Data Mining Concepts and Techniques, Morgan

Kaufmarm Publishers, San Francisco, 2006.

[6] Hui Xiong, Junjie Wu, Jian Chen, “K-means clustering versus validation

measures: a data distribution perspective”, Proceedings of the 12th

ACM SIGKDD international conference on Knowledge Discovery and

Data Mining, August 2006, pp. 779-784.

[7] Jain A K, Murthy M N, Flynn P J., “Data Clustering: A Review”, ACM

Computing Surveys, Vol. 31 No.3, Septermber 1999, pp. 265-323.

[8] Januzaj E, Kriegel Hans P, Pfeifle M., “DBDC: Density Based

Distributed Clustering”, Advances in Databases Technology – EDBT

2004, Springer Berlin / Heidelberg, Vol. 2992, February 2004, pp.

529-530.

[9] Krista Rizman Zalik, “An Efficient K’means Clustering Algorithm”,

Pattern Recognition Letters, Vol. 29, 2008, pp. 1385-1391.

[10] Luai A. Shalabi, Zyad Shaaban and Basel Kassabeh, “Data Mining A

Preprocessing Engine”, Journal of Computer Science, Vol. 2, No. 9,

2006, pp. 735-739.

[11] J. B. MacQueen, "Some Methods for classification and Analysis of

Multivariate Observations", Proceedings of 5th Berkeley Symposium on

Mathematical Statistics and Probability. University of California Press,

1967, Vol 1, pp. 281–297

[12] Merz C J, Murphy P M., “UCI Repository of Machine Learning

Databases”, Irvine, University of California, 1998,

http://www.ics.uci.eedu/~mlearn/.

[13] Pang-Ning Tan, Steinbach M, Kumar V., ”Cluster Analysis: Basic

Concepts and Algorithms”, Introduction to Data Mining, Pearson

Addison Wesley, Boston, 2006.

[14] Shehroz S. Khan, Amir Ahmad, “Cluster Center Initialization Algorithm

for K-Means Clustering”, Pattern Recognition Letters, Vol. 25, 2004, pp.

1293-1302.

[15] K. P. Soman, Shyam Diwakar and V. Ajay, Insight into Data Mining

Theory and Practice, PHI, India, 2006.

[16] Timothy J Ross, Fuzzy Logic with Engineering Applications, Mcgraw

Hill, Newyork.

[17] Xu R, Wunsch D II, “Survey of clustering algorithms”, IEEE

Transaction on Neural Networks, Vol. 16, Issue 3, May 2005, pp.

645-678.

Real-Time Tracking Data and Machine Learning Approaches for Mapping Pedestrian Walking Behavior: A Case Study at the University of Moratuwa

Article

Full-text available

Jun 2024
SENSORS-BASEL

The growing urban population and traffic congestion underline the importance of building pedestrian-friendly environments to encourage walking as a preferred mode of transportation. However, a major challenge remains, which is the absence of such pedestrian-friendly walking environments. Identifying locations and routes with high pedestrian concentration is critical for improving pedestrian-friendly walking environments. This paper presents a quantitative method to map pedestrian walking behavior by utilizing real-time data from mobile phone sensors, focusing on the University of Moratuwa, Sri Lanka, as a case study. This holistic method integrates new urban data, such as location-based service (LBS) positioning data, and data clustering with unsupervised machine learning techniques. This study focused on the following three criteria for quantifying walking behavior: walking speed, walking time, and walking direction inside the experimental research context. A novel signal processing method has been used to evaluate speed signals, resulting in the identification of 622 speed clusters using K-means clustering techniques during specific morning and evening hours. This project uses mobile GPS signals and machine learning algorithms to track and classify pedestrian walking activity in crucial sites and routes, potentially improving urban walking through mapping.

Multi-Category Prediction of Students’ Academic Performance Using Machine Learning: For Students Joining Higher Educational Institution in Ethiopia

Preprint

Full-text available

Sep 2023

Following the deployment of the Learning Management System (LMS) platform in higher educational institutions in Ethiopia, a massive amount of potentially helpful but as-yet untapped educational data has been generated. Despite the fact that the data is powerful enough to contribute to reducing student dropout rates through the application of modern educational data mining techniques such as machine learning, it has not been successfully employed to tackle student academic performance problems in higher education institutions(HEIs). As a result, a machine learning model was proposed based on data from three semesters of undergraduate students at Bule Hora University. To predict students' academic achievement, five machine learning methods (SVM, Random Forest, KNN, Gradient Boosting, and Decision Tree) were used. The Decision Tree model outperformed other models with a promising result of 97.3% on test accuracy and was selected as a proposed model. Moreover, our findings suggest that academic factors (entrance result, study time, attendance, and Internet access) and socio-demographic factors (age, gender, father job, mother job, family size, and address) had a greater impact on students' academic success. However, the academic performance of students was less affected by additional features like extra classes, another job, and guidance and counselling. Moreover, we are working on improving the accuracy of the proposed model. In this study, the student entrance result is the only variable used from the pre-university data. However, a CGPA result is not sufficient to qualify a student. Therefore, in the future, the exit exam results of the students will be incorporated.

Multi-Category Prediction of Students’ Academic Performance Using Machine Learning: For Students Joining Higher Educational Institutions in Ethiopia

Preprint

Full-text available

Sep 2023

Article

Full-text available

Jun 2023

Target detection and segmentation in synthetic aperture radar (SAR) images are vital steps for many remote sensing applications. In the era of data-driven deep learning, this task is extremely challenging due to the limited labeled data. Few-shot learning has the ability to learn quickly from a few samples with supervised information. Inspired by this, a few-shot learning framework named MSG-FN is proposed to solve the segmentation of ship targets in heterologous SAR images with few annotated samples. The proposed MSG-FN adopts a dual-branch network consisting of a support branch and a query branch. The support branch is used to extract features with an encoder, and the query branch uses a U-shaped encoder–decoder structure to segment the target in the query image. The encoder of each branch is composed of well-designed residual blocks combined with filter response normalization to capture robust and domain-independent features. A multi-scale similarity guidance module is proposed to improve the scale adaptability of detection by applying hand-on-hand guidance of support features to query features of various scales. In addition, a SAR dataset named SARShip-4i is built to evaluate the proposed MSG-FN, and the experimental results show that the proposed method achieves superior segmentation results compared with the state-of-the-art.

Preprint

Full-text available

May 2023

Target detection and segmentation in synthetic aperture radar (SAR) images are vital steps for many remote sensing applications. In the era of data-driven deep learning, this task is extremely challenging due to the limited labeled data. Few-shot Learning has the ability to learn quickly from few samples with supervised information. Inspired by this, a few-shot learning framework named MSG-FN is proposed to solve the segmentation of ship targets in heterologous SAR images with few annotated samples. The proposed MSG-FN adopts a dual-branch network consisting of a support branch and a query branch. The support branch is used to extract features with an encoder, and the query branch uses a U-shaped encoder-decoder structure to segment the target in the query image. The encoder of each branch is composed of well-designed residual blocks combined with filter response normalization to capture robust and domain-independent features. A multi-scale similarity guidance module is proposed to improve the scale adaptability of detection by applying hand-on-hand guidance of support features to query features of various scales. In addition, a SAR dataset named SARShip-4i is built to evaluate the proposed MSG-FN and the experimental results show that the proposed method achieves superior segmentation results compared with the state-of-the-arts.

Understanding the COVID-19 pandemic prevalence in Africa through optimal feature selection and clustering: evidence from a statistical perspective

Article

Full-text available

Aug 2022
Environ Dev Sustain

The COVID-19 pandemic, which outbroke in Wuhan (China) in December 2019, severely hit almost all sectors of activity in the world as a consequence of the restrictive measures imposed. Two years later, Africa still emerges as the least affected continent by the pandemic. This study analyzed COVID-19 prevalence across African countries through country-level variables prior to clustering. Using Spearman-rank correlation, multicollinearity analysis and univariate filtering, 9 country-level variables were identified from an initial set of 34 variables. These variables relate to socioeconomic status, population structure, healthcare system and environment and the climatic setting. A clustering of the 54 African countries is further carried out through the use of agglomerative hierarchical clustering (AHC) method, which generated 3 distinctive clusters. Cluster 1 (11 countries) is the most affected by COVID-19 (median of 63,508.6 confirmed cases and 946.5 deaths per million) and is composed of countries with the highest socioeconomic status. Cluster 2 (27 countries) is the least affected (median of 4473.7 confirmed cases and 81.2 deaths per million), and mainly features countries with the least socioeconomic features and international exposure. Cluster 3 (16 countries) is intermediate in terms of COVID-19 prevalence (median of 2569.3 confirmed cases and 35.7 deaths per million) and features countries the least urbanized and geographically close to the equator, with intermediate international exposure and socioeconomic features. These findings shed light on the main features of COVID-19 prevalence in Africa and might help refine effectively coping management strategies of the ongoing pandemic. Supplementary information: The online version contains supplementary material available at 10.1007/s10668-022-02646-3.

Hybrid fuzzy clustering technique to enhance the performance based on a fusion of intuitionistic modified fuzzy c-means and improved genetic algorithm

Article

Full-text available

Dec 2023

In the modern era, there is a sudden rise in data due to the wide use of the web, social networks and so on. Now, it becomes difficult to explore desired data from such fused complex data. For mining the data, many clustering algorithms have been proposed, and it is very difficult to find a clustering method which is suitable for all types of datasets. The present study proposes a hybrid fuzzy clustering technique based on the fusion of intuitionistic modified fuzzy c-means and improved genetic algorithm. This study overcomes the problem of high sensitivity toward initial centroids by using an improved genetic algorithm with developed normalized crossover and mutation operator. The proposed algorithm reduces the effect of noise by developing a metric and resolve uncertainty in assigning membership value by using two negation functions. The strength of the proposed algorithm over existing methods in the literature is revealed by testing it on 11 benchmark real-world and three artificial datasets. The experiment results state the admirable achievement of the proposed algorithm over other tested algorithms.

Image Intelligence-Assisted Time-Series Analysis Method for Identifying “Dispersed, Disordered, and Polluting” Sites Based on Power Consumption Data

Chapter

Aug 2023

A novel effective method for identifying “Dispersed, Disordered, and Polluting” (DDP) sites was proposed for the purpose of promoting the modernization of ecological and environmental governance capabilities and building a big data application platform for enterprises’ pollution prevention. This paper aggregated data from the electricity consumption information collection system and characteristic sensing terminals, including user daily total electricity consumption records, peak and valley electricity consumption, and other information. Firstly, we used the hierarchical K-means algorithm to cluster the time series of user electricity consumption data. After ranking by cluster features, the electricity consumption time series of the selected suspicious users were encoded into Gramian angular field (GAF) images. Finally, we adopted the perceptual hash algorithm to build the model to identify “Dispersed, Disordered, and Polluting” sites. The case analysis results verified this method’s feasibility, rationality, and effectiveness.KeywordsPower consumption dataImage intelligencePerceptual hash algorithmPollution prevention

Range clustering: An algorithm for empirical evaluation of classical clustering algorithms

Conference Paper

Aug 2016

A fuzzy clustering technique for enhancing the convergence performance by using improved Fuzzy c-means and Particle Swarm Optimization algorithms

Article

Jul 2022
DATA KNOWL ENG

Fuzzy clustering is a well-established technique among the well-known clustering techniques in several real-world applications due to easy implementation and produces satisfactory clustering result. However, it has some deficiency such as sensitive to outliers, result dependency on choosing initial centroid, etc. To eradicate the shortcoming of FCM algorithm, this article introduces a robust clustering technique, particle swarm optimization improved fuzzy c-means is developed by the hybridization of particle swarm optimization and improved fuzzy c-means techniques, to deal with noisy data and initialization problem. In this article, a fuzzy clustering technique is developed to increase the convergence performance of clustering techniques. Fuzzy c-means is improved by developing a new metric to tolerate the noisy environment. Particle swarm optimization has an inbuilt guidance strategy which leads the solution in particle swarm optimization to obtain useful information from the better solution and thereby helping them improve their own solution. To handle the initialization problem of fuzzy c-means, particle swarm optimization technique is used. PSO effectively enhance the performance of improved FCM to increase the effectiveness of clustering. The effectiveness of the proposed clustering technique over existing techniques in literature has been illustrated by adopting eight real worlds and three artificial data sets. The results show that the proposed algorithm generates encouraging results as compared to the established clustering technique in literature.

K-Means Clustering Versus Validation Measures: A Data-Distribution Perspective

Article

Full-text available

May 2009

K-means is a well-known and widely used partitional clustering method. While there are considerable research efforts to characterize the key features of the K-means clustering algorithm, further investigation is needed to understand how data distributions can have impact on the performance of K-means clustering. To that end, in this paper, we provide a formal and organized study of the effect of skewed data distributions on K-means clustering. Along this line, we first formally illustrate that K-means tends to produce clusters of relatively uniform size, even if input data have varied ldquotruerdquo cluster sizes. In addition, we show that some clustering validation measures, such as the entropy measure, may not capture this uniform effect and provide misleading information on the clustering performance. Viewed in this light, we provide the coefficient of variation (CV) as a necessary criterion to validate the clustering results. Our findings reveal that K-means tends to produce clusters in which the variations of cluster sizes, as measured by CV, are in a range of about 0.3-1.0. Specifically, for data sets with large variation in ldquotruerdquo cluster sizes (e.g., CV > 1.0 ), K-means reduces variation in resultant cluster sizes to less than 1.0. In contrast, for data sets with small variation in ldquotruerdquo cluster sizes (e.g., CV < 0.3), K-means increases variation in resultant cluster sizes to greater than 0.3. In other words, for the earlier two cases, K-means produces the clustering results which are away from the ldquotruerdquo cluster distributions.

Data Clustering: A Review

Article

Jan 1999

Cluster validity methods, SIGMOD

Article

Jan 2002
Record

Cluster Analysis: Basic Concepts and Algorithms

Article

Jan 2005

K-means clustering versus validation measures

Conference Paper

Aug 2006

K-means is a widely used partitional clustering method. While there are considerable research efforts to characterize the key features of K-means clustering, further investigation is needed to reveal whether and how the data distributions can have the impact on the performance of K-means clustering. Indeed, in this paper, we revisit the K-means clustering problem by answering three questions. First, how the "true" cluster sizes can make impact on the performance of K-means clustering? Second, is the entropy an algorithm-independent validation measure for K-means clustering? Finally, what is the distribution of the clustering results by K-means? To that end, we first illustrate that K-means tends to generate the clusters with the relatively uniform distribution on the cluster sizes. In addition, we show that the entropy measure, an external clustering validation measure, has the favorite on the clustering algorithms which tend to reduce high variation on the cluster sizes. Finally, our experimental results indicate that K-means tends to produce the clusters in which the variation of the cluster sizes, as measured by the Coefficient of Variation(CV), is in a specific range, approximately from 0.3 to 1.0.

UCI Repository of Machine Learning Databases

Article

Jan 1998

Data clustering: A survey

Article

Jan 1999

Fuzzy Logic With Engineering Applications

Article

Jan 2009

Timothy J. Ross

This chapter summarizes only two popular methods of classification. The first is classification using equivalence relations. This approach makes use of certain special properties of equivalence relations and the concept of defuzzification known as lambda-cuts on the relations. The second method of classification is a very popular method known as fuzzy c-means (FCM), so named because of its close analog in the crisp world, hard c-means (HCM). This method uses concepts in n-dimensional Euclidean space to determine the geometric closeness of data points by assigning them to various clusters or classes and then determining the distance between the clusters. In the case of fuzzy relations, for all fuzzy equivalence relations, their ?-cuts are equivalent ordinary relations. Hence, to classify data points in the universe using fuzzy relations, we need to find the associated fuzzy equivalence relation. fuzzy logic; pattern clustering

Data mining: concepts and techniques morgan kaufmann

Article

Jan 2006

Some Methods for Classification and Analysis of MultiVariate Observations

Conference Paper

Jan 1967

J.B. MacQueen

K-means clustering using Max-min distance measure

Abstract and Figures

Recommended publications

Classified information: The data clustering problem

Neurocomputing applications in post-operative liver transplant monitoring

Categorizing Web documents using competitive learning: an ingredient of a personal adaptive agent

Unsupervised speech/music classification using one-class support vector machines