Article

Clustering of interval data based on City-Block distances

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The recording of interval data has become a common practice with the recent advances in database technologies. This paper introduces clustering methods for interval data based on the dynamic cluster algorithm. Two methods are considered: one with adaptive distances and the other without.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In general, interval data can represent the analysis of data in a specific area, i.e., user data mining [30]. Interval data definition and distance measures were quickly discussed, as were their applications in clustering [31,32]. Peng W et al. [33] (2006) proposed interval data clustering with applications. ...
... The first distance measure is termed the L1 norm, or city block distance [31]. The representation in the interval is the sum of the absolute values of the differences in all intervals: ...
... In this paper, the definition of an interval was outlined as the lower and upper The first distance measure is termed the L1 norm, or city block distance [31]. The representation in the interval is the sum of the absolute values of the differences in all intervals: ...
Article
Full-text available
Symbolic interval data analysis (SIDA) has been successfully applied in a wide range of fields, including finance, engineering, and environmental science, making it a valuable tool for many researchers for the incorporation of uncertainty and imprecision in data, which are often present in real-world scenarios. This paper proposed the interval improved fuzzy partitions fuzzy C-means (IIFPFCM) clustering algorithm from the viewpoint of fast convergence that independently combined with Euclidean distance and city block distance. The two proposed methods both had a faster convergence speed than the traditional interval fuzzy c-means (IFCM) clustering method in SIDA. Moreover, there was a problem regarding large and small group division for symbolic interval data. The proposed methods also had better performance results than the traditional interval fuzzy c-means clustering method in this problem. In addition, the traditional IFCM clustering method will be affected by outliers. This paper also proposed the IIFPFCM algorithm to deal with outliers from the perspective of interval distance measurement. From experimental comparative analysis, the proposed IIFPFCM clustering algorithm with the city block distance measure was found to be suitable for dealing with SIDA with outliers. Finally, nine symbolic interval datasets were assessed in the experimental results. The statistical results of convergence and efficiency on performance revealed that the proposed algorithm has better results.
... However, it has obtained the great interest of statisticians in recent years. Souza and Carvalho (2004) proposed the clustering method for intervals using the dynamic algorithm with squared Euclidean distance. From this result, Masson and Denoeux (2004) proposed the clustering approach based on interval-valued dissimilarities that were tackled in the framework of the Dempster-Shafer theory. ...
... In comparison with other methods, as shown in Table 5, the CR index obtained from (De Souza et al. 2004) is the lowest at 0.5778 and the CR indexes of k-Means-H (the k-Means algorithm using Hausdorff distance), k-Means-C (the k-Means algorithm using City block distance), k-Means-E (the k-Means algorithm using Euclidean distance) and k-Means-O (the k-Means algorithm using Overlap distance) has the same value at 0.5921. Meanwhile, the CR index of the proposed algorithm and (Hung et al. 2016) are the best with C R = 1. ...
... The comparison between the proposed methods and the others is presented in Table 8 which again demonstrates the superiority of the ACIG over other methods. Through the two above examples and application in image recognition with two data sets, it can be found that the algorithm of Hung et al. (2016) can achieve better performance than the algorithms of De Souza et al. (2004) and Souza and Carvalho (2004), but this algorithm is more suitable for data with low overlapping degree (Example 1 and Example 2) instead of complex objects, such as images. Using the Hausdorff, City-block and Euclidean distances, the proposed algorithm can achieve quite good and stable results. ...
Article
Full-text available
This paper proposes an Automatic Clustering algorithm for Interval data using the Genetic algorithm (ACIG). In this algorithm, the overlapped distance between intervals is applied to determining the suitable number of clusters. Moreover, to optimize in clustering, we modify the Davies & Bouldin index, and to improve the crossover, mutation, and selection operators of the original genetic algorithm. The convergence of ACIG is theoretically proved and illustrated by the numerical examples. ACIG can be implemented effectively by the established Matlab procedure. Through the experiments on data sets with different characteristics, the proposed algorithm has shown the outstanding advantages in comparison to the existing ones. Recognizing the images by the proposed algorithm gives the potential in real applications of this research.
... A similarity measure is a real-valued function that quantifies the similarity between two objects. Several similarity measures have been proposed in the literature for interval-valued fuzzy sets [47][48][49] and fuzzy numbers [58][59][60]. The following similarity measures are chosen for their ease of producing accurate results. ...
... (9)- (12). Finally, with operational law (iv), we obtain from Eqs. (48) and (49) n À jI 2 j n 1 d a d r¼1 1 jP r j a i2P r a p i b 1 jP r j À 1 a j2P r j6 ¼i a q j p pþq a jI 2 j n 1 jI 2 j a i2I 2 a p i !1 ...
... It is observed that 0 A; B; C; D 1. So, using Eqs. (48), (49) and operational laws (i)-(iv), we get ...
Article
Full-text available
In group decision making, each expert’s background and the level of knowledge and ability differ, which makes the expert’s information inputs to the decision-making process heterogeneous. Such heterogeneity in the information can affect the outcome of the selection of the decision alternatives. This paper therefore attempts to partition the heterogeneous information into homogeneous groups to elicit similar (related) and dissimilar (unrelated) data using a clustering algorithm. We then develop an aggregation approach to gather the collective opinions from the homogeneous clusters to accurately model the decision problem in a group setting. The proposed aggregation approach, labeled as the generalized partitioned Bonferroni mean (GPBM), is studied to investigate the characteristics of the aggregation operator. Further, we extend the GPBM concept to an interval-valued fuzzy set context using the additive generators of the strict t-conorms and we develop two other new aggregation operators: the interval-valued GPBM (IVGPBM) and the weighted IVGPBM (WIVGPBM). We analyze the aggregation of fuzzy numbers by the IVGPBM operator using interval arithmetic involving \(\alpha\)-cuts and the \(\alpha\)-cut-based decomposition principle of fuzzy numbers. Two practical examples are presented to illustrate the applicability of these operators, and a comparison is conducted to highlight the effects of the confidence level and the sensitivity of the parameters chosen, analyzing the results with the parameterized strict t-conorm. Finally, we compare the experimental results of the proposed method with existing methods.
... Ezugwu et al. [21] conducted a systematic review of traditional and state-of-the-art clustering techniques for different domains. Moreover, some dissimilarity measures for interval data can be found in the literature concerning symbolic data (e.g., [16,[22][23][24][25][26][27][28]), and there are also some similarity measures that can deal with interval data (e.g., [17,[29][30][31]). Gordon [32] (p. ...
... Cluster 1: {2, 3,4,5,6,8,11,12,15,17,18,19,22,23,29, 31}; Cluster 2: {0, 1, 7,9,10,13,14,16,20,21,24,25,26,27,28,30,33,34,35,3 Cluster 3: {32}. Figure 1). ...
Article
Full-text available
From the affinity coefficient between two discrete probability distributions proposed by Matusita, Bacelar-Nicolau introduced the affinity coefficient in a cluster analysis context and extended it to different types of data, including for the case of complex and heterogeneous data within the scope of symbolic data analysis (SDA). In this study, we refer to the most significant partitions obtained using the hierarchical cluster analysis (h.c.a.) of two well-known datasets that were taken from the literature on complex (symbolic) data analysis. h.c.a. is based on the weighted generalized affinity coefficient for the case of interval data and on probabilistic aggregation criteria from a VL parametric family. To calculate the values of this coefficient, two alternative algorithms were used and compared. Both algorithms were able to detect clusters of macrodata (aggregated data into groups of interest) that were consistent and consonant with those reported in the literature, but one performed better than the other in some specific cases. Moreover, both approaches allow for the treatment of large microdatabases (non-aggregated data) after their transformation into macrodata from the huge microdata.
... In light of its advantages, the CURE algorithm is adopted in this study as the basic algorithm for LSM modeling. However, the algorithm has some fundamental problems including 1) the use of a random approach to select representative points, which may result in incorrect clustering results; 2) Like in most of the traditional clustering algorithms, the Euclidean distance used to calculate distance between points cannot work well with uncertain data, which can also affect the clustering results (de Souza and De Carvalho, 2004;Ren et al., 2009). ...
... Alternatively, the data can be represented using its midpoint (mp) and radius r) as p (mp, r) where mp (a + b)/2 and r (b − a)/2. Thus, to calculate the distance between data points in the CURE algorithm, the CIBD (de Souza and De Carvalho, 2004) is applied, which can facilitate the successful processing of uncertain data, whereas, the other data types will be considered as special uncertain data with mp(p) p, and r(p) 0. Hence, the CIBD method can be applied to attributes with different data types: continuous, discrete and uncertain attributes. ...
Article
Full-text available
Landslide susceptibility mapping (LSM) is a crucial step during landslide assessment and environmental management. Clustering algorithms can construct effective models for LSM. However, a random selection of important parameters, inconsideration of uncertain data, noise data, and large datasets can limit the implementation of clustering in LSM, resulting in low and unreliable performance results. Thus, to address these problems, this study proposed an optimized clustering algorithm named O-CURE, which combines: the traditional Clustering Using REpresentatives algorithm (CURE), that is, efficient for large datasets and noise data, the partition influence weight (PIW)-based method to enhance the selection of sample sets and the city block distance (CIBD) for processing of the uncertain data in CURE clustering during LSM modeling. A database containing 293 landslide location samples, 213 non-landslide samples, and 7 landslide conditioning factors was prepared for the implementation and evaluation of the method. Also, a Multicollinearity analysis was conducted to select the most appropriate factors, and all the factors were acceptable for modeling. Based on O-CURE, landslide density, and the partitioning around medoids (PAM) algorithm a susceptibility map was constructed and classified into very high (33%), high (18%), moderate (24%), low (13%), and very low (12%) landslide susceptible levels. To evaluate the performance of the O-CURE model, five statistic metrics including accuracy, sensitivity, specificity, kappa, and AUC were applied. The analysis shows that O-CURE obtained accuracy = .9368, sensitivity = .9215, specificity = .9577, kappa = .8496, and AUC = .896 is an indication of high-performance capability. Also, the proposed method was compared with the CURE algorithm, three existing clustering methods, and popular supervised learning methods. From this assessment, O-CURE outperformed the other clustering methods while showing significant and more consistent performance than the supervised learning methods. Therefore, we recommend that the O-CURE model and the constructed map can be useful in assessing landslides and contribute to sustainable land-use planning and environmental management in light of future disasters.
... Some similarity and dissimilarity measures for the case of Symbolic data can be found, for instance, in Bock and Diday (2000). The recording of interval data has become a more frequent practice with the recent advances in database technologies (Souza and De Carvalho, 2004). Some dissimilarity measures for interval data can be found in the literature (see f.i. ...
... Some dissimilarity measures for interval data can be found in the literature (see f.i. Chavent and Lechevallier, 2002;Chavent et al., 2003;Souza and De Carvalho, 2004;De Carvalho et al., 2006a, 2006b, as well as some similarity measures which are capable of dealing with the particular case of interval data (e.g. Bacelar-Nicolau et al., 2009, 2014a, 2014b. ...
Article
Full-text available
Symbolic Data Analysis can be defined as the extension of standard Received: Dec 9, 2014 Accepted: Feb 1, 2015 Published: Feb 26, 2015 *Corresponding Contact Email: aurea@uac.pt Cell Phone: (351) 967864946 data analysis to more complex data tables. We illustrate the application of the Ascendant Hierarchical Cluster Analysis (AHCA) to a symbolic data set (with a known structure) in the field of the automobile industry (car data set), in which objects are described by variables whose values are intervals of the real data set (interval variables). The AHCA of thirty-three car models, described by eight interval variables (with different scales of measure), was based on the standardized weighted generalized affinity coefficient, by the method of Wald and Wolfowitz. We applied three probabilistic aggregation criteria in the scope of the VL methodology (V for Validity, L for Linkage). Moreover, we compare the achieved results with those obtained by other authors.
... For large-scale data, it is not realistic to use the EDT algorithm to obtain the Euclidean distance value quickly and accurately. Therefore, if there is no strict requirement on the calculation precision, the distance transformation is usually accomplished by approximate algorithms, such as Manhattan distance [20], Chebyshev distance [21], and chamfer distance [22]. To ensure the rotation invariance and accuracy of Euclidean distance, a lot of efforts have been made to reduce the complexity of Euclidean distance calculation. ...
Preprint
Full-text available
Distance transform (DT) can accurately represent spatial information relationship, and it has been widely used in digital core analysis, such as computer image processing, pattern recognition, and etc. However, for large-resolution 2D/3D images, the existing serial Euclidean distance transform (EDT) algorithm is time-consuming and memory-consuming. This paper proposed a parallel implementation of fast Euclidean distance transform in neighborhood based on data set decomposition, which is called as Parallel Computing EDT (PCEDT). The distance transformation is completed efficiently and accurately by data segmentation, batch calculation, and results combination. Compared with other EDT algorithms, PCEDT algorithm is more efficient and more suitable, especially for 2000 ³ Voxels to 4000 ³ voxels data amount to deal with when applied in practical applications.
... From the point of distance measure, the 13 possible relationships [1] can be reduced to only three relations before, overlap and contain [2]. In-spite of the difference between interval and numeric data, several distance measures used for numeric data are used in interval data [2][3][4][5][6][7]. These measures may or may not consider the relationship between the intervals. ...
... Liu et al., [39] and Ullah et al., [40] given the idea of normal T-SFs aggregation operator on muti-attribute decision making. Garg et al., [41] given the concept of power aggregation operator on T-SFs, while Wang and Ullah [42] proposed uncertain linguistic MARCOS method in T-SFs environment. ...
Article
Full-text available
To handle problematic and ambiguous data, Schweizer and Sklar added a parameter p in 1960, which helped to develop the theory of SS t-norm (SSTN) and t-conorm (SSTCN). The parameter p=-1.1 can be used to easily derive the information of the Hamacher and Lukasiewicz t-norms. Furthermore, prioritized aggregation operators (PAOs) choose which data will be collected into a singleton set. The main contribution of this work is the construction of new aggregation operators for T-spherical fuzzy (T-SF) information based on SS t-norm and t-conorm. Moreover, the fundamental characteristics of the operators are identified. Further, we developed MADM (Multi-Attribute Decision-Making) models and deduced several useful properties from the operators T-SFSSPA, T-SFSSWPA, T-SFSSPG, and T-SFSSWPG. Finally, using an actual case study, we were able to draw the conclusion that, in comparison to the ground-breaking and current methods to enhance the value and capability of the diagnosed operators, the proposed MADM algorithm performs noticeably better than the operators in place for resolving the water recycling problem in a way that is easy to understand.
... In the case of this study, a city-block metric was selected. The city-block metric relies on the sum of absolute differences [42]. Each centroid is the component-wise median of the points in that cluster. ...
Article
Full-text available
In recent decades, an increasing number of studies on psychophysiology and, in general, on clinical medicine has employed the technique of facial thermal infrared imaging (IRI), which allows to obtain information about the emotional and physical states of the subjects in a completely non-invasive and contactless fashion. Several regions of interest (ROIs) have been reported in literature as salient areas for the psychophysiological characterization of a subject (i.e. nose tip and glabella ROIs). There is however a lack of studies focusing on the functional correlation among these ROIs and about the physiological basis of the relation existing between thermal IRI and vital signals, such as the electrodermal activity, i.e. the galvanic skin response (GSR). The present study offers a new methodology able to assess the functional connection between salient seed ROIs of thermal IRI and all the pixel of the face. The same approach was also applied considering as seed signal the GSR and its phasic and tonic components. Seed correlation analysis on 63 healthy volunteers demonstrated the presence of a common pathway regulating the facial thermal functionality and the electrodermal activity. The procedure was also tested on a pathological case study, finding a completely different pattern compared to the healthy cases. The method represents a promising tool in neurology, physiology and applied neurosciences.
... For instance, Lingras and West [25] propose a version of the k-means algorithm to develop interval clusters of web visitors using rough set theory. In the same year, de Souza and de Carvalho [11] introduce some methods for clustering with interval data. Those methods are extensions of the standard dynamic cluster method. ...
Article
Clustering algorithms create groups of objects based on their similarity. As objects are usually defined by data points, this similarity is commonly measured by a distance function. When the objects are defined by variables that are intervals, it is more difficult to determine how to measure the similarity between the objects of the dataset. In this work, we propose some similarity measures between intervals based on average embedding functions. Using these, new similarity measures between interval-based objects are proposed. All the proposed similarities are based on measuring the similarity between the objects variable by variable and then averaging the obtained results to get a single value. By its definition, the objects can be considered as interval-valued fuzzy sets (IVFS), so the similarities introduced are proved to be valid similarities for IVFS. The measures proposed are used in a hierarchical clustering algorithm with the aim of grouping the objects of the dataset into different clusters based on their similarity to interval-valued data. The described process is applied to real data regarding the Spanish weather in order to cluster the provinces of Spain based on the interval temperature of each month in 2021, showing different results that the ones obtained using non-interval-valued data.
... From the point of distance measure, the 13 possible relationships [1] can be reduced to only three relations before, overlap and contain [2]. In-spite of the difference between interval and numeric data, several distance measures used for numeric data are used in interval data [2][3][4][5][6][7]. These measures may or may not consider the relationship between the intervals. ...
Article
Full-text available
Interval data, a special case of symbolic data, is becoming more and more frequent in different fields of applications including the field of Data Mining. Measuring the dissimilarity or similarity between two intervals is an important task in Data Mining. In this paper an analysis of ten desirable properties that should be fulfilled by the measures for interval data for making it suitable for applications like clustering and classification has been done. Also, it has been verified whether these properties are satisfied by three existing measures- L1-norm, L2-norm, L∞-norm and also a new dissimilarity measure for interval data has also been proposed. The performance of all the existing distance measures are compared with the proposed measure by applying well known K-Means algorithm on 6 interval datasets. It is seen that proposed measure gives better clustering accuracy then the existing measures on most of the datasets.
... The ISODATA algorithm with the 1 -norm was introduced in (Jajuga, 1987). In the paper (de Souza R & de Carvalho, 2004), the authors introduce adaptive and non-adaptive clustering algorithms using the 1 -norm. ...
Article
A new version of the k-medians algorithm, the adaptive k-medians algorithm, is introduced to solve clustering problems with the similarity measure defined using the L1-norm. The proposed algorithm first calculates the center of the whole data set as its median. To solve the k-clustering problem (k-1), we formulate the auxiliary clustering problem to generate a set of starting points for the k-th cluster center. Then, the k-medians algorithm is applied starting from the previous (k-1) cluster centers and each point from the set of starting points to solve the k-clustering problem. A solution with the least value of the clustering function is accepted as the solution to the k-clustering problem. We evaluate the performance of the adaptive k-medians algorithm and compare it with other concurrent clustering algorithms using 8 real-world data sets.
... [2] for a survey). Indeed, data can be transactional, sequential, trees, graphs, texts, or even of a symbolic nature [6,7,10]. This last kind of data is particularly suitable for modeling complex and heterogeneous objects usually described by a set of multivalued variables of different types (e.g. ...
... In many applications, we seek to cluster data that is non-Gaussian. In the literature, most do this using different distance metrics other than Euclidean distances (Choi et al., 2010;Fowlkes and Mallows, 1983;de Souza and De Carvalho, 2004). Some use losses based on exponential family or deviances closely related to Bregman divergences (Banerjee et al., 2005). ...
Article
Full-text available
In mixed multi-view data, multiple sets of diverse features are measured on the same set of samples. By integrating all available data sources, we seek to discover common group structure among the samples that may be hidden in individualistic cluster analyses of a single data view. While several techniques for such integrative clustering have been explored, we propose and develop a convex formalization that enjoys strong empirical performance and inherits the mathematical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multi-view data is often challenging when each data source is high-dimensional. To perform feature selection in such scenarios, we develop an adaptive shifted group-lasso penalty that selects features by shrinking them towards their loss-specific centers. Our so-called iGecco+ approach selects features from each data view that are best for determining the groups, often leading to improved integrative clustering. To solve our problem, we develop a new type of generalized multi-block ADMM algorithm using sub-problem approximations that more efficiently fits our model for big data sets. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for high-dimensional mixed multi-view data.
... From the perspective of clustering, interval data is more challenging than a crisp number [35]. Lu et al. [36] proposed a fuzzy clustering algorithm for interval data based on Gaussian distribution, and D'Urso et al. [37] proposed a fuzzy c-ordered medoids clustering algorithm for interval data clustering. ...
Article
Customer segmentation refers to dividing customer groups into multiple different sub-communities according to customer characteristics. The accurate segmentation of customers is critical for decision-makers to fully understand the customer requirements in the market and then design market activities that satisfy customers, which will impact customer loyalty. In past studies, clustering algorithms have been widely used to solve customer segmentation. However, it is still difficult to divide customers clearly when facing real customer requirement data. To solve these difficulties, this paper develops a heuristic clustering method for customer segmentation, termed Gaussian Peak Heuristic Clustering (GPHC, for short). Specifically, this paper utilizes the entropy method and standardized Gaussian distribution to filter and model interval customer requirement data. Then, the customer preference pattern hidden in customers’ requirement data is recognized by niching genetic algorithm and hierarchical clustering. Finally, the CRD clustering result will be obtained by the k-means algorithm based on heuristics information from customer preference patterns. Furthermore, customer segmentation can be extracted from customer requirement data. A practical case is used to illustrate the effectiveness of GPHC in solving customer segmentation problems. The experimental results show that the customer segmentation results output by the method proposed in this paper is consistent with the customer segmentation results given by experts. Besides, the robustness of GPHC in the face of complex customer segmentation scenarios has been verified through numerical experiments.
... Souza defined the dissimilarity measure which named as city-block distances, this measure is an appropriate extension of Minkowski distance metric to interval-valued data x  and y  as [32]: ...
Article
Full-text available
In the process of land cover segmentation of remote sensing image, there are some uncertainties such as ?significant difference in class density?, ?different body with same spectrum? and ?same body with different spectrum?. Existing fuzzy c-means clustering is not sufficient to describe the high-order fuzzy uncertainty and cannot be completed accurate segmentation. Type-2 fuzzy sets can effectively handle inter-class multiple uncertainties, and clustering algorithm can effectively suppress the noise of remote sensing image by embedding local information. Therefore, on the basis of integrating local information, this paper proposes a robust single fuzzifier interval type-2 fuzzy local C-means clustering based on adaptive interval-valued data for land cover segmentation. Firstly, interval-valued data modeling is performed on remote sensing data, and remote sensing features are represented as interval-valued vectors, and the robust interval-valued distance measure that can maximize the distance between interval-valued numbers is used to generate an interval type-2 fuzzy set through robust fuzzy clustering. Secondly, this paper adopts an efficient type reduction method to seek equivalent type-1 fuzzy sets adaptively, and realizes the segmentation of land cover by the criterion of maximum type-1 fuzzy membership. The test results of multi-spectral remote sensing images show that the segmentation performance of this proposed algorithm outperforms existing state of the art adaptive interval type-2 fuzzy clustering algorithms, and it is beneficial to remote sensing image interpretation.
... The X-means algorithm introduced in [33] can use various distance functions as similarity measures. In the paper [39], the authors introduce adaptive and nonadaptive clustering methods using the function d 1 . A one dimensional center-based L 1 -clustering algorithm is proposed in [37]. ...
... This paper is focused on agglomerative methods in the context of cluster analysis and on a symbolic data table where each cell contains an interval of the real axis. Some dissimilarity measures for interval data have been reported in the literature (e.g., Chavent and Lechevallier, 2002;Chavent et al., 2003;Souza and De Carvalho, 2004;De Carvalho et al., 2006a, 2006b, 2007, as well as some similarity measures which allow us to deal with this type of data (e.g., Guru et al., 2004;Bacelar-Nicolau et al., 2009, 2014a, 2014bSousa et al., 2010Sousa et al., , 2013aSousa et al., , 2015. ...
Article
Full-text available
In this paper we compare the best partitions of data units (cities) obtained from different algorithms of Ascendant Hierarchical Cluster Analysis (AHCA) of a well-known data set of the literature on symbolic data analysis (“city temperature interval data set”) with a priori partition of cities given by a panel of human observers. The AHCA was based on the weighted generalised affinity with equal weights, and on the probabilistic coefficient associated with the asymptotic standardized weighted generalized affinity coefficient by the method of Wald and Wolfowitz. These similarity coefficients between elements were combined with three aggregation criteria, one classical, Single Linkage (SL), and the other ones probabilistic, AV1 and AVB, the last ones in the scope of the VL methodology. The evaluation of the partitions in order to find the partitioning that best fits the underlying data was carried out using some validation measures based on the similarity matrices. In general, global satisfactory results have been obtained using our methods, being the best partitions quite close (or even coinciding) with the a priori partition provided by the panel of human observers.
... In many applications, we seek to cluster data that is non-Gaussian. In the literature, most do this using different distance metrics other than Euclidean distances (Choi et al. 2010;Fowlkes and Mallows 1983;de Souza and De Carvalho 2004). Some use losses based on exponential family or deviances closely related to Bregman divergences (Banerjee et al. 2005). ...
Preprint
In mixed multi-view data, multiple sets of diverse features are measured on the same set of samples. By integrating all available data sources, we seek to discover common group structure among the samples that may be hidden in individualistic cluster analyses of a single data-view. While several techniques for such integrative clustering have been explored, we propose and develop a convex formalization that will inherit the strong statistical, mathematical and empirical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multi-view data is often challenging when each data source is high-dimensional. To perform feature selection in such scenarios, we develop an adaptive shifted group-lasso penalty that selects features by shrinking them towards their loss-specific centers. Our so-called iGecco+ approach selects features from each data-view that are best for determining the groups, often leading to improved integrative clustering. To fit our model, we develop a new type of generalized multi-block ADMM algorithm using sub-problem approximations that more efficiently fits our model for big data sets. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for high-dimensional mixed multi-view data.
... According to Manhattan distance (also known as city block distance), the difference between two points is calculated by summing the absolute differences between the correspondence vector values as seen in Eq. (7) [15,16]: ...
Article
Full-text available
Clustering process is an important stage for many data mining applications. In this process, data elements are grouped according to their similarities. One of the most known clustering algorithms is the k-means algorithm. The algorithm initially requires the number of clusters as a parameter and runs iteratively. Many remote sensing image processing applications usually need the clustering stage like many image processing applications. Remote sensing images provide more information about the environments with the development of the multispectral sensor and laser technologies. In the dataset used in this paper, the infrared (IR) and the digital surface maps (DSM) are also supplied besides the red (R), the green (G), and the blue (B) color values of the pixels. However, remote sensing images come with very large sizes (6000 × 6000 pixels for each image in the dataset used). Clustering these large-size images using their multiattributes consumes too much time if it is used directly. In the literature, some studies are available to accelerate the k-means algorithm. One of them is the normalized distance value (NDV)-based fast k-means algorithm that benefits from the speed of the histogram-based approach and uses the multiattributes of the pixels. In this paper, we evaluated the effects of these attributes on the correctness of the clustering process with different color space transformations and distance measurements. We give the success results as peak signal-to-noise ratio and structural similarity index values using two different types of reference data (the source images and the ground-truth images) separately. Finally, we give the results based on accuracy measurement for evaluating both the success of the clustering outputs and the reliability of the NDV-based measurement methods presented in this paper.
... Urban blocks have received increasing attention in urban research [33][34][35]. This paper selected the traffic community as the research unit based on the concept of urban block. ...
Article
Full-text available
Along with the rapid development of China’s economy as well as the continuing urbanization, the internal spatial and functional structures of cities within this country are also gradually changing and restructuring. The study of functional region identification of a city is of great significance to the city’s functional cognition, spatial planning, economic development, human livability, and so forth. Backed by the emerging urban Big Data, and taking the traffic community as the smallest research unit, a method is proposed to identify urban functional regions by combining floating car track data with point of interest (POI) data recorded on an electronic map. It provides a new perspective for the study of urban functional region identification. Firstly, the main functional regions of the city studied are identified through clustering analysis according to the passenger’s spatial-temporal travel characteristics derived from the floating car data. Secondly, the fine-grained identification of the functional region attributes of the traffic communities is achieved using the label information from POI data. Finally, the AND-OR operation is performed on the recognition results derived by the clustering algorithm and the Delphi method, to obtain the identification of urban functional regions. This approach is verified by applying it to the main urban zone within Chengdu’s Third Ring Road. The results show that: (1) There are fewer single functional regions and more mixed functional regions in the main urban zone of Chengdu, and the distribution of the functional regions are roughly concentric centering in the city center. (2) Using the traffic community as a research unit, combined with dynamic human activity trajectory data and static urban interest point data, complex urban functional regions can be effectively identified.
... Biospectra analysis has been shown to be critical in mining pharmacology datasets as well as predicting possible adverse drug effects based on profile similarity with drug-like molecules known for adverse reactions [84]. Similarity between molecules is measured using the Tanimoto similarity coefficient [86], cosine correlation [87], Euclidean distance [88] or city block distance [89]. ...
Article
Full-text available
Drug-like compounds are most of the time denied approval and use owing to the unexpected clinical side effects and cross-reactivity observed during clinical trials. These unexpected outcomes resulting in significant increase in attrition rate centralizes on the selected drug targets. These targets may be disease candidate proteins or genes, biological pathways, disease-associated microRNAs, disease-related biomarkers, abnormal molecular phenotypes, crucial nodes of biological network or molecular functions. This is generally linked to several factors, including incomplete knowledge on the drug targets and unpredicted pharmacokinetic expressions upon target interaction or off-target effects. A method used to identify targets, especially for polygenic diseases, is essential and constitutes a major bottleneck in drug development with the fundamental stage being the identification and validation of drug targets of interest for further downstream processes. Thus, various computational methods have been developed to complement experimental approaches in drug discovery. Here, we present an overview of various computational methods and tools applied in predicting or validating drug targets and drug-like molecules. We provide an overview on their advantages and compare these methods to identify effective methods which likely lead to optimal results. We also explore major sources of drug failure considering the challenges and opportunities involved. This review might guide researchers on selecting the most efficient approach or technique during computational drug discovery process.
Article
Along with the abundant appearance of interval-valued time series (ITS), the study on ITS clustering, especially on shape-based ITS clustering, is becoming increasingly important. As an effective approach to extracting trend information in time series, fuzzy trend-granulation addresses the needs of shape-based ITS clustering. However, when extracting trend information in ITS, unequal-size granules are inevitably produced, which makes ITS clustering difficult and challenging. Facing with this issue, this paper aims to generalize the widely used Fuzzy C-Means (FCM) algorithm to a fuzzy trend-granulation based FCM algorithm for ITS clustering. To this end, a suite of algorithms including ITS segmenting, segment merging and granule building algorithms are firstly developed for fuzzy trend-granulation of ITS, with which the given ITS are transformed into granular ITS which consist of double linear fuzzy information granules (DLFIGs) and may be of different lengths. With the defined distance between DLFIGs, the distance between granular ITS is further developed through the dynamic time warping (DTW) algorithm. In designing the fuzzy trend-granulation based FCM algorithm, the key step is to design the method for updating cluster prototypes to cope with the unequal lengths of granular ITS. Weighted DTW barycenter averaging (wDBA) method is a previously adopted prototype updating approach with the drawback of hardly changing the lengths of prototypes, which often makes prototypes less representative. Thus, a granule splitting and merging algorithm is designed to resolve this issue. Additionally, a prototype initialization method is also proposed to improve the clustering performance. The proposed fuzzy trend-granulation based FCM algorithm for clustering ITS, being a typical shape-based clustering algorithm, exhibits superior performance which is validated by the ablation experiments as well as the comparative experiments.
Article
Millions of papers are submitted and published every year, but researchers often do not have much information about the journals that interest them. In this paper, we introduced the first dynamical clustering algorithm for symbolic polygonal data and this was applied to build scientific journals profiles. Dynamic clustering algorithms are a family of iterative two-step relocation algorithms involving the construction of clusters at each iteration and the identification of a suitable representation or prototype (means, axes, probability laws, groups of elements, etc.) for each cluster by locally optimizing an adequacy criterion that measures the fitting between clusters and their corresponding prototypes The application gives a powerful vision to understand the main variables that describe journals. Symbolic polygonal data can represent summarized extensive datasets taking into account variability. In addition, we developed cluster and partition interpretation indices for polygonal data that have the ability to extract insights about clustering results. From these indices, we discovered, e.g., that the number of difficult words in abstract is fundamental to building journal profiles.
Article
Land cover classification of remote sensing image is faced with uncertainties such as “significant difference in category density”, “the same object with different spectra”, and “different objects with the same spectrum”. Existing possibilistic fuzzy clustering-related method still cannot fully meet the interpretation requirements of remote sensing data. To accurately classify remote sensing image with complex geographical distribution, this paper proposes a robust interval type-2 dual-distance driven possibilistic fuzzy clustering motivated by interval-valued number for land cover classification. Firstly, interval-valued data model is established by using the local mean and variance of single-valued data. Secondly, Hausdorff distance and MW distance for interval-valued numbers are introduced to realize the maximum separability of overlapping categories in interval-valued data and generate interval uncertainty sets. To further enhance its robustness, weighted local information factors are constructed by making full use of the generated fuzzy membership and possibilistic typicality, and a novel robust interval type-2 possibilistic fuzzy C-means clustering is proposed. Finally, the adaptive type reduction method is introduced, meanwhile the adaptive expansion of interval-valued data model is realized by using the adaptive contraction expansion control factor. After the iterative algorithm convergences, the accurate classification of covered objects in remote sensing images is realized. Experimental results indicate that the classification performance of the proposed algorithm is better than that of existing interval type-2 fuzzy clustering algorithm and its variants, and it is more suitable for the interpretation of remote sensing images in the actual environments.
Chapter
Full-text available
Dynamic cluster analysis is understood as partitioning of spatio-temporal objects into groups which can change their composition over time. The aim of the paper is to propose the measure of cluster stability for such groups. The measure takes values from [0;1] interval. Some of its characteristics are discussed on the basis of simple example. The distribution of the measure under random membership is estimated by Monte Carlo simulation. The real example from the analysis of European Union countries in 15-year period is also provided.KeywordsDynamic cluster analysisSpatio-temporal analysisCluster stabilityEU countries
Article
Full-text available
Due to the fact of uncertainty contained in observed real-valued time series, the aim of this paper is to explore interval implicitization in real-valued time series classification problems. A novel real-valued time series classification method under the transformed implicit interval-valued data environment is developed, namely 1NN-IDTW. To do this, by utilizing the ARIMA model, real-valued time series are first converted in parallel to interval-valued time series. Then, the integration of explored interval implicitization process, Dynamic Time Warping algorithm and the simple nearest neighbor classifier is proposed. In the numerical experimental part, the developed 1NN-IDTW is first directly applied to randomly selected 16 real-world datasets from the UCR time series archive for time series classification. The explored interval implicitization process is also integrated with different classification models, so as to verity its performance. The results indicate that our developed model performs better on 13 datasets over 6 baselines. Furthermore, comparing with existed time series classification methods, the integration of interval implicitization can improve the prediction accuracy by more than 10%.
Article
This article builds the fuzzy clustering algorithm for interval data (FCAI). In the proposed algorithm, we use the overlap distance as a criterion to cluster for interval data. The FCAI can determine not only the suitable number of clusters, the elements in each cluster but also the probability of assigning the elements to the established clusters at the same time. In addition, we also consider the convergence of the proposed algorithm by the theory and illustrated it by the numerical examples. The FCAI is applied well in image recognition, a problem with many challenges nowadays. Using the Grey Level Co-occurrence matrices (GLCMs), we propose a novel texture extraction approach to generate featured intervals. The complex computations of the FCAI can be performed conveniently and efficiently by the established Matlab program. We utilize the corrected rand indexes (CR) to find the suitable number of clusters while a partition entropy (PE) and partition coefficients (PC) are applied to argue the quality of fuzzy clusters. As a result, the experiments on the data sets having different characteristics and elements show the reasonableness of the proposed algorithm and its advantages in comparison to the existing ones. Regarding our best knowledge, it has also shown potential in the real application of this study.
Article
In the process of land cover classification, existing fuzzy clustering is not enough to describe the high-order fuzzy uncertainty, while possibilistic clustering has serious parameter dependence and cluster consistency, which makes them unable to effectively deal with the phenomena of “different objects with the same spectrum” and “different objects with the same spectrum”. Hence, this paper proposes a robust type-2 possibilistic C-means clustering with local information and interval-valued data model for remote sensing land cover classification. Firstly, according to the local neighborhood variance, remote sensing information is modeled as interval-valued data. Secondly, existing possibilistic C-means clustering is modified and an enhanced possibilistic C-means clustering is obtained. Once again, it is used to clustering interval-valued data and a single weighting exponent type-2 possibilistic C-means clustering with double distance measures is constructed. Finally, to further improve the robustness of clustering method, local information is embedded into the objective function of this enhanced type-2 possibilistic C-means clustering and a novel robust possibilistic clustering-related algorithm for remote sensing information classification is proposed. Experimental results show that the proposed algorithm outperforms existing state of the art type-2 clustering-related algorithms, and is of great significance to the interpretation of remote sensing images.
Article
Full-text available
This article proposes the genetic algorithm in fuzzy clustering problem for interval value (IGI). In this algorithm, we use the overlap divergence to assess the similarity of the intervals, and take the new index (IDB) as the objective function to build the IGI. The crossover and selection operators in IGI are modified to optimize the results in clustering. The IGI not only determines the suitable number of groups, optimizes the result of clustering but also finds the probability of assigning the elements to the established clusters. The proposed algorithm is also applied in image recognition. The convergence of the IGI is considered and illustrated by the numerical examples. The complex computations of the IGI are performed conveniently and efficiently by the built Matlab program. The experiments on the data-sets having different characteristics and elements show the reasonableness of the IGI, and its advantages overcome other algorithms.
Preprint
Full-text available
Landslide susceptibility mapping (LSM) is one of the crucial steps in managing and mitigating landslides. This study targets at developing a new LSM model for Baota District in China, using a new clustering method namely CIBD-CURE, which combines the city block distance (CIBD) and traditional clustering using representatives (CURE) algorithm. It aims at addressing limitations of inability to identify clusters (subclasses) with arbitrary shapes and varying sizes, sensitivity to noise, inability to perform well in large study areas with large dataset and inability to process rainfall (uncertain) data, which affect the results of traditional clustering algorithms in LSM. The CIBD was introduced into the CURE algorithm for processing uncertain data, then CIBD-CURE partitioned the mapping units into arbitrary shaped and sized subclasses with respect to their underlying geology and topography characteristics, and handled noise successfully. Furthermore, LEPAM method was proposed to sort the subclasses into five landslide susceptibility levels. Finally, standard statistical measures were applied to evaluate the model’s performance and compared it against CURE, AHC-OLID, HC and KPSO clustering models, along with DTU and NBU classification models. The result analysis showed data the proposed model attained higher performance. This LSM study will be useful for landslides management strategies not only for this study area but also other affected areas around the world.
Article
In several applications, data information is obtained in the form of intervals, such as the monthly temperature in a meteorological station or daily pollution levels in different locations. This paper proposes partitioning clustering algorithms for interval-valued data based on adaptive Euclidean and City-Block distances. Since some boundary variables may be more relevant for the clustering process, the proposals consider the joint weights of the relevance of the lower and upper boundaries of the interval-valued variables. Consequently, clusters of different shapes and sizes in some subspaces of the variables, even in specific boundaries of the interval-valued data, can be recognized. In addition, robust dissimilarity functions were introduced to reduce the influence of outliers in the data. The adaptive distances change at each iteration of the algorithms and can be different from one cluster to another. The methods optimize an objective function by alternating three steps for obtaining the representatives of each group, the cluster partition, and the relevance weights for the interval-valued variables. Experiments on synthetic and real data sets corroborate the robustness and usefulness of the proposed adaptive clustering methods.
Article
Việc sử dụng khoảng cách chồng lấp để xây dựng thuật toán phân tích chùm mờ cho dữ liệu khoảng được đề nghị trong bài viết, trong đó việc xác định số chùm, những phần tử cụ thể trong chùm và xác suất thuộc vào chùm của mỗi phần tử được thực hiện cùng lúc. Thuật toán đề nghị được trình bày cụ thể từng bước về mặt lý thuyết và được minh hoạ cụ thể bởi ví dụ số. Nghiên cứu cũng xem xét việc trích xuất đặc trưng kết cấu của ảnh thành khoảng hai chiều để nhận dạng và áp dụng thuật toán đề nghị. Ví dụ số và áp dụng cho thấy ưu điểm của thuật toán đề nghị so với nhiều thuật toán phổ biến hiện nay qua các tham số thống kê.
Chapter
This article proposes a novel techniques for unsupervised learning in image recognition using automatic fuzzy clustering algorithm (AFCA) for discrete data. There are two main stages in order to recognize images in this study. First of all, new technique is shown to extract sixty four textural features from n images represented by a matrix n × 64. Afterwards, we use the proposed method based on Hausdorff distance to simultaneously determine the appropriate number of clusters. At the end of the unsupervised clustering process, discrete data belonging to the same cluster converge to the same position, which represents the cluster’s center. After determining number of cluster, we have probability of assigning objects to the established clusters. The simulation result built by Matlab program shows the effectiveness of the proposed method using the corrected rand, the partition entropy, and the partition coefficients index. The experimental outcomes illustrate that the proposed method is better than the existing ones as Fuzzy C-mean. As a result, we believe that the proposed method is filled with a potential possibility which can be applied in practical realization.
Article
Both regression modeling and clustering methodologies have been extensively studied as separate techniques. There has been some activity in using regression-based algorithms to partition a data set into clusters for classical data; we propose one such algorithm to cluster interval-valued data. The new algorithm is based on the k-means algorithm of MacQueen (1967) and the dynamical partitioning method of Diday and Simon (1976), with the partitioning criteria being based on establishing regression models for each sub-cluster. This also depends on distance measures between the underlying regression models for each sub-cluster. Several types of simulated data sets are generated for several different data structures. The proposed k-regressions algorithm consistently out-performs the k-means algorithm. Elbow plots are used to identify the total number of clusters K in the partition. The new method is also applied to real data.
Preprint
Empirical Bayes methods have been around for a long time and have a wide range of applications. These methods provide a way in which historical data can be aggregated to provide estimates of the posterior mean. This thesis revisits some of the empirical Bayesian methods and develops new applications. We first look at a linear empirical Bayes estimator and apply it on ranking and symbolic data. Next, we consider Tweedie's formula and show how it can be applied to analyze a microarray dataset. The application of the formula is simplified with the Pearson system of distributions. Saddlepoint approximations enable us to generalize several results in this direction. The results show that the proposed methods perform well in applications to real data sets.
Book
This volume presents the latest advances in statistics and data science, including theoretical, methodological and computational developments and practical applications related to classification and clustering, data gathering, exploratory and multivariate data analysis, statistical modeling, and knowledge discovery and seeking. It includes contributions on analyzing and interpreting large, complex and aggregated datasets, and highlights numerous applications in economics, finance, computer science, political science and education. It gathers a selection of peer-reviewed contributions presented at the 16th Conference of the International Federation of Classification Societies (IFCS 2019), which was organized by the Greek Society of Data Analysis and held in Thessaloniki, Greece, on August 26-29, 2019.
Article
Data analysis plays an indispensable role in understanding different phenomena. One of the vital means of handling these data is to group them into a set of clusters given a measure of similarity. Usually, clustering methods deal with objects described by single-valued variables. Nevertheless, this representation is too restrictive for representing complex data, such as lists, histograms, or even intervals. Furthermore, in some problems, many dimensions are irrelevant and can mask existing clusters. In this regard, new interval-valued data clustering methods with regularizations and adaptive distances are proposed. These approaches consider that the boundaries of the interval-valued variables have the same and different importance for the clustering process. The algorithms optimize an objective function alternating three steps for obtaining the representatives of each group, a fuzzy partition, and the relevance weights of the variables. Experiments on synthetic and real data sets corroborate the robustness and usefulness of the proposed methods.
Article
Interval-valued variables are required in data analysis since this type of data represents either the uncertainty existing in an error measurement or the natural variability of the data. Currently, methods and algorithms which aim to manage interval-valued data are very much required. Hence, this paper presents a center and range clusterwise nonlinear regression algorithm for interval-valued data. The proposed algorithm combines a k-means type algorithm with the center and range linear and nonlinear regression methods for interval-valued data, with the aim to identify both the partition of the data and the relevant regression models fitted on the center and range of the intervals simultaneously, one for each cluster. The proposed method is able to automatically select the best pair of center and range (linear and/or nonlinear) functions according to optimization criteria. A simulation study with synthetic data sets with the purpose of assessing the parameter estimation and the prediction performance of the proposed algorithm was undertaken. Finally, applications on real data sets were performed and the prediction accuracy of the proposed method was compared to the linear case. The results obtained showed that the proposed method performed well on both synthetic and real data sets.
Chapter
Entrepreneurial regimes is a topic, receiving ever more research attention. Existing studies on entrepreneurial regimes mainly use common methods from multivariate analysis and some type of institutional related analysis. In our analysis, the entrepreneurial regimes are analyzed by applying a novel polygonal symbolic data cluster analysis approach. Considering the diversity of data structures in Symbolic Data Analysis (SDA), interval-valued data is the most popular. Yet, this approach requires assuming equidistribution hypothesis. We use a novel polygonal cluster analysis approach to address this limitation with additional advantages: to store more information, to significantly reduce large data sets preserving the classical variability through polygon radius and to open new possibilities in symbolic data analysis. We construct a dynamic cluster analysis algorithm for this type of data with proving main theorems and lemmata to justify its usage. In the empirical part, we use a data set of Global Entrepreneurship Monitor (GEM) for the year 2015, to construct typologies of countries based on responses to main entrepreneurial questions. The article presents a novel approach to clustering in statistical theory (with novel type of variables never accounted for) and application to a pressing issue in entrepreneurship with novel results.
Article
Full-text available
Clustering interval data has been studied for decades. High-dimensional interval data can be expressed in terms of hyperrectangles in Rd\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {R}^d$$\end{document} (or d-orthotopes) in case of real-valued d-attributes data. This paper investigates such high-dimensional interval data: the Cartesian product of intervals, or a vector of interval. For the efficient computation of related Boolean functions, some interesting aspects have been discovered using vertices and edges of the graph, generated from given events. We also study the lower and upper-bounded orthants in Rd\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbb {R}^d$$\end{document} as events for which we show the existence of a polynomial-time algorithm to calculate the probability of the union of such events. This efficient algorithm has been discovered by constructing a suitable partial order relation based on a recursive projection onto lower-dimensional spaces. Illustrative real-life applications are presented.
Article
Vowel formants provide information as to how a vowel is uttered. Formant frequencies are relevant in applications involving human speech processing. However, such implementations are mainly performed with non-Spanish speakers. Thus, the Spanish vowels characterization should be further explored. In this study, a method for formants extraction based on the discrete wavelet transform is presented. The work focuses on Spanish speakers from Antioquia, Colombia. The parameters of the wavelet analysis are adjusted in order to establish a suitable vowels characterization within the frequency formant space. The results show that the vowel-specific wavelet analysis yields well defined clusters in the formant space. A k-means algorithm was trained in order to obtain representative centroids for each vowel. These centroids are tested in a vowels identification task, with good performance results. Moreover, the centroids are compared with vowel formants from Spanish speakers reported in the literature. The comparison reveals that speakers from distinct regions express specific features of vowels utterance, suggesting that speakers from regional populations within countries, can be better characterized. The proposed wavelet parametrization combined with the clustering algorithm can be attractive for real-time applications of voice processing. Furthermore, the proposed methodology can be applied in future studies with speakers from other Colombian- and Spanish-speaking regions.
Preprint
Full-text available
We present a Cognitive Amplifier framework to augment things part of an IoT, with cognitive capabilities for the purpose of improving life convenience. Specifically, the Cognitive Amplifier consists of knowledge discovery and prediction components. The knowledge discovery component focuses on finding natural activity patterns considering their regularity, variations, and transitions in real life setting. The prediction component takes the discovered knowledge as the base for inferring what, when, and where the next activity will happen. Experimental results on real-life data validate the feasibility and applicability of the proposed approach.
Article
Full-text available
Cluster analysis or classification usually concerns a set of exploratory multivariate data analysis methods and techniques for grouping either a set of statistical data units or the associated set of descriptive variables, into clusters of similar and, hopefully, well separated elements. In this work we refer to an extension of this paradigm to generalized three-way data representations and particularly to classification of interval variables. Such approach appears to be especially useful in large data bases, mostly in a data mining context. A health sciences case study is partially discussed.
Chapter
This chapter is devoted to the most popular heuristic partitional clustering algorithms such as k-means, k-medians, and k-medoids. In addition, we give an overview of some clustering algorithms based on mixture models, self-organizing map, and fuzzy clustering. The description of these algorithms as well as their flowcharts is presented. The convergence results for the k-means and the k-medians algorithms using nonsmooth optimization techniques are discussed.
Article
Full-text available
This paper is dedicated to the metrological aspects of clustering procedure. It is shown that taking into account the information on uncertainty of the data to be clustered allows easily making reasonable decisions that are metrologically supported. The paper presents that the problems that are usually difficult to solve in practice of the clustering become simpler: the proposed approach allows determining the number of maximum recognizable clusters and thus protects from the unreasonable conclusions and allows performing the preconditioning of data to be clustered simpler.
Article
Full-text available
In most existing clustering methods, the resemblance measure is fixed. A family of algorithms is proposed where the resemblance measure evolves by adapting itself to the local structure of the analyzed set. A given quality criterion of this adaptation is iteratively improved until the process converges. Other intervals algorithms families are considered limits of a more general algorithm.
Article
Full-text available
Most of the techniques used in the literature for hierarchical clustering are based on off-line operation. The main contribution of this paper is to propose a new algorithm for on-line hierarchical clustering by finding the nearest k objects to each introduced object so far and these nearest k objects are continuously updated by the arrival of a new object. By final object, we have the objects and their nearest k objects which are sorted to produce the hierarchical dendogram. The results of the application of the new algorithm on real and synthetic data and also using simulation experiments, show that the new technique is quite efficient and, in many respects, superior to traditional off-line hierarchical methods.
Article
Full-text available
Most of the techniques used in the literature in clustering symbolic data are based on the hierarchical methodology, which utilizes the concept of agglomerative or divisive methods as the core of the algorithm. The main contribution of this paper is to show how to apply the concept of fuzziness on a data set of symbolic objects and how to use this concept in formulating the clustering problem of symbolic objects as a partitioning problem. Finally, a fuzzy symbolic c-means algorithm is introduced as an application of applying and testing the proposed algorithm on real and synthetic data sets. The results of the application of the new algorithm show that the new technique is quite efficient and, in many respects, superior to traditional methods of hierarchical nature
Article
Full-text available
This paper presents simple and convenient generalized Minkowski metrics on the multidimensional feature space in which coordinate axes are associated with not only quantitative features but also qualitative and structural features. The metrics are defined on a new mathematical model (U<sup>(d</sup>),[+], [X]) which is called simply the Cartesian space model, where U<sup>(d</sup>) is the feature space which permits mixed feature types, [+] is the Cartesian join operator which yields a generalized description for given descriptions on U<sup>(d</sup>), and [X] is the Cartesian meet operator which extracts a common description from given descriptions on U<sup>(d</sup>). To illustrate the effectiveness of our generalized Minkowski metrics, we present an approach to the hierarchical conceptual clustering, and a generalization of the principal component analysis for mixed feature data
Chapter
The paper presents an iterative relocation algorithm that seeks to partition the descriptions of Boolean symbolic objects into classes so as to minimize the sum of the description potentials of the classes.
Chapter
The aim of this paper is to present an approach to calculate the proximity between Boolean symbolic objects (BSO), that take into account simultaneously the variability, as range of values, and some kinds of logical dependencies between variables. A BSO is described by a logical conjunction of properties, each property being a disjunction of values on a variable. Our approach is based on both a comparison function and an aggregation fuction A comparison function is a proximity index based on a positive measure, called description potential of a Boolean elementary event (cardinal of the disjunction of values on a variable of a BSO), and on the proximity indices related to data matrix of binary variables. An aggregation function is a proximity index, related to Minkowsky distance, that aggregate the p results given by the comparison functions.
Chapter
The aim of this paper is to introduce the symbolic approach in data analysis and to show that it extends data analysis to more complex data which may be closer to the multidimensional reality. We introduce several kinds of symbolic objects (”events”, ”assertions”, and also ”hordes” and ”synthesis” objects) which are defined by a logical conjunction of properties concerning the variables. They can take for instance several values on a same variable and they are adapted to the case of missing and nonsense values. Background knowledge may be represented by hierarchical or pyramidal taxonomies. In clustering the problem remains to find inter-class structures such as partitions, hierarchies and pyramids on symbolic objects. Symbolic data analysis is conducted on several principles: accuracy of the representation, coherence between the kind of objects used at input and output, knowledge predominance for driving the algorithms, self-explanation of the results. We define order, union and intersection between symbolic objects and we conclude that they are organised according to an inheritance lattice. We study several properties and qualities of symbolic objects, of classes and of classifications of symbolic objects. Modal symbolic objects are then introduced. Finally, we present an algorithm to represent the clusters of a partition by modal assertions and obtain a locally optimal partition according to a given criterion.
Article
We present a dynamical clustering algorithm in order to partition a set of multi-nominal data in k classes. This kind of data can be considered as a particular description of symbolic objects. In this algorithm, the representation of the classes is given by prototypes that generalize the characteristics of the elements belonging to each class. A suitable allocation function (context dependent) is considered in this context to assign an object to a class. The final classes are described by the distributions associated to the multi-nominal variables of the elements belonging to each class. That representation corresponds to the usual description of the so-called modal symbolic objects.
Article
In order to extend the dynamical clustering algorithm to interval data sets, we define the prototype of a cluster by optimization of a classical adequacy criterion based on Hausdor distance. Once this class prototype properly defined we give a simple and converging algorithm for this new type of interval data.
Article
A novel ISODATA clustering procedure for symbolic objects is presented using distributed genetic algorithms where in a structured organisation in the distribution of the population is introduced and selection and mating are made within locally distributed subgroups of individuals rather than the whole population.
Article
The proposed divisive clustering method performs simultaneously a hierarchy of a set of objects and a monothetic characterization of each cluster of the hierarchy. A division is performed according to the within-cluster inertia criterion which is minimized among the bipartitions induced by a set of binary questions. In order to improve the clustering, the algorithm revises at each step the division which has induced the cluster chosen for division.
Article
A hierarchical, agglomerative clustering methodology is presented in which composite symbolic objects are formed using a cartesian join operator whenever symbolic objects are selected for agglomeration based on both similarity and dissimilarity.
Article
A new approach to clustering of symbolic objects which makes use of both similarity and dissimilarity measures is proposed. The proposed modified new similarity and dissimilarity measures will take into consideration the position, span and content of symbolic objects. The similarity and dissimilarity measures used are of new type. The advantages of the proposed modified measures are presented. A divisive clustering algorithm which makes use of both similarity and dissimilarity is proposed. The results obtained by the proposed method is compared with other methods.
Article
A method for determining the mutual nearest neighbours (MNN) and mutual neighbourhood value (mnv) of a sample point, using the conventional nearest neighbours, is suggested. A nonparametric, hierarchical, agglomerative clustering algorithm is developed using the above concepts. The algorithm is simple, deterministic, noniterative, requires low storage and is able to discern spherical and nonspherical clusters. The method is applicable to a wide class of data of arbitrary shape, large size and high dimensionality. The algorithm can discern mutually homogenous clusters. Strong or weak patterns can be discerned by properly choosing the neighbourhood width.
Article
This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify cross-cutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval
Article
The problem of comparing two different partitions of a finite set of objects reappears continually in the clustering literature. We begin by reviewing a well-known measure of partition correspondence often attributed to Rand (1971), discuss the issue of correcting this index for chance, and note that a recent normalization strategy developed by Morey and Agresti (1984) and adopted by others (e.g., Miligan and Cooper 1985) is based on an incorrect assumption. Then, the general problem of comparing partitions is approached indirectly by assessing the congruence of two proximity matrices using a simple cross-product measure. They are generated from corresponding partitions using various scoring rules. Special cases derivable include traditionally familiar statistics and/or ones tailored to weight certain object pairs differentially. Finally, we propose a measure based on the comparison of object triples having the advantage of a probabilistic interpretation in addition to being corrected for chance (i.e., assuming a constant value under a reasonable null hypothesis) and bounded between ±1.
Article
Most of the techniques used in the literature in clustering symbolic data are based on the hierarchical methodology, which uses the concept of agglomeration or division as the core of the algorithm. The main contribution of this paper is to formulate a clustering algorithm for symbolic objects based on the gravitational approach. The proposed procedure is based on the physical phenomenon in which a system of particles in space converge to the centroid of the system due to gravitational attraction between the particles. Some pairs of samples called mutual pairs, which have a tendency to gravitate toward each other, are discerned at each stage of this multistage scheme. The notions of cluster coglomerate strength and global coglomerate strength are used for accomplishing or abandoning the process of merging a mutual pair. The methodology forms composite symbolic objects whenever two symbolic objects are merged. The process of merging at each stage, reduces the number of samples that are available for consideration. The procedure terminates at some stage where there are no more mutual pairs available for merging. The efficacy of the proposed methodology is examined by applying it on numeric data and also on data sets drawn from the domain of fat oil, microcomputers, microprocessors, and botany. A detailed comparative study is carried out with other methods and the results are presented.
Article
A hierarchical, agglomerative, symbolic clustering methodology based on a similarity measure that takes into consideration the position, span, and content of symbolic objects is proposed. The similarity measure used is of a new type in the sense that it is not just another aspect of dissimilarity. The clustering methodology forms composite symbolic objects using a Cartesian join operator when two symbolic objects are merged. The maximum and minimum similarity values at various merging levels permit the determination of the number of clusters in the data set. The composite symbolic objects representing different clusters give a description of the resulting classes and lead to knowledge acquisition. The algorithm is capable of discerning clusters in data sets made up of numeric as well as symbolic objects consisting of different types and combinations of qualitative and quantitative feature values. In particular, the algorithm is applied to fat-oil and microcomputer data
A dynamical clustering algorithm for symbolic data
  • R Verde
  • F A T De Carvalho
  • Y Lechevallier
Verde, R., de Carvalho, F.A.T., Lechevallier, Y., 2001. A dynamical clustering algorithm for symbolic data. Tutorial on Symbolic Data Analysis, GfKl Conference, Munich.
Mercury in the food web: Accumulation and transfer mechanisms
  • Bobou
Data clustering: A review
  • Jain
The symbolic approach in clustering and related methods of data analysis
  • Diday