The clustering method has two significant limitations: first, for the clustering results to be unambiguous, well-separated data are required. Almost all existing algorithms produce the same clusters from non-overlapping classes of data that are distinct from each other. However, if the classes are diffuse and interpenetrating, the results of each algorithm will be somewhat different. As a result, each algorithm defines unclear boundaries, each clustering algorithm obtains its own optimal result, and each data segment will yield a single piece of information. In order to explain the different results from the same data due to different algorithms, attention must be paid to the way in which the differences are judged. Correctly interpreting the actual results of clustered content from either algorithm is difficult for the geneticist (especially the boundaries). Ultimately, empirical plausibility will be needed to guide clustering interpretations through sequence comparisons.
The second limitation arises from linear correlation. All of the clustering methods described above analyze only simple one-to-one relationships. Because they are only linear comparisons of pairs, they greatly reduce the computational effort of discovering expression type relationships, but ignore the multifactorial and nonlinear nature of biological systems.
From a statistical point of view, cluster analysis is a way to simplify data through data modeling. Traditional statistical cluster analysis methods include systematic clustering, decomposition, accession, dynamic clustering, ordered sample clustering, clustering with overlap and fuzzy clustering. Cluster analysis tools using algorithms such as k-means, k-centroids, etc. have been added to many well-known statistical analysis software packages, such as SPSS, SAS, and so on.
From a machine learning perspective, clusters are equivalent to hidden patterns. Clustering is the unsupervised learning process of searching for clusters. Unlike classification, where unsupervised learning does not rely on pre-defined classes or training instances with class labels, and requires the labels to be determined automatically by the clustering learning algorithm, classification learns instances or data objects labeled with classes. Clustering is observational learning, not example-based learning.
From a practical application point of view, clustering analysis is one of the main tasks of data mining. As far as the data mining function is concerned, clustering can be used as an independent tool to obtain the distribution of data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for further analysis.
Clustering analysis can also be used as a preprocessing step for other data mining tasks (e.g., classification, association rules).
The field of data mining focuses on efficient and practical cluster analysis algorithms for large databases and data warehouses.
Cluster analysis is a very active research area in data mining and many clustering algorithms have been proposed.
These algorithms can be categorized into segmentation methods, hierarchical methods, density-based methods, grid-based methods, and
model-based methods.
1 Delineation methods (PAM:PArtitioning method) First k divisions are created, k being the number of divisions to be created; then a cyclic
localization technique is used to help improve the quality of the divisions by moving objects from one division to another. Typical division methods include:
k-means,k-medoids,CLARA (Clustering LARge Application),
CLARANS (Clustering Large Application based upon RANdomized Search).
FCM
2 Hierarchical method (hierarchical method) creates a hierarchy to decompose a given dataset. This method can be categorized into two operations: top-down (decomposition) and bottom-up (merging). To compensate for the shortcomings of decomposition and merging, hierarchical merging
and often has to be combined with other clustering methods, such as cyclic localization. Typical such methods include:
The first is the ;BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) method, which first utilizes the structure of a tree to partition the set of objects; these clusters are then optimized using
other clustering methods. optimization.
The second is the CURE (Clustering Using REprisentatives) method, which uses a fixed number of representative objects to represent the clusters; the clusters are then contracted by a specified
amount (toward the cluster center).
The third is the ROCK method, which uses connections between clusters to merge clusters.
The last one, CHEMALOEN, it constructs dynamic models when hierarchical clustering.
3 Density-based method, which accomplishes clustering of objects based on density. It grows clusters based on the density around the object (e.g.
DBSCAN). Typical density-based methods include:
DBSCAN (Densit-based Spatial Clustering of Application with Noise):This algorithm performs clustering by continuously growing regions of sufficiently high density
; it can discover arbitrary shapes from spatial databases that contain noisy Clustering. This method defines a cluster
as a set of "densely connected" points.
OPTICS (Ordering Points To Identify the Clustering Structure): does not explicitly produce a cluster, but computes an augmented cluster order for automatic interactive cluster analysis.
4 A grid-based approach, where the object space is first divided into a finite number of cells to form a grid structure; the clustering is then accomplished using
the grid structure.
STING (STatistical INformation Grid) is a grid-based
clustering method that utilizes statistical information stored in grid cells.
CLIQUE (Clustering In QUEst) and Wave-Cluster are methods that combine grid-based and density-based
methods.
5 Model-based methods, which assume a model for each cluster and find data that fits the corresponding model. Typical
model-based methods include:
Statistical method COBWEB: is a commonly used and simple incremental conceptual clustering method. Its input objects are described using
symbolic quantity (attribute-value) pairs. A hierarchical clustering is created
in the form of a classification tree.
CLASSIT is another version of COBWEB.... It allows incremental clustering
of successively valued attributes. It stores the corresponding continuous normal distribution (mean and variance) for each attribute in each node; and utilizes
an improved method of describing the classification power, i.e., instead of calculating the discrete attribute (fetch) sums as COBWEB does
integrating over the continuous attributes. But the CLASSIT method also suffers from similar problems as COBWEB.
So they are not suitable for clustering large databases.