How to use cluster analysis method?

Cluster analysis is the ideal multivariate statistical technique, with hierarchical clustering and iterative clustering as the main methods. Clustering works by placing the target data into a small number of relatively homogeneous groups or "classes" (clusters). Expression data are analyzed (1) by standardizing the variation in a set of genes to be tested through a series of tests and then comparing linear covariances in pairs. (2) Clustering the samples by placing genes in the most closely related profiles, e.g., by simple hierarchical clustering. This clustering can also be extended to each experimental sample by utilizing the total linear correlation of a group of genes. (3) Multidimensional scaling analysis (MDS) is a method that displays the approximate degree of correlation of experimental samples in a two-dimensional Euclidean "distance". (4) K-means clustering, a method of minimizing dispersion within a "class" by repeatedly reassigning class members.

The clustering method has two significant limitations: first, for the clustering results to be unambiguous, well-separated data are required. Almost all existing algorithms produce the same clusters from non-overlapping classes of data that are distinct from each other. However, if the classes are diffuse and interpenetrating, the results of each algorithm will be somewhat different. As a result, each algorithm defines unclear boundaries, each clustering algorithm obtains its own optimal result, and each data segment will yield a single piece of information. In order to explain the different results from the same data due to different algorithms, attention must be paid to the way in which the differences are judged. Correctly interpreting the actual results of clustered content from either algorithm is difficult for the geneticist (especially the boundaries). Ultimately, empirical plausibility will be needed to guide clustering interpretations through sequence comparisons.

The second limitation arises from linear correlation. All of the clustering methods described above analyze only simple one-to-one relationships. Because they are only linear comparisons of pairs, they greatly reduce the computational effort of discovering expression type relationships, but ignore the multifactorial and nonlinear nature of biological systems.

From a statistical point of view, cluster analysis is a way to simplify data through data modeling. Traditional statistical cluster analysis methods include systematic clustering, decomposition, accession, dynamic clustering, ordered sample clustering, clustering with overlap and fuzzy clustering. Cluster analysis tools using algorithms such as k-means, k-centroids, etc. have been added to many well-known statistical analysis software packages, such as SPSS, SAS, and so on.

From a machine learning perspective, clusters are equivalent to hidden patterns. Clustering is the unsupervised learning process of searching for clusters. Unlike classification, where unsupervised learning does not rely on pre-defined classes or training instances with class labels, and requires the labels to be determined automatically by the clustering learning algorithm, classification learns instances or data objects labeled with classes. Clustering is observational learning, not example-based learning.

From a practical application point of view, clustering analysis is one of the main tasks of data mining. As far as the data mining function is concerned, clustering can be used as an independent tool to obtain the distribution of data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for further analysis.

Clustering analysis can also be used as a preprocessing step for other data mining tasks (e.g., classification, association rules).

The field of data mining focuses on efficient and practical cluster analysis algorithms for large databases and data warehouses.

Cluster analysis is a very active research area in data mining and many clustering algorithms have been proposed.

These algorithms can be categorized into segmentation methods, hierarchical methods, density-based methods, grid-based methods, and

model-based methods.

1 Delineation methods (PAM:PArtitioning method) First k divisions are created, k being the number of divisions to be created; then a cyclic

localization technique is used to help improve the quality of the divisions by moving objects from one division to another. Typical division methods include:

k-means,k-medoids,CLARA (Clustering LARge Application),

CLARANS (Clustering Large Application based upon RANdomized Search).

FCM

2 Hierarchical method (hierarchical method) creates a hierarchy to decompose a given dataset. This method can be categorized into two operations: top-down (decomposition) and bottom-up (merging). To compensate for the shortcomings of decomposition and merging, hierarchical merging

and often has to be combined with other clustering methods, such as cyclic localization. Typical such methods include:

The first is the ;BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) method, which first utilizes the structure of a tree to partition the set of objects; these clusters are then optimized using

other clustering methods. optimization.

The second is the CURE (Clustering Using REprisentatives) method, which uses a fixed number of representative objects to represent the clusters; the clusters are then contracted by a specified

amount (toward the cluster center).

The third is the ROCK method, which uses connections between clusters to merge clusters.

The last one, CHEMALOEN, it constructs dynamic models when hierarchical clustering.

3 Density-based method, which accomplishes clustering of objects based on density. It grows clusters based on the density around the object (e.g.

DBSCAN). Typical density-based methods include:

DBSCAN (Densit-based Spatial Clustering of Application with Noise):This algorithm performs clustering by continuously growing regions of sufficiently high density

; it can discover arbitrary shapes from spatial databases that contain noisy Clustering. This method defines a cluster

as a set of "densely connected" points.

OPTICS (Ordering Points To Identify the Clustering Structure): does not explicitly produce a cluster, but computes an augmented cluster order for automatic interactive cluster analysis.

4 A grid-based approach, where the object space is first divided into a finite number of cells to form a grid structure; the clustering is then accomplished using

the grid structure.

STING (STatistical INformation Grid) is a grid-based

clustering method that utilizes statistical information stored in grid cells.

CLIQUE (Clustering In QUEst) and Wave-Cluster are methods that combine grid-based and density-based

methods.

5 Model-based methods, which assume a model for each cluster and find data that fits the corresponding model. Typical

model-based methods include:

Statistical method COBWEB: is a commonly used and simple incremental conceptual clustering method. Its input objects are described using

symbolic quantity (attribute-value) pairs. A hierarchical clustering is created

in the form of a classification tree.

CLASSIT is another version of COBWEB.... It allows incremental clustering

of successively valued attributes. It stores the corresponding continuous normal distribution (mean and variance) for each attribute in each node; and utilizes

an improved method of describing the classification power, i.e., instead of calculating the discrete attribute (fetch) sums as COBWEB does

integrating over the continuous attributes. But the CLASSIT method also suffers from similar problems as COBWEB.

So they are not suitable for clustering large databases.

Big data: the core problem is "people" not "technology"

What to do when the trip card shows no place you've been

Does jiaotong university mba have a future? Where is the employment direction?

Zhejiang Hengdian Film and Television Vocational College enrollment catalog and what faculties (reference)

h800 graphics card in the Bay Communications why can get

Released the first security compliance white paper in the instant messaging industry.

What is the address of Hefei Technology Vocational College? Phone Website

Which companies in the data center industry are more powerful?

Scientific and technological support emphasizes giving full play to what and other information technologies play an active role in strengthening and innovating social governance

Why are some people ranked in the postgraduate entrance examination?