Current location - Loan Platform Complete Network - Big data management - Cluster Analysis (Cluster Analysis)
Cluster Analysis (Cluster Analysis)

Clustering, the process of clustering similar things together and classifying dissimilar things into different categories. It is a means of simplifying complex data into a small number of categories.

There are m sample units, n indicators (variables) measured by each sample, the original information matrix:

The choice of indicators is very important:

Necessity requirements: and the purpose of cluster analysis is closely related to the clustering analysis, not the more the better

Representativeness requirements: to reflect the characteristics of the variables to be categorized

Distinctiveness requirements: the value of different research in different Independence: the variables should not be highly correlated (height and weight are highly correlated)

Dispersion: it is desirable that the distribution is not too concentrated across the range of values

Data standardization may be required when the scale of the various standardized measures varies too much or when the data do not conform to a normal distribution.

(1) Sum standardization. The sum of the data corresponding to each clustering indicator is obtained by dividing the data for each indicator by the sum of the data for that indicator.

According to the different clustering objects, it is divided into Q-type clustering, R-type clustering

(1) common distance statistics - Minkowski distance series (linear distance)

p = 2, when the Euclidean distance (geometric distances in the n-dimensional space)

p = ∞, when the Chebyshev distance (tessellation distance)

p = ∞, when the Chebyshev distance (tessellation distance)

(2) Common distance statistic - Mahalanobis distance (covariance distance)

Vector x=(1,2,...n)

Vector x=(1,2,...n)

Vector x=(1,2,.... .n)

In contrast to the Euclidean distance, the Mahalanobis distance takes into account associations between indicators (e.g., height and weight are not independent) and the Mahalanobis distance is scale-invariant, so standardization can be dispensed with.

If the covariance matrix is a unit matrix (the indicators are completely independent of each other), the Mahalanobis distance is normalized to the Euclidean distance.

If the covariance matrix is a diagonal matrix, then the Mahalanobis distance is normalized Euclidean distance

(3) Common Distance Statistic - Text Distance

The text distance is usually used to measure the similarity between texts.

Text distance is commonly used to measure the similarity between texts, and is commonly used in biological research to analyze sequence comparison.

Common similarity coefficient statistics

Similarity coefficient = 1, indicating perfect similarity

Similarity coefficient = -1, indicating perfect opposite

Similarity coefficient = 0, indicating perfect independence

Correlation coefficient:

Measurement of class-to-class distance. Methods:

The systematic clustering method requires the measurement of not only individual-to-individual distances, but also class-to-class distances. After the inter-class distance is measured, the two subclasses with the smallest distance will be merged into one class first. The different definitions of inter-class distances give rise to different systematic clustering methods.

Currently there are more than 1000 clustering algorithms: there is no one clustering algorithm that can do it all, and the parameters of the clustering algorithms must be adjusted according to the specific problem

Common clustering algorithms are categorized as follows:

1. Hierarchical clustering

2. Partitioning clustering)

3, Density-based clustering (Density)

4, Expectation Maximization clustering (Expectation Maximization)

5, Grid-based clustering (Grid)

6, Model-based clustering (Model-based

1. Hierarchical clustering

Basic idea:

At the beginning of the cluster analysis, each sample (or variable) forms a class of its own. Then, measure the closeness of all the samples (or variables) in some way, and cluster the most similar samples (or variables) into a small class first; next, measure the closeness of the remaining samples (or variables) to the small class, and cluster the closest current samples (or variables) with the small class; and so on until all the samples are clustered into one class.

Example:

There is a set of data D={a,b,c,d,e} given the distance matrix between them.

First of all, each example is a class:

2. Partitioning Clustering Method

Partitioning Clustering Algorithm:

Given a dataset containing n samples, a partitioning method (based on the partitioning method) is to partition the n samples into k clusters (or clusters) according to a specified metric. The Partitioning Method is to divide the n samples into k clusters (k ≤ n) according to a specific metric, such that each cluster contains at least one object, and each object belongs to and only belongs to one cluster, and there is no hierarchical relationship between the clusters.

Most of the division-based methods divide based on distance, first initialize the samples to divide, then calculate the distance between the samples, re-divide the samples in the dataset, divide the samples into clusters that are closer to each other, and get a new division of the samples, and iterate the computation until the clustering result meets the user-specified requirements.

To get the optimal clustering results, the algorithm needs to exhaust all possible divisions of the dataset, but the amount of data in the actual application is relatively large, the use of exhaustive clustering is clearly unrealistic, so most of the clustering methods based on the division of the greedy strategy, that is, to seek the optimal solution in each division, and then based on the optimal solution for the iterative computation, and gradually improve the quality of clustering results. quality. Although it is possible to obtain locally optimal results in this way, it is acceptable when combined with efficiency considerations.

Algorithm:

Example:

There are some points in a 2D space, and we want to classify them into 3 classes, i.e., K=3.

First, we randomly select 3 initial centers of mass, each of which is a class:

Then, we calculate the center of mass for each point that is not a center of mass. We calculate the distance from each point that is not a center of mass to these three centers of mass:

Classify these points into the class of the center of mass that is closest to them:

Recalculate the centers of mass for these three classifications:

Repeat the above two steps over and over again, updating the three classes:

The iteration stops when it is stable, and the three classes at that point are the final three that we get:

The best-known algorithms are the k-means clustering algorithm and the K-medoids algorithm (centroid clustering)

Dealing with "a number of islands in the middle of the ocean", the islands are distinguished by their density

Most of the density-based methods (Density-based Method) use distance measurements. Density-based methods use a distance metric to partition the dataset, which is correct in spherical datasets, but fails to cluster the samples correctly in non-spherical datasets, and is heavily influenced by the noisy data in the dataset. Density-based methods can overcome these two weaknesses.

Density-based methods propose the idea of "density", that is, given the number of sample points in the neighborhood, when the density in the neighborhood reaches or exceeds the density threshold, the samples in the neighborhood are included in the current cluster. If the density in the neighborhood does not meet the threshold requirement, the current cluster division is completed and the next cluster is divided. The density-based approach allows for the detection and filtering of outliers in the dataset.

Algorithm :

Grid-based Method (Grid-based Method) divides the dataset space into a finite number of grid cells to form a network structure, and in the subsequent clustering process, clustering is performed with the grid cells as the basic unit, rather than with the samples. Since the processing time of the algorithm is independent of the number of samples and only related to the number of grid cells, this method is highly efficient when dealing with large data sets. Grid-based methods can be used in combination with density-based methods, hierarchy-based methods, and so on, on the basis of grid cell division.

Model-based Method assumes that the dataset satisfies a certain distribution model, and by finding such a distribution model, the dataset can be clustered. Model-based methods mainly include two categories: statistical-based and neural network-based, the former represented by Gaussian Mixture Models (GMM) and the latter represented by Self Organizing Map (SOM). Statistical model-based approaches are currently dominant.

The following was added later:

Example of data:

Example of data:

In order to effectively utilize the clustering algorithm, the first thing you need to do is to measure the distance at which observations are seen, which is often accomplished in R with the dist function in the stats package:

dist(x, method = "euclidean"). method = "euclidean", diag = FALSE, upper = FALSE, p = 2)

The dist function calculates the distance between two objects (matrices or dataframes), and returns a distance matrix (object of class dist). the parameters of the dist function are described below.

Another way to calculate the distance between points is the daisy function in the cluster package:

The daisy function calculates the dissimilarity of each pair of observations in a dataset. the parameters of the daisy function are described as follows:

k-means clustering is one of the simplest clustering algorithms. k-means clustering is implemented in R through the kmeans function in the stats package. The kmeans function implements k-means clustering:

kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"), trace= FALSE)

The parameters of the kmeans function are described as follows.