Principle of K-mean clustering analysis

In the training image, the number of data events is very large. If these data events are compared one by one with the simulated regional data patterns, it requires high computer performance and is computationally inefficient. Analyzing the data events, it is found that many data events have high similarity and can be classified into the same category. This greatly reduces the number of data events and improves computational efficiency. Based on this consideration, the cluster analysis technique was introduced into multipoint geostatistics.

The K-means algorithm, proposed by J.B. MacQueen in 1967, is an influential technique among the many clustering algorithms that have been used in scientific and industrial applications to date. It is a basic division of clustering methods, often using the error sum of squares criterion function as the clustering criterion function, the error sum of squares criterion function is defined as

Multipoint Geostatistics Principles, Methods, and Applications

Equation: mi(i=1,2,...,k) is the mean value of the data objects in the class i, which represent the K classes, respectively.

K-means algorithm works: first of all, randomly select K points from the data set as the initial clustering center, and then calculate the distance of each sample to the cluster, the sample to the closest to the cluster center of the class. Calculate the average of the data objects of each newly formed cluster to get the new clustering center, if there is no change in the clustering center of the two adjacent times, it means that the sample adjustment is over and the clustering criterion function has converged. One of the characteristics of this algorithm is that in each iteration to examine whether the classification of each sample is correct. If it is not correct, it should be adjusted, and after all the samples are adjusted, then the clustering center is modified and the next iteration is entered. If in one iteration of the algorithm, all the samples are correctly classified, there will be no adjustment, the clustering center will not have any change, which marks the convergence, so the algorithm ends.

The basic steps are as follows:

a. For the set of data objects, arbitrarily select K objects as the initial class centers;

b. According to the average value of the objects in the class, reassign each object to the most similar class;

c. Update the average value of the class, i.e., compute the average of the objects in each class;

d. Repeat the b and c steps. ;

e. until no more changes occur.

Figure 2-7 shows the results of a cluster analysis of a data event done using the K-means method. The data classes are defined as 10. The data events are from Figure 2-8,and the data samples used are 8×8 data samples.

The advantage of K-means algorithm is that it works better when the clustering is dense and the distinction between classes is clear. For processing large data sets, this algorithm is relatively scalable and efficient, there are three main disadvantages:

Figure 2-7 Clustering results of the K-means method

Figure 2-8 Training image used for clustering, with data samples of 8*8

1) In the K-means algorithm K is given in advance, and it is very difficult to estimate the value of this K value. The value of K is very difficult to estimate. In many cases, it is not known in advance how many categories a given dataset should be divided into in order to be the most appropriate. This is a shortcoming of the K-means algorithm.

2) In the K-means algorithm, first of all, it is necessary to determine an initial division based on the initial clustering center, and then optimize the initial division. The selection of this initial clustering center has a large impact on the clustering results, once the initial value is not well selected, it may not be able to obtain effective clustering results, which has become a major problem of K-means algorithm.

3) From the framework of K-means algorithm, it can be seen that the algorithm needs to continuously adjust the sample classification and continuously calculate the new clustering center after adjustment, so when the data volume is very large, the time overhead of the algorithm is very large. Therefore, the time complexity of the algorithm needs to be analyzed and improved to improve the scope of application of the algorithm.

How about Guo Xin Lixin Big Data Technology (Beijing) Co., Ltd.

How long does it take to eliminate high-risk customers?

What mode to choose in the era of big data?

How ChinaSoft Kuanqun provides solutions for CRC Logistics in logistics service?

Keep your head down. Look up to the sky.

LiaoNing He's Medical College Academic Affairs Office Login Entrance LiaoNing He's Medical College Academic Affairs Office Login Entrance

How to fry shrimp delicious

Nanjing Lukou International Airport detects 9 positives, what are the details in Nanjing?

What is the current status and development situation of my country's canned edible fungi exports?

Hangzhou Xiaoshan Iveco Auto Parts Factory closed down?