K-means principle, optimization, application

K-Means algorithm is an unsupervised clustering algorithm, it is relatively simple to implement, the clustering effect is also good, so it is widely used.K-Means algorithm has a large number of variants, this article from the most traditional K-Means algorithm, based on which to talk about the optimization of the K-Means variant method. Including initialization optimization K-Means++, distance calculation optimization elkan K-Means algorithm and optimization Mini Batch K-Means algorithm in the case of big data.

The idea of the K-Means algorithm is simple, for a given set of samples, divide the sample set into K clusters according to the size of the distance between the samples. Let the points within the clusters be as closely connected as possible, and let the distances between the clusters be as large as possible.

1. Randomly select the initial centers of the K clusters.

2, for any sample point, find its distance to the center of the K clusters, and classify the sample point into the cluster with the center with the smallest distance.

3, each iteration process, the use of the mean and other methods to update the center of each cluster (center of mass).

4, the center of the K clusters, the use of 2, 3 steps iterative update, if the location of the point of change is very small (you can set a threshold), it is considered to reach a stable state, the end of the iteration. (Drawing, you can choose a different color label for different clustering blocks and clustering centers)

1, the principle is relatively simple, the implementation is also very easy, fast convergence.

2, the clustering effect is better.?

3, the interpretability of the algorithm is stronger.?

4, the main parameter that needs to be adjusted is only the number of clusters k.

1, the selection of the value of K is not good to grasp?

2, for the data set is not convex is more difficult to converge?

3, if the data of each implied category is not balanced, such as the amount of data in each implied category is seriously imbalanced, or the variance of each implied category is different, the clustering effect is not good.?

4. The final result is related to the choice of initial points, and it is easy to fall into local optimization.

5. Sensitivity to noise and anomaly comparison.

To solve the problem that the K-Means algorithm is sensitive to the initial center of mass, the dichotomous K-Means algorithm is an algorithm that weakens the initial center of mass.

1. Put all the sample data into a queue as a cluster.

2, select a cluster from the queue for K-Means algorithm division into two sub-clusters and add the sub-clusters to the queue.

3, loop iteration step 2 operation until the abort condition is reached (number of clusters, least square error, number of iterations, etc.).

4. The clusters in the queue are the final set of classification clusters.

There are generally two ways to select the rules for dividing the clusters from the queue; they are as follows:

1. Calculate the error and SSE for all the clusters (SSE can also be considered as a variant of the distance function), and select the cluster with the largest SSE for the division operation (preferred this strategy).

2. Select the cluster with the largest amount of sample data for the division operation:

Since the classification results of the K-means algorithm will be differentiated by the selection of the initial point, there is a proposed improvement of this algorithm :?K-means++?.

In fact, this algorithm is only an improvement in the selection of the initial point, and the other steps are the same. The basic idea of initial center of mass selection is that the initial clustering centers should be as far away from each other as possible.

1, randomly select a sample as the first cluster center c1;

2, calculate the shortest distance between each sample and the center of the current existing clusters (that is, the distance from the nearest cluster center), expressed as ?D(x); the larger the value, the greater the probability that it will be selected as the center of the clusters; and, finally, select the next center of the clusters by roulette wheel method.

3. Repeat step 2, knowing that k clustering centers are selected.

4. Select the initial point (the center of the cluster), and then continue to use the standard k-means algorithm.

Although K-Means++ wastes a lot of time on the computation of the clustering centers, k-mean itself converges quickly during the iterations, so the algorithm actually reduces the computation time.

? To solve the shortcomings of the K-Means++ algorithm and an algorithm; the main idea is to change the sampling rules each traversal, not according to the K-Means++ algorithm each traversal to obtain only one sample, but each time to obtain K samples, repeat the sampling operation O (logn) times (n is the number of samples), and then these sampled samples clustered out of the K points, and finally use the K-mean as a K-Mean. These K points are used as the initial cluster center of the K-Means algorithm. Practice has shown that: generally 5 repeated use can ensure a better cluster center point.

1, in the N samples drawn K samples, a *** draw logn times, the formation of a new sample set, a *** have Klogn data.

2. Use the K-Means algorithm in the new data set to find K cluster centers.

3. Put these K cluster centers into the initial sample set as the initial cluster centers.

4. The original dataset is then used to calculate the final clusters using the K-Means algorithm based on the initial cluster centers described above.

Canopy belongs to a 'coarse' clustering algorithm, that is, using a simple, fast distance calculation method to divide the dataset into a number of overlapping subsets of copy, this algorithm does not need to specify the value of k, but the accuracy of the lower, can be combined with the K-means algorithm to use together: first by the This algorithm does not need to specify the value of k, but the accuracy is low, and can be used in conjunction with the K-means algorithm: first, the Canopy algorithm performs coarse clustering to obtain the k centers of mass, and then the K-means algorithm is used for clustering.

?1. Randomize the original sample set into a sample list L=[x1,x2,.... ,xm] (no further changes after arrangement), set initial distance thresholds T1, T2 according to prior knowledge or cross-validation tuning parameter, and T1>T2 .

2. Randomly select a sample P from the list L as the center of mass of the first copy, and remove P from the list.

3. Choose a sample Q at random from the list L, compute the distances from Q to all the centers of mass, and examine the smallest of these distances, D:

If D ≤ T1, give Q a weak mark that Q belongs to that canopy, and add Q to it;

If D ≤ T2, give Q a strong mark that Q belongs to that canopy and is very close to the center of mass, so add Q to that. center of mass is very close, so set the center of mass of that canopy to the center of all strongly marked samples and remove Q from list L;

? If D>T1, then Q forms a new cluster and Q is removed from list L.

4. Repeat step 3 until the number of elements in list L is zero.

1. The choice of 'coarse' distance calculation is very important for the distribution of the CANOPIES, e.g., the choice of one of the attributes, other external attributes, Euclidean distances, etc.

2. When T2<D≤T1, the sample is not removed from the list, but continues to participate in the next round of iterations until it becomes a new center of mass or a strongly labeled member of some canopy.

3. The values of T1 and T2 affect the overlap rate and granularity of the copy: when T1 is too large, the sample belongs to more than one copy, and the difference between the various copies is not obvious; when T2 is too large, it will reduce the number of copies, and when T2 is too small, it will increase the number of copies, and at the same time, it will increase the computation time.

4. There may be overlap between canopies, but there will not be a case where a sample does not belong to any canopy.

5, Canopy algorithm can eliminate isolated points, that is, remove the canopy containing a small number of samples, often these canopy contains isolated points or noise points.

Since the K-Means algorithm has the problem of initial cluster centroid sensitivity, it is common to use the Canopy+K-Means algorithm in a hybrid form for model construction.

1, first use the canopy algorithm for "coarse" clustering to get K cluster centroids.

2. The K-Means algorithm uses the K clustering centroids obtained by the Canopy algorithm as the initial centroids for "fine" clustering.

1, the implementation of fast (first carried out a cluster center point selection of preprocessing);

2, do not need to give the value of K, the application of many scenarios.

3, can alleviate the K-Means algorithm for the initial clustering center point sensitive problem.

Mini Batch K-Means algorithm is an optimized variant of the K-Means algorithm, using a small subset of the data (the data set used for each training is a subset of the data randomly selected during training of the algorithm) to reduce the computation time, and at the same time trying to optimize the objective function; Mini Batch K-Means algorithm reduces the convergence time of the K-Means algorithm. convergence time and produces results that are only slightly less effective than the standard K-Means algorithm.

1, first extract part of the dataset, using the K-Means algorithm to construct a model of K clustered points.

2, continue to extract part of the training dataset sample data, and add it to the model, assigned to the closest cluster center point.

3. Update the cluster centroid values.

4. Loop and iterate the second and third step operations until the centroids are stable or the number of iterations is reached, stopping the calculation operation.

/p/f0727880c9c0

Talk about the mobile communications industry service marketing strategy innovation

What is the fastest way to re-execute a previous command in cad?

What are the data sources and methods of big data credit collection

Is it normal for a boyfriend to keep in touch with his ex-girlfriend?

QVQ asked APH (Axis Powers Hetalia) to set a poker role! !

Will the computer maintenance industry be eliminated?

How about this brand of POS machine of Ruiyinxin?

Netizens rumor Shijiazhuang dead more than 200 people detained, will face what punishment?

E-commerce operation data analysis software which

What is the meaning of unrequited love (sexual unrequited love should be how to redeem themselves)