Cluster analysis of R language learning notes

Packages required to use k-means clustering:

factoextra

cluster? #Load package

library(factoextra)

library(cluster)l?

#Data preparation

Use the built-in R data set USArrests

#load the dataset

data("USArrests")

#remove any missing value (i.e, NA values ??for not available)

#That might be present in the data

USArrests <- na.omit(USArrests)#view the first 6 rows of the data

head(USArrests, n=6 )?

In this data set, the columns are variables and the rows are observations

Before clustering, we can first perform some necessary data checks, that is, data descriptive statistics, such as average Value, standard deviation, etc.

desc_stats <- data.frame( Min=apply(USArrests, 2, min),#minimum

Med=apply(USArrests, 2, median), #median

Mean=apply(USArrests, 2, mean),#mean

SD=apply(USArrests, 2, sd),#Standard deviation

Max=apply(USArrests, 2, max)#maximum

)

desc_stats <- round(desc_stats, 1)#Retain one decimal place head(desc_stats)

Standardization is required when variables have large variances and means

df <- scale(USArrests)

#Data cluster assessment

Use get_clust_tendency ()Calculate Hopkins statistic

res <- get_clust_tendency(df, 40, graph = TRUE)

res$hopkins_stat

## [1] 0.3440875< /p>

#Visualize the dissimilarity matrix

res$plot

The value of Hopkins statistic is <0.5, indicating that the data is highly aggregable. In addition, it can also be seen from the figure that the data can be aggregated.

#Estimated number of aggregated clusters

Since k-means clustering requires specifying the number of clusters to be generated, we will use the function clusGap() to calculate the optimal clustering number. The function fviz_gap_stat() is used for visualization.

set.seed(123)

## Compute the gap statistic

gap_stat <- clusGap(df, FUN = kmeans, nstart = 25, K. max = 10, B = 500)

# Plot the result

fviz_gap_stat(gap_stat)

The figure shows that the best clustering is into four categories (k= 4)

#Perform clustering

set.seed(123)

km.res <- kmeans(df, 4, nstart = 25)

head(km.res$cluster, 20)

# Visualize clusters using factoextra

fviz_cluster(km.res, USArrests)

# Check the cluster silhouette plot

Recall that the silhouette measures (SiSi) how similar an object ii is to the the other objects in its own cluster versus those in the neighbor cluster. SiSi values ??range from 1 to - 1:

A value of SiSi close to 1 indicates that the object is well clustered. In the other words, the object ii is similar to the other objects in its group.

A value of SiSi close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.

sil <- silhouette(km.res$cluster, dist(df))

rownames(sil) <- rownames(USArrests)

head(sil[, 1:3])

#Visualize

fviz_silhouette(sil)

It can be seen in the figure that there are negative values, and you can determine which observation value it is through the function silhouette()

neg_sil_index <- which(sil[, "sil_width"] < 0)

sil[neg_sil_index, , drop = FALSE]

## ?cluster neighbor sil_width

## Missouri 3 ?2 -0.07318144

#eclust(): Enhanced cluster analysis

Compared with other cluster analysis packages, eclust() has the following advantages: