Cluster analysis of R language learning notes
Packages required to use k-means clustering:
factoextra
cluster? #Load package
library(factoextra)
library(cluster)l?
#Data preparation
Use the built-in R data set USArrests
#load the dataset
data("USArrests")
#remove any missing value (i.e, NA values ??for not available)
#That might be present in the data
USArrests <- na.omit(USArrests)#view the first 6 rows of the data
head(USArrests, n=6 )?
In this data set, the columns are variables and the rows are observations
Before clustering, we can first perform some necessary data checks, that is, data descriptive statistics, such as average Value, standard deviation, etc.
desc_stats <- data.frame( Min=apply(USArrests, 2, min),#minimum
Med=apply(USArrests, 2, median), #median
Mean=apply(USArrests, 2, mean),#mean
SD=apply(USArrests, 2, sd),#Standard deviation
Max=apply(USArrests, 2, max)#maximum
)
desc_stats <- round(desc_stats, 1)#Retain one decimal place head(desc_stats)
Standardization is required when variables have large variances and means
df <- scale(USArrests)
#Data cluster assessment
Use get_clust_tendency ()Calculate Hopkins statistic
res <- get_clust_tendency(df, 40, graph = TRUE)
res$hopkins_stat
## [1] 0.3440875< /p>
#Visualize the dissimilarity matrix
res$plot
The value of Hopkins statistic is <0.5, indicating that the data is highly aggregable. In addition, it can also be seen from the figure that the data can be aggregated.
#Estimated number of aggregated clusters
Since k-means clustering requires specifying the number of clusters to be generated, we will use the function clusGap() to calculate the optimal clustering number. The function fviz_gap_stat() is used for visualization.
set.seed(123)
## Compute the gap statistic
gap_stat <- clusGap(df, FUN = kmeans, nstart = 25, K. max = 10, B = 500)
# Plot the result
fviz_gap_stat(gap_stat)
The figure shows that the best clustering is into four categories (k= 4)
#Perform clustering
set.seed(123)
km.res <- kmeans(df, 4, nstart = 25) p>
head(km.res$cluster, 20)
# Visualize clusters using factoextra
fviz_cluster(km.res, USArrests)
# Check the cluster silhouette plot
Recall that the silhouette measures (SiSi) how similar an object ii is to the the other objects in its own cluster versus those in the neighbor cluster. SiSi values ??range from 1 to - 1:
A value of SiSi close to 1 indicates that the object is well clustered. In the other words, the object ii is similar to the other objects in its group.
A value of SiSi close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.
sil <- silhouette(km.res$cluster, dist(df))
rownames(sil) <- rownames(USArrests)
head(sil[, 1:3])
#Visualize
fviz_silhouette(sil)
It can be seen in the figure that there are negative values, and you can determine which observation value it is through the function silhouette()
neg_sil_index <- which(sil[, "sil_width"] < 0)
sil[neg_sil_index, , drop = FALSE]
## ?cluster neighbor sil_width
## Missouri 3 ?2 -0.07318144
#eclust(): Enhanced cluster analysis
Compared with other cluster analysis packages, eclust() has the following advantages:
Simplifies the cluster analysis Workflow
Can be used to calculate hierarchical clustering and partition clustering
eclust() automatically calculates the optimal number of clusters.
Automatically provide Silhouette plot
Can be combined with ggplot2 to draw beautiful graphics
#K-means clustering using eclust()
# Compute k-means
res.km <- eclust(df, "kmeans")
# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)
# Silhouette plotfviz_silhouette(res.km)
## cluster size ave.sil.width
## 1 1 13 ?0.31
< p>## 2 2 29 ?0.38## 3 3 ?8 ?0.39
#Hierarchical clustering using eclust()
# Enhanced hierarchical clustering
res.hc <- eclust(df, "hclust") # compute hclust
fviz_dend(res.hc, rect = TRUE) # dendrogam
#The following R code generates Silhouette plot and hierarchical clustering scatter plot.
fviz_silhouette(res.hc) # silhouette plot
## ? cluster size ave.sil.width
## 1 1 19 ?0.26
## 2 2 19 ?0.28
## 3 3 12 ?0.43
fviz_cluster(res.hc) # scatter plot
#Infos
This analysis has been performed using R software (R version 3.3.2)