Assumptions and Process
Why do we assume in the first place? The answer is that making assumptions helps simplify problems and simplified problems can then be solved accurately. To divide your dataset into clusters, one must define the criteria of a cluster and those make the assumptions for the technique. K-Means clustering method considers two assumptions regarding the clusters – first that the clusters are spherical and second that the clusters are of similar size. Spherical assumption helps in separating the clusters when the algorithm works on the data and forms clusters. If this assumption is violated, the clusters formed may not be what one expects. On the other hand, assumption over the size of clusters helps in deciding the boundaries of the cluster. This assumption helps in calculating the number of data points each cluster should have. This assumption also gives an advantage. Clusters in K-means are defined by taking the mean of all the data points in the cluster. With this assumption, one can start with the centers of clusters anywhere. Keeping the starting points of the clusters anywhere will still make the algorithm converge with the same final clusters as keeping the centers as far apart as possible.
Now let’s understand how the algorithm works. The first step is to assign initial clusters. You can specify any K clusters or let the algorithm assign them randomly. The algorithm works in iterations and in every iteration, all the data points are then assigned to one of the clusters based on the nearest distance from the centers. After all points are assigned to one of the cluster, the cluster centers are now updated. The new centers are decided based on the centroid mean of all the points within the cluster. This is repeated, iteration after iteration until the there is no change in the cluster assignment of any of the data points but there are a lot of calculations which are not fixed in this algorithm.Deploy K-means results in dynamic dashboards via Power BI Consultants, enabling cluster analysis and business insights through interactive visuals.
For example, one can decide how the distance for each data point from the cluster center is defined. All of us are familiar with the Euclidean distance. The algorithm is straightforward and easy to understand but using the technique is not as easy as it looks. Let’s try out some examples in R.
K-Means Starter
To understand how K-Means works, we start with an example where all our assumptions hold. R includes a dataset about waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park known as ‘faithful’. The dataset consists of 272 observations of 2 features.
1 2 | #Viewing the Faithful datasetplot(faithful) |

Looking at the dataset, we can notice two clusters.
I will now use the kmeans() function in R to form clusters.
Let’s see how K-Means clustering works on the data
1 2 3 4 | #Specify 2 centersk_clust_start=kmeans(faithful, centers=2)#Plot the data using clustersplot(faithful, col=k_clust_start$cluster,pch=2) |

Being a small dataset, clusters are formed almost instantaneously but how do we see the clusters, their centers or sizes? The k_clust_start variable I used contains information on both centers and the size of clusters. Let’s check them out
1 2 3 4 5 6 7 8 9 10 | #Use the centers to find the cluster centersk_clust_start$centers eruptions waiting1 4.29793 80.284882 2.09433 54.75000#Use the size to find the cluster sizesk_clust_start$size[1] 172 100 |
This means the first cluster consists of 172 members and is centered at 4.29793 value of eruptions and 80.28488 value of waiting. Similarly the second cluster consists of 100 members with 2.09433 value of eruptions and 54.75 value of waiting. Now this information is golden! We know that these centers are the cluster means. So, the eruptions typically happen for either ~2 mins or ~4.3 mins. For longer eruptions, the waiting time is also longer.
Getting into the depths
Imagine a dataset which has clusters which one can clearly identify but k-means cannot. I’m talking about a dataset which does not satisfy the assumptions. A common example is a dataset which represents two concentric circles. Let’s generate it and see how it looks like
1 2 3 4 5 6 7 8 9 10 11 12 13 | #The following code will generate different plots for you but they will be similarlibrary(plyr)library(dplyr)#Generate random data which will be first clusterclust1 = data_frame(x = rnorm(200), y = rnorm(200))#Generate the second cluster which will ‘surround’ the first clusterclust2 =data_frame(r = rnorm(200, 15, .5), theta = runif(200, 0, 2 * pi), x = r * cos(theta), y = r * sin(theta)) %>% dplyr::select(x, y)#Combine the datadataset_cir= rbind(clust1, clust2)#see the plotplot(dataset_cir) |

Simple, isn’t it? There are two clusters – one in the middle and the other circling the first. However, this violates the assumption that the clusters are spherical. The inner data is spherical while the outer circle is not. Even though the clustering will not be good, let’s see how does k-means perform on this data
1 2 3 4 | #Fit the k-means modelk_clust_spher1=kmeans(dataset_cir, centers=2)#Plot the data and clustersplot(dataset_cir, col=k_clust_spher1$cluster,pch=2) |

How do we solve this problem? There are clearly 2 clusters but k-means is not working well. A simple way in this case is to transform our data into polar format. Let’s convert it and plot it.
1 2 3 4 5 6 7 8 9 10 11 | #Using a function for transformationcart2pol=function(x,y){#This is r newx=sqrt(x^2 + y^2)#This is theta newy=atan(y/x) x_y=cbind(newx,newy) return(x_y)}dataset_cir2=cart2pol(dataset_cir$x,dataset_cir$y)plot(dataset_cir2) |

Now we run the k-means model on this data
1 2 3 | k_clust_spher2=kmeans(dataset_cir2, centers=2)#Plot the data and clustersplot(dataset_cir2, col=k_clust_spher2$cluster,pch=2) |

This time k-means algorithm works well and correctly transform the data. We can also view the clusters on the original data to double-check this.
1 | plot(dataset_cir, col=k_clust_spher2$cluster,pch=2) |