K-means
clustering
In
data mining, k-means clustering is a method of cluster analysis which aims to
partition n observations into k clusters in which each observation belongs to
the cluster with the nearest mean.
Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ n) S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS):
Where μi
is the mean of points in Si.
The algorithm is composed of the following steps:
·
Place
K points into the space represented by the objects that are being clustered.
These points represent initial group centroids.
·
Assign
each object to the group that has the closest centroid.
·
When
all objects have been assigned, recalculate the positions of the K centroids.
·
Repeat
Steps 2 and 3 until the centroids no longer move. This produces a separation of
the objects into groups from which the metric to be minimized can be
calculated.
Discussion
The two key features of k-means which make it
efficient are often regarded as its biggest drawbacks:
Euclidean distance is used as a metric and
variance is used as a measure of cluster scatter.The number of clusters k is an
input parameter: an inappropriate choice of k may yield poor results. That is
why, when performing k-means, it is important to run diagnostic checks for
determining the number of clusters in the data set.
A key limitation of
k-means is its cluster model. The concept is based on spherical clusters that
are separable in a way so that the mean value converges towards the cluster
center. The clusters are expected to be of similar size, so that the assignment
to the nearest cluster center is the correct assignment.
When for example
applying k-means with a value of k=3 onto the well-known Iris flower data set,
the result often fails to separate the three Iris species contained in the data
set. With k=2, the two visible clusters (one containing two species) will be discovered,
whereas with k=3 one of the two clusters will be split into two even parts. In
fact, k=2 is more appropriate for this data set, despite the data set
containing 3 classes.
As with any other
clustering algorithm, the k-means result relies on the data set to satisfy the
assumptions made by the clustering algorithms. It works well on some data sets,
while failing on others.
Applications
of the algorithm
k-means clustering in particular when
using heuristics such as Lloyd's algorithm is rather easy to implement and
apply even on large data sets. As such, it has been successfully used in
various topics, ranging from market segmentation, computer vision,
geostatistics, and astronomy to agriculture. It often is used as a
preprocessing step for other algorithms, for example to find a starting
configuration.
Source:
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html
Author-
Kevin Rajeev Abraham
No comments:
Post a Comment