Business Analytics Workshop SIBM 2011 Marketing : Day 4

Monday, September 10, 2012

Day 4 - Team H

K-means clustering

In data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

Given a set of observations (x₁, x₂, …, x_n), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ n) S = {S₁, S₂, …, S_k} so as to minimize the within-cluster sum of squares (WCSS):

Where μ_i is the mean of points in S_i.

The algorithm is composed of the following steps:

· Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.

· Assign each object to the group that has the closest centroid.

· When all objects have been assigned, recalculate the positions of the K centroids.

· Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

Discussion

The two key features of k-means which make it efficient are often regarded as its biggest drawbacks:

Euclidean distance is used as a metric and variance is used as a measure of cluster scatter.The number of clusters k is an input parameter: an inappropriate choice of k may yield poor results. That is why, when performing k-means, it is important to run diagnostic checks for determining the number of clusters in the data set.

A key limitation of k-means is its cluster model. The concept is based on spherical clusters that are separable in a way so that the mean value converges towards the cluster center. The clusters are expected to be of similar size, so that the assignment to the nearest cluster center is the correct assignment.

When for example applying k-means with a value of k=3 onto the well-known Iris flower data set, the result often fails to separate the three Iris species contained in the data set. With k=2, the two visible clusters (one containing two species) will be discovered, whereas with k=3 one of the two clusters will be split into two even parts. In fact, k=2 is more appropriate for this data set, despite the data set containing 3 classes.

As with any other clustering algorithm, the k-means result relies on the data set to satisfy the assumptions made by the clustering algorithms. It works well on some data sets, while failing on others.

Applications of the algorithm

k-means clustering in particular when using heuristics such as Lloyd's algorithm is rather easy to implement and apply even on large data sets. As such, it has been successfully used in various topics, ranging from market segmentation, computer vision, geostatistics, and astronomy to agriculture. It often is used as a preprocessing step for other algorithms, for example to find a starting configuration.

Source:

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html

Author-

Kevin Rajeev Abraham

Business Analytics Workshop SIBM 2011 Marketing

Monday, September 10, 2012

Day 4 - Team H

Discussion

No comments:

Post a Comment