K
Means Clustering:
In data mining, k-means clustering is a method of cluster
analysis which aims to partition n observations into k clusters
in which each observation belongs to the cluster with the nearest mean. k-means
clustering tends to find clusters of comparable spatial extent, while the
expectation-maximization mechanism allows clusters to have different shapes. It
is used to cluster observations into groups of related observations without any
prior knowledge of those relationships. It is faster than Hierarchical
clustering, and it can handle large data files and is used for classification.
Clustering analysis has had a
long and active history in marketing research. The K-means clustering method,
which requires ratio or interval-scaled data, is an iterative partitioning method. It doesn’t
require computation of all possible distances. It differs from hierarchical
clustering in several ways. The algorithm is called k-means, where k is the
number of clusters required, since a case is assigned to the cluster for which
its distance to the cluster mean is the smallest. The action in the algorithm
centers around finding the k-means. We
start out with an initial set of means and classify cases based on their
distances to the centers. Next, we compute the cluster means again, using the
cases that are assigned to the cluster and then, we reclassify all cases based
on the new set of means.
Hierarchical
vs. k-means clustering:
Hierarchical clustering only requires a
measure of similarity between groups of data points.
Whereas, k-means requires
·
A number
of clusters k
·
An
initial assignment of data to clusters
·
A
distance measure between data d(xn, xm)
Merits of K-means clustering:
·
It is
faster than hierarchical clustering.
·
It
produces tighter clusters than hierarchical clustering, especially if the
clusters are globular.
Demerits:
·
It is difficult to compare quality of the
clusters produced.
·
Fixed
number of clusters can make it difficult to predict the value of K.
Boxplot: A boxplot is a statistical tool that
represents graphically the distribution of a set of numerical data. It splits a
data set into quartiles by calculating the median, the upper quartile, the
lower quartile, the minimum value and the maximum value.
Outlier: An outlier is
any value that lies more than one and a half times the length of the box from
either end of the box. Some of the values are ‘too far’ from the central value.
These ‘too far’ points are the outliers, as they lie outside the range in which we expected them to be in.
No comments:
Post a Comment