K- Mean
Clustering
K-Mean
Cluster analysis helps identify relatively similar homogeneous groups /clusters
based on selected parameters according to the nature of the problem to be
solved. It uses an algorithm that can handle large number of cases i.e. above
50.
We
have to specify the number of clusters we want while clustering. One usually
starts from 3 clusters and keeps increasing till you find satisfactory number
of objects in one cluster ( 1/5th of the total cases). At times one
can find outlier that is only 1 or 2 people in a cluster and we have to do
validation exercise of the outlier to see if it is genuine and then we select
data cases in such a way to eliminate the outliers. E.g. Select Monthly
expenditure <600. To recognise the outliers we use Box Plot Graph. Box Plot Graph is a convenient way of graphically depicting
groups of numerical data through their five-number summaries: the smallest
observation (sample minimum), lower quartile (Q1), median (Q2), upper quartile (Q3), and largest observation (sample maximum). A box plot may also indicate which observations, if any,
might be considered outliers.

You
can select one of two methods for classifying cases: iteratively or classifying
only. One has an option of saving cluster membership, distance information, and
final cluster centres.
While
these statistics are opportunistic (the procedure tries to form groups that do
differ), the relative size of the statistics provides information about each
variable's contribution to the separation of the groups.
The
Output has 2 main tables:
Number of Cases in each Cluster
Cluster
|
1
|
13.000
|
2 |
1.000
|
|
3 |
192.000
|
|
Valid
|
206.000
|
|
Missing
|
.000
|
Final
Cluster Centers
|
Cluster
|
||
1 |
2
|
3
|
|
Monthly
expenditure on phone
|
734.69
|
2000.00
|
318.14
|
Fixed
component of bill
|
44.08
|
70.00
|
48.29
|
Voice
calls bill
|
46.54
|
60.00
|
48.54
|
SMS bill
|
27.85
|
40.00
|
26.65
|
Other
charges
|
3.08
|
.00
|
5.77
|
DRAWBACKS:
·
The
number of clusters k is an input parameter: an inappropriate
choice of k may yield poor results. That is why, when performing
k-means, it is important to run diagnostic checks for determining the
number of clusters in the data set.
·
Convergence
to a local minimum may produce counterintuitive ("wrong") results
APPLICATIONS:
k-means
clustering in particular when using heuristics such as Lloyd's algorithm is
rather easy to implement and apply even on large data sets. It has been successfully
used in various topics
·
market segmentation,
·
computer vision,
·
geostatistics,
·
astronomy to
·
Agriculture.
After clustering various primary first level techniques are used to get into insights like cross tabs and frequencies
No comments:
Post a Comment