Monday, September 10, 2012

TEAM A- Day 4- Avinash Pandey


K- Mean Clustering
K-Mean Cluster analysis helps identify relatively similar homogeneous groups /clusters based on selected parameters according to the nature of the problem to be solved. It uses an algorithm that can handle large number of cases i.e. above 50.
We have to specify the number of clusters we want while clustering. One usually starts from 3 clusters and keeps increasing till you find satisfactory number of objects in one cluster ( 1/5th of the total cases). At times one can find outlier that is only 1 or 2 people in a cluster and we have to do validation exercise of the outlier to see if it is genuine and then we select data cases in such a way to eliminate the outliers. E.g. Select Monthly expenditure <600. To recognise the outliers we use Box Plot Graph. Box Plot Graph is a convenient way of graphically depicting groups of numerical data through their five-number summaries: the smallest observation (sample minimum), lower quartile (Q1), median (Q2), upper quartile (Q3), and largest observation (sample maximum). A box plot may also indicate which observations, if any, might be considered outliers.
You can select one of two methods for classifying cases: iteratively or classifying only. One has an option of saving cluster membership, distance information, and final cluster centres.
While these statistics are opportunistic (the procedure tries to form groups that do differ), the relative size of the statistics provides information about each variable's contribution to the separation of the groups.
The Output has 2 main tables:
      Number of Cases in each Cluster

 

 

Cluster
1
13.000
2
1.000
3
192.000
Valid
206.000
Missing
.000
                                           Final Cluster Centers

 


Cluster
1
2
3
Monthly expenditure on phone
734.69
2000.00
318.14
Fixed component of bill
44.08
70.00
48.29
Voice calls bill
46.54
60.00
48.54
SMS bill
27.85
40.00
26.65
Other charges
3.08
.00
5.77

 DRAWBACKS:
·         Euclidean distance is used as a metric and variance is used as a measure of cluster scatter.
·         The number of clusters k is an input parameter: an inappropriate choice of k may yield poor results. That is why, when performing k-means, it is important to run diagnostic checks for determining the number of clusters in the data set.
·         Convergence to a local minimum may produce counterintuitive ("wrong") results 
 APPLICATIONS:
k-means clustering in particular when using heuristics such as Lloyd's algorithm is rather easy to implement and apply even on large data sets. It has been successfully used in various topics
·         market segmentation, 
·         computer vision,
·         geostatistics,
·         astronomy to
·          Agriculture.

After clustering various primary first level techniques are used to get into insights like cross tabs and frequencies


No comments:

Post a Comment