Business Analytics Workshop SIBM 2011 Marketing : TEAM A- Day 4- Avinash Pandey

Monday, September 10, 2012

TEAM A- Day 4- Avinash Pandey

K- Mean Clustering

K-Mean Cluster analysis helps identify relatively similar homogeneous groups /clusters based on selected parameters according to the nature of the problem to be solved. It uses an algorithm that can handle large number of cases i.e. above 50.

We have to specify the number of clusters we want while clustering. One usually starts from 3 clusters and keeps increasing till you find satisfactory number of objects in one cluster ( 1/5^th of the total cases). At times one can find outlier that is only 1 or 2 people in a cluster and we have to do validation exercise of the outlier to see if it is genuine and then we select data cases in such a way to eliminate the outliers. E.g. Select Monthly expenditure <600. To recognise the outliers we use Box Plot Graph. Box Plot Graph is a convenient way of graphically depicting groups of numerical data through their five-number summaries: the smallest observation (sample minimum), lower quartile (Q1), median (Q2), upper quartile (Q3), and largest observation (sample maximum). A box plot may also indicate which observations, if any, might be considered outliers.

You can select one of two methods for classifying cases: iteratively or classifying only. One has an option of saving cluster membership, distance information, and final cluster centres.

While these statistics are opportunistic (the procedure tries to form groups that do differ), the relative size of the statistics provides information about each variable's contribution to the separation of the groups.

The Output has 2 main tables:

Number of Cases in each Cluster

Cluster	1	13.000
	2	1.000
	3	192.000
Valid		206.000
Missing		.000

Final Cluster Centers

	Cluster
	1	2	3
Monthly expenditure on phone	734.69	2000.00	318.14
Fixed component of bill	44.08	70.00	48.29
Voice calls bill	46.54	60.00	48.54
SMS bill	27.85	40.00	26.65
Other charges	3.08	.00	5.77

DRAWBACKS:

· Euclidean distance is used as a metric and variance is used as a measure of cluster scatter.

· The number of clusters k is an input parameter: an inappropriate choice of k may yield poor results. That is why, when performing k-means, it is important to run diagnostic checks for determining the number of clusters in the data set.

· Convergence to a local minimum may produce counterintuitive ("wrong") results

APPLICATIONS:

k-means clustering in particular when using heuristics such as Lloyd's algorithm is rather easy to implement and apply even on large data sets. It has been successfully used in various topics

· market segmentation,

· computer vision,

· geostatistics,

· astronomy to

· Agriculture.

After clustering various primary first level techniques are used to get into insights like cross tabs and frequencies

Business Analytics Workshop SIBM 2011 Marketing

Monday, September 10, 2012

TEAM A- Day 4- Avinash Pandey

No comments:

Post a Comment