Monday, September 10, 2012

Day 4 'K-Means Clustering-Team C'


K-Means Cluster Analysis

K-Means fall under non-hierarchical clustering. A trial-and-error method is followed to decide on the number of clusters; it is done till reasonable cases per cluster are obtained.
It is generally suggested to use Hierarchical method to decide on the number of clusters and then K-means is used to actually form clusters. Only interval/continuous/scale variables are used for k-means.

We had used Cell.sav file to understand k-means. An attempt is made to prioritize customers based on average revenue generated per user.
For that, the variables selected are Monthly expenditure on phone, Fixed component of bill, Voice calls bill, SMS bill and Other charges.
Navigation is Analyse---> Classify-->K-means Cluster. Then the following window appears; number of clusters chosen is 3.


The clusters obtained are as follows. This does not form a good cluster base as the distribution is highly uneven. The clusters containing very few objects are called “outliers”.




To identify outliers we use “Boxplot”.
Navigation is Graphs--->Legacy Dialogs--->Boxplot
Since monthly expenditure is causing the difference, the boxplot is obtained based on that variable. It looks like this:

The box is studied like this:

The thick line passing through the box is the median. 25% of cases lie between the bottom line and the end of the piston; another 25% lie between the median and the bottom line of the box. The rest 50% are spread from median of the box towards top.

The circles represent outliers and stars represent extreme cases.
For ex, star 39 represents the row number of the case spending Rs 2000/ month.
Since most data lies around 300-600 we perform k-means clustering on data with condition of monthly expenditure < 600, owing to which we arrive at the following data:


This seems to be valid basis for study; hence to study the profile of each cluster we need to compare on price, demographics etc for which we chose Split file option. But before that, we need to save these clusters as variables.

Navigation is Analyse--->Classify--->K-Means Cluster--->Save-->Cluster Membership
This creates a new variable of which case belongs to which cluster named “QCL_1”

Now we do Data--->Split File--->Compare Groups based on QCL_1 and the output will be virtually divided into three groups.
Now a frequency or cross tab analysis could be performed on various parameters like level of education, name of current service, connection type etc to study different clusters and suggest a strategy for the same.







No comments:

Post a Comment