K-Means Cluster Analysis
K-Means
fall under non-hierarchical clustering. A trial-and-error method is followed to
decide on the number of clusters; it is done till reasonable cases per cluster
are obtained.
It is
generally suggested to use Hierarchical method to decide on the number of
clusters and then K-means is used to actually form clusters. Only
interval/continuous/scale variables are used for k-means.
We had used
Cell.sav file to understand k-means. An attempt is made to prioritize customers
based on average revenue generated per user.
For that,
the variables selected are Monthly expenditure on phone, Fixed component of
bill, Voice calls bill, SMS bill and Other charges.
Navigation
is Analyse---> Classify-->K-means Cluster. Then the following window appears; number of
clusters chosen is 3.
The
clusters obtained are as follows. This does not form a good cluster base as the
distribution is highly uneven. The clusters containing very few objects are
called “outliers”.
To identify
outliers we use “Boxplot”.
Navigation
is Graphs--->Legacy Dialogs--->Boxplot
Since monthly
expenditure is causing the difference, the boxplot is obtained based on that
variable. It looks like this:
The box is
studied like this:
The thick
line passing through the box is the median. 25% of cases lie between the bottom
line and the end of the piston; another 25% lie between the median and the
bottom line of the box. The rest 50% are spread from median of the box towards
top.
The circles
represent outliers and stars represent extreme cases.
For ex,
star 39 represents the row number of the case spending Rs 2000/ month.
Since most
data lies around 300-600 we perform k-means clustering on data with condition
of monthly expenditure < 600, owing to which we arrive at the following
data:
This seems
to be valid basis for study; hence to study the profile of each cluster we need
to compare on price, demographics etc for which we chose Split file option. But
before that, we need to save these clusters as variables.
Navigation
is Analyse--->Classify--->K-Means Cluster--->Save-->Cluster
Membership
This creates
a new variable of which case belongs to which cluster named “QCL_1”
Now we do
Data--->Split File--->Compare Groups based on QCL_1 and the output will
be virtually divided into three groups.
Now a frequency
or cross tab analysis could be performed on various parameters like level of
education, name of current service, connection type etc to study different
clusters and suggest a strategy for the same.
No comments:
Post a Comment