Monday, September 10, 2012

Day 4 - Team E - Kranthi


In the process of learning, today we have started discussing about the topic “K-Means” clustering.

Before proceeding to the topic there has been a recap given on the previous class i.e., Variables used for Hierarchical clustering – for interval measure - Euclidean, for count measure - Chi-square and for binary measure - Jaccard & Simple matching. Continuous / Interval / Scale / Summary variables are the variables which hold infinite values. Category / Grouping variables are those which hold finite values.
Now coming to today’s subject, below are the questions that were explained:

What is K Means clustering?
K Means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

Why is it used?
It used to cluster observations into groups of related observations without any prior knowledge of those relationships. It is commonly used in medical imaging, biometrics and related fields. It is faster than Hierarchical clustering, and it can handle large data files. It can be used only for classification, not for regression.

Pros and Cons of K-Means clustering:
·          - Relatively efficient: O(tkn)
o   n: Objects; k: clusters; t: iterations
·          - Often terminate at local optimum
·          - Applicable only when mean is defined
·          - Need to specify the number of clusters
·          - Unable to handle noisy data and outliers

Comparison of K-Means & Hierarchical Clustering:
K-means clustering does not assume a tree structure. In its pure form you might ask the computer - split these data values into three groups or four groups, but you can't guarantee that merging two groups from the four-group solution will produce the same as the three-group solution. The results of k-means clustering may be affected by the choice of initial centres for the clusters, i.e. the starting points for the iteration.
On the other hand, hierarchal clustering is the sort that you might apply when there is a "tree" structure to the data. It accepts any kind of variables provided you choose an adequate measure of proximity for them, and then the cluster procedure proceeds to form successive groupings, from the initial situation of N clusters with one member each, to the final situation of one giant cluster with N members. 

When is it used?
When the number of objects are greater than 50 then we go for K Means clustering.

What variables are used?
Quantitative variables like Scale / Interval / Ratio i.e. continuous variables are used for this clustering; now the question arises - Why we use only these variables?
If your variables are binary or counts, use the Hierarchical Cluster Analysis procedure. But in few instances we can use binary variables as well. Next question is - Why we use Binary variables?

How is it used?
In SPSS, navigate to Analyse > Classify > K-Means Cluster Analysis

Two rules of Segmentation:
·          - To find out sizeable segment
·          - Each segment should be different from the other

What is an Outlier? How to figure out an Outlier in the cluster?
Outlier is defined as a noisy observation, which does not fit to the assumed model that generated the data. In clustering, outliers are considered as observations that should be removed in order to make clustering more reliable. These are the clusters with less number of cases. To remove these we need to find out ‘what variables make them Outliers?’ and then remove them.
O – represents Outliers in the Box plot diagram.
* – represents Extreme cases in the Box plot diagram.

Graphs for Outliers:
In SPSS tool, navigate to Graphs > Legacy Dialogs > Box Plot. The box plot is a quick way of examining one or more sets of data graphically. 
Removing Outliers go to – Select the cut off.

What variables make them Outliers?
K-Means clustering process:
·          - Segmenting
·          - Building a profile
·          - Organise the data (Using ‘Split file’ organise and compare the data)
·          - Interpretation

Critical Elements of Hierarchical clustering are: “Dendogram” & “Proximity matrix”.
Critical Elements of K-Means clustering are: “No. of cases in each cluster” & “Final cluster centres”.

No comments:

Post a Comment