In the process of learning, today we have started discussing
about the topic “K-Means” clustering.
Before proceeding to the topic there has been a recap given
on the previous class i.e., Variables used for Hierarchical clustering – for interval
measure - Euclidean, for count measure - Chi-square and for binary measure - Jaccard
& Simple matching. Continuous / Interval / Scale / Summary variables are
the variables which hold infinite values. Category / Grouping variables are
those which hold finite values.
Now coming to today’s subject, below are the questions that
were explained:
What is K Means clustering?
K Means clustering is a method of cluster analysis which
aims to partition n observations into k clusters in which each observation
belongs to the cluster with the nearest mean.
Why is it used?
It used to cluster observations into groups of related
observations without any prior knowledge of those relationships. It is commonly
used in medical imaging, biometrics and related fields. It is faster than
Hierarchical clustering, and it can handle large data files. It can be used
only for classification, not for regression.
Pros and Cons of
K-Means clustering:
· - Relatively efficient: O(tkn)
o
n: Objects; k: clusters; t: iterations
· - Often terminate at local optimum
· - Applicable only when mean is defined
· - Need to specify the number of clusters
· - Unable to handle noisy data and outliers
Comparison of K-Means
& Hierarchical Clustering:
K-means clustering does not assume a tree structure. In its
pure form you might ask the computer - split these data values into three
groups or four groups, but you can't guarantee that merging two groups
from the four-group solution will produce the same as the three-group
solution. The results of k-means clustering may be affected by the choice of
initial centres for the clusters, i.e. the starting points for the iteration.
On the other hand, hierarchal clustering is the sort that you might apply when there is a "tree" structure to the data. It accepts any kind of variables provided you choose an adequate measure of proximity for them, and then the cluster procedure proceeds to form successive groupings, from the initial situation of N clusters with one member each, to the final situation of one giant cluster with N members.
On the other hand, hierarchal clustering is the sort that you might apply when there is a "tree" structure to the data. It accepts any kind of variables provided you choose an adequate measure of proximity for them, and then the cluster procedure proceeds to form successive groupings, from the initial situation of N clusters with one member each, to the final situation of one giant cluster with N members.
When is it used?
When the number of objects are greater than 50 then we go
for K Means clustering.
What variables are used?
Quantitative variables like Scale / Interval / Ratio i.e. continuous
variables are used for this clustering; now the question arises - Why we use only
these variables?
If your variables are binary or counts, use the Hierarchical
Cluster Analysis procedure. But in few instances we can use binary variables as
well. Next question is - Why we use Binary variables?
How is it used?
In SPSS, navigate to Analyse > Classify > K-Means
Cluster Analysis
Two rules of
Segmentation:
· - To find out sizeable segment
· - Each segment should be different from the other
What is an Outlier? How
to figure out an Outlier in the cluster?
Outlier is defined as a noisy observation, which does not
fit to the assumed model that generated the data. In clustering, outliers are
considered as observations that should be removed in order to make clustering
more reliable. These are the clusters with less number of cases. To remove these
we need to find out ‘what variables make them Outliers?’ and then remove them.
O – represents Outliers in the Box plot diagram.
* – represents
Extreme cases in the Box plot diagram.
Graphs for Outliers:
In SPSS tool, navigate to Graphs > Legacy Dialogs >
Box Plot. The box plot is a quick way of examining one or more sets of data
graphically.
Removing Outliers go to – Select the cut off.
What variables make
them Outliers?
K-Means clustering process:
· - Segmenting
· - Building a profile
· - Organise the data (Using ‘Split file’ organise
and compare the data)
· - Interpretation
Critical Elements of Hierarchical clustering are: “Dendogram”
& “Proximity matrix”.
Critical Elements of K-Means clustering are: “No. of
cases in each cluster” & “Final cluster centres”.
No comments:
Post a Comment