Analytics – A vast ocean of knowledge mining, where we have entered. In
today’s world, there is no shortage for data, only for smart people who could
decipher the nuances from the available data!
On day 2 of the Business Analytics Workshop at
SIBM Bangalore by Team nmore, we enter the
domains of cluster analysis. Cluster analysis is that which classifies data
into small number of mutually exclusive and exhaustive groups, ensuring that
there will be much likeness within groups and as much difference among groups
as possible. A typical use of cluster analysis is to facilitate market
segmentation by identifying subjects or individuals who have similar needs,
lifestyles, or responses to marketing strategies.
The basic methods of clustering used in
computer packages are of two types:
a)
Hierarchical
clustering/Linkage methods
b)
Non-hierarchical
clustering/Nodal methods
The first method does not take input from us as
to how many clusters are to be formed. A range of solutions is provided by the
computer, from a 1-cluster solution to an n-cluster solution (n being the
number of objects in the study). The second method takes an input from us about
the number of clusters to be formed. K-means method belongs to the
non-hierarchical method of clustering.
Note: Hierarchical
clustering is used when the number of objects is less than 50 and
non-hierarchical method used when the number of objects is greater than 50.
This is done so that the total number of
clusters is limited in number.
In each case, a distance measure has to be
calculated for distances between objects/nodes being clustered. One of the most
commonly used measures is Euclidean distance.
Generally, interval-scaled variables are
ideally suited for cluster analysis. Continuous or ratio-scaled variables can
also be used but the instances are rarer. Standardisation may be necessary if
the units of measurement of different variables are widely different from each
other.
The clustering process consists of the following
3 basic steps:
a)
Selection
of variables
b)
Distance
Measurement
c)
Clustering
criteria
* Clustering criteria tells us how to measure
the distance among objects & objects, clusters & clusters and objects
& clusters.
Distance dissimilarity Measures:
a)
Interval
data – Euclidean distance, Squared Euclidean distance, Chebychev, block,
Minkowski or customised
b)
Count
data – Chi-square measure or phi-square measure
c)
Binary
data - Euclidean distance, Squared Euclidean distance, size difference,
variance, shape or Lance and Williams. [Enter values for Present and Absent to
specify which two values are meaningful]
Distance measurement:
a)
Nearest
neighbour – In this method, the distance between the two nearest objects in two
different clusters are considered as the distance
b)
Farthest
neighbour – In this method, the distance between the two farthest objects in
two different clusters are considered as the distance
c)
Centroid
clustering – In this method, the distance between the centroids of each cluster
is taken as the distance
There are other methods available
for distance measurement, but we do not go beyond the above 3 methods for
business analytics.
Dendrogram – It is the pictorial
representation of the clustering process. It can be used to assess the
cohesiveness of the clusters formed and can provide information about the
appropriate number of clusters to keep.
References:
·
Nargundkar,R.
Marketing Research – Text and Cases. Third Edition
·
Zikmund,W.
Business Research Methods. Seventh Edition
Author:
Anand Chandran
Author:
Anand Chandran
No comments:
Post a Comment