Introduction
Cluster analysis is an exploratory data
analysis tool for classifying information into
manageable form which maximizes the
similarity of cases within each cluster while maximizing the
dissimilarity between groups that are initially unknown. It
is used to reduce data to create subgroups that are more manageable than the original data.There is no prior knowledge about which
elements belong to which clusters. The grouping of clusters are done through the analysis of
the data.
Each cluster thus describes, in terms of the
data collected, the class to which its members belong. Items in each
cluster are similar in some ways to each other and
dissimilar to those in other clusters.
How
Should Clusters Be Combined?
Agglomerative hierarchical clustering starts
with each variable being a cluster. At the
next step, the two variables who have the smallest value for the distance measure (or largest value if you
are using similarities) are joined into a single cluster. At the second step, either a
third variable is added to the cluster that already contains two variable or two other variable
are merged into a new cluster. At every step, either individual variables are added to
existing clusters, two variables are combined, or two existing clusters are combined.
When you have only one case in a cluster, the
smallest distance between variables in
two clusters is unambiguous. It’s the
distance or similarity measure you selected for
the proximity matrix. Once you start forming
clusters with more than one variable you
need to define a distance between pairs of
clusters. For example, if cluster A has variables
1 and 4, and cluster B has variables 5, 6,
and 7, you need a measure of how different or
similar the two clusters are.
There are many ways to define the distance
between two clusters with more than one
case in a cluster. For example, you can
average the distances between all pairs of variables
formed by taking one member from each of the
two clusters. Or you can take the largest
or smallest distance between two variables
that are in different clusters. Different methods
for computing the distance between clusters
are available and may well result in
different solutions.
Distance
between Cluster Pairs
The most frequently used methods for
combining clusters at each stage are available in
SPSS. These methods define the distance
between two clusters at each stage of the
procedure. If cluster A has variables 1 and 2
and if cluster B has variables 5, 6, and 7, you need a measure of how different or similar the two
clusters are.
Nearest
neighbor (single linkage)
If you use the nearest neighbor method to
form clusters, the distance between two clusters is defined as the smallest
distance between two variables in the different clusters. That
means the distance between cluster A and cluster B is the smallest of the distances
between the following pairs of variables: (1,5),(1,6), (1,7), (2,5), (2,6), and (2,7). At
every step, the distance between two clusters is taken to be the distance between their two
closest members.
Furthest
neighbor (complete linkage)
If you
use a method called furthest neighbor (also known as complete linkage), the distance
between two clusters is defined as the distance between the two furthest points.
UPGMA. The average-linkage-between-groups
method, often aptly called UPGMA
(unweighted pair-group method using
arithmetic averages), defines the distance
between two clusters as the average of the
distances between all pairs of variables in which
one member of the pair is from each of the
clusters. For example, if variables 1 and 2 form
cluster A and variables 5, 6, and 7 form
cluster B, the average-linkage-between-groups
distance between clusters A and B is the
average of the distances between the same
pairs of variables as before: (1,5), (1,6),
(1,7), (2,5), (2,6), and (2,7). This differs from the
linkage methods in that it uses information
about all pairs of distances, not just the
nearest or the furthest. For this reason, it
is usually preferred to the single and complete
linkage methods for cluster analysis.
Average
linkage within groups
The UPGMA method considers only distances
between pairs of cases in different clusters. A
variant of it, the average linkage within groups, combines clusters so that the average
distance between all variables in the resulting cluster is as small as possible. Thus, the distance
between two clusters is the average of the distances between all possible pairs of cases
in the resulting cluster.
The methods discussed above can be used with
any kind of similarity or distance
measure between variables. The next three
methods use squared Euclidean distances.
Ward’s method. For each cluster, the means
for all variables are calculated. Then, for
each variable, the squared Euclidean distance
to the cluster means is calculated. These
distances are summed for all of the variables.
At each step, the two clusters that merge are
those that result in the smallest increase in
the overall sum of the squared within-cluster
distances. The coefficient in the
agglomeration schedule is the within-cluster sum of
squares at that step, not the distance at
which clusters are joined.
Centroid
method
This
method calculates the distance between two clusters as the sum of distances between cluster means for all of
the variables. In the centroid method, the centroid of a merged cluster is a weighted combination
of the centroids of the two individual clusters, where the weights are
proportional to the sizes of the clusters. One disadvantage of the centroid method is that
the distance at which clusters are combined. can actually decrease from one step to the
next.
This is an undesirable property because clusters merged at later stages are more
dissimilar than those merged at early stages.Median method. With this method, the two
clusters being combined are weighted equally in the computation of the centroid,
regardless of the number of cases in each. This allows small groups to have an equal
effect on the characterization of larger clusters into which they are merged.
Akshith.M
No comments:
Post a Comment