Tuesday, September 4, 2012

Day 2 - Team A - Cluster Analysis


Introduction

Cluster analysis is an exploratory data analysis tool for classifying information into
manageable form which maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that are initially unknown. It is used to reduce data to create subgroups that are more manageable than the original data.There is no prior knowledge about which elements belong to which clusters. The grouping of clusters are done through the analysis of the data.

Each cluster thus describes, in terms of the data collected, the class to which its members belong. Items in each cluster are similar in some ways to each other and
dissimilar to those in other clusters.


How Should Clusters Be Combined?

Agglomerative hierarchical clustering starts with each variable  being a cluster. At the next step, the two variables who have the smallest value for the distance measure (or largest value if you are using similarities) are joined into a single cluster. At the second step, either a third variable is added to the cluster that already contains two variable or two other variable are merged into a new cluster. At every step, either individual variables are added to existing clusters, two variables are combined, or two existing clusters are combined.

When you have only one case in a cluster, the smallest distance between variables in
two clusters is unambiguous. It’s the distance or similarity measure you selected for
the proximity matrix. Once you start forming clusters with more than one variable you
need to define a distance between pairs of clusters. For example, if cluster A has variables
1 and 4, and cluster B has variables 5, 6, and 7, you need a measure of how different or
similar the two clusters are.

There are many ways to define the distance between two clusters with more than one
case in a cluster. For example, you can average the distances between all pairs of variables
formed by taking one member from each of the two clusters. Or you can take the largest
or smallest distance between two variables that are in different clusters. Different methods
for computing the distance between clusters are available and may well result in
different solutions. 


Distance between Cluster Pairs

The most frequently used methods for combining clusters at each stage are available in
SPSS. These methods define the distance between two clusters at each stage of the
procedure. If cluster A has variables 1 and 2 and if cluster B has variables 5, 6, and 7, you need a measure of how different or similar the two clusters are.

Nearest neighbor (single linkage)

If you use the nearest neighbor method to form clusters, the distance between two clusters is defined as the smallest distance between two variables in the different clusters. That means the distance between cluster A and cluster B is the smallest of the distances between the following pairs of variables: (1,5),(1,6), (1,7), (2,5), (2,6), and (2,7). At every step, the distance between two clusters is taken to be the distance between their two closest members.

Furthest neighbor (complete linkage)

If you use a method called furthest neighbor (also known as complete linkage), the distance between two clusters is defined as the distance between the two furthest points.
UPGMA. The average-linkage-between-groups method, often aptly called UPGMA
(unweighted pair-group method using arithmetic averages), defines the distance
between two clusters as the average of the distances between all pairs of variables in which
one member of the pair is from each of the clusters. For example, if variables 1 and 2 form
cluster A and variables 5, 6, and 7 form cluster B, the average-linkage-between-groups
distance between clusters A and B is the average of the distances between the same
pairs of variables as before: (1,5), (1,6), (1,7), (2,5), (2,6), and (2,7). This differs from the
linkage methods in that it uses information about all pairs of distances, not just the
nearest or the furthest. For this reason, it is usually preferred to the single and complete
linkage methods for cluster analysis.

Average linkage within groups

The UPGMA method considers only distances between pairs of cases in different clusters. A variant of it, the average linkage within groups, combines clusters so that the average distance between all variables in the resulting cluster is as small as possible. Thus, the distance between two clusters is the average of the distances between all possible pairs of cases in the resulting cluster.

The methods discussed above can be used with any kind of similarity or distance
measure between variables. The next three methods use squared Euclidean distances.
Ward’s method. For each cluster, the means for all variables are calculated. Then, for
each variable, the squared Euclidean distance to the cluster means is calculated. These
distances are summed for all of the variables. At each step, the two clusters that merge are
those that result in the smallest increase in the overall sum of the squared within-cluster
distances. The coefficient in the agglomeration schedule is the within-cluster sum of
squares at that step, not the distance at which clusters are joined.

Centroid method

This method calculates the distance between two clusters as the sum of distances between cluster means for all of the variables. In the centroid method, the centroid of a merged cluster is a weighted combination of the centroids of the two individual clusters, where the weights are proportional to the sizes of the clusters. One disadvantage of the centroid method is that the distance at which clusters are combined. can actually decrease from one step to the next. 

This is an undesirable property because clusters merged at later stages are more dissimilar than those merged at early stages.Median method. With this method, the two clusters being combined are weighted equally in the computation of the centroid, regardless of the number of cases in each. This allows small groups to have an equal effect on the characterization of larger clusters into which they are merged.


 Akshith.M























No comments:

Post a Comment