Business Analytics Workshop SIBM 2011 Marketing : Day 2 - Team A

Introduction

Cluster analysis is an exploratory data analysis tool for classifying information into

manageable form which maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that are initially unknown. It is used to reduce data to create subgroups that are more manageable than the original data.There is no prior knowledge about which elements belong to which clusters. The grouping of clusters are done through the analysis of the data.

Each cluster thus describes, in terms of the data collected, the class to which its members belong. Items in each cluster are similar in some ways to each other and

dissimilar to those in other clusters.

How Should Clusters Be Combined?

Agglomerative hierarchical clustering starts with each variable being a cluster. At the next step, the two variables who have the smallest value for the distance measure (or largest value if you are using similarities) are joined into a single cluster. At the second step, either a third variable is added to the cluster that already contains two variable or two other variable are merged into a new cluster. At every step, either individual variables are added to existing clusters, two variables are combined, or two existing clusters are combined.

When you have only one case in a cluster, the smallest distance between variables in

two clusters is unambiguous. It’s the distance or similarity measure you selected for

the proximity matrix. Once you start forming clusters with more than one variable you

need to define a distance between pairs of clusters. For example, if cluster A has variables

1 and 4, and cluster B has variables 5, 6, and 7, you need a measure of how different or

similar the two clusters are.

There are many ways to define the distance between two clusters with more than one

case in a cluster. For example, you can average the distances between all pairs of variables

formed by taking one member from each of the two clusters. Or you can take the largest

or smallest distance between two variables that are in different clusters. Different methods

for computing the distance between clusters are available and may well result in

different solutions.

Distance between Cluster Pairs

The most frequently used methods for combining clusters at each stage are available in

SPSS. These methods define the distance between two clusters at each stage of the

procedure. If cluster A has variables 1 and 2 and if cluster B has variables 5, 6, and 7, you need a measure of how different or similar the two clusters are.

Nearest neighbor (single linkage)

If you use the nearest neighbor method to form clusters, the distance between two clusters is defined as the smallest distance between two variables in the different clusters. That means the distance between cluster A and cluster B is the smallest of the distances between the following pairs of variables: (1,5),(1,6), (1,7), (2,5), (2,6), and (2,7). At every step, the distance between two clusters is taken to be the distance between their two closest members.

Furthest neighbor (complete linkage)

If you use a method called furthest neighbor (also known as complete linkage), the distance between two clusters is defined as the distance between the two furthest points.

UPGMA. The average-linkage-between-groups method, often aptly called UPGMA

(unweighted pair-group method using arithmetic averages), defines the distance

between two clusters as the average of the distances between all pairs of variables in which

one member of the pair is from each of the clusters. For example, if variables 1 and 2 form

cluster A and variables 5, 6, and 7 form cluster B, the average-linkage-between-groups

distance between clusters A and B is the average of the distances between the same

pairs of variables as before: (1,5), (1,6), (1,7), (2,5), (2,6), and (2,7). This differs from the

linkage methods in that it uses information about all pairs of distances, not just the

nearest or the furthest. For this reason, it is usually preferred to the single and complete

linkage methods for cluster analysis.

Average linkage within groups

The UPGMA method considers only distances between pairs of cases in different clusters. A variant of it, the average linkage within groups, combines clusters so that the average distance between all variables in the resulting cluster is as small as possible. Thus, the distance between two clusters is the average of the distances between all possible pairs of cases in the resulting cluster.

The methods discussed above can be used with any kind of similarity or distance

measure between variables. The next three methods use squared Euclidean distances.

Ward’s method. For each cluster, the means for all variables are calculated. Then, for

each variable, the squared Euclidean distance to the cluster means is calculated. These

distances are summed for all of the variables. At each step, the two clusters that merge are

those that result in the smallest increase in the overall sum of the squared within-cluster

distances. The coefficient in the agglomeration schedule is the within-cluster sum of

squares at that step, not the distance at which clusters are joined.

Centroid method

This method calculates the distance between two clusters as the sum of distances between cluster means for all of the variables. In the centroid method, the centroid of a merged cluster is a weighted combination of the centroids of the two individual clusters, where the weights are proportional to the sizes of the clusters. One disadvantage of the centroid method is that the distance at which clusters are combined. can actually decrease from one step to the next.

This is an undesirable property because clusters merged at later stages are more dissimilar than those merged at early stages.Median method. With this method, the two clusters being combined are weighted equally in the computation of the centroid, regardless of the number of cases in each. This allows small groups to have an equal effect on the characterization of larger clusters into which they are merged.

Akshith.M

Business Analytics Workshop SIBM 2011 Marketing

Tuesday, September 4, 2012

Day 2 - Team A - Cluster Analysis

No comments:

Post a Comment