Thursday, September 6, 2012

Day 3 - Team G - Adarsh Kumar


Cluster Analysis  


What is Cluster Analysis?

Cluster analysis is a statistical technique used to group cases (individuals or objects) into homogeneous sub‐groups based on responses to variables. Using PASW (SPSS) 17.0 to conduct a cluster analysis, there are three clustering procedures: two‐step, k‐means, and hierarchical.  
Hierarchical clustering is best for small datasets because this procedure computes a proximity matrix of the distance/similarity of every case with every other case in the dataset. An agglomerative or divisive method can be used to cluster cases. The agglomerative method begins with each case being a cluster by itself and continues until similar clusters merge together. The divisive method begins with every case into one cluster and continues until each case is divided into individual clusters.

Hierarchical Clustering

As an example of hierarchical clustering, a sample PASW 17.0 dataset was used; car sales.sav, the car manufacturer’s data set. This analysis is based on automobiles that sold at least 100,000 units and will cluster automobiles based on their physical properties.
In PASW 17.0, go to Analyze ‐> Classify ‐> Hierarchical Cluster


Next, the Hierarchical Cluster Analysis box appears. Select Price in thousands through Fuel efficiency variables and place in the Variables box. Select Model variable and place in the Label Cases by box.




















Click Statistics; the Hierarchical Cluster Analysis: Statistics box appears. Check the Proximity matrix box. Click Continue.
Agglomeration schedule. By default this will be checked. The output will print a distance (or similarity) statistic to give you an idea of how unlike (or alike) the clusters being combined are.
Proximity matrix. The output will print distances or similarities computed for any pair of cases.
Cluster Membership. This box allows you to specify a set number of clusters. If you have a hypothesis about how many clusters there are, you can specify a set number of clusters, or create a number of clusters within a range.




















Click Plots; the Hierarchical Cluster Analysis: Plots box appears. Check the Dendrogram box. Click Continue.
Dendrogram. This option depicts the links between cases and its structure allows you to visually see how cases form clusters. Dendrograms, or tree diagrams, represent the process of going from individual cases to one large cluster.
Icicle. Default choice by SPSS. Icicle plots visually represent information on the agglomeration schedule. You can select that all clusters are included in the icicle plot, or restrict it to a range of clusters. Also, you can read the plot from bottom up (vertical orientation) or from left to right (horizontal orientation).




















Click Method; the Hierarchical Cluster Analysis: Method box appears. Select “Nearest Neighbor” from the Cluster Method drop down box. Select “Z scores” from the Transform Values, Standardize drop down box. Click Continue.
Cluster Method. This section allows you to choose how cases or clusters are combined; different methods will result in different cluster patterns. See the Help system for a description of the different cluster methods available in SPSS.
The “Furthest Neighbor” method, or complete linkage method, begins with two cases that have the highest, and these two cases form a cluster. Next, a new case is added to the cluster that has a high similarity both cases already in the cluster. The next case to be added to the cluster is the one with the highest similarity to the previous three cases, and so on.
Measure. There are different distance measure choices depending on the level of measurement of the data: interval, count, or binary.
For this example, the data were on an interval scale. “Squared Euclidean distance” was chosen as the distance measure. The Euclidean distance measure is the geometric distance between two cases, or the sum of the differences (of cases) over all of the variables. The squared Euclidean distance measure is the sum of the squared differences.
Please see SPSS’ Help system for a description of the more than 30 distance and similarity measures available.
Transform Values. You standardize scores because the variables are measured on different scales, which affect the Euclidean distance measure.
Transform Measure. This option transforms the values generated by the distance measure.




















Click Save; the Hierarchical Cluster Analysis: Save box appears. Click Continue.
Cluster Membership. Allows you to save cluster memberships for a single solution or a range of solutions, which can be used in subsequent analyses to explore other differences between groups.

















Click OK.

No comments:

Post a Comment