K-Means
Clustering
On Day 4 we learnt About
K-Means Cluster Analysis by team N-more:-
A clustering method that doesn’t require computation of all possible
distances is k-means clustering. It differs from hierarchical clustering in
several ways. You have to know in advance the number of clusters you want. You
can’t get solutions for a range of cluster numbers unless you rerun the
analysis for each different number of clusters. The algorithm repeatedly
reassigns cases to clusters, so the same case can move from cluster to cluster
during the analysis. In agglomerative hierarchical clustering, on the other
hand, cases are added only to existing clusters. They’re forever captive in
their cluster, with a widening circle of neighbors. The algorithm is called
k-means, where k is the number of clusters you want; since a case is assigned
to the cluster for which its distance to the cluster mean is the smallest. The
action in the algorithm centers around finding the k-means. You start out with
an initial set of means and classify cases based on their distances to the
centers. Next, you compute the cluster means again, using the cases that are assigned
to the cluster; then, you reclassify all cases based on the new set of means.
You keep repeating this step until cluster means don’t change much between
successive steps. Finally, you calculate the means of the clusters once again
and assign the cases to their permanent clusters.
This procedure attempts to identify relatively homogeneous groups of
cases based on selected characteristics, using an algorithm that can handle
large numbers of cases. However, the algorithm requires you to specify the
number of clusters. You can specify initial cluster centers if you know this
information. You can select one of two methods for classifying cases, either
updating cluster centers iteratively or classifying only. You can save cluster
membership, distance information, and final cluster centers. Optionally, you
can specify a variable whose values are used to label case wise output. You can
also request analysis of variance F statistics. While these statistics are
opportunistic (the procedure tries to form groups that do differ), the relative
size of the statistics provides information about each variable's contribution
to the separation of the groups.
Statistics: Complete solution: initial cluster
centers, ANOVA table.
Each case: cluster information, distance from cluster center.
Assumptions: Distances are computed using simple
Euclidean distance. If you want to use another distance or similarity measure,
use the Hierarchical Cluster Analysis procedure. Scaling of variables is an
important consideration--if your variables are measured on different scales
(for example, one variable is expressed in dollars and another is expressed in
years), your results may be misleading. In such cases, should consider
standardizing your variables before you perform the k-means cluster analysis
(this can be done in the Descriptive procedure). The procedure assumes that you
have selected the appropriate number of clusters and that you have included all
relevant variables. If you have chosen an inappropriate number of clusters or
omitted important variables, your results may be misleading.
Case and Initial Cluster Center
Order: The default algorithm for choosing
initial cluster centers is not invariant to case ordering. The Use running
means option on the Iterate dialog box makes the resulting solution potentially
dependent upon case order regardless of how initial cluster centers are chosen.
If you are using either of these methods, you may want to obtain several
different solutions with cases sorted in different random orders to verify the
stability of a given solution. Specifying initial cluster centers and not using
the Use running means option will avoid issues related to case order. However,
ordering of the initial cluster centers may affect the solution, if there are
tied distances from cases to cluster centers. Comparing results from analyses
with different permutations of the initial center values may be used to assess
the stability of a given solution.
Before You Start
Whenever you use a statistical
procedure that calculates distances, you have to worry about the impact of the
different units in which variables are measured. Variables that have large
values will have a large impact on the distance compared to variables that have
smaller values. In this example, the average percentages of the oxides differ
quite a bit, so it’s a good idea to standardize the variables to a mean of 0
and a standard deviation of 1. (Standardized variables are used in the
example.) You also have to specify the number of clusters (k) that you want
produced.
Tip: If you
have a large data file, you can take a random sample of the data and try to
determine a good number, or range of numbers, for a cluster solution based on
the hierarchical clustering procedure. You can also use hierarchical cluster
analysis to estimate starting values for the k-means algorithm.
Initial Cluster Centers
The first step in k-means
clustering is finding the k centers. This is done iteratively. You start with
an initial set of centers and then modify them until the change between two
iterations is small enough. If you have good guesses for the centers, you can
use those as initial starting points; otherwise, you can let SPSS find k cases
that are well separated and use these values as initial cluster centers.
Warning:
K-means clustering is very sensitive to outliers, since they will usually be
selected as initial cluster centers. This will result in outliers forming
clusters with small numbers of cases. Before you start a cluster analysis,
screen the data for outliers and remove them from the initial analysis. The
solution may also depend on the order of the cases in the file. After the
initial cluster centers have been selected, each case is assigned to the
closest cluster, based on its distance from the cluster centers. After all of
the cases have been assigned to clusters, the cluster centers are recomputed,
based on all of the cases in the cluster. Case assignment is done again, using
these updated cluster centers. You keep assigning cases and recomputing the
cluster centers until no cluster center changes appreciably or the maximum
number of iterations (10 by default) is reached.
Tip: You can
update the cluster centers after each case is classified, instead of after all
cases are classified, if you select the Use Running Means check box in the
Iterate dialog box.
To Obtain a K-Means Cluster Analysis
From the menus choose:
·
Select the
variables to be used in the cluster analysis.
·
Specify the
number of clusters. The number of clusters must be at least two and must not be
greater than the number of cases in the data file.
·
Select
either Iterate and classify or classify only.
By,
Saket Deepak
Team J
No comments:
Post a Comment