Business Analytics Workshop SIBM 2011 Marketing : Day 4 Saket Deepak Team J

K-Means Clustering

On Day 4 we learnt About K-Means Cluster Analysis by team N-more:-

A clustering method that doesn’t require computation of all possible distances is k-means clustering. It differs from hierarchical clustering in several ways. You have to know in advance the number of clusters you want. You can’t get solutions for a range of cluster numbers unless you rerun the analysis for each different number of clusters. The algorithm repeatedly reassigns cases to clusters, so the same case can move from cluster to cluster during the analysis. In agglomerative hierarchical clustering, on the other hand, cases are added only to existing clusters. They’re forever captive in their cluster, with a widening circle of neighbors. The algorithm is called k-means, where k is the number of clusters you want; since a case is assigned to the cluster for which its distance to the cluster mean is the smallest. The action in the algorithm centers around finding the k-means. You start out with an initial set of means and classify cases based on their distances to the centers. Next, you compute the cluster means again, using the cases that are assigned to the cluster; then, you reclassify all cases based on the new set of means. You keep repeating this step until cluster means don’t change much between successive steps. Finally, you calculate the means of the clusters once again and assign the cases to their permanent clusters.

This procedure attempts to identify relatively homogeneous groups of cases based on selected characteristics, using an algorithm that can handle large numbers of cases. However, the algorithm requires you to specify the number of clusters. You can specify initial cluster centers if you know this information. You can select one of two methods for classifying cases, either updating cluster centers iteratively or classifying only. You can save cluster membership, distance information, and final cluster centers. Optionally, you can specify a variable whose values are used to label case wise output. You can also request analysis of variance F statistics. While these statistics are opportunistic (the procedure tries to form groups that do differ), the relative size of the statistics provides information about each variable's contribution to the separation of the groups.

Statistics: Complete solution: initial cluster centers, ANOVA table.

Each case: cluster information, distance from cluster center.

Assumptions: Distances are computed using simple Euclidean distance. If you want to use another distance or similarity measure, use the Hierarchical Cluster Analysis procedure. Scaling of variables is an important consideration--if your variables are measured on different scales (for example, one variable is expressed in dollars and another is expressed in years), your results may be misleading. In such cases, should consider standardizing your variables before you perform the k-means cluster analysis (this can be done in the Descriptive procedure). The procedure assumes that you have selected the appropriate number of clusters and that you have included all relevant variables. If you have chosen an inappropriate number of clusters or omitted important variables, your results may be misleading.

Case and Initial Cluster Center Order: The default algorithm for choosing initial cluster centers is not invariant to case ordering. The Use running means option on the Iterate dialog box makes the resulting solution potentially dependent upon case order regardless of how initial cluster centers are chosen. If you are using either of these methods, you may want to obtain several different solutions with cases sorted in different random orders to verify the stability of a given solution. Specifying initial cluster centers and not using the Use running means option will avoid issues related to case order. However, ordering of the initial cluster centers may affect the solution, if there are tied distances from cases to cluster centers. Comparing results from analyses with different permutations of the initial center values may be used to assess the stability of a given solution.

Before You Start

Whenever you use a statistical procedure that calculates distances, you have to worry about the impact of the different units in which variables are measured. Variables that have large values will have a large impact on the distance compared to variables that have smaller values. In this example, the average percentages of the oxides differ quite a bit, so it’s a good idea to standardize the variables to a mean of 0 and a standard deviation of 1. (Standardized variables are used in the example.) You also have to specify the number of clusters (k) that you want produced.

Tip: If you have a large data file, you can take a random sample of the data and try to determine a good number, or range of numbers, for a cluster solution based on the hierarchical clustering procedure. You can also use hierarchical cluster analysis to estimate starting values for the k-means algorithm.

Initial Cluster Centers

The first step in k-means clustering is finding the k centers. This is done iteratively. You start with an initial set of centers and then modify them until the change between two iterations is small enough. If you have good guesses for the centers, you can use those as initial starting points; otherwise, you can let SPSS find k cases that are well separated and use these values as initial cluster centers.

Warning: K-means clustering is very sensitive to outliers, since they will usually be selected as initial cluster centers. This will result in outliers forming clusters with small numbers of cases. Before you start a cluster analysis, screen the data for outliers and remove them from the initial analysis. The solution may also depend on the order of the cases in the file. After the initial cluster centers have been selected, each case is assigned to the closest cluster, based on its distance from the cluster centers. After all of the cases have been assigned to clusters, the cluster centers are recomputed, based on all of the cases in the cluster. Case assignment is done again, using these updated cluster centers. You keep assigning cases and recomputing the cluster centers until no cluster center changes appreciably or the maximum number of iterations (10 by default) is reached.

Tip: You can update the cluster centers after each case is classified, instead of after all cases are classified, if you select the Use Running Means check box in the Iterate dialog box.

To Obtain a K-Means Cluster Analysis

From the menus choose:

Analyze > Classify > K-Means Cluster...

· Select the variables to be used in the cluster analysis.

· Specify the number of clusters. The number of clusters must be at least two and must not be greater than the number of cases in the data file.

· Select either Iterate and classify or classify only.

Reference: http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Fidh_quic.htm

www.mvsolution.com/wp.../SPSS-Tutorial-Cluster-Analysis.pdf

By,

Saket Deepak

Team J

Business Analytics Workshop SIBM 2011 Marketing

Monday, September 10, 2012

Day 4 Saket Deepak Team J

To Obtain a K-Means Cluster Analysis

No comments:

Post a Comment