Monday, September 10, 2012

Team A (Day 4) - Ankit Jaiswal


K – Means Clustering

K-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus, the purpose of K-mean clustering is to classify the data.

This method is generally used for data which are more than 50. So such data is called Continuous/Scale/Interval/Summary.

Example: Suppose while dividing the entire data in to clusters of 3, we get clusters of 13, 1, and 192, then 1 is known as Outlier. This type of clustering is not good as it is not uniform & would not help us.

Advantages to Using this Technique
  • With a large number of variables, K-Means may be computationally faster than hierarchical clustering (if K is small).
  • K-Means may produce tighter clusters than hierarchical clustering, especially if the clusters are globular.
Disadvantages to Using this Technique
  • Difficulty in comparing quality of the clusters produced (e.g. for different initial partitions or values of K affect outcome).
  • Fixed number of clusters can make it difficult to predict what K should be.
  • Does not work well with non-globular clusters.
  • Different initial partitions can result in different final clusters. It is helpful to rerun the program using the same as well as different K values, to compare the results achieved.
In K-Means cluster analysis, we mainly try to look for two things; namely;
  • Number of cases in each cluster – This means the number of respondents in each cluster after the entire respondents has been divided into different clusters. Example, if there are 206 respondents in total, and after applying the K-means cluster analysis, we get 3 clusters of 1, 13, 192 of each on some parameter, i.e. monthly expenditure. Then the number of respondents in each cluster, i.e., 192, 13, 1 is the number of cases in each cluster.
  • Final Cluster – In the above example, the 3 clusters formed is the final cluster in this case.

A boxplot is a statistical tool that represents graphically the distribution of a set of numerical data. It splits a data set into quartiles by calculating five numbers:
  • the median (Q2): the value separating the higher half of a sample from the lower half;
  • the upper quartile (Q3): the median of the higher half of the data set;
  • the lower quartile (Q1): the median of the lower half of the data set;
  • the minimum value; and
  • the maximum value.
The length of the box is represented by the inter quartile (IQ), which is the difference between the upper and the lower quartiles. The inter quartile tells how spread out the "middle" values are; it can also be used to tell when some of the other values are "too far" from the central value. These "too far" points are called "outliers", because they "lie outside" the range in which we expect them.

An outlier is any value that lies more than one and a half times the length of the box from either end of the box. That is, if a data point is below Q1 1.5[1]IQ or above Q3 + 1.5[1]IQ, it is viewed as being too far from the central values to be reasonable.


No comments:

Post a Comment