Monday, September 10, 2012

Day- 4 Team B -Gopalkrishnan Iyer




K Means Clustering:
In data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes. It is used to cluster observations into groups of related observations without any prior knowledge of those relationships. It is faster than Hierarchical clustering, and it can handle large data files and is used for classification.
Clustering analysis has had a long and active history in marketing research. The K-means clustering method, which requires ratio or interval-scaled data, is an iterative partitioning method.  It doesn’t require computation of all possible distances. It differs from hierarchical clustering in several ways. The algorithm is called k-means, where k is the number of clusters required, since a case is assigned to the cluster for which its distance to the cluster mean is the smallest. The action in the algorithm centers around finding the k-means. We  start out with an initial set of means and classify cases based on their distances to the centers. Next, we compute the cluster means again, using the cases that are assigned to the cluster and then, we reclassify all cases based on the new set of means.

Hierarchical vs. k-means clustering:

 Hierarchical clustering only requires a measure of similarity between groups of data points.
     
       Whereas, k-means requires
·         A number of clusters k
·         An initial assignment of data to clusters
·         A distance measure between data d(xn, xm)

Merits of K-means clustering:

·         It is faster than hierarchical clustering.
·         It produces tighter clusters than hierarchical clustering, especially if the clusters are globular.
Demerits:
·          It is difficult to compare quality of the clusters produced.
·         Fixed number of clusters can make it difficult to predict the value of K.

Boxplot: A boxplot is a statistical tool that represents graphically the distribution of a set of numerical data. It splits a data set into quartiles by calculating the median, the upper quartile, the lower quartile, the minimum value and the maximum value.

Outlier: An outlier is any value that lies more than one and a half times the length of the box from either end of the box. Some of the values are ‘too far’ from the central value. These ‘too far’ points are the outliers, as they  lie outside  the range in which we expected them to be in.

No comments:

Post a Comment