K – Means Clustering
K-means clustering is
a method of cluster analysis which aims to partition n observations
into k clusters in which each observation belongs to the
cluster with the nearest mean. The grouping is done by minimizing the
sum of squares of distances between data and the corresponding cluster
centroid. Thus, the purpose of K-mean clustering is to classify the data.
This method
is generally used for data which are more than 50. So such data is called
Continuous/Scale/Interval/Summary.
Example:
Suppose while dividing the entire data in to clusters of 3, we get clusters of
13, 1, and 192, then 1 is known as Outlier. This type of clustering is not good
as it is not uniform & would not help us.
Advantages
to Using this Technique
- With a large number
of variables, K-Means may be computationally faster than hierarchical
clustering (if K is small).
- K-Means may produce
tighter clusters than hierarchical clustering, especially if the clusters
are globular.
Disadvantages
to Using this Technique
- Difficulty in
comparing quality of the clusters produced (e.g. for different initial
partitions or values of K affect outcome).
- Fixed number of
clusters can make it difficult to predict what K should be.
- Does not work well
with non-globular clusters.
- Different initial
partitions can result in different final clusters. It is helpful to rerun
the program using the same as well as different K values, to compare the
results achieved.
In
K-Means cluster analysis, we mainly try to look for two things; namely;
- Number of cases in each cluster – This means the number of respondents in each cluster after the entire respondents has been divided into different clusters. Example, if there are 206 respondents in total, and after applying the K-means cluster analysis, we get 3 clusters of 1, 13, 192 of each on some parameter, i.e. monthly expenditure. Then the number of respondents in each cluster, i.e., 192, 13, 1 is the number of cases in each cluster.
- Final Cluster – In the above example, the 3 clusters formed is the final cluster in this case.
A
boxplot is a statistical tool that represents graphically the distribution of a
set of numerical data. It splits a data set into quartiles by calculating five numbers:
- the median (Q2): the value separating the higher half of a sample from the lower half;
- the upper quartile (Q3): the median of the higher half of the data set;
- the lower quartile (Q1): the median of the lower half of the data set;
- the minimum value; and
- the maximum value.
An
outlier is any value that lies more than one and a half times the length of the
box from either end of the box. That is, if a data point is below Q1 1.5[1]IQ
or above Q3 + 1.5[1]IQ,
it is viewed as being too far from the central values to be reasonable.
No comments:
Post a Comment