Cluster
analysis
SPSS offers two
separate approaches to cluster analysis, K-Means clustering (also called Quick
clustering) and Hierarchical (or agglomerative) clustering
K –Mean Clustering
In data mining, k-means clustering is a method
of cluster analysis which aims to partition n observations
into k clusters in
which each observation belongs to the cluster with the nearest mean. It requires the number of clusters to be specified in
advance, and the initial number chosen may split natural groupings or combine
two or more groups that are rather different from each other.
The main disadvantage is
that there needs to be a certain amount of trial and error in choosing the
number of clusters.
How to carry out
analysis using K-Mean
To carry out the
analysis, choose Classify>K-means
Cluster from the Analyze menu. Copy all of your variables across into the list, and
specify the number of cluster that you want it to find.
You need to keep on
increasing the number of clusters till satisfactory numbers of objects are not
there in each and every cluster.
Boxplot : To detect Outliers
Graphs> Legacy> Boxplots
The box plot is a
useful graphical display for describing the behaviour of the data in the middle
as well as at the ends of the distributions. The box plot uses
the median and the lower and upper quartiles (defined as the 25th and
75th percentiles). If the lower quartile is Q1 and the upper quartile is
Q3, then the difference (Q3 - Q1) is called the interquartile range or IQ.
Box and whisker plots
are uniform in their use of the box: the bottom and top of the box are always
the 25th and 75th percentile (the lower and upper quartiles,
respectively), and the band near the middle of the box is always the 50th percentile (the median). But the ends of the whiskers can represent several
possible alternative values, among them:
§ The minimum and maximum of all the data
§ The lowest datum still within 1.5 IQR of the lower quartile,
and the highest datum still within 1.5 IQR of
the upper quartile.
§ One standard deviation above and below the
mean of the data
§ The 9th percentile and the 91st percentile
§ The 2nd percentile and the 98th percentile.
Any data not included
between the whiskers should be plotted as an outlier with a dot, small circle,
or star, but occasionally this is not done.
Some box plots include
an additional character to represent the mean of the data.
On some box plots a crosshatch is
placed on each whisker, before the end of the whisker.
Rarely, box plots can
be presented with no whiskers at all. Because of this variability, it is
appropriate to describe the convention being used for the whiskers and outliers
in the caption for the plot.
As long as outliers
are there, good cluster would not be there.So to remove outliers, “IF”
Condition is used.
Data Option > If
condition > Select Cases
Hierarchical
The hierarchical
algorithms result in a tree-like dendrogram.
· At the top of the tree each observation is
represented as a separated “cluster”.
· At intermediate levels observations are
grouped into fewer “cluster” than at the higher levels.
· At the bottom, all of the observations
are merged into one “cluster”.
· In some problems, entire tree structure may be
of interest.
· In others, tree is just a convenient tool for
obtaining a partition.
· This is done by cutting the tree at a suitable
level which forces a particular partition.
· Some hierarchical algorithms form the tree
from the
bottom up in a divisive fashion, but most work
agglomeratively from the top down.
No comments:
Post a Comment