SPSS offers two separate
approaches to cluster analysis,
·
K-Means clustering (also called Quick
clustering) and
·
Hierarchical (or agglomerative) clustering
K –Mean Clustering
In data mining, k-means clustering is a method of cluster analysis which aims to partition n observations
into k clusters
in which each observation belongs to the cluster with the nearest mean. It requires the
number of clusters to be specified in advance, and the initial number chosen
may split natural groupings or combine two or more groups that are rather
different from each other.
The main disadvantage is that there needs to be
a certain amount of trial and error in choosing the number of clusters.
How
to carry out analysis using K-Mean
To carry out the
analysis, choose Classify>K-means Cluster from the Analyze menu.
Copy all of your variables across into the list, and specify the number of
cluster that you want it to find.
You need to keep on
increasing the number of clusters till satisfactory numbers of objects are not
there in each and every cluster.
Outliers
Ø Outliers are cases that have data values that are very
different from the data values for the majority of cases in the data set.
Ø Outliers are important because they can change the
results of data analysis.
Whether to include or exclude outliers from a data
analysis depends on the reason why the case is an outlier and the purpose of
the analysis.
In other words, an Outlier is an observation that lies
an abnormal distance from other values in a random sample from a population.
To remove
these we need to find out ‘what variables make them Outliers?’ and then remove
them.
O –
represents Outliers in the Box plot diagram.
* – represents Extreme cases in the
Box plot diagram.
Investigating outliers
carefully: Often outliers
contain valuable information about the process under investigation or the data
gathering and recording process. Before considering the possible elimination of
these points from the data, one should try to understand why they appeared and
whether it is likely similar values will continue to appear. Of course,
outliers are often bad data points.
Boxplot
: To detect Outliers
Graphs>
Legacy> Boxplots
The box plot is a
useful graphical display for describing the behaviour of the data in the middle
as well as at the ends of the distributions. The box plot uses the median and
the lower and upper quartiles (defined as the 25th and 75th percentiles).
If the lower quartile is Q1 and the upper quartile is Q3, then the difference
(Q3 - Q1) is called the interquartile range or IQ.
Box and whisker plots are uniform in their use of
the box: the bottom and top of the box are always the 25th and 75th percentile (the
lower and upper quartiles, respectively), and the band near the middle of
the box is always the 50th percentile (the median). But
the ends of the whiskers can represent several possible alternative values,
among them:
§
The
minimum and maximum of all the data (as in Figure 1)
§
The
lowest datum still within 1.5 IQR of the lower quartile, and the highest
datum still within 1.5 IQR of the upper quartile (as in
Figure 2)
§
One
standard deviation above and below the mean of the data
§
The 9th percentile and
the 91st percentile
§
The 2nd percentile and
the 98th percentile.
Any data not included between the whiskers should
be plotted as an outlier with a dot, small circle, or star, but occasionally
this is not done.
Some box plots include an additional character to
represent the mean of the data.
On some box plots a crosshatch is placed on each whisker, before the end of the
whisker.
Rarely, box plots can be presented with no whiskers
at all. Because of this variability, it is appropriate to describe the
convention being used for the whiskers and outliers in the caption for the
plot.
As long as outliers are there, good cluster would not
be there.So to remove outliers, “IF” Condition is used.
Data Option > If
condition > Select Cases
Author
Akhil Aggarwal (14126)
No comments:
Post a Comment