Monday, September 10, 2012

Day 4 - Team G - Abhishek Kumar Rai


Cluster analysis
SPSS offers two separate approaches to cluster analysis, K-Means clustering (also called Quick clustering) and Hierarchical (or agglomerative) clustering

K –Mean Clustering
In data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. It requires the number of clusters to be specified in advance, and the initial number chosen may split natural groupings or combine two or more groups that are rather different from each other.
The main disadvantage is that there needs to be a certain amount of trial and error in choosing the number of clusters.

How to carry out analysis using K-Mean
To carry out the analysis, choose Classify>K-means Cluster from the Analyze menu. Copy all of your variables across into the list, and specify the number of cluster that you want it to find.
You need to keep on increasing the number of clusters till satisfactory numbers of objects are not there in each and every cluster.

Boxplot : To detect Outliers

Graphs> Legacy> Boxplots
The box plot is a useful graphical display for describing the behaviour of the data in the middle as well as at the ends of the distributions. The box plot uses the median and the lower and upper quartiles (defined as the 25th and 75th percentiles). If the lower quartile is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the interquartile range or IQ.


Box and whisker plots are uniform in their use of the box: the bottom and top of the box are always the 25th and 75th percentile (the lower and upper quartiles, respectively), and the band near the middle of the box is always the 50th percentile (the median). But the ends of the whiskers can represent several possible alternative values, among them:
§         The minimum and maximum of all the data 
§         The lowest datum still within 1.5 IQR of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile.
§        One standard deviation above and below the mean of the data
§        The 9th percentile and the 91st percentile
§        The 2nd percentile and the 98th percentile.
Any data not included between the whiskers should be plotted as an outlier with a dot, small circle, or star, but occasionally this is not done.
Some box plots include an additional character to represent the mean of the data.
On some box plots a crosshatch is placed on each whisker, before the end of the whisker.
Rarely, box plots can be presented with no whiskers at all. Because of this variability, it is appropriate to describe the convention being used for the whiskers and outliers in the caption for the plot.

As long as outliers are there, good cluster would not be there.So to remove outliers, “IF” Condition is used.
Data Option > If condition > Select Cases

Hierarchical
The hierarchical algorithms result in a tree-like dendrogram.
·         At the top of the tree each observation is represented as a separated “cluster”.
·         At intermediate levels observations are grouped into fewer “cluster” than at the higher levels.
·          At the bottom, all of the observations are merged into one “cluster”.
·         In some problems, entire tree structure may be of interest.
·         In others, tree is just a convenient tool for obtaining a partition.
·         This is done by cutting the tree at a suitable level which forces a particular partition.
·         Some hierarchical algorithms form the tree from the
bottom up in a divisive fashion, but most work
agglomeratively from the top down.




No comments:

Post a Comment