Business Analytics Workshop SIBM 2011 Marketing : Day 4 -Team G-Akhil

Monday, September 10, 2012

Day 4 -Team G-Akhil

Cluster analysis

SPSS offers two separate approaches to cluster analysis,

· K-Means clustering (also called Quick clustering) and

· Hierarchical (or agglomerative) clustering

K –Mean Clustering

In data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. It requires the number of clusters to be specified in advance, and the initial number chosen may split natural groupings or combine two or more groups that are rather different from each other.

The main disadvantage is that there needs to be a certain amount of trial and error in choosing the number of clusters.

How to carry out analysis using K-Mean

To carry out the analysis, choose Classify>K-means Cluster from the Analyze menu. Copy all of your variables across into the list, and specify the number of cluster that you want it to find.

You need to keep on increasing the number of clusters till satisfactory numbers of objects are not there in each and every cluster.

Outliers

Ø Outliers are cases that have data values that are very different from the data values for the majority of cases in the data set.

Ø Outliers are important because they can change the results of data analysis.

Whether to include or exclude outliers from a data analysis depends on the reason why the case is an outlier and the purpose of the analysis.

In other words, an Outlier is an observation that lies an abnormal distance from other values in a random sample from a population.

To remove these we need to find out ‘what variables make them Outliers?’ and then remove them.

O – represents Outliers in the Box plot diagram.

* – represents Extreme cases in the Box plot diagram.

Investigating outliers carefully: Often outliers contain valuable information about the process under investigation or the data gathering and recording process. Before considering the possible elimination of these points from the data, one should try to understand why they appeared and whether it is likely similar values will continue to appear. Of course, outliers are often bad data points.

Boxplot : To detect Outliers

Graphs> Legacy> Boxplots

The box plot is a useful graphical display for describing the behaviour of the data in the middle as well as at the ends of the distributions. The box plot uses the median and the lower and upper quartiles (defined as the 25th and 75th percentiles). If the lower quartile is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the interquartile range or IQ.

Box and whisker plots are uniform in their use of the box: the bottom and top of the box are always the 25th and 75th percentile (the lower and upper quartiles, respectively), and the band near the middle of the box is always the 50th percentile (the median). But the ends of the whiskers can represent several possible alternative values, among them:

§ The minimum and maximum of all the data (as in Figure 1)

§ The lowest datum still within 1.5 IQR of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile (as in Figure 2)

§ One standard deviation above and below the mean of the data

§ The 9th percentile and the 91st percentile

§ The 2nd percentile and the 98th percentile.

Any data not included between the whiskers should be plotted as an outlier with a dot, small circle, or star, but occasionally this is not done.

Some box plots include an additional character to represent the mean of the data.

On some box plots a crosshatch is placed on each whisker, before the end of the whisker.

Rarely, box plots can be presented with no whiskers at all. Because of this variability, it is appropriate to describe the convention being used for the whiskers and outliers in the caption for the plot.

As long as outliers are there, good cluster would not be there.So to remove outliers, “IF” Condition is used.

Data Option > If condition > Select Cases

Author

Akhil Aggarwal (14126)

Business Analytics Workshop SIBM 2011 Marketing

Monday, September 10, 2012

Day 4 -Team G-Akhil

No comments:

Post a Comment