Tuesday, September 4, 2012

Day 2 - Team E


Today's session started with the Retail example, using SPSS tool,
  • Gone through the values of each and every variable
  • Discussed the scale types of every variable

From the above, practically we have understood the different types of scales and when they are used.
Also understood the below options available in SPSS tool using the Retail example:
  • Frequencies (Analyze>Descriptive Statistics>Frequencies)
  •  Crosstabs (Analyze>Descriptive Statistics>Crosstabs)
  • Control variable in Crosstabs section
  • Using ‘If’ condition in ‘Select cases’ section

Frequencies
Crosstabs
This command is used to obtain counts on a single variable's values.
This command is used to obtain counts on more than one variable's values.
The frequencies command can be used to determine quartiles, percentiles, measures of central tendency (mean, median, and mode), and measures of dispersion (range, standard deviation, variance, minimum and maximum). 
Crosstabs is designed for discrete variables--usually those measured on nominal or ordinal scales. The procedure is not suitable for continuous variables that assume many values.
The output has two columns. The left column names the statistic and the right column gives the value of the statistic.
Crosstabs are usually presented with the independent variable across the top and the dependent along the side.
Summarizes information about one variable.
Crosstabs generates information about bivariate relationships.

Control variables in Crosstabs:
To produce cross tabulations with more than two dimensions (rows and columns), you will need to specify a layer variable called as Control variable. Use Analyse ¦ Descriptive Statistics ¦ Crosstabs to call up the crosstabs dialog and then fill in the row and the column variables as with a bivariate table.
The control variable must be categorical, with at least two categories, and ideally no more than five categories.  With this procedure you get a separate crosstab for every category of your control variable, including a chi-square and p value for each crosstab.  This allows you to isolate the effect of the IV on the DV for every value of the CV.
The hypotheses are:
Ho: There is no relationship between IV and DV, controlling for CV.  Chi-square = 0
H1: There is a relationship between IV and DV, controlling for CV.  Chi-square ≠ 0

‘If’ condition is ‘Select cases’ option:
Using the If condition in ‘Select cases’ section, a random sample of cases, a sample based on a range, a filter variable, or a condition can be used to fetch the required data. The condition in the ‘If’ needs to be scripted using the keypad made available in the ‘If’ dialog box.

Cluster Analysis
Cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise.
Below are the two types of clustering:
1.       Hierarchical Clustering  – Gen. Used for <50 objects
a.       Agglomerate clustering (Multiple cluster unite and form a single cluster)
b.      Divisive clustering (Single cluster is divided into multiple clusters)
2.       Non Hierarchical clustering  – Gen. Used for >50 objects
a.       K-Means clustering
Clustering process
Step 1: Selection of Variables: By asking this question – “What are the objectives of creating clusters?”
Step 2: Distance Measurement – The joining or tree clustering method uses the dissimilarities (similarities) or distances between objects when forming the clusters. Similarities are a set of rules that serve as criteria for grouping or separating items.
There are different ways of measuring distances by checking the:
·         Probabilities
·         Correlations
·         Differences etc
Euclidean distance: This is probably the most commonly chosen type of distance. It simply is the geometric distance in the multidimensional space. It is computed as:
distance(x,y) = {http://www.statsoft.com/textbook/graphics/sigmablu.gifi (xi - yi)2 }½
Note that Euclidean (and squared Euclidean) distances are usually computed from raw data, and not from standardized data. This method has certain advantages (e.g., the distance between any two objects is not affected by the addition of new objects to the analysis, which may be outliers).
City-block (Manhattan) distance: This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). The city-block distance is computed as:
distance(x,y) = http://www.statsoft.com/textbook/graphics/sigmablu.gifi |xi - yi|
Most commonly used distance measurements in business are:
Measure
Option
For Interval
Euclidean Distance; Block
For Counts
Chi & Phi Square measure
For Binary
Simple matching; Jaccard

Step 3: Clustering criteria - Linkage Rules
At the first step, when each object represents its own cluster, the distances between those objects are defined by the chosen distance measure. However, once several objects have been linked together, how do we determine the distances between those new clusters? In other words, we need a linkage to determine when two clusters are sufficiently similar to be linked together. There are numerous linkage rules but below are the most commonly used:
Single linkage (nearest neighbour): Distance between two clusters is determined by the distance of the two closest objects (nearest neighbours) in the different clusters.
Complete linkage (furthest neighbour): Distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbours").
Group centroid. The centroid of a cluster is the average point in the multidimensional space defined by the dimensions. In this method, the distance between two clusters is determined as the difference between centroids.

PHA – Proportional Hazard Analysis:
Proportional hazards models are a class of survival models in statistics. Survival models relate the time that passes before some event occurs to one or more covariates that may be associated with that quantity. In a proportional hazards model, the unique effect of a unit increase in a covariate is multiplicative with respect to the hazard rate. For example, taking a drug may halve one's hazard rate for a stroke occurring, or, changing the material from which a manufactured component is constructed may double its hazard rate for failure.

Dendogram - is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. Dendrograms are often used in computational biology to illustrate the clustering of genes or samples.

References

No comments:

Post a Comment