Today's
session started with the Retail example, using SPSS tool,
- Gone through the values of each and every variable
- Discussed the scale types of every variable
From the above, practically we have understood the different
types of scales and when they are used.
Also understood the below options available in SPSS tool
using the Retail example:
- Frequencies (Analyze>Descriptive Statistics>Frequencies)
- Crosstabs (Analyze>Descriptive Statistics>Crosstabs)
- Control variable in Crosstabs section
- Using ‘If’ condition in ‘Select cases’ section
Frequencies
|
Crosstabs
|
This command is used to obtain counts on a single variable's values.
|
This command is used to obtain counts on more than one variable's
values.
|
The frequencies command can be used to determine quartiles,
percentiles, measures of central tendency (mean, median, and mode), and
measures of dispersion (range, standard deviation, variance, minimum and
maximum).
|
Crosstabs is designed for discrete variables--usually those measured
on nominal or ordinal scales. The procedure is not suitable for continuous
variables that assume many values.
|
The output has two columns. The left column names the statistic and
the right column gives the value of the statistic.
|
Crosstabs are
usually presented with the independent variable
across the top and the dependent along
the side.
|
Summarizes information about one variable.
|
Crosstabs generates information about bivariate relationships.
|
Control variables in Crosstabs:
To produce cross tabulations with more than two dimensions
(rows and columns), you will need to specify a layer variable called as Control variable. Use Analyse ¦ Descriptive Statistics ¦
Crosstabs to call up the crosstabs dialog and then fill in the row and the column variables as with a
bivariate table.
The control variable must be categorical, with at least two
categories, and ideally no more than five categories. With this procedure you get a separate crosstab
for every category of your control variable, including a chi-square and p value
for each crosstab. This allows you to
isolate the effect of the IV on the DV for every value of the CV.
The hypotheses are:
Ho: There is no relationship between IV and DV, controlling
for CV. Chi-square = 0
H1: There is a relationship between IV and DV, controlling for
CV. Chi-square ≠ 0
‘If’ condition is ‘Select cases’ option:
Using the If condition in ‘Select cases’ section, a random
sample of cases, a sample based on a range, a filter variable, or a condition
can be used to fetch the required data. The condition in the ‘If’ needs to be
scripted using the keypad made available in the ‘If’ dialog box.
Cluster Analysis
Cluster analysis is an exploratory data analysis tool which
aims at sorting different objects into groups in a way that the degree of
association between two objects is maximal if they belong to the same group and
minimal otherwise.
Below are the two
types of clustering:
1.
Hierarchical
Clustering – Gen. Used for <50 objects
a.
Agglomerate
clustering (Multiple cluster unite and form a single cluster)
b.
Divisive
clustering (Single cluster is divided into multiple clusters)
2.
Non Hierarchical
clustering – Gen. Used for >50
objects
a.
K-Means
clustering
Clustering process
Step 1: Selection
of Variables: By asking this
question – “What are the objectives of creating clusters?”
Step 2: Distance
Measurement – The
joining or tree clustering method uses the dissimilarities (similarities) or
distances between objects when forming the clusters. Similarities are a set of
rules that serve as criteria for grouping or separating items.
There are different ways of measuring distances by checking the:
·
Probabilities
·
Correlations
·
Differences etc
Euclidean
distance: This is probably the most commonly chosen type
of distance. It simply is the geometric distance in the multidimensional space.
It is computed as:
distance(x,y) = {
i (xi -
yi)2 }½

Note that Euclidean (and squared Euclidean) distances are
usually computed from raw data, and not from standardized data. This method has
certain advantages (e.g., the distance between any two objects is not affected
by the addition of new objects to the analysis, which may be outliers).
City-block
(Manhattan) distance: This distance is simply the
average difference across dimensions. In most cases, this distance measure
yields results similar to the simple Euclidean distance. However, note that in
this measure, the effect of single large differences (outliers) is dampened
(since they are not squared). The city-block distance is computed as:
distance(x,y) =
i |xi -
yi|

Most commonly used
distance measurements in business are:
Measure
|
Option
|
For Interval
|
Euclidean Distance; Block
|
For Counts
|
Chi & Phi Square measure
|
For Binary
|
Simple matching; Jaccard
|
Step 3: Clustering
criteria - Linkage Rules
At the first step, when each object represents its own
cluster, the distances between those objects are defined by the chosen distance
measure. However, once several objects have been linked together, how do we
determine the distances between those new clusters? In other words, we need a
linkage to determine when two clusters are sufficiently similar to be linked
together. There are numerous linkage rules but below are the most commonly
used:
Single linkage (nearest neighbour): Distance between two clusters is
determined by the distance of the two closest objects (nearest neighbours) in
the different clusters.
Complete linkage (furthest neighbour): Distances
between clusters are determined by the greatest distance between any two
objects in the different clusters (i.e., by the "furthest neighbours").
Group centroid. The centroid of a cluster is the average point in the
multidimensional space defined by the dimensions. In this method, the distance
between two clusters is determined as the difference between centroids.
PHA – Proportional Hazard Analysis:
Proportional hazards models are a class of survival models
in statistics. Survival models relate the time that passes before some event
occurs to one or more covariates that may be associated with that quantity. In
a proportional hazards model, the unique effect of a unit increase in a
covariate is multiplicative with respect to the hazard rate. For example,
taking a drug may halve one's hazard rate for a stroke occurring, or, changing the
material from which a manufactured component is constructed may double its
hazard rate for failure.
Dendogram - is a tree diagram frequently used to
illustrate the arrangement of the clusters produced by hierarchical clustering.
Dendrograms are often used in computational biology to illustrate the
clustering of genes or samples.
References
No comments:
Post a Comment