Jai Mohan Singh Sood
It was the second day of the
marketing BA Workshop, and already I’ve developed a liking to the subject! With
all the practicality involved, and minimal focus on actual theory, I look
forward to all the cross tabulations and the chi-square tests!
So let’s have a look at what all
we learnt today!
In the 1st lecture, we began where we left yesterday. We opened
the example of retail format and
started to study all the variables and the values associated with them. The
following areas were covered:
- Cross tabulation and chi-square testing
- The ‘If’ condition
- Frequencies
- Cluster Analysis
Cross
tabulation and chi-square testing:
Cross tabulation is used to obtain counts on
more than one variable’s value and generates information about bivariate
relationships. This function is designed for discrete variables ie nominal or ordinal scales.
The procedure is not suitable for continuous variables that assume many values.
Cross tabulations are presented with the independent variable across
the top and the dependent along the side.
Chi-square tests are used to determine
whether a direct relation exists between any two or more variables in
consideration. For our testing, we use Pearson’s chi-squared test to determine
possible relations. We use a 95% confidence level, which means a 5% level of
significance. Hence in our tests, we reject the null hypothesis (Ho) if the
chi-square value comes out to be less than 0.05.
The ‘If’
condition
‘If’
statements are used to transform existing variables into new variables. For
example, based on individuals’ scores on a ‘Masculinity and Femininity’ Scale,
you might want to create a new variable that contains 4 categories (Masculine,
Feminine, Androgynous,
Undifferentiated). A similar grouping of age groups can be done – say 18-24;
25-45; 45+ etc
The
new variable created appears at the end of the list of variables. Now we need
to assign value labels for the new variable so as to remember how each group
was defined. Click on the new variable. Directions for how to assign value
labels can be found under instructions for ‘Defining Variables’.
Frequencies
This command is
used to obtain counts on a single variable's values and generated information about a single variable.
The frequencies
command can be used to determine quartiles, percentiles, measures of central
tendency (mean, median, and mode), and measures of dispersion (range, standard
deviation, variance, minimum and maximum). The output has two columns. The left
column names the statistic and the right column gives the value of the
statistic.
Cluster Analysis
This was started
in the 2nd lecture. It is
defined as the statistical
method of partitioning a sample into homogeneous classes to produce an
operational classification. It is the task
of assigning a set of objects into groups (called clusters) so that the
objects in the same cluster are more similar (in some sense or another) to each
other than to those in other clusters.
2
major cluster models were discussed in class:
- Hierarchical Clustering: It is based on the core idea of objects being more related to nearby objects than to objects farther away. As such, these algorithms connect "objects" to form "clusters" based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a dendrogram, which explains where the common name "hierarchical clustering" comes from: these algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances. In a dendrogram, the y-axis marks the distance at which the clusters merge, while the objects are placed along the x-axis such that the clusters don't mix. This model is generally used for < 50 objects.
Hierarchical
clustering is of 2 types:
o
Divisive:
This is a "top down" approach: all observations start in one cluster,
and splits are performed recursively as one moves down the hierarchy.
o
Agglomerative: This is a "bottom up" approach:
each observation starts in its own cluster, and pairs of clusters are merged as
one moves up the hierarchy.
- Non-Hierarchical Clustering: In this, clusters are represented by a
central vector, which may not necessarily be a member of the data set. When the
number of clusters is fixed to k, k-means clustering gives a formal
definition as an optimization problem: find the
cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized. This method is generally used for > 50 objects.
This method is also called K-means clustering. It is a method of cluster
analysis which aims to partition n observations into k clusters
in which each observation belongs to the cluster with the nearest mean. This
results in a partitioning of the data space into Voronoi cells.
Clustering Process: It consists of 3 stages:
·
- Selection of Variables: This is done on an intuitive basis, simply by understanding and estimating what the objective of creating clusters is.
- Distance Measurement: An important component of a clustering algorithm is the distance measure between data points. If the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. However, even in this case the Euclidean distance can sometimes be misleading.
Most commonly used
distance measurements in business are:
Measure
|
Option
|
Interval
|
Euclidean Distance;
Block
|
Count
|
Chi & Phi Square
measure
|
Binary
|
Simple matching;
Jaccard
|
- Clustering Criteria: In the first step where each object represents its own cluster, the distances between those objects are defined by the chosen distance measure. However, once several objects have been linked together, we need a linkage to determine when two clusters are sufficiently similar to be linked together. There are numerous linkage rules but below are the most commonly used:
·
- Single linkage (nearest neighbour): Distance between two clusters is determined by the distance of the two closest objects (nearest neighbours) in the different clusters.
- Complete linkage (furthest neighbour): Distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the ‘furthest neighbours’).
- Group centroid: The centroid of a cluster is the average point in the multidimensional space defined by the dimensions. In this method, the distance between two clusters is determined as the difference between centroids.
Towards the end of the
lecture, we were explained briefly about a dendogram.
It is a
tree diagram frequently used to illustrate the arrangement of the clusters
produced by hierarchical clustering. Dendrograms are often used in
computational biology to illustrate the clustering of genes or samples.
Phew! Quite
exhaustive to say the least! Looking forward to tomorrow!
References:
No comments:
Post a Comment