Wednesday, September 5, 2012

Day 2 - Team E (2)


 Jai Mohan Singh Sood

It was the second day of the marketing BA Workshop, and already I’ve developed a liking to the subject! With all the practicality involved, and minimal focus on actual theory, I look forward to all the cross tabulations and the chi-square tests!

So let’s have a look at what all we learnt today!

In the 1st lecture, we began where we left yesterday. We opened the example of retail format and started to study all the variables and the values associated with them. The following areas were covered:
  • Cross tabulation and chi-square testing
  • The ‘If’ condition
  • Frequencies
  • Cluster Analysis

Cross tabulation and chi-square testing:
                                Cross tabulation is used to obtain counts on more than one variable’s value and generates information about bivariate relationships. This function is designed for discrete variables ie nominal or ordinal scales. The procedure is not suitable for continuous variables that assume many values. Cross tabulations are presented with the independent variable across the top and the dependent along the side.
                                Chi-square tests are used to determine whether a direct relation exists between any two or more variables in consideration. For our testing, we use Pearson’s chi-squared test to determine possible relations. We use a 95% confidence level, which means a 5% level of significance. Hence in our tests, we reject the null hypothesis (Ho) if the chi-square value comes out to be less than 0.05.

The ‘If’ condition
                                ‘If’ statements are used to transform existing variables into new variables. For example, based on individuals’ scores on a ‘Masculinity and Femininity’ Scale, you might want to create a new variable that contains 4 categories (Masculine, Feminine, Androgynous, Undifferentiated). A similar grouping of age groups can be done – say 18-24; 25-45; 45+ etc
                                The new variable created appears at the end of the list of variables. Now we need to assign value labels for the new variable so as to remember how each group was defined. Click on the new variable. Directions for how to assign value labels can be found under instructions for ‘Defining Variables’.

Frequencies
                                This command is used to obtain counts on a single variable's values and generated information about a single variable. The frequencies command can be used to determine quartiles, percentiles, measures of central tendency (mean, median, and mode), and measures of dispersion (range, standard deviation, variance, minimum and maximum). The output has two columns. The left column names the statistic and the right column gives the value of the statistic.

Cluster Analysis
                                This was started in the 2nd lecture. It is defined as the statistical method of partitioning a sample into homogeneous classes to produce an operational classification. It is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters.
                                2 major cluster models were discussed in class:

  • Hierarchical Clustering: It is based on the core idea of objects being more related to nearby objects than to objects farther away. As such, these algorithms connect "objects" to form "clusters" based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a dendrogram, which explains where the common name "hierarchical clustering" comes from: these algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances. In a dendrogram, the y-axis marks the distance at which the clusters merge, while the objects are placed along the x-axis such that the clusters don't mix. This model is generally used for < 50 objects.

Hierarchical clustering is of 2 types:
o   Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
o   Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

  • Non-Hierarchical Clustering: In this, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the kcluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized. This method is generally used for > 50 objects.
This method is also called K-means clustering. It is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This results in a partitioning of the data space into Voronoi cells.


Clustering Process: It consists of 3 stages:
·          
  • Selection of Variables: This is done on an intuitive basis, simply by understanding and estimating what the objective of creating clusters is.
  • Distance Measurement: An important component of a clustering algorithm is the distance measure between data points. If the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. However, even in this case the Euclidean distance can sometimes be misleading.

 Most commonly used distance measurements in business are:

Measure
Option
Interval
Euclidean Distance; Block
Count
Chi & Phi Square measure
Binary
Simple matching; Jaccard


  • Clustering Criteria: In the first step where each object represents its own cluster, the distances between those objects are defined by the chosen distance measure. However, once several objects have been linked together, we need a linkage to determine when two clusters are sufficiently similar to be linked together. There are numerous linkage rules but below are the most commonly used:
·          
    • Single linkage (nearest neighbour): Distance between two clusters is determined by the distance of the two closest objects (nearest neighbours) in the different clusters.
    • Complete linkage (furthest neighbour): Distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the ‘furthest neighbours’).
    • Group centroid: The centroid of a cluster is the average point in the multidimensional space defined by the dimensions. In this method, the distance between two clusters is determined as the difference between centroids.


Towards the end of the lecture, we were explained briefly about a dendogram. It is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. Dendrograms are often used in computational biology to illustrate the clustering of genes or samples.


Phew! Quite exhaustive to say the least! Looking forward to tomorrow!


References:

No comments:

Post a Comment