Tuesday, September 4, 2012

Day 2 - Team B



Continuing for the yesterday’s class, today in the initial stages we brushed up on the concepts of Scale, Ordinal value and Nominal value. We also got to understand the types of scale such as:
Likert scale: This is used when the researcher knows the exact distance between 2 pointers on the scale. For e.g. – a person scoring 70 marks is equidistant from a person scoring 60 marks (on the left side) and a person scoring 80 marks (on the right side) on the scale.
We continued using the first level analysis. This analysis is done with the help of two tools:

1.        Frequency: It is used to look into the detailed information on nominal data and describing the results. Frequencies options include a table showing counts and percentages, statistics including percentile values, central tendency, dispersion and distribution, and charts including bar charts and histograms.
Choosing Frequencies Procedure: From the "Analyse" menu, highlight "Descriptive Statistics", then move to the sub menu and click on "Frequencies.”

2.       Cross Tab: It helps us identify the relation between two variables in which the question is related. It helps us understand the direction of the relationship. For e.g. in the retail experiment that we were doing in class, we were finding the relation between the stores and the service satisfaction among the customers.

In today’s retail example, we had following hypothesis:
H0 - Whichever store the customer visits, he will get the same level of service.
a.       We rejected the H0 as with the help of Chi Square test, the test was >0.05. We rejected the H0.
We found out the level of dissatisfaction is very high in store 2 (41.2%) and very low in store 3 (44.9%).
We also understood the reasons for statistics that can go wrong.
1.       Because of spurious data.
2.       The present data in the research cancels each other.
With the help of 3rd factor in partial correlation, we found that level of dissatisfaction is majorly because of the problem with the contact of employees.
The valid recommendations given were:
1.       Training given to the employees
2.       Observe and follow the best practices in the respective stores
3.       Shift the best trained employees to the relevant store to increase the satisfaction among the customers.
After this exercise, we started understanding the 2nd level analysis.
It is also called as Cluster Analysis. There are 2 common ways of doing it.
1.       Hierarchical Clustering
a.       This method is used when there are less than 50 objects.
b.      First way of clustering is Divisive Clustering. It is also known as the top-down clustering. We start at the top with all documents in one cluster. The cluster is split using a flat clustering algorithm. This procedure is applied recursively until each document is in its own singleton cluster. There is evidence that divisive algorithms produce more accurate hierarchies than bottom-up algorithms in some circumstances. Top-down clustering benefits from complete information about the global distribution when making top-level partitioning decisions.
c.        Agglomerative Clustering: It is also known as bottom-up clustering.  Bottom-up algorithms treat each document as a singleton cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all documents.
d.      The clustering process is done in 3 levels.
                                                               i.      Selection of variables
                                                             ii.      Distance measurement: Here, we measure the distance between 2 objects. For e.g. if 5 students are compared to measure the marks in 5 subjects, we can cluster it based on the average of the marks scored or we can measure on the basis of correlation between them (If student scores high marks in Maths and Statistics, then we can say that it can be correlated). There are 3 types of distance measurement.
1.       Interval – There are 2 types of interval measurement tools.
a.       Euclidean distance – It is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula.
b.      Block distance – For e.g. if we need to measure the distance between 2 buildings, rather than using the Pythagorean formula, we can measure the practical distance (path taken) between them.
2.       Count – This can be determined with the help of Chi square test.
3.       Binary – Out of the tools available, 2 types are very important:
a.       Jaccard - It is a statistic used for comparing the similarity and diversity of sample sets.
b.      Simple Magic
                                                            iii.      Clustering criteria: We try to measure the distance between one cluster to other cluster and one cluster to other object.
                                                           iv.      Dendrogram: It is the pictorial representation of how the clustering happens. It is a visual representation of the spot correlation data. The individual spots are arranged along the bottom of the dendrogram and referred to as leaf nodes. Spot clusters are formed by joining individual spots or existing spot clusters with the join point referred to as a node. This can be seen in the diagram above. At each dendrogram node we have a right and left sub-branch of clustered spots. In the following discussion, spot clusters can refer to a single spot of a group of spots. The vertical axis is labelled distance and refers to a distance measure between spots or spot clusters. The height of the node can be thought of as the distance value between the right and left sub-branch clusters.

No comments:

Post a Comment