Continuing for the yesterday’s class, today in the initial
stages we brushed up on the concepts of Scale, Ordinal value and Nominal value.
We also got to understand the types of scale such as:
Likert scale: This is used when the researcher knows the
exact distance between 2 pointers on the scale. For e.g. – a person scoring 70
marks is equidistant from a person scoring 60 marks (on the left side) and a
person scoring 80 marks (on the right side) on the scale.
We continued using the first level analysis. This analysis
is done with the help of two tools:
1.
Frequency: It is used to look into the detailed
information on nominal data and describing the results. Frequencies options
include a table showing counts and percentages, statistics including percentile
values, central tendency, dispersion and distribution, and charts including bar
charts and histograms.
Choosing Frequencies Procedure: From
the "Analyse" menu, highlight "Descriptive Statistics", then
move to the sub menu and click on "Frequencies.”
2.
Cross Tab: It helps us identify the relation
between two variables in which the question is related. It helps us understand
the direction of the relationship. For e.g. in the retail experiment that we
were doing in class, we were finding the relation between the stores and the
service satisfaction among the customers.
In today’s retail example, we had following hypothesis:
H0 - Whichever store the customer visits, he
will get the same level of service.
a.
We rejected the H0 as with the help of Chi
Square test, the test was >0.05. We rejected the H0.
We found out the level of dissatisfaction is very high in
store 2 (41.2%) and very low in store 3 (44.9%).
We also understood the reasons for statistics that can go
wrong.
1.
Because of spurious data.
2.
The present data in the research cancels each
other.
With the help of 3rd factor in partial
correlation, we found that level of dissatisfaction is majorly because of the
problem with the contact of employees.
The valid recommendations given were:
1.
Training given to the employees
2.
Observe and follow the best practices in the
respective stores
3.
Shift the best trained employees to the relevant
store to increase the satisfaction among the customers.
After this exercise, we started understanding the 2nd
level analysis.
It is also called as Cluster Analysis. There are 2 common
ways of doing it.
1.
Hierarchical Clustering
a.
This method is used when there are less than 50
objects.
b.
First way of clustering is Divisive Clustering. It is also known as the top-down clustering. We
start at the top with all documents in one cluster. The cluster is split using
a flat clustering algorithm. This procedure is applied recursively until each
document is in its own singleton cluster. There is evidence that divisive
algorithms produce more accurate hierarchies than bottom-up algorithms in some
circumstances. Top-down clustering benefits from complete information about the
global distribution when making top-level partitioning decisions.
c.
Agglomerative Clustering: It is also
known as bottom-up clustering. Bottom-up
algorithms treat each document as a singleton cluster at the outset and then successively
merge (or agglomerate) pairs of clusters until all clusters have been
merged into a single cluster that contains all documents.
d.
The clustering process is done in 3 levels.
i.
Selection of variables
ii.
Distance measurement: Here, we measure the
distance between 2 objects. For e.g. if 5 students are compared to measure the
marks in 5 subjects, we can cluster it based on the average of the marks scored
or we can measure on the basis of correlation between them (If student scores
high marks in Maths and Statistics, then we can say that it can be correlated).
There are 3 types of distance measurement.
1.
Interval – There are 2 types of interval
measurement tools.
a.
Euclidean distance – It is the
"ordinary" distance between two points that one would measure with a
ruler, and is given by the Pythagorean formula.
b.
Block distance – For e.g. if we need to measure
the distance between 2 buildings, rather than using the Pythagorean formula, we
can measure the practical distance (path taken) between them.
2.
Count – This can be determined with the help of
Chi square test.
3.
Binary – Out of the tools available, 2 types are
very important:
a.
Jaccard - It is a statistic used for comparing
the similarity and diversity of sample sets.
b.
Simple Magic
iii.
Clustering criteria: We try to measure the
distance between one cluster to other cluster and one cluster to other object.
iv.
Dendrogram: It is the pictorial representation
of how the clustering happens. It is a visual representation of the spot
correlation data. The individual spots are arranged along the bottom of the dendrogram
and referred to as leaf nodes. Spot clusters are formed by joining individual
spots or existing spot clusters with the join point referred to as a node. This
can be seen in the diagram above. At each dendrogram node we have a right and
left sub-branch of clustered spots. In the following discussion, spot clusters
can refer to a single spot of a group of spots. The vertical axis is labelled
distance and refers to a distance measure between spots or spot clusters. The
height of the node can be thought of as the distance value between the right
and left sub-branch clusters.
No comments:
Post a Comment