What is cluster analysis?
Cluster
analysis is an exploratory technique that we can use to visualize patterns in our
project by grouping sources or nodes that share similar words, similar
attribute values, or are coded similarly by nodes. Cluster analysis
diagrams provide a graphical representation of sources or nodes to make it easy
to see similarities and differences. Sources or nodes in the cluster analysis
diagram that appear close together are more similar than those that are far
apart.
We can use cluster
analysis diagrams to visualize:
§ The similarities and differences across your sources—for
example, how similar are the
submissions from the various community members?
§ The similarities and differences across your nodes—for
example, how similar is the coding
at rising sea levels, flood control, soil erosion, and land reclamation?
§ The demographic spread of your survey respondents based
on attribute value.
There are
several different types of cluster analysis. The two most commonly used are
K-means clustering and hierarchical clustering.
K-means
Clustering
K-means
clustering treats the observations in the data as objects having locations and
distances from each other (note that the distances used in clustering often do
not represent spatial distances). It partitions the objects into K mutually
exclusive clusters so that objects within each cluster are as close to each
other as possible and at the same time, as far from objects in other clusters
as possible. Each cluster is then characterized by it’s mean, or center point.examples:
§ Sequential Threshold method - first determine a cluster center, then
group all objects that are within a predetermined threshold from the center -
one cluster is created at a time
§ Parallel Threshold method - simultaneously several cluster centers
are determined, then objects that are within a predetermined threshold from the
centers are grouped
§ Optimizing Partitioning method - first a non-hierarchical procedure is
run, then objects are reassigned so as to optimize an overall criterion.
Hierarchical
Clustering
Hierarchical
clustering is a way to investigate groupings in the data simultaneously over a
variety of scales and distances. It does this by creating a cluster tree with
various levels. Unlike K-means clustering, the tree is not a single set of
clusters. Rather, the tree is a multi-level hierarchy where clusters at one
level are joined as clusters at the next higher level. The algorithm that is
used starts with each case or variable in a separate cluster and then combines
clusters until only one is left. This allows the researcher to decide what
level of clustering is most appropriate for his or her research.
examples:
§ Divisive clustering - start by treating all objects as if they
are part of a single large cluster, then divide the cluster into smaller and
smaller clusters
§ Agglomerative clustering - start by treating each object as a
separate cluster, then group them into bigger and bigger clusters
§ examples:
§ Centroid methods - clusters are generated that maximize the
distance between the centers of clusters (a centroid is the mean value for all
the objects in the cluster)
§ Variance methods - clusters are generated that minimize the
within-cluster variance
§ example:
§ Ward’s Procedure - clusters are generated that minimize the
squared Euclidean distance to the center mean
§ Linkage methods - cluster objects based on the distance
between them
§ examples:
§ Single Linkage method - cluster objects based on the minimum
distance between them (also called the nearest neighbour rule)
§ Complete Linkage method - cluster objects based on the maximum
distance between them (also called the furthest neighbour rule)
§ Average Linkage method - cluster objects based on the average
distance between all pairs of objects (one member of the pair must be from a
different cluster)
Types of variables used in data processing
Univariant:
refers to an expression, equation, function or polynomial of only one variable. Bivariant/Multivariate
statistics: is a form of statistics encompassing the simultaneous
observation and analysis of more than one outcome variable. The application of
multivariate statistics is multivariate analysis. Multivariate data can be
plotted using scatter plots.
Difference between processing and analysis
Data
processing:
is an intermediary stage of work between data collection and data
analysis. The completed instruments of data collection, viz., interview
schedules/ questionnaires/ data sheets/field notes contain a vast mass of data.
They cannot straightaway provide answers to research questions. They, like raw
materials, need processing. Data processing involves classification and
summarisation of raw data in order to make them amenable to analysis.
Data
Analysis: Data analysis is a body of methods that help to
describe facts, detect patterns, develop explanations, and test hypotheses. It
is used in all of the sciences. It is used in business, in administration, and
in policy. Analysis of data is a process of inspecting, cleaning, transforming,
and modelling data with the goal of highlighting useful information, suggesting
conclusions, and supporting decision making.
No comments:
Post a Comment