Tuesday, September 4, 2012

Day 2 team J Ayush Jain


What is cluster analysis?

Cluster analysis is an exploratory technique that we can use to visualize patterns in our project by grouping sources or nodes that share similar words, similar attribute values, or are coded similarly by nodes.  Cluster analysis diagrams provide a graphical representation of sources or nodes to make it easy to see similarities and differences. Sources or nodes in the cluster analysis diagram that appear close together are more similar than those that are far apart.
We can use cluster analysis diagrams to visualize:
§  The similarities and differences across your sources—for example, how similar are the submissions from the various community members?
§  The similarities and differences across your nodes—for example, how similar is the coding at rising sea levels, flood control, soil erosion, and land reclamation?
§  The demographic spread of your survey respondents based on attribute value.
There are several different types of cluster analysis. The two most commonly used are K-means clustering and hierarchical clustering.
K-means Clustering
K-means clustering treats the observations in the data as objects having locations and distances from each other (note that the distances used in clustering often do not represent spatial distances). It partitions the objects into K mutually exclusive clusters so that objects within each cluster are as close to each other as possible and at the same time, as far from objects in other clusters as possible. Each cluster is then characterized by it’s mean, or center point.examples:
§  Sequential Threshold method - first determine a cluster center, then group all objects that are within a predetermined threshold from the center - one cluster is created at a time
§  Parallel Threshold method - simultaneously several cluster centers are determined, then objects that are within a predetermined threshold from the centers are grouped
§  Optimizing Partitioning method - first a non-hierarchical procedure is run, then objects are reassigned so as to optimize an overall criterion.
Hierarchical Clustering
Hierarchical clustering is a way to investigate groupings in the data simultaneously over a variety of scales and distances. It does this by creating a cluster tree with various levels. Unlike K-means clustering, the tree is not a single set of clusters. Rather, the tree is a multi-level hierarchy where clusters at one level are joined as clusters at the next higher level. The algorithm that is used starts with each case or variable in a separate cluster and then combines clusters until only one is left. This allows the researcher to decide what level of clustering is most appropriate for his or her research.
examples:
§  Divisive clustering - start by treating all objects as if they are part of a single large cluster, then divide the cluster into smaller and smaller clusters
§  Agglomerative clustering - start by treating each object as a separate cluster, then group them into bigger and bigger clusters
§  examples:
§  Centroid methods - clusters are generated that maximize the distance between the centers of clusters (a centroid is the mean value for all the objects in the cluster)
§  Variance methods - clusters are generated that minimize the within-cluster variance
§  example:
§  Ward’s Procedure - clusters are generated that minimize the squared Euclidean distance to the center mean
§  Linkage methods - cluster objects based on the distance between them
§  examples:
§  Single Linkage method - cluster objects based on the minimum distance between them (also called the nearest neighbour rule)
§  Complete Linkage method - cluster objects based on the maximum distance between them (also called the furthest neighbour rule)
§  Average Linkage method - cluster objects based on the average distance between all pairs of objects (one member of the pair must be from a different cluster)

Types of variables used in data processing

Univariant: refers to an expression, equation, function or polynomial of only one variable. Bivariant/Multivariate statistics:  is a form of statistics encompassing the simultaneous observation and analysis of more than one outcome variable. The application of multivariate statistics is multivariate analysis. Multivariate data can be plotted using scatter plots.

Difference between processing and analysis
Data processing:  is an intermediary stage of work between data collection and data analysis. The completed instruments of data collection, viz., interview schedules/ questionnaires/ data sheets/field notes contain a vast mass of data. They cannot straightaway provide answers to research questions. They, like raw materials, need processing. Data processing involves classification and summarisation of raw data in order to make them amenable to analysis.
Data Analysis: Data analysis is a body of methods that help to describe facts, detect patterns, develop explanations, and test hypotheses. It is used in all of the sciences. It is used in business, in administration, and in policy. Analysis of data is a process of inspecting, cleaning, transforming, and modelling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making.

No comments:

Post a Comment