Monday, September 10, 2012


DAY-4   GROUP:D-Sam  Stephen

Box Plot

A boxplot is a way of summarizing a set of data measured on an interval scale. It is often used in exploratory data analysis. It is a type of graph which is used to show the shape of the distribution, its central value, and variability. The picture produced consists of the most extreme values in the data set (maximum and minimum values), the lower and upper quartiles, and the median.

The median for each dataset is indicated by the black center line, and the first and third quartiles are the edges of the red area, which is known as the inter-quartile range (IQR). The extreme values (within 1.5 times the inter-quartile range from the upper or lower quartile) are the ends of the lines extending from the IQR. Points at a greater distance from the median than 1.5 times the IQR are plotted individually as asterisks. These points represent potential outliers.

Not uncommonly real datasets will display surprisingly high maximums or surprisingly low minimums called outliers. John Tukey has provided a precise definition for two types of outliers:
  • Outliers are either 3×IQR or more above the third quartile or 3×IQR or more below the first quartile.
  • Suspected outliers are are slightly more central versions of outliers: either 1.5×IQR or more above the third quartile or 1.5×IQR or more below the first quartile.

A boxplot, or box and whisker diagram, powerfully depicts the centers and spreads of given data. The two most important measures of center are the mean and the median. Important measures of spread include the interquartile range and the mean absolute deviation.
The StatCharts boxplot shows the measure of central location (the median), two measures of dispersion (the range and inter-quartile range), the skewness (from the orientation of the median relative to the quartiles) and potential outliers (marked individually). Boxplots are especially useful when comparing two or more sets of data. The boxplot also shows the second measure of central location (the mean), which is represented with a "+" sign in the box. The notch around the mean, as shown in  the figure below, represents the Confidence Interval for the mean. This notch is displayed only when the If Notched option is chosen in the dialog box.




Box Plot






The plot may be drawn either vertically as in the above diagram, or horizontally.
Interpreting a Boxplot
The boxplot is interpreted as follows:
  • The box itself contains the middle 50% of the data. The upper edge (hinge) of the box indicates the 75th percentile of the data set, and the lower hinge indicates the 25th percentile. The range of the middle two quartiles is known as the inter-quartile range.
  • The line in the box indicates the median value of the data.
  • If the median line within the box is not equidistant from the hinges, then the data is skewed.
  • The ends of the vertical lines or "whiskers" indicate the minimum and maximum data values, unless outliers are present in which case the whiskers extend to a maximum of 1.5 times the inter-quartile range.
  • The points outside the ends of the whiskers are outliers or suspected outliers.

Boxplot Enhancements
Beyond the basic information, boxplots sometimes are enhanced to convey additional information:
  • The mean and its confidence interval can be shown using a diamond shape in the box.
  • The expected range of the median can be shown using notches in the box.
  • The width of the box can be varied in proportion to the log of the sample size.
             Box plots have the following strengths:
  • Graphically display a variable's location and spread at a glance.
  • Provide some indication of the data's symmetry and skewness.
  • Unlike many other methods of data display, boxplots show outliers.
  • By using a boxplot for each categorical variable side-by-side on the same graph, one quickly can compare data sets.
One drawback of boxplots is that they tend to emphasize the tails of a distribution, which are the least certain points in the data set. They also hide many of the details of the distribution. Displaying a histogram in conjunction with the boxplot helps in this regard, and both are important tools for exploratory data analysis.

What is K-mean?

k-means clustering is a data mining/machine learning algorithm used to cluster observations into groups of related observations without any prior knowledge of those relationships. The k-means algorithm is one of the simplest clustering techniques and it is commonly used in medical imaging, biometrics and related fields.

The K-means Algorithm?

The k-means algorithm is an evolutionary algorithm that gains its name from its method of operation. The algorithm clusters observations into k groups, where k is provided as an input parameter. It then assigns each observation to clusters based upon the observation’s proximity to the mean of the cluster. The cluster’s mean is then recomputed and the process begins again. 

No comments:

Post a Comment