DAY-4
GROUP:D-Sam Stephen
Box Plot
A boxplot is a way of summarizing a set
of data measured on an interval scale. It is often used in exploratory data
analysis. It is a type of graph which is used to show the shape of the
distribution, its central value, and variability. The picture produced consists
of the most extreme values in the data set (maximum and minimum values), the lower and upper
quartiles, and the median.
The median for each dataset is
indicated by the black center line, and the first and third quartiles are the
edges of the red area, which is known as the inter-quartile range (IQR). The
extreme values (within 1.5 times the inter-quartile range from the upper or
lower quartile) are the ends of the lines extending from the IQR. Points at a
greater distance from the median than 1.5 times the IQR are plotted
individually as asterisks. These points represent potential outliers.
Not uncommonly real datasets will display
surprisingly high maximums or surprisingly low minimums called outliers.
John Tukey has provided a precise definition for two types of outliers:
- Outliers are either 3×IQR or more above the third quartile or 3×IQR or more below the first quartile.
- Suspected outliers are are slightly more central versions of outliers: either 1.5×IQR or more above the third quartile or 1.5×IQR or more below the first quartile.
A boxplot, or box and whisker diagram, powerfully depicts the centers and spreads of given data. The two most important measures of center are the mean and the median. Important measures of spread include the interquartile range and the mean absolute deviation.
The StatCharts boxplot shows the measure of
central location (the median), two measures of dispersion (the range and
inter-quartile range), the skewness (from the orientation of the median
relative to the quartiles) and potential outliers (marked individually).
Boxplots are especially useful when comparing two or more sets of
data. The boxplot also shows the second measure of central location (the
mean), which is represented with a "+" sign in the box. The notch
around the mean, as shown in the figure below, represents the Confidence
Interval for the mean. This notch is displayed only when the If Notched option
is chosen in the dialog box.
Box Plot
The plot
may be drawn either vertically as in the above diagram, or horizontally.
Interpreting a Boxplot
The
boxplot is interpreted as follows:
- The box itself contains the middle 50% of
the data. The upper edge (hinge) of the box indicates the 75th percentile
of the data set, and the lower hinge indicates the 25th percentile. The
range of the middle two quartiles is known as the inter-quartile range.
- The line in the box indicates the median
value of the data.
- If the median line within the box is not
equidistant from the hinges, then the data is skewed.
- The ends of the vertical lines or
"whiskers" indicate the minimum and maximum data values, unless
outliers are present in which case the whiskers extend to a maximum of 1.5
times the inter-quartile range.
- The points outside the ends of the
whiskers are outliers or suspected outliers.
Boxplot Enhancements
Beyond the
basic information, boxplots sometimes are enhanced to convey additional
information:
- The mean and its confidence interval can
be shown using a diamond shape in the box.
- The expected range of the median can be
shown using notches in the box.
- The width of the box can be varied in
proportion to the log of the sample size.
Box plots have the following strengths:
- Graphically
display a variable's location and spread at a glance.
- Provide
some indication of the data's symmetry and skewness.
- Unlike
many other methods of data display, boxplots show outliers.
- By
using a boxplot for each categorical variable side-by-side on the same
graph, one quickly can compare data sets.
One
drawback of boxplots is that they tend to emphasize the tails of a
distribution, which are the least certain points in the data set. They also
hide many of the details of the distribution. Displaying a histogram in
conjunction with the boxplot helps in this regard, and both are important tools
for exploratory data analysis.
What is K-mean?
k-means clustering is a data mining/machine learning algorithm used to cluster observations into groups of related observations without any prior knowledge of those relationships. The k-means algorithm is one of the simplest clustering techniques and it is commonly used in medical imaging, biometrics and related fields.
k-means clustering is a data mining/machine learning algorithm used to cluster observations into groups of related observations without any prior knowledge of those relationships. The k-means algorithm is one of the simplest clustering techniques and it is commonly used in medical imaging, biometrics and related fields.
The K-means
Algorithm?
The k-means algorithm is an evolutionary
algorithm that gains its name from its method of operation. The algorithm
clusters observations into k groups, where k is provided as an input parameter.
It then assigns each observation to clusters based upon the observation’s
proximity to the mean of the cluster. The cluster’s mean is then recomputed and
the process begins again.
No comments:
Post a Comment