Monday, September 10, 2012

Day 4 - Team E - Siva Sandeep


Outliers
Outliers are cases that have data values that are very different from the data values for the majority of cases in the data set.
Outliers are important because they can change the results of data analysis.
Whether to include or exclude outliers from a data analysis depends on the reason why the case is an outlier and the purpose of the analysis.

Investigating outliers carefully
Often outliers contain valuable information about the process under investigation or the data gathering and recording process. Before considering the possible elimination of these points from the data, one should try to understand why they appeared and whether it is likely similar values will continue to appear. Of course, outliers are often bad data points.

Box Plot: To detect Outliers
Graphs> Legacy> Boxplots
It is a graphical representation of data that shows a data set’s lowest value, highest value, median value, and the size of the first and third quartile. The box plot is useful in analyzing small data sets that do not lend themselves easily to histograms. Because of the small size of a box plot, it is easy to display and compare several box plots in a small space. A box plot is a good alternative or complement to a histogram and is usually better for showing several simultaneous comparisons.

Following are the steps to use Box plot:
Data collection
Depth of median is calculated
Draw and label the axes of the graph.
Draw the box plots: Construct the boxes, insert median points, and attach upper and lower adjacent limits. Identify outliers (values outside the upper and lower adjacent limits) with asterisks.
Analyze the results: A box plot shows the distribution of data. The line between the lowest adjacent limit and the bottom of the box represent one-fourth of the data. One-fourth of the data falls between the bottom of the box and the median, and another one-fourth between the median and the top of the box. The line between the top of the box and the upper adjacent limit represents the final one-fourth of the data observations. Once the pattern of data variation is clear, the next step is to develop an explanation for the variation.

Author - Siva Sandeep

No comments:

Post a Comment