Outliers are those data points which are inconsistent with the majority of the data points. There can be different kinds of outliers, some valid and some not. An valid example of an outlier is the CEO's salary in an income attribute, which is usually very higher than for the other employees. While on the other hand an AGE attribute with value as 200 is obviously noisy and should be removed as an outlier.
Some of the general methods used for removing outliers are:
|
|
Clustering: - This can be used to cluster the relevant data points together and then use those cluster centers to find out the data points not close enough to them and then reject them as outliers. |
|
|
Curve-Fitting: - This method initially uses regression analysis to find the curves which fit the data, closely. It then removes all points (outliers), which are sufficiently far away from the fitted curve. |
|
|
Hypothesis-testing with given model: - In this case certain hypothesis are developed which need to be satisfied by the data domain. Then those data points which do not satisfy the hypothesis are rejected as outliers. |
![]()