Authors: Taghi M. Khoshgoftaar, Pierre Rebours
Addresses: Department of Computer Science and Engineering, Florida Atlantic University, 777 Glades Road, Boca Raton 33431, FL, USA. ' Department of Computer Science and Engineering, Florida Atlantic University, 777 Glades Road, Boca Raton 33431, FL, USA
Abstract: We present two new noise filtering techniques which improve the quality of training datasets by removing data points that are likely to be noisy. In addition, a new measure called |efficiency paired comparison| is introduced for simplifying the comparison between two filters. The filtering techniques are based on the partitioning approach – the training dataset is first split into subsets, and base learners are induced on each of these subsets. The predictions are then combined in such a way that an instance in the training data is identified as noisy if it is misclassified by a certain number of base learners. The first technique, multiple partitioning filter combines several classifiers induced on each subset. The second technique, iterative-partitioning filter uses only one base learner but goes through multiple filtering iterations. The amount of noise removed by the techniques is varied by tuning either the filtering level or the number of iterations. Empirical studies using software measurement data from a high assurance software project assess the efficiencies of our two noise filtering approaches. The empirical results suggest that using several base classifiers as well as performing several iterations with a conservative filtering scheme can improve the efficiency of the filtering technique.
Keywords: noise detection; noise elimination; data quality; software quality; partitioning filters; filtering level; iterative filtering; data mining.
International Journal of Computer Applications in Technology, 2006 Vol.27 No.4, pp.246 - 258
Published online: 08 Jan 2007 *Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article