Authors: Tshering Cigay Dorji
Addresses: Bhutan Innovation and Technology Centre, Thimphu TechPark, P.O. Box 633, Thimphu, Bhutan
Abstract: Popular text classification algorithms such as Naïve Bayes, kNN, Centroid-based classifiers and support vector machines (SVM) are based on supervised machine learning. They normally use classical text representation technique consisting of a 'bag of words' as features. This representation leads to the inclusion of unimportant features, and the loss of important semantic relationships and inflection information, resulting in accuracy reduction. To address this problem, we propose a new text classification methodology based on field association terms - a set of terms that identify specific document fields. The methodology is compared against Naïve Bayes, kNN, Centroid-based classifier and SVM on a close dataset of 3180 documents from Wikipedia dumps and open dataset of 9449 documents from Reuters RCV1 Corpus, 20-Newsgroup and 4-Universities datasets. The new method outperformed the other algorithms with a precision of 97% as compared with Centroid-based 85%, Naïve Bayes 78%, kNN 48% and SVM 42%.
Keywords: text categorisation; text classification; field association terms; naive Bayes classifier; k-nearest neighbour; kNN; centroid-based classifier; support vector machines; SVM; text mining.
International Journal of Computer Applications in Technology, 2015 Vol.52 No.2/3, pp.150 - 159
Published online: 26 Sep 2015 *Full-text access for editors Access for subscribers Purchase this article Comment on this article