Title: Text classification using scores based k-NN approach and term to category relevance weighting scheme

Authors: Ahmed Ben Afia; Hamid Amiri

Addresses: Signal, Image and Information Technology Laboratory (LR-SITI), Department of Electrical Engineering, Tunis National Engineering School (ENIT), Tunis El Manar University, Tunis, Tunisia ' Signal, Image and Information Technology Laboratory (LR-SITI), Department of Electrical Engineering, Tunis National Engineering School (ENIT), Tunis El Manar University, Tunis, Tunisia

Abstract: Text categorisation is the task of deciding whether a document belongs to a set of pre-specified classes of documents. To reach this goal, a TC system must include two basic stages. First stage consists on features extraction using a term weighting scheme. Second stage is the classification using a machine learning algorithm. After proposing, a new term to category relevance weighting scheme, called TF.IDF.TCR, we focus on finding a new algorithm to perform classification step. Results of our experiments, in which we use many classifiers, show promising performances. On the other hand, using relevance to category to improve the term's discriminating power appears to be inapplicable when classifying an unlabelled document. As a solution, we propose a k-NN based approach using scores calculating in order to resolve the problem of unknown category.

Keywords: NLP; natural language processing; text categorisation; feature extraction; term weighting schemes; k-NN; K nearest neighbour; TF.IDF; TF.IDF.TCR; distance measures; text classification; category relevance weighting; unlabelled documents.

DOI: 10.1504/IJSISE.2016.078268

International Journal of Signal and Imaging Systems Engineering, 2016 Vol.9 No.4/5, pp.283 - 290

Received: 23 May 2015
Accepted: 05 Mar 2016

Published online: 10 Aug 2016 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article