Article: A parallel ACO algorithm to select terms to categorise longer documents Journal: International Journal of Computational Science and Engineering (IJCSE) 2011 Vol.6 No.4 pp.238 - 248 Abstract: Text categorisation (TC) is the task of assigning predefined categories to text. The primary step in TC is to transform documents into a representation suitable for machine learning algorithms. Bag of Words is the most popular document representation. Most of the machine learning algorithms are sensitive to the features fed into it and are misled by the high dimensionality of text. Feature selection (FS) is an important preprocessing step to remove redundant and irrelevant terms in the training corpus. This paper proposes an ant colony optimization (ACO) algorithm to select features for categorizing longer documents whose categories are closely related. Heuristic value for each word is computed by the statistical dependency of the term to a category and its compactness value. Compactness of a term indicates its spread in a document. Experiments were conducted with documents from 20 newsgroup and Reuters-21578 benchmarks. The selected features were fed into the naïve Bayes classifier and its performance was analysed. It was observed that the performance of the classifier improves with the features selected by the proposed method. The processes involved in algorithm are time intensive and demands parallelism. Hence the ACO algorithm was parallelised using the MapReduce programming model. The parallel algorithm was implemented and tested with a cluster of six machines formed using Hadoop. Inderscience Publishers - linking academia, business and industry through research

Title: A parallel ACO algorithm to select terms to categorise longer documents

Authors: M. Janaki Meena; K.R. Chandran; A. Karthik; A. Vijay Samuel

Addresses: Department of CSE, PSG College of Technology, Coimbatore – 641004, Tamilnadu, India. ' Department of IT, PSG College of Technology, Coimbatore – 641004, Tamilnadu, India. ' Department of CSE, PSG College of Technology, Coimbatore – 641004, Tamilnadu, India. ' Department of CSE, PSG College of Technology, Coimbatore – 641004, Tamilnadu, India

Abstract: Text categorisation (TC) is the task of assigning predefined categories to text. The primary step in TC is to transform documents into a representation suitable for machine learning algorithms. Bag of Words is the most popular document representation. Most of the machine learning algorithms are sensitive to the features fed into it and are misled by the high dimensionality of text. Feature selection (FS) is an important preprocessing step to remove redundant and irrelevant terms in the training corpus. This paper proposes an ant colony optimization (ACO) algorithm to select features for categorizing longer documents whose categories are closely related. Heuristic value for each word is computed by the statistical dependency of the term to a category and its compactness value. Compactness of a term indicates its spread in a document. Experiments were conducted with documents from 20 newsgroup and Reuters-21578 benchmarks. The selected features were fed into the naïve Bayes classifier and its performance was analysed. It was observed that the performance of the classifier improves with the features selected by the proposed method. The processes involved in algorithm are time intensive and demands parallelism. Hence the ACO algorithm was parallelised using the MapReduce programming model. The parallel algorithm was implemented and tested with a cluster of six machines formed using Hadoop.

Keywords: Bag of Words; metaheuristics; ant colony optimisation; ACO; CHIR; parallel algorithms; map reduce; longer reduce; text categorisation; machine learning; feature selection; document classification.

DOI: 10.1504/IJCSE.2011.043923

International Journal of Computational Science and Engineering, 2011 Vol.6 No.4, pp.238 - 248

Received: 31 Dec 2010
Accepted: 03 Jul 2011
Published online: 21 Mar 2015 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: A parallel ACO algorithm to select terms to categorise longer documents

Keep up-to-date