Title: A negative category based approach for Wikipedia document classification

Authors: Meenakshi Sundaram Murugeshan, K. Lakshmi, Saswati Mukherjee

Addresses: Department of Computer Science and Engineering, College of Engineering, Guindy, Anna University, Chennai-600025, India. ' Department of Computer Science and Engineering, College of Engineering, Guindy, Anna University, Chennai-600025, India. ' Department of Computer Science and Engineering, College of Engineering, Guindy, Anna University, Chennai-600025, India

Abstract: Profile based methods have been successfully used for the classification of unstructured texts. This paper presents a profile based method for Wikipedia XML document classification. We have used profiles built using negative category information. Our approach exploits the structure of Wikipedia documents to build profiles. Two class profiles are built; one based on the whole content and the other based on the initial description of the Wikipedia documents. In addition, we have also explored the option of using the terms in the section and subsection titles. The effectiveness of cosine and fractional similarity measures in classifying XML documents is compared. The importance of combining two profile based classifiers is experimentally shown to have worked better than individual classifiers.

Keywords: XML classification; profile creation; negative categories; similarity measures; feature selection; initial descriptions; Wikipedia documents; document classification; unstructured text; cosine; fractional similarity.

DOI: 10.1504/IJKEDM.2010.032582

International Journal of Knowledge Engineering and Data Mining, 2010 Vol.1 No.1, pp.84 - 97

Published online: 08 Apr 2010 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article