Title: Enhanced and combined centroid-based approach for multi-label genre classification of web pages

Authors: Chaker Jebari

Addresses: Department of Information Technology, College of Applied Sciences, Ibri, Sultanate of Oman

Abstract: This paper proposes an enhanced and combined centroid-based approach to classify web pages by genre. To deal with the complexity of web pages, the proposed approach implements a multi-label classification scheme in which a web page can be affected to more than one genre. In addition, it implements an incremental classification to handle the rapid evolution of web genres. In this classification, web pages are classified one by one, according to the similarity between the new page and each genre centroid, our approach either adjusts the genre centroid or considers the new page as noise page and discards it. Moreover, our approach combines three homogenous and centroid-based classifiers: contextual, logical and hyper link classifiers. These classifiers exploit the character n-grams extracted from different sources which are URL, title, headings and anchors. Experiments conducted using a known multi-label corpus showing that our approach is very fast and outperforms many other multi-label classifiers.

Keywords: centroid-based classification; combination; genre classification; incremental classification; multi-label classification; web pages; logical classifiers; contextual classifiers; hyperlink classifiers.

DOI: 10.1504/IJMHEUR.2015.074426

International Journal of Metaheuristics, 2015 Vol.4 No.3/4, pp.220 - 243

Available online: 29 Jan 2016 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article