Title: Enhanced and combined centroid-based approach for multi-label genre classification of web pages
Authors: Chaker Jebari
Addresses: Department of Information Technology, College of Applied Sciences, Ibri, Sultanate of Oman
Abstract: This paper proposes an enhanced and combined centroid-based approach to classify web pages by genre. To deal with the complexity of web pages, the proposed approach implements a multi-label classification scheme in which a web page can be affected to more than one genre. In addition, it implements an incremental classification to handle the rapid evolution of web genres. In this classification, web pages are classified one by one, according to the similarity between the new page and each genre centroid, our approach either adjusts the genre centroid or considers the new page as noise page and discards it. Moreover, our approach combines three homogenous and centroid-based classifiers: contextual, logical and hyper link classifiers. These classifiers exploit the character n-grams extracted from different sources which are URL, title, headings and anchors. Experiments conducted using a known multi-label corpus showing that our approach is very fast and outperforms many other multi-label classifiers.
Keywords: centroid-based classification; combination; genre classification; incremental classification; multi-label classification; web pages; logical classifiers; contextual classifiers; hyperlink classifiers.
International Journal of Metaheuristics, 2015 Vol.4 No.3/4, pp.220 - 243
Received: 07 Jan 2015
Accepted: 18 Oct 2015
Published online: 29 Jan 2016 *