Title: CWrap: web wrapping using context variables

Authors: Ahmad Pouramini

Addresses: Computer Engineering Department, Sirjan University of Technology, Iran

Abstract: A procedure that extracts data from a data source is called wrapper. In some applications identifying the desired data is better served using a wrapping language rather than an unsupervised method. In this paper, we propose a novel wrapping language, called CWrap. In this language, various types of features (syntactical, semantic, visual and densitometric) can be employed in the extraction rules to identify the items of interest. Moreover, the context in which the desired items appear is specified using variables called context variables. Context variables enable the user to express different types of contextual dependencies (structural, visual and semantical) in a consistent way. They are set under certain conditions by one rule and are used later to form the contextual conditions for another extraction rule. This allows the user to organise the extraction rules in a hierarchical structure, from general to more specific rules. We also present a visual development toolkit which enables the user to develop and debug a wrapper visually and assembling it in an incremental manner.

Keywords: web mining; web data extraction; information extraction; web wrappers; wrapping languages.

DOI: 10.1504/IJKWI.2016.084756

International Journal of Knowledge and Web Intelligence, 2016 Vol.5 No.4, pp.304 - 318

Received: 09 Jul 2016
Accepted: 11 Jan 2017

Published online: 25 Jun 2017 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article