Inderscience PublishersInderscience PublishersInderscience Publishers About Inderscience Contact Information Current Site Map General Help
  PUBLISHERS OF DISTINGUISHED ACADEMIC, SCIENTIFIC AND PROFESSIONAL JOURNALS

The full text of this article:

A methodical approach to extracting interesting objects from dynamic web pages
by Ling Liu, David Buttler, James Caverlee, Calton Pu, Jianjun Zhang
International Journal of Web and Grid Services (IJWGS), Vol. 1, No. 2, 2005
Abstract: This paper presents a fully automated object extraction system for web documents. Our methodology consists of a layered framework and a set of algorithms. A distinct feature of our approach is the full automation of both the extraction of data object regions from dynamic web pages and the identification of the correct object-boundary separators. We implemented the methodology in the XWRAPElite object extraction system and evaluated the system using more than 3200 pages over 75 diverse websites. Our experiments show three important and interesting results: First, our algorithms for identifying the minimal object-rich subtree achieves a 96% success rate over all the web pages we have tested. Second, our algorithms for discovering and extracting object separator tags reach the success rate of 95%. Most significantly, the overall system achieves a precision between 96% and 100% (it returns only correct objects) and excellent recall (between 95% and 96%, with very few significant objects left out). The minimal subtree extraction algorithms and the object-boundary identification algorithms are fast, about 87 milliseconds per page with an average page size of 30KB.

is only available to individual subscribers or to users at subscribing institutions.

ATTENTION SUBSCRIBERS:
Please re-direct your browser by clicking on this Inderscience Online Journals link, to access the full-text of this article.

Pay per view: If you are not a Subscriber and you just want to read the full contents of this article, please click here to purchase online access to the full-text of this article. Please allow 3 days + mailing time. Current price for article is Thirty Euros (€30)

Complimentary Subscribers, Editors or Members of the Editorial Board of the International Journal of Web and Grid Services (IJWGS) journal, that have been redirected here, please check if you have a registered username/password subscription with Inderscience. If that is the case, please Login:

    Username:        Password:         Forgotten your Password?

If you are not yet a Subscriber to International Journal of Web and Grid Services (IJWGS) journal, you can subscribe by following a few simple and quick steps. A subscription will give you complete access to all articles in the current issue, as well as to all articles in the previous three years, where applicable. Click here to subscribe.

Should you experience further difficulties or have any enquiries, please email subs@inderscience.com