Article: A methodical approach to extracting interesting objects from dynamic web pages Journal: International Journal of Web and Grid Services (IJWGS) 2005 Vol.1 No.2 pp.165 - 195 Abstract: This paper presents a fully automated object extraction system for web documents. Our methodology consists of a layered framework and a set of algorithms. A distinct feature of our approach is the full automation of both the extraction of data object regions from dynamic web pages and the identification of the correct object-boundary separators. We implemented the methodology in the XWRAPElite object extraction system and evaluated the system using more than 3200 pages over 75 diverse websites. Our experiments show three important and interesting results: First, our algorithms for identifying the minimal object-rich subtree achieves a 96% success rate over all the web pages we have tested. Second, our algorithms for discovering and extracting object separator tags reach the success rate of 95%. Most significantly, the overall system achieves a precision between 96% and 100% (it returns only correct objects) and excellent recall (between 95% and 96%, with very few significant objects left out). The minimal subtree extraction algorithms and the object-boundary identification algorithms are fast, about 87 milliseconds per page with an average page size of 30KB. Inderscience Publishers - linking academia, business and industry through research

Title: A methodical approach to extracting interesting objects from dynamic web pages

Authors: Ling Liu, David Buttler, James Caverlee, Calton Pu, Jianjun Zhang

Addresses: College of Computing, George Institute of Technology, USA. ' College of Computing, George Institute of Technology, USA. ' College of Computing, George Institute of Technology, USA. ' College of Computing, George Institute of Technology, USA. ' College of Computing, George Institute of Technology, USA

Abstract: This paper presents a fully automated object extraction system for web documents. Our methodology consists of a layered framework and a set of algorithms. A distinct feature of our approach is the full automation of both the extraction of data object regions from dynamic web pages and the identification of the correct object-boundary separators. We implemented the methodology in the XWRAPElite object extraction system and evaluated the system using more than 3200 pages over 75 diverse websites. Our experiments show three important and interesting results: First, our algorithms for identifying the minimal object-rich subtree achieves a 96% success rate over all the web pages we have tested. Second, our algorithms for discovering and extracting object separator tags reach the success rate of 95%. Most significantly, the overall system achieves a precision between 96% and 100% (it returns only correct objects) and excellent recall (between 95% and 96%, with very few significant objects left out). The minimal subtree extraction algorithms and the object-boundary identification algorithms are fast, about 87 milliseconds per page with an average page size of 30KB.

Keywords: semantic web mining; object extraction; web information retrieval; web documents; document information retrieval; object-boundary separators; data object regions; web search.

DOI: 10.1504/IJWGS.2005.008319

International Journal of Web and Grid Services, 2005 Vol.1 No.2, pp.165 - 195

Published online: 02 Dec 2005 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: A methodical approach to extracting interesting objects from dynamic web pages

Keep up-to-date