Title: Automated data extraction from the web with conditional models

Authors: Xuan-Hieu Phan, Susumu Horiguchi, Tu-Bao Ho

Addresses: Graduate School of Information Science, Japan Advanced Institute of Science and Technology (JAIST), 1-1, Asahidai, Nomi, Ishikawa 923-1292, Japan. ' Graduate School of Information Sciences, Tohoku University, Aoba 6-3-09, Sendai 980-8579, Japan. ' Graduate School of Knowledge Science, Japan Advanced Institute of Science and Technology (JAIST), 1-1, Asahidai, Nomi, Ishikawa 923-1292, Japan

Abstract: Extracting data on the Web is an important information extraction task. Most existing approaches rely on wrappers which require human knowledge and user interaction during extraction. This paper proposes the use of conditional models as an alternative solution to this task. Deriving the strength of conditional models like maximum entropy and maximum entropy Markov models, our method offers three major advantages: the full automation, the ability to incorporate various non-independent, overlapping features of different hypertext representations, and the ability to deal with missing and disordered data fields. The experimental results on a wide range of e-commercial websites with different layouts show that our method can achieve a satisfactory trade-off between automation and accuracy, and also provide a practical application of automated data extraction from the Web.

Keywords: web mining; information extraction; statistical machine learning; maximum entropy; maximum entropy Markov model; conditional models; data mining; automatic extraction; data extraction; hypertext representations; e-commerce; electronic commerce.

DOI: 10.1504/IJBIDM.2005.008362

International Journal of Business Intelligence and Data Mining, 2005 Vol.1 No.2, pp.194 - 209

Published online: 08 Dec 2005 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article