Unstructured data extraction of Chinese expert web page
by Xudong Hong; Tao Shen; Longhua Shen; Zhengtao Yu; Jianyi Guo
International Journal of Wireless and Mobile Computing (IJWMC), Vol. 7, No. 2, 2014

Abstract: Aiming at the problem of requiring a lot of human intervention in the process of unstructured data extraction from expert page based on traditional extraction methods, this paper proposes a method which detects data template automatically based on similarities and differences between HTML tags and strings, uses the lattice theory to find the location of the data grid region storing unstructured expert data, thus accesses to unstructured expert data. Firstly, with the help of the classifier on Chinese Expert Entity Homepages, a lot of expert pages are acquired by expert web crawler. Secondly, divide the expert pages into two types, list type and document type, then extract respectively the unstructured data from the two different types. Lastly, the extraction experiments are conducted on different types of web pages by improving open source code of Roadrunner. Experimental results show that, in the case of unsupervised, this method performs effectively on extraction of unstructured web data from Chinese expert pages.

Online publication date: Fri, 31-Oct-2014

The full text of this article is only available to individual subscribers or to users at subscribing institutions.

 
Existing subscribers:
Go to Inderscience Online Journals to access the Full Text of this article.

Pay per view:
If you are not a subscriber and you just want to read the full contents of this article, buy online access here.

Complimentary Subscribers, Editors or Members of the Editorial Board of the International Journal of Wireless and Mobile Computing (IJWMC):
Login with your Inderscience username and password:

    Username:        Password:         

Forgotten your password?


Want to subscribe?
A subscription gives you complete access to all articles in the current issue, as well as to all articles in the previous three years (where applicable). See our Orders page to subscribe.

If you still need assistance, please email subs@inderscience.com