Int. J. of Wireless and Mobile Computing   »   2014 Vol.7, No.2

 

 

Title: Unstructured data extraction of Chinese expert web page

 

Authors: Xudong Hong; Tao Shen; Longhua Shen; Zhengtao Yu; Jianyi Guo

 

Addresses:
The School of Information Engineering and Automation, Key Laboratory of Intelligent Information Processing, Kunming University of Science and Technology, Kunming, Yunnan, China
The School of Material Science and Engineering, Kunming University of Science and Technology, Kunming, Yunnan, China
China Research and Development Academy of Machinery Equipment, Beijing, China
The School of Information Engineering and Automation, Key Laboratory of Intelligent Information Processing, Kunming University of Science and Technology, Kunming, Yunnan, China
The School of Information Engineering and Automation, Key Laboratory of Intelligent Information Processing, Kunming University of Science and Technology, Kunming, Yunnan, China

 

Abstract: Aiming at the problem of requiring a lot of human intervention in the process of unstructured data extraction from expert page based on traditional extraction methods, this paper proposes a method which detects data template automatically based on similarities and differences between HTML tags and strings, uses the lattice theory to find the location of the data grid region storing unstructured expert data, thus accesses to unstructured expert data. Firstly, with the help of the classifier on Chinese Expert Entity Homepages, a lot of expert pages are acquired by expert web crawler. Secondly, divide the expert pages into two types, list type and document type, then extract respectively the unstructured data from the two different types. Lastly, the extraction experiments are conducted on different types of web pages by improving open source code of Roadrunner. Experimental results show that, in the case of unsupervised, this method performs effectively on extraction of unstructured web data from Chinese expert pages.

 

Keywords: expert web pages; unsupervised; lattice theory; unstructured data; data extraction; Roadrunner; Chinese experts.

 

DOI: 10.1504/IJWMC.2014.059709

 

Int. J. of Wireless and Mobile Computing, 2014 Vol.7, No.2, pp.132 - 136

 

Submission date: 17 May 2013
Date of acceptance: 03 Jul 2013
Available online: 06 Mar 2014

 

 

Editors Full text accessAccess for SubscribersPurchase this articleComment on this article