Title: Research on crawling mechanism and policy for crawling product information from mobile internet

Authors: Shu Wang; Jia Chen; Chonghuan Xu

Addresses: School of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou, China ' Ever Maple Food Science and Technology Co., Ltd, Hongsheng Group, Hangzhou, China ' School of Business Administration, Contemporary Business and Trade Research Center, Contemporary Business and Collaborative Innovation Research Center, Zhejiang Gongshang University, Hangzhou, China

Abstract: Product information on the mobile internet grows fast in volume and becomes hard in acquisition. Companies tend to deliver product information on their well-tuned mobile websites or websites that is responsive to various mobile devices. Thus, this kind of site is more of a web app than a traditional website, which we call a rich internet application (RIA). With RIAs, information are kept secret from search engine spiders by means of HTML5, Ajax and other scripting techniques in deep web, user interactions are needed to trigger some prescribed events in some certain order to show the whole picture of the information we need. In this paper, we identified the crux of the problem is how to provide the mechanism to parse the scripts and manipulate document object model (DOM) and the policy to trigger user events and run the scrape process. A new mechanism and policy was formulated based on web crawler techniques and studies in Ajax-specified web crawlers. By remodelling web pages redesigning the architecture of web crawler and refining scrape algorithm, we successfully scrape product data from mobile internet RIAs.

Keywords: crawler; scrape data; mobile internet; rich internet application; RIA; product information.

DOI: 10.1504/IJCSM.2017.088946

International Journal of Computing Science and Mathematics, 2017 Vol.8 No.6, pp.506 - 525

Received: 24 Jun 2016
Accepted: 29 Sep 2016

Published online: 03 Jan 2018 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article