DWSpyder: a new schema extraction method for a deep web integration system Online publication date: Tue, 08-Oct-2019
by Yasser Saissi; Ahmed Zellou; Ali Adri
International Journal of Web Engineering and Technology (IJWET), Vol. 14, No. 2, 2019
Abstract: The deep web is a huge part of the web that is not indexed by search engines. The deep web sources are accessible only through their associated access forms. We wish to use a web integration system to access the deep web sources and all of their information. To implement this web integration system, we need to know the schema description of each web source. The problem resolved in this paper is how to extract the schema describing an inaccessible deep web source. We propose our DWSpyder method as being able to extract the schema describing a deep web source despite its inaccessibility. The DWSpyder method starts with a static analysis of the deep web source access forms in order to extract the first elements of the associated schema description. The second step of our method is a dynamic analysis of these access forms using queries to enrich our schema description. Our DWSpyder method also uses a clustering algorithm to identify the possible values of deep web form fields with undefined sets of values. All of the information extracted is used by DWSpyder to generate automatically deep web source schema descriptions.
Existing subscribers:
Go to Inderscience Online Journals to access the Full Text of this article.
If you are not a subscriber and you just want to read the full contents of this article, buy online access here.Complimentary Subscribers, Editors or Members of the Editorial Board of the International Journal of Web Engineering and Technology (IJWET):
Login with your Inderscience username and password:
Want to subscribe?
A subscription gives you complete access to all articles in the current issue, as well as to all articles in the previous three years (where applicable). See our Orders page to subscribe.
If you still need assistance, please email subs@inderscience.com