Title: DWSpyder: a new schema extraction method for a deep web integration system

Authors: Yasser Saissi; Ahmed Zellou; Ali Adri

Addresses: ENSIAS, Mohammed V University in Rabat, Rabat, Morocco ' ENSIAS, Mohammed V University in Rabat, Rabat, Morocco ' ENSIAS, Mohammed V University in Rabat, Rabat, Morocco

Abstract: The deep web is a huge part of the web that is not indexed by search engines. The deep web sources are accessible only through their associated access forms. We wish to use a web integration system to access the deep web sources and all of their information. To implement this web integration system, we need to know the schema description of each web source. The problem resolved in this paper is how to extract the schema describing an inaccessible deep web source. We propose our DWSpyder method as being able to extract the schema describing a deep web source despite its inaccessibility. The DWSpyder method starts with a static analysis of the deep web source access forms in order to extract the first elements of the associated schema description. The second step of our method is a dynamic analysis of these access forms using queries to enrich our schema description. Our DWSpyder method also uses a clustering algorithm to identify the possible values of deep web form fields with undefined sets of values. All of the information extracted is used by DWSpyder to generate automatically deep web source schema descriptions.

Keywords: web integration; schema extraction; deep web; clustering.

DOI: 10.1504/IJWET.2019.102872

International Journal of Web Engineering and Technology, 2019 Vol.14 No.2, pp.122 - 150

Published online: 03 Oct 2019 *

