Title: Extraction of the contents in the web texts by content-density distribution

Authors: Saori Kitahara; Koya Tamura; Kenji Hatano

Addresses: Graduate School of Culture and Information Science, Doshisha University, 1-3, Tatara Miyakodani, Kyotanabe, Kyoto 610-0394, Japan. ' UX Department, mixi Inc., 1-2-20, Higashi, Shibuya, Tokyo 150-0011, Japan. ' Faculty of Culture and Information Science, Doshisha University, 1-3, Tatara Miyakodani, Kyotanabe, Kyoto 610-0394, Japan

Abstract: In recent years, users use result snippets of a web search engine to grasp the content of web pages, when users search for useful information on the internet. However, they are sometimes unable to notice the content of web pages by reading the result snippets because these snippets are so short that they cannot determine whether the content of each web page is relevant. To address this problem, we propose a method for grasping the content of each web page and extracting a part of the web page concerned to query keywords. This method is more effective than conventional methods based on snippets, because we regard the content as a set of words in the text of a web page, and we generate the content-density distribution by using both the position and the influence of the word. In the result of our experiments, we found that our method is useful for gasping the influence of extracted web text.

Keywords: web information retrieval; web page recognition; content density distribution; knowledge engineering; soft data paradigms; web content; web search engines; query keywords; web text extraction; internet.

DOI: 10.1504/IJKESDP.2011.045723

International Journal of Knowledge Engineering and Soft Data Paradigms, 2011 Vol.3 No.2, pp.108 - 120

Published online: 07 Mar 2015 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article