Title: Partial retrieval of compressed semi-structured documents

Authors: Ashutosh Gupta, Suneeta Agarwal

Addresses: Department of Computer Science & Information Technology, Institute of Engineering and Technology, MJP Rohilkhand University, Bareilly, India. ' Department of Computer Science & Engineering, MNNIT, Allahabad, India

Abstract: We describe a compression model called tri-structural contexts model (TSCM), for semi-structured documents. The intention is that separation of the start tag, the attribute name/attribute value and textual words may reduce the entropy. We also combine the attributes with their values and use a separate container for them. We mainly focus on semi-static models, and test our idea using a word-based tagged code. This code allows random access and partial decompression of the compressed collection. The compression time is found to be better than scmhuff and decompression time is also observed much less than scmhuff and xmlppm. The shorter time for partial decompression emphasises the use of TSC model to keep the semi-structured document compressed all the time. The algorithm and proposed model are useful in information retrieval systems.

Keywords: text compression; semi-structured documents; word based tagged code; compression models; information retrieval.

DOI: 10.1504/IJCAT.2010.034524

International Journal of Computer Applications in Technology, 2010 Vol.38 No.4, pp.239 - 249

Published online: 07 Aug 2010 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article