Title: A study on disk index design for large scale de-duplication storage systems

Authors: Tian-Ming Yang; Dan Feng; Wen-Kuang Chou; Jing-Ning Liu

Addresses: International College, Huanghuai University, Henan, 463000, China ' Wuhan National Laboratory for Optoelectronics, College of Computer Science and Technology, HUST, Wuhan, 430074, China ' Department of Computer Science and Information Management, Providence University, Taichung, 43301, Taiwan ' Wuhan National Laboratory for Optoelectronics, College of Computer Science and Technology, HUST, Wuhan, 430074, China

Abstract: Chunk-based de-duplication storage, which aims to optimise the storage or bandwidth usage by eliminating the duplicate chunks in the inter-file level, has been attended broadly both in academia and industry recently. For a petabyte-scale de-duplication storage system, the metadata storage especially the disk index, which establishes a mapping between the fingerprints and corresponding chunks in the system, can reach terabyte-scale size. In this paper, we propose a disk-resident hash table to implement the disk index, and theoretically study yet extensively experiment the probability of hash table overflow. These studies help us design a space-efficient disk index which not only reduces metadata storage but also improves access performance.

Keywords: storage systems; chunk-based de-duplication; disk index; hash tables; deduping; metadata storage; hash table overflow; access performance.

DOI: 10.1504/IJCSE.2015.067074

International Journal of Computational Science and Engineering, 2015 Vol.10 No.1/2, pp.171 - 180

Received: 20 Jan 2014
Accepted: 10 Mar 2014

Published online: 25 Jan 2015 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article