Article: Implementation of a deduplication cache mechanism using content-defined chunking Journal: International Journal of High Performance Computing and Networking (IJHPCN) 2016 Vol.9 No.3 pp.190 - 205 Abstract: Many application programs in data-intensive science read and write large files. Large data consume significant memory because the data is loaded into the page cache. Since memory resources are critically valuable in data-intensive computing, reducing the memory footprint consumed by file data is essential. In this paper, we propose a cache deduplication mechanism with content-defined chunking (CDC) for the Gfarm distributed file system. CDC divides a file into variable-size blocks (chunks) based on the contents of the file. The client stores the chunks in the local file system as cache files and reuses them during subsequent file accesses. Deduplication of chunks reduces the amount of transmitted data between clients and servers, and reduces storage and memory requirements. The experimental results demonstrate that the proposed mechanism significantly improves the performance of file-read operations and that the introduction of parallelism reduces the overhead of file-write operations. Inderscience Publishers - linking academia, business and industry through research

Title: Implementation of a deduplication cache mechanism using content-defined chunking

Authors: Yoshihiro Oyama; Jun Murakami; Shun Ishiguro; Osamu Tatebe

Addresses: Department of Informatics, The University of Electro-Communications, Chofu, Tokyo, Japan; CREST, Japan Science and Technology Agency, Kawaguchi, Saitama, Japan ' Department of Informatics, The University of Electro-Communications, Chofu, Tokyo, Japan ' Department of Informatics, The University of Electro-Communications, Chofu, Tokyo, Japan ' Department of Computer Science, University of Tsukuba, Tsukuba, Ibaraki, Japan; CREST, Japan Science and Technology Agency, Kawaguchi, Saitama, Japan

Abstract: Many application programs in data-intensive science read and write large files. Large data consume significant memory because the data is loaded into the page cache. Since memory resources are critically valuable in data-intensive computing, reducing the memory footprint consumed by file data is essential. In this paper, we propose a cache deduplication mechanism with content-defined chunking (CDC) for the Gfarm distributed file system. CDC divides a file into variable-size blocks (chunks) based on the contents of the file. The client stores the chunks in the local file system as cache files and reuses them during subsequent file accesses. Deduplication of chunks reduces the amount of transmitted data between clients and servers, and reduces storage and memory requirements. The experimental results demonstrate that the proposed mechanism significantly improves the performance of file-read operations and that the introduction of parallelism reduces the overhead of file-write operations.

Keywords: distributed file systems; content-defined chunking; CDC; file cache; cache deduplication; data-intensive science; high-performance computing; deduping.

DOI: 10.1504/IJHPCN.2016.076251

International Journal of High Performance Computing and Networking, 2016 Vol.9 No.3, pp.190 - 205

Received: 07 Aug 2014
Accepted: 04 Jan 2015
Published online: 30 Apr 2016 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Implementation of a deduplication cache mechanism using content-defined chunking

Keep up-to-date