Authors: Yoshihiro Oyama; Jun Murakami; Shun Ishiguro; Osamu Tatebe
Addresses: Department of Informatics, The University of Electro-Communications, Chofu, Tokyo, Japan; CREST, Japan Science and Technology Agency, Kawaguchi, Saitama, Japan ' Department of Informatics, The University of Electro-Communications, Chofu, Tokyo, Japan ' Department of Informatics, The University of Electro-Communications, Chofu, Tokyo, Japan ' Department of Computer Science, University of Tsukuba, Tsukuba, Ibaraki, Japan; CREST, Japan Science and Technology Agency, Kawaguchi, Saitama, Japan
Abstract: Many application programs in data-intensive science read and write large files. Large data consume significant memory because the data is loaded into the page cache. Since memory resources are critically valuable in data-intensive computing, reducing the memory footprint consumed by file data is essential. In this paper, we propose a cache deduplication mechanism with content-defined chunking (CDC) for the Gfarm distributed file system. CDC divides a file into variable-size blocks (chunks) based on the contents of the file. The client stores the chunks in the local file system as cache files and reuses them during subsequent file accesses. Deduplication of chunks reduces the amount of transmitted data between clients and servers, and reduces storage and memory requirements. The experimental results demonstrate that the proposed mechanism significantly improves the performance of file-read operations and that the introduction of parallelism reduces the overhead of file-write operations.
Keywords: distributed file systems; content-defined chunking; CDC; file cache; cache deduplication; data-intensive science; high-performance computing; deduping.
International Journal of High Performance Computing and Networking, 2016 Vol.9 No.3, pp.190 - 205
Available online: 13 Apr 2016 *Full-text access for editors Access for subscribers Purchase this article Comment on this article