Title: An empirical evaluation of compression techniques for genome sequences

Authors: M. Muthulakshmi; G. Murugeswari; S.P. Raja

Addresses: Department of Computer Science and Engineering, Manonmaniam Sundaranar University, Tirunelveli – 12, Tamil Nadu, India ' Department of Computer Science and Engineering, Manonmaniam Sundaranar University, Tirunelveli – 12, Tamil Nadu, India ' School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, India

Abstract: Databases of biological sequences are increasing at an exponential rate due to tremendous growth of living organisms. Among all other scientific databases, the size of biological databases is in terabytes. With the advancement in sequencing technologies, each day thousands of nucleotide bases of different organisms are sequenced and submitted to the database worldwide. So, there is a need for compression of biological sequence data to reduce the space required for storage and thereby increase the transmission speed. Three existing sequence compression algorithms namely modified HuffBit, one bit compression and extended American Standard Code for Information Interchange (ASCII) compression algorithms are implemented. The DNA sequence data is obtained from National Center for Biotechnology Information (NCBI) database. The main aim of this paper is to compare and evaluate the performance of existing sequence compression algorithms. Experimental results show that modified HuffBit compress algorithm performs better with an average compression ratio of 3.8.

Keywords: DNA; extended ASCII; genome sequences; modified HuffBit; one bit.

DOI: 10.1504/IJBM.2021.117870

International Journal of Biometrics, 2021 Vol.13 No.4, pp.447 - 463

Received: 03 Jun 2020
Accepted: 19 Oct 2020

Published online: 04 Oct 2021 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article