Title: Sequence similarity using composition method

Authors: Geetika Munjal; Pooja Sharma; Deepti Gaur

Addresses: Department of Computer Science and Engineering, The Northcap University, Gurgaon 122017, India ' Department of Computer Science and Engineering, The Northcap University, Gurgaon 122017, India ' Department of Computer Science and Engineering, The Northcap University, Gurgaon 122017, India

Abstract: Deoxyribo nucleic acid (DNA) has enormous capacity to carry very important information in the form of character strings. Sequence analysis is the process of applying a wide range of methods to DNA sequences for understanding the structure, feature or evolution of these nucleotides strings. The analysis uses mathematical methods to convert these character strings to numerical values, and these numerical values are used to find similarity between the sequences. DNA sequences only contain four nucleotides: A, C, G and T, but in order to find information from these sequences, sequence comparison becomes essential. In this paper, various methods to analyse DNA sequences including usage of entropy, divergence, LZ complexity and the role of hybridisation are explored. A hybrid model based on the composition vector and distance methods is proposed to find dissimilarity between sequences and this hybrid model is tested on sequences of species downloaded from National Center for Biotechnology Information (NCBI).

Keywords: nucleotides; entropy; frequency vector.

DOI: 10.1504/IJDS.2018.090626

International Journal of Data Science, 2018 Vol.3 No.1, pp.19 - 28

Received: 30 Apr 2015
Accepted: 01 Jan 2016

Published online: 25 Mar 2018 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article