Article: An optimal DNA segmentation based on the MDL principle Journal: International Journal of Bioinformatics Research and Applications (IJBRA) 2005 Vol.1 No.1 pp.3 - 17 Abstract: The biological world is highly stochastic and inhomogeneous in its behaviour. There are regions in DNA with a high concentration of G or C bases; stretches of sequences with an abundance of CG dinucleotide (CpG islands); coding regions with strong periodicity-of-three pattern, and so forth. Transitions between these regions of DNA, known also as change points, carry important biological information. Computational methods used to identify these homogeneous regions are called segmentations. Viewing a DNA sequence as a non-stationary process, we apply recent novel techniques of universal source coding to discover stationary (possibly recurrent) segments. In particular, the Stein-Ziv lemma is adopted to find an asymptotically optimal discriminant function that determines whether two DNA segments are generated by the same source assuring exponentially small false positives. Next, we use the Minimum Description Length (MDL) principle to select parameters that lead to a linear-time segmentation algorithm. We apply our algorithm to human chromosome 9 and chromosome 20 to discover coding and noncoding regions, starting positions of genes, as well as the beginning of CpG islands. Inderscience Publishers - linking academia, business and industry through research

Title: An optimal DNA segmentation based on the MDL principle

Authors: Wojciech Szpankowski, Wenhui Ren, Lukasz Szpankowski

Addresses: Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA. ' Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA. ' Cell and Molecular Biology, University of Michigan, Ann Arbor, MI 48104, USA

Abstract: The biological world is highly stochastic and inhomogeneous in its behaviour. There are regions in DNA with a high concentration of G or C bases; stretches of sequences with an abundance of CG dinucleotide (CpG islands); coding regions with strong periodicity-of-three pattern, and so forth. Transitions between these regions of DNA, known also as change points, carry important biological information. Computational methods used to identify these homogeneous regions are called segmentations. Viewing a DNA sequence as a non-stationary process, we apply recent novel techniques of universal source coding to discover stationary (possibly recurrent) segments. In particular, the Stein-Ziv lemma is adopted to find an asymptotically optimal discriminant function that determines whether two DNA segments are generated by the same source assuring exponentially small false positives. Next, we use the Minimum Description Length (MDL) principle to select parameters that lead to a linear-time segmentation algorithm. We apply our algorithm to human chromosome 9 and chromosome 20 to discover coding and noncoding regions, starting positions of genes, as well as the beginning of CpG islands.

Keywords: DNA segmentation; stochastic modelling; universal data compression; MDL principle; minimum description length; model selection; piecewise stationary sequences; change points; DNA sequence; universal source coding; information theory; computational biology; post-sequencing analysis; bioinformatics.

DOI: 10.1504/IJBRA.2005.006899

International Journal of Bioinformatics Research and Applications, 2005 Vol.1 No.1, pp.3 - 17

Published online: 21 Apr 2005 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: An optimal DNA segmentation based on the MDL principle

Keep up-to-date