Title: Efficient and exact maximum likelihood quantisation of genomic features using dynamic programming

Authors: Mingzhou (Joe) Song, Robert M. Haralick, Stephane Boissinot

Addresses: Department of Computer Science, New Mexico State University, Las Cruces, NM 88003, USA. ' PhD Program in Computer Science, Graduate Center, City University of New York, New York, NY 10016, USA. ' Department of Biology, Queens College, City University of New York, Flushing, NY 11367, USA

Abstract: An efficient and exact dynamic programming algorithm is introduced to quantise a continuous random variable into a discrete random variable that maximises the likelihood of the quantised probability distribution for the original continuous random variable. Quantisation is often useful before statistical analysis and modelling of large discrete network models from observations of multiple continuous random variables. The quantisation algorithm is applied to genomic features including the recombination rate distribution across the chromosomes and the non-coding transposable element LINE-1 in the human genome. The association pattern is studied between the recombination rate, obtained by quantisation at genomic locations around LINE-1 elements, and the length groups of LINE-1 elements, also obtained by quantisation on LINE-1 length. The exact and density-preserving quantisation approach provides an alternative superior to the inexact and distance-based univariate iterative k-means clustering algorithm for discretisation.

Keywords: maximum likelihood quantisation; discretisation; dynamic programming; recombination rate distribution; transposable elements; LINE-1; retrotransposon; genomic features; bioinformatics; discrete random variables.

DOI: 10.1504/IJDMB.2010.032167

International Journal of Data Mining and Bioinformatics, 2010 Vol.4 No.2, pp.123 - 141

Published online: 11 Mar 2010 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article