Title: Assessment of length distributions between non-coding and coding sequences amongst two model organisms

Authors: Rachel Caldwell, Yan-Xia Lin, Ren Zhang

Addresses: School of Biological Sciences, University of Wollongong, Northfields Ave, NSW 2522, Australia. ' School of Mathematics and Applied Statistics, University of Wollongong, Northfields Ave, NSW 2522, Australia. ' School of Biological Sciences, University of Wollongong, Northfields Ave, NSW 2522, Australia

Abstract: The availability of genomic DNA and cDNA sequence data has escalated the data mining and genomics era. We aim to investigate the length distributions of the non-coding and coding regions of protein genes of two model organisms, Arabidopsis thaliana and Drosophila melanogaster. A non-linear functional relationship model was applied and strong correlation was found between the Coding Sequence (CDS) and non-coding sequence regions, conditional on the 5| UTR data. Significant differences were found between the protein functional classes and each gene region. Examination of the non-coding and coding regions of these organisms has revealed possible correlations.

Keywords: non-coding sequences; CDS; coding sequences; Arabidopsis thaliana; Drosophila melanogaster; data mining; UTRs; untranslated regions; length distributions; protein genes; model organisms; nonlinear modelling; functional relationships; protein functional classes; gene regions; bioinformatics.

DOI: 10.1504/IJDMB.2010.035899

International Journal of Data Mining and Bioinformatics, 2010 Vol.4 No.5, pp.535 - 552

Published online: 08 Oct 2010 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article