Title: Effective statistical features for coding and non-coding DNA sequence classification for yeast, C. elegans and human

Authors: Alan Wee-Chung Liew, Yonghui Wu, Hong Yan, Mengsu Yang

Addresses: Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong. ' Agenica Research Pte Ltd, 11 Hospital Drive, 169610, Singapore. ' Department of Computer Engineering and Information Technology, City University of Hong Kong, Kowloon, Hong Kong; School of Electrical and Information Engineering, University of Sydney, NSW 2006, Australia. ' Department of Chemistry and Biology, City University of Hong Kong, Kowloon, Hong Kong

Abstract: This study performs a quantitative evaluation of the different coding features in terms of their information content for the classification of coding and non-coding regions for three species. Our study indicated that coding features that are effective for yeast or C. elegans are generally not very effective for human, which has a short average exon length. By performing a correlation analysis, we identified a subset of human coding features with high discriminative power, but complementary in their information content. For this subset, a classification accuracy of up to 90% was obtained using a simple kNN classifier.

Keywords: exon-intron classification; coding statistics; feature selection; information content; DNA sequence classification; correlation analysis; bioinformatics; statistical features; coding features.

DOI: 10.1504/IJBRA.2005.007577

International Journal of Bioinformatics Research and Applications, 2005 Vol.1 No.2, pp.181 - 201

Published online: 06 Aug 2005 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article