Authors: Yaping Fang; Jun Li
Addresses: Department of Chemistry and Biochemistry, University of Oklahoma, Norman, OK 73019, USA ' Department of Chemistry and Biochemistry, University of Oklahoma, Norman, OK 73019, USA
Abstract: Protein coding gene prediction by computational approaches is a fundamental step for genome annotation. However, it is a challenge to accurately predict eukaryotic genes in silico. By surveying the model genomes, we found that the Spearman's rank correlation coefficient between the number of experimental-verified genes and the size of genomes was 0.96 for all eukaryotes except plants, indicating the relationship between genome size and the number of coding genes can be expressed with a monotonic function. Regression analysis found that the relationship of total protein coding genes over genome size followed a logarithmic equation. We integrated the equation into ab initio gene prediction software to guide the gene prediction by constraining the total number of predicted genes. We evaluated the software in three eukaryotic genomes. Results showed that >90% of false positive predictions were removed while >80% of true positives were retained, resulting in much higher specificity.
Keywords: gene prediction; gene count; gene structure; genome size; eukaryotic genome annotation; fungi; metazoans; protein coding; eukaryotic genes.
International Journal of Computational Biology and Drug Design, 2013 Vol.6 No.1/2, pp.157 - 169
Received: 12 Apr 2012
Accepted: 19 Jul 2012
Published online: 20 Feb 2013 *