Title: Protein homology detection with biologically inspired features and interpretable statistical models

Authors: Pai-Hsi Huang, Vladimir Pavlovic

Addresses: Department of Computer Science, Rutgers University, Piscataway, NJ 08854-8019, USA. ' Department of Computer Science, Rutgers University, Piscataway, NJ 08854-8019, USA

Abstract: Computational classification of proteins using methods such as string kernels and Fisher-SVM has demonstrated great success. However, the resulting models do not offer an immediate interpretation of the underlying biological mechanisms. In this work, we propose a biologically motivated feature set combined with a sparse classifier, based on a small subset of positions and residues in protein sequences, for protein superfamily detection and show the performance of our models is comparable to that of the state-of-the-art methods on a benchmark dataset. The set of sparse critical features discovered by the models is consistent with the confirmed biological findings.

Keywords: sequence classification; homology detection; protein homology; discriminative learning; biologically motivated features; feature selection; data mining; bioinformatics; biological mechanisms; sparse classifier; protein sequences; protein superfamilies.

DOI: 10.1504/IJDMB.2008.019096

International Journal of Data Mining and Bioinformatics, 2008 Vol.2 No.2, pp.157 - 175

Published online: 28 Jun 2008 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article