Title: Prediction of Protein Secondary Structure with two-stage multi-class SVMs

Authors: Minh N. Nguyen, Jagath C. Rajapakse

Addresses: BioInformatics Research Centre, School of Computer Engineering, Nanyang Technological University, Singapore. ' BioInformatics Research Centre, School of Computer Engineering, Nanyang Technological University, Singapore; Biological Engineering Division, Massachusetts Institute of Technology, USA; Singapore-MIT Alliance, N2-B2C-15, 50 Nanyang Avenue, Singapore 639798

Abstract: Bioinformatics techniques to Protein Secondary Structure (PSS) prediction mostly depend on the information available in amino acid sequences. In this paper, we propose a two-stage Multi-class Support Vector Machine (MSVM) approach, where the second MSVM predictor is introduced at the output of the first stage MSVM to capture the contextual relationship among secondary structure elements in order to minimise the generalisation error in the prediction. By using position-specific scoring matrices generated by PSI-BLAST, the two-stage MSVM approach achieves Q3 accuracies of 78.0% and 76.3% on the RS126 dataset of 126 non-homologous globular proteins and the CB396 dataset of 396 non-homologous proteins, respectively, which are better than the scores reported on both datasets to date. By using MSVM, the present prediction scheme significantly achieves 2–6% and 3–15% of improvement in Q3 and Sov accuracies, respectively, on the two datasets. On larger blind-test datasets from PSIPRED, CASP4 and EVA datasets, two-stage MSVM approach achieves Q3 accuracies from 77.0% to 79.5%.

Keywords: protein structure; secondary structure prediction; support vector machines; multi-class SVMs; data mining; bioinformatics; amino acid sequences.

DOI: 10.1504/IJDMB.2007.011612

International Journal of Data Mining and Bioinformatics, 2007 Vol.1 No.3, pp.248 - 269

Published online: 06 Dec 2006 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article