Authors: Nagamma Patil; Durga Toshniwal; Kumkum Garg
Addresses: Department of Electronics and Computer Engineering, Indian Institute of Technology, Roorkee 247667, India ' Department of Electronics and Computer Engineering, Indian Institute of Technology, Roorkee 247667, India ' Department of Computer Science and Engineering, Manipal University, Jaipur 302026, India
Abstract: This paper presents a computational system to predict protein structure using N-grams and a wrapper feature selection framework (the N-gram is a subsequence composed of N characters, extracted from a larger sequence). N-gram features are extracted from a dataset consisting of 277 domains: 70 all-α domains, 61 all-β domains, 81 α/β domains and 65 α + β domains. A wrapper feature selection system, GA-SVM, is applied to obtain an optimised feature set. Using the optimised 3070-feature subset, a classifier model is trained and tested in the Support Vector Machine (SVM) learning system. This model achieves an overall accuracy of 88.09%, evaluated by a 10-fold cross-validation test. This value is 4.7% higher than the one using the initial 6,414 features. Experimental results also illustrate that employing a feature subset selection, by using the proposed GA-SVM wrapper approach, has enhanced classification accuracy in comparison to other GA-based wrapper approaches and existing protein sequence encoding methods.
Keywords: wrapper feature selection; GAs; genetic algorithms; SVM; support vector machines; protein structure prediction; classification accuracy; protein sequences.
International Journal of Functional Informatics and Personalised Medicine, 2012 Vol.4 No.1, pp.69 - 79
Received: 21 Mar 2012
Accepted: 16 Jul 2012
Published online: 20 Nov 2012 *