Authors: Yadunath Pathak; Prashant Singh Rana; P.K. Singh; Mukesh Saraswat
Addresses: Computational Intelligence and Data Mining Research Lab, ABV-Indian Institute of Information Technology and Management, Gwalior 474015, Madhya Pradesh, India ' Computer Science and Engineering Department, Thapar University, Patiala 147004, Punjab, India ' Computational Intelligence and Data Mining Research Lab, ABV-Indian Institute of Information Technology and Management, Gwalior 474015, Madhya Pradesh, India ' Jaypee Institute of Information Technology, Noida 201307, Uttar Pradesh, India
Abstract: Physical and chemical properties of protein help to determine the quality of protein structure. Here we explore the machine learning models using six physical and chemical properties, namely total empirical energy, secondary structure penalty, total surface area, pair number, residue length and Euclidean distance to predict the RMSD of a protein structure in the absence of its true native state. The Real Coded Genetic Algorithm is used to determine feature importance, and k-fold cross-validation is used to measure the robustness of the best predictive model. The experiments show that the random forest model outperforms the other machine learning approaches in RMSD prediction. The performance result shows that in the prediction of RMSD, the Root Mean Square Error (RMSE) is 0.48, correlation is 0.90, R² is 0.82 and accuracy is 97.02% (with ±2 error) on the testing data. The data set used in the study is available at http://bit.ly/PSP-ML.
Keywords: protein structure prediction; machine learning; random forest; real coded genetic algorithms; total empirical energy; secondary structure penalty; total surface area; pair number; residue length; Euclidean distance; bioinformatics.
International Journal of Data Mining and Bioinformatics, 2016 Vol.14 No.1, pp.71 - 85
Available online: 30 Nov 2015Full-text access for editors Access for subscribers Purchase this article Comment on this article