Article: Protein structure prediction (RMSD ≤ 5 Å) using machine learning models Journal: International Journal of Data Mining and Bioinformatics (IJDMB) 2016 Vol.14 No.1 pp.71 - 85 Abstract: Physical and chemical properties of protein help to determine the quality of protein structure. Here we explore the machine learning models using six physical and chemical properties, namely total empirical energy, secondary structure penalty, total surface area, pair number, residue length and Euclidean distance to predict the RMSD of a protein structure in the absence of its true native state. The Real Coded Genetic Algorithm is used to determine feature importance, and k-fold cross-validation is used to measure the robustness of the best predictive model. The experiments show that the random forest model outperforms the other machine learning approaches in RMSD prediction. The performance result shows that in the prediction of RMSD, the Root Mean Square Error (RMSE) is 0.48, correlation is 0.90, R² is 0.82 and accuracy is 97.02% (with ±2 error) on the testing data. The data set used in the study is available at http://bit.ly/PSP-ML. Inderscience Publishers - linking academia, business and industry through research

Title: Protein structure prediction (RMSD ≤ 5 Å) using machine learning models

Authors: Yadunath Pathak; Prashant Singh Rana; P.K. Singh; Mukesh Saraswat

Addresses: Computational Intelligence and Data Mining Research Lab, ABV-Indian Institute of Information Technology and Management, Gwalior 474015, Madhya Pradesh, India ' Computer Science and Engineering Department, Thapar University, Patiala 147004, Punjab, India ' Computational Intelligence and Data Mining Research Lab, ABV-Indian Institute of Information Technology and Management, Gwalior 474015, Madhya Pradesh, India ' Jaypee Institute of Information Technology, Noida 201307, Uttar Pradesh, India

Abstract: Physical and chemical properties of protein help to determine the quality of protein structure. Here we explore the machine learning models using six physical and chemical properties, namely total empirical energy, secondary structure penalty, total surface area, pair number, residue length and Euclidean distance to predict the RMSD of a protein structure in the absence of its true native state. The Real Coded Genetic Algorithm is used to determine feature importance, and k-fold cross-validation is used to measure the robustness of the best predictive model. The experiments show that the random forest model outperforms the other machine learning approaches in RMSD prediction. The performance result shows that in the prediction of RMSD, the Root Mean Square Error (RMSE) is 0.48, correlation is 0.90, R² is 0.82 and accuracy is 97.02% (with ±2 error) on the testing data. The data set used in the study is available at http://bit.ly/PSP-ML.

Keywords: protein structure prediction; machine learning; random forest; real coded genetic algorithms; total empirical energy; secondary structure penalty; total surface area; pair number; residue length; Euclidean distance; bioinformatics.

DOI: 10.1504/IJDMB.2016.073361

International Journal of Data Mining and Bioinformatics, 2016 Vol.14 No.1, pp.71 - 85

Received: 20 Jul 2014
Accepted: 04 May 2015
Published online: 30 Nov 2015 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Protein structure prediction (RMSD ≤ 5 Å) using machine learning models

Keep up-to-date