Title: An amino acid property-based method for identifying solenoid proteins

Authors: Senthilnathan Rajendran; Arunachalam Jothi

Addresses: Department of Bioinformatics, School of Chemical and Biotechnology, SASTRA Deemed University, Thanjavur, Tamil Nadu, India ' Department of Bioinformatics, School of Chemical and Biotechnology, SASTRA Deemed University, Thanjavur, Tamil Nadu, India

Abstract: Solenoid proteins are proteins that contain repeating structural units. They are associated with many important biological functions and also key factors for the onset of many human diseases like Huntington disease, mental retardation, inherited ataxias, etc. Detecting solenoid proteins from the sequence information alone is a challenging problem. Current methods for identifying solenoid proteins from sequence rely heavily on homology-based approaches. In this work, we have proposed an alternate method which uses just the amino acid composition and a set of biophysical descriptors to identify solenoid proteins. Four different machine learning approaches: Naive Bayes (NB), Support Vector Machine (SVM), Bayesian Generalised Linear Models (BGLM) and Random Forest (RF) method were used for classification. These four classification models were validated using the cross-validation technique. The Area under the Curve (AUC) was found to be above 0.9 for all the models. The entire procedure was performed using the R programming language.

Keywords: solenoid proteins; repeats; amino acid composition; biophysical properties; PCA; AUC; SVM; random forest; Naive Bayes; machine learning.

DOI: 10.1504/IJDMB.2020.112853

International Journal of Data Mining and Bioinformatics, 2020 Vol.24 No.3, pp.269 - 289

Received: 03 Oct 2019
Accepted: 14 Sep 2020

Published online: 07 Feb 2021 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article