Title: An empirical study of self-training and data balancing techniques for splice site prediction

Authors: Ana Stanescu; Doina Caragea

Addresses: Department of Computing and Information Sciences, Kansas State University, Manhattan, KS, USA ' Department of Computing and Information Sciences, Kansas State University, Manhattan, KS, USA

Abstract: Thanks to Next Generation Sequencing technologies, unlabelled data is now generated easily, while the annotation process remains expensive. Semi-supervised learning represents a cost-effective alternative to supervised learning, as it can improve supervised classifiers by making use of unlabelled data. However, semi-supervised learning has not been studied much for problems with highly skewed class distributions, which are prevalent in bioinformatics. To address this limitation, we carry out a study of a semi-supervised learning algorithm, specifically self-training based on Naïve Bayes, with focus on data-level approaches for handling imbalanced class distributions. Our study is conducted on the problem of predicting splice sites and it is based on datasets for which the ratio of positive to negative examples is 1-to-99. Our results show that under certain conditions semi-supervised learning algorithms are a better choice than purely supervised classification algorithms.

Keywords: semi-supervised learning; supervised learning; imbalanced data; data balancing; under-sampling; over-sampling; splice sites; self-training; splice site prediction; next generation sequencing; highly skewed class distributions; bioinformatics; naive Bayes.

DOI: 10.1504/IJBRA.2017.082055

International Journal of Bioinformatics Research and Applications, 2017 Vol.13 No.1, pp.40 - 61

Received: 29 Mar 2015
Accepted: 15 May 2016

Published online: 06 Feb 2017 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article