Title: Ensemble of sparse classifiers for high-dimensional biological data

Authors: Sunghan Kim; Fabien Scalzo; Donatello Telesca; Xiao Hu

Addresses: Department of Engineering, College of Technology and Computer Science, East Carolina University, 1001 E 5th St, Greenville, NC 27858, USA; Neural Systems and Dynamics Laboratory, Department of Neurosurgery, David Geffen School of Medicine, University of California Los Angeles, 18-265 Semel, 10833 Le Conte Avenue, Los Angeles, CA 90095, USA ' Neural Systems and Dynamics Laboratory, Department of Neurosurgery, David Geffen School of Medicine, University of California Los Angeles, 18-265 Semel, 10833 Le Conte Avenue, Los Angeles, CA 90095, USA ' Department of Biostatistics, School of Public Health, University of California Los Angeles 18-265 Semel, 10833 Le Conte Avenue, Los Angeles, CA 90095, USA ' Neural Systems and Dynamics Laboratory, Department of Neurosurgery, David Geffen School of Medicine, University of California Los Angeles, 18-265 Semel, 10833 Le Conte Avenue, Los Angeles, CA 90095, USA

Abstract: Biological data are often high in dimension while the number of samples is small. In such cases, the performance of classification can be improved by reducing the dimension of data, which is referred to as feature selection. Recently, a novel feature selection method has been proposed utilising the sparsity of high-dimensional biological data where a small subset of features accounts for most variance of the dataset. In this study we propose a new classification method for high-dimensional biological data, which performs both feature selection and classification within a single framework. Our proposed method utilises a sparse linear solution technique and the bootstrap aggregating algorithm. We tested its performance on four public mass spectrometry cancer datasets along with two other conventional classification techniques such as Support Vector Machines and Adaptive Boosting. The results demonstrate that our proposed method performs more accurate classification across various cancer datasets than those conventional classification techniques.

Keywords: ensemble sparse classifiers; l0-norm solution; feature selection; mass spectrometry; sparse solvers; high-dimensional biological data; data classification; cancer datasets; support vector machines; adaptive boosting; bioinformatics.

DOI: 10.1504/IJDMB.2015.069416

International Journal of Data Mining and Bioinformatics, 2015 Vol.12 No.2, pp.167 - 183

Received: 20 Aug 2012
Accepted: 03 Jun 2013

Published online: 15 May 2015 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article