Authors: Sujay Saha; Dibyendu Bikash Seal; Anupam Ghosh; Kashi Nath Dey
Addresses: Department of Computer Science and Engineering, Heritage Institute of Technology, Chowbaga Road, Anandapur, Kolkata 700107, India ' Department of Computer Science and Engineering, University of Engineering & Management, University Area, Plot No. III - B/5, New Town, Action Area - III, Kolkata 700160, India ' Department of Computer Science and Engineering, Netaji Subhash Engineering College, Techno City, PO Panchpota, Garia, Kolkata 700152, India ' Department of Computer Science and Engineering, University of Calcutta, Acharya Prafulla Chandra Roy Shiksha Prangan, JD - 2, Sector - III, Salt Lake City, Kolkata 700106, India
Abstract: Over the last few decades, a large amount of research work has been carried on genomic data. The cancer disease make cells in specific tissues in the body undergo uncontrolled division which results in the malignant growth or tumour. Today, DNA microarray technologies allow us to simultaneously monitor the expression pattern of thousands of genes. Microarray gene expression data are characterised by a very high dimensionality (genes), and a relatively small number of samples (observations). If one wants to identify all those genes from these thousands of gene expressions which are responsible for the disease like cancer, then it is useful to rank the genes. In this paper, we have proposed a novel gene ranking method based on Wilcoxon Rank Sum Test and genetic algorithm. WRST has been used for reducing dimensionality and genetic algorithm for finding out those differentially expressed genes. The final subset of genes has been cross-validated using k fold LOOCV (k varied for different dataset) method and thereafter used for classification of data using SVM with linear kernel. At first the proposed method has been applied on two relatively new benchmark datasets, like GDS4382 colorectal cancer dataset and GDS4794 small cell lung cancer dataset and the results show that the proposed method can reach up to 100% classification accuracy with very few dominant genes, which indirectly validates the biological and statistical significance of the proposed method. After that it is also applied on five real-life datasets and the results are compared with one of the recent state of the art approach on the basis of % of Accuracy, Sensitivity, and Specificity etc.
Keywords: gene ranking; DNA microarray; Wilcoxon rank sum test; genetic algorithms; SVM classifier; support vector machines; gene expression data; cancer; bioinformatics.
International Journal of Bioinformatics Research and Applications, 2016 Vol.12 No.3, pp.263 - 279
Received: 16 May 2015
Accepted: 22 Mar 2016
Published online: 01 Aug 2016 *