International Journal of Data Mining and Bioinformatics (6 papers in press)
Deriving Enhanced Geographical Representations via Similarity-based Spectral Analysis: Predicting Colorectal Cancer Survival Curves in Iowa
by Michael Lash, Min Zhang, Xun Zhou, Nick Street, Charles Lynch
Abstract: Neural networks are capable of learning rich, nonlinear feature representations shown to be beneficial in many predictive tasks. In this work, we use such models to explore different geographical feature representations in the context of predicting colorectal cancer survival curves for patients in the state of Iowa, spanning the years 1989 to 2013. Specifically, we compare model performance using "area between the curves" (ABC) to assess (a) whether survival curves can be reasonably predicted for colorectal cancer patients in the state of Iowa, (b) whether geographical features improve predictive performance, (c) whether a simple binary representation, or a richer, spectral analysis-elicited representation perform better, and (d) whether spectral analysis-based representations can be improved upon by leveraging geographically-descriptive features. In exploring (d), we devise a similarity-based spectral analysis procedure, which allows for the combination of geographically relational and geographically descriptive features. Our findings suggest that survival curves can be reasonably estimated on average, with predictive performance deviating at the five-year survival mark among all models. We also find that geographical features improve predictive performance, and that better performance is obtained using richer, spectral analysis-elicited features. Furthermore, we find that similarity-based spectral analysis-elicited representations improve upon the original spectral analysis results by approximately 40%.
Keywords: Geographical representations; Spectral analysis; Deep learning; Spectral clustering; Neural networks; Colorectal cancer; Survival curve.
Improving secretory proteins prediction in Mycobacterium tuberculosis using the unbiased dipeptide composition with support vector machine
by Saeed Ahmed, Muhammad Kabir, Muhammad Arif, Zakir Ali, Farman Ali, Zar Nawab Khan Swati
Abstract: Tuberculosis (TB) is an infectious disease, remains a significant cause of death from bacterial infection worldwide. Recent biological research reveals that secretory proteins (SPs) are considered paramount antigenic agent in developing drugs and vaccines for the treatment of TB. Due to its biological importance, traditional experimental approaches are used for identification of secretory proteins in Mycobacterium tuberculosis (MTB). However, these methods for predicting SPs are costly, slow and challenging due to the abundance of the unknown sequence generated in the post-genomic era. Therefore, it is high precision by incorporating unbiased evolutionary profile and discrete feature spaces with various machine learning algorithms including support vector machine, k-nearest neighbor, and probabilistic neural network and, generalized regression neural network. Also, imbalance issue occurs in SPs training dataset which causes classification error, to tackle this dilemma a very well-known resampling technique synthetic minority oversampling technique was adopted. The presented method, achieved satisfactory outcomes in term of accuracy (ACC) 97.0%, sensitivity (Sen) 99.24 %, specificity (Spe) 92.53% and Mathews correlation coefficient (MCC) 0.932 using jackknife test. It is demonstrated that the new model remarkably outperformed the existing state-of-the-art approaches. Our study might provide useful hints to the pharmaceutical industry in designing new drugs for TB treatment in particular and research community in the area of computational biology and bioinformatics in general.
Keywords: Secretory proteins; Mycobacterium tuberculosis; Feature extraction; Oversampling; Support vector machine; Synthetic minority oversampling technique.
A Review on Biclustering of Gene Expression Microarray Data: Algorithms, Effective Measures and Validations
by Bhawani Sankar Biswal, Anjali Mohapatra, Swati Vipsita
Abstract: Gene expression microarray data analysis associated with the analysis of the actual expression data that reveals relevant information of genes, proteins, diseases etc. is a major area of research in pattern discovery. DNA microarrays promote the contemporary assessment of gene expression levels and are often helpful in the study of various gene co-regulation, gene function identification, pathway identification, gene regulatory networks, predictive toxicology, clinical diagnosis, and sequence variance studies. Popular microarray data mining techniques like classification, clustering, biclustering, and association analysis are relying on various statistical methods and machine learning algorithms. Many of these techniques are completely data-driven since unable to explore a significant amount of biological knowledge to the intellectual property. Hence several types of validations are further needed to validate the output and to prove its relevance. The role of the evaluation measures in a biclustering algorithm is highly significant in generating quality outputs with better accuracy. Selecting a proper measure in context to the data and the conditions is another biggest challenge in a biclustering approach as well as in other data and text mining techniques. This review article presents a brief idea about the three factors i.e biclustering algorithms, relevant evaluation measures and different types of validations in the context of the biclustering of gene expression microarray data.
Keywords: Gene expression microarray data; Biclustering; Metric and Non-metric based biclustering algorithms; Inter and Intra-biclusters evaluation functions.
A Corpus-Oriented Perspective on Terminologies of Side Effect and Adverse Reaction in Support of Text Retrieval for Drug Repurposing
by Alex Chengyu Fang, Yemao Liu, Yaping Lu, Jing Cao, Jingbo Xia
Abstract: Text resource selection is a primary concern for efficient and wide-coverage document processing for the extraction of required bio-medical information, while size control and topic relevance are key issues to ensure high-quality output from the retrieval system. In this study, we analyzed terms and terminologies used for adverse reaction and side effect of drugs. Furthermore, we proposed an effective strategy that used an intersection of unions of both 'adverse react' hyponyms and 'side effect' hyponyms to evaluate their semantic relationships, including the similarities and differences, in massive biomedical texts. Our results showed that the hyponyms related to these two superordinates perform differently in their use as signifiers of relevant documents. Our proposed strategy resulted in an optimal trade-off for relevant abstract retrieval, followed with empirical work of drug/gene matching in order to test the proposed strategy. The results also confirmed that the strategy was capable of maintaining a good trade-off between text size and content relevance.
Keywords: terminology; Jacquard similarity coefficient; text retrieval; side effects; adverse reaction.
Facial Expression Awareness based on Multiscale Permutation Entropy of EEG
by Xiaofeng Liu, Bin Hu, Xiangwei Zheng, Xiaowei Li
Abstract: Electroencephalogram (EEG) is a comprehensive manifestation of the dynamic activity of human brain neurons and has been proven to have the potential to serve as an effective biomarker for identifying subtle emotion- or cognition-related changes. This paper focuses on facial expression awareness and proposes multi-scale permutation entropy (MPE) of EEG data with the aim of finding a convenient and accurate method for identifying different facial expressions. First, the principle and computational procedure of MPE is introduced. Then, MPE analysis of EEG for facial expression awareness is detailed. Finally, computational analysis is conducted. In the first experiment, the influence of the scale factor on the MPE values is investigated in which the entropy value tends to be augmented with an increase in the scale factor when the scale factor is less than 5. In the second experiment, the analysis results show that the MPE of the angry expression EEG is higher than that of the happy expression EEG. Furthermore, we analysed the MPE in the form of a boxplot and found that the two expressions of anger and happiness can be distinguished clearly and that MPE can be used to predict angry and happy expressions based on EEG signals.
Keywords: EEG; permutation entropy; multi-scale permutation entropy; facial expression awareness.
Association Test for Rare variants using the Hamming Distance
by Suhyun Hwangbo, Jin-Young Jang, Bermseok Oh, Atsuko Imai-Okazaki, Jurg Ott, Taesung Park
Abstract: The recent development of DNA sequencing technology has given rise to many statistical methods for rare variant association studies (RVASs). However, these methods can lose power in association studies with small samples. In this study, we propose two statistical approaches applicable for RVASs when the sample size is not large. Our approaches are based on the Hamming distance. Existing Hamming distance-based methods mainly analyze common variants. For rare variant data with a small sample size, we extended two existing methods by using the weight based on minor allele frequency. Through simulation studies, we show that our proposed approaches control type 1 error rates and are more powerful even when given very small sample sizes. They also work well regardless of the direction of causal SNP effects. Applying these methods to real data, we confirmed that they identified true causal genes well. Based on the results of this study, we firmly believe that our proposed methods are powerful for small sample data.
Keywords: RVASs; rare variant association studies; hamming distance; MAF; minor allele frequency.