International Journal of Data Mining and Bioinformatics (11 papers in press)
by Yan Yan, Connor Burbridge, Jinhong Shi, Juxin Liu, Anthony Kusalik
Abstract: Many software packages have been developed for genome-wide association studies (GWAS) based on various statistical models. One key factor influencing the statistical reliability of GWAS is the amount of input data used. In this paper we investigate how input data quantity influences output of four widely used GWAS programs, PLINK, TASSEL, GAPIT, and FaST-LMM, in the context of plant genomes and phenotypes. Both synthetic and real data are used. Evaluation is based on p- and q-values of output SNPs, and Kendall rank correlation between output SNP lists. Results show that for the same GWAS program, different Arabidopsis thaliana datasets demonstrate similar trends of rank correlation with varied input quantity, but differentiate on the numbers of SNPs passing a given p- or q-value threshold. We also show that variations in numbers of replicates influence the p-values of SNPs, but do not strongly affect the rank correlation.
Keywords: GWAS; Genome-Wide Association Study; Arabidopsis thaliana; plant phenomics; plant genomics; PLINK; TASSEL; GAPIT; FaST-LMM; statistical power; input data quantity; epistasis.
Facial Expression Awareness based on Multiscale Permutation Entropy of EEG
by Xiaofeng Liu, Bin Hu, Xiangwei Zheng, Xiaowei Li
Abstract: Electroencephalogram (EEG) is a comprehensive manifestation of the dynamic activity of human brain neurons and has been proven to have the potential to serve as an effective biomarker for identifying subtle emotion- or cognition-related changes. This paper focuses on facial expression awareness and proposes multi-scale permutation entropy (MPE) of EEG data with the aim of finding a convenient and accurate method for identifying different facial expressions. First, the principle and computational procedure of MPE is introduced. Then, MPE analysis of EEG for facial expression awareness is detailed. Finally, computational analysis is conducted. In the first experiment, the influence of the scale factor on the MPE values is investigated in which the entropy value tends to be augmented with an increase in the scale factor when the scale factor is less than 5. In the second experiment, the analysis results show that the MPE of the angry expression EEG is higher than that of the happy expression EEG. Furthermore, we analysed the MPE in the form of a boxplot and found that the two expressions of anger and happiness can be distinguished clearly and that MPE can be used to predict angry and happy expressions based on EEG signals.
Keywords: EEG; permutation entropy; multi-scale permutation entropy; facial expression awareness.
Association Test for Rare variants using the Hamming Distance
by Suhyun Hwangbo, Jin-Young Jang, Bermseok Oh, Atsuko Imai-Okazaki, Jurg Ott, Taesung Park
Abstract: The recent development of DNA sequencing technology has given rise to many statistical methods for rare variant association studies (RVASs). However, these methods can lose power in association studies with small samples. In this study, we propose two statistical approaches applicable for RVASs when the sample size is not large. Our approaches are based on the Hamming distance. Existing Hamming distance-based methods mainly analyze common variants. For rare variant data with a small sample size, we extended two existing methods by using the weight based on minor allele frequency. Through simulation studies, we show that our proposed approaches control type 1 error rates and are more powerful even when given very small sample sizes. They also work well regardless of the direction of causal SNP effects. Applying these methods to real data, we confirmed that they identified true causal genes well. Based on the results of this study, we firmly believe that our proposed methods are powerful for small sample data.
Keywords: RVASs; rare variant association studies; hamming distance; MAF; minor allele frequency.
Function Prediction of Cancer-related LncRNAs Using Heterogeneous Information Network Model
by Sunil Kumar P V, Manju M, Gopakumar G
Abstract: The aberrant expression of lncRNAs is proven to be one of the prime reasons for cancer progression. Recent studies recommend lncRNAs as potential therapeutic target in cancer. The overexpression of oncogenic lncRNAs causes tumour progression, whereas that of tumour suppressor lncRNAs leads to apoptosis. In this paper, a heterogeneous information network based Support Vector Machine classifier that can predict lncRNAs into oncogenic or tumour suppressor is proposed. Interactions of lncRNAs with other lncRNAs and proteins along with protein-protein interactions are used to build the network. The model predicted lncRNAs into oncogenic or tumour suppressor with an accuracy of 0.83 and produced an accuracy of 0.80 during an independent validation. A comparison with recently reported studies shows that prediction results fall in line with them.
Availability: The source code and the sample data are freely available at: bdbl.nitc.ac.in/CanLNCClassify
Keywords: LncRNA; Cancer; Heterogeneous Information Network; Meta-Path; Classification; Support Vector Machine.
Predicting survival outcomes in ovarian cancer using gene expression data
by TaeJin Ahn, Nayeon Kang, Yonggab Kim, Se Ik Kim, Yong-Sang Song, Taesung Park
Abstract: Ovarian cancer is a deadly disease that while only eighth in cancer incidence among women, it is the most lethal gynaecological malignancy. About 70% of ovarian cancer types are high-grade serous ovarian cancer (HGSOC). Early stage HGSOC has a survival rate of more than 90 percent, but most diagnoses occur after the third stage, such that the overall survival rate, worldwide, is only ~ 35 percent. To detect early ovarian cancer, many studies have attempted to identify HGSOC-associated genes, such as BRCA, better known as a breast cancer-related cancer gene. In addition to early diagnosis, prognostic tools and the identification of drug response-related genes are much needed. Consequently, in this study, we endeavoured to identify HGSOC-related genes from RNA-seq data. We further suggest that stable extraction of genes could overcome difficulties regarding the reproducibility of existing RNA-seq data. The scheme we present was used to analyze an ovarian cancer RNA-seq from The Cancer Genome Atlas (TCGA), an open data sources. We developed a new gene selection strategy using leave-one-out cross validation (LOOCV). This method showed better performance than a previously developed method, when evaluating the same data set. Using this method, we could also infer biologic functions of selected genes, i.e., the outcomes were not commonly affected by one gene, but instead, subsets of samples associated with different subsets of genes. These findings suggest that multiple signaling pathways contribute to ovarian cancer patient survival.
Keywords: RNA-seq; ovarian cancer; The Cancer Genome Atlas (TCGA); survival analysis.
SPARTA: Super-fast Permutation AppRoach To Approximate extremely low p-values
by Sangseob Leem, Dae Ho Lee, Taesung Park
Abstract: The permutation test, a non-parametric method for assessing statistical significance, now widely used in many disciplines, including bioinformatics, is very useful in situations where a null distribution, of test statistics, is unknown or hard to determine. In permutation tests, the precision of significance depends on the number of permutations, although computation time precludes achieving extremely low p-values.rnIn this paper, we propose a novel strategy, for approximating extremely low p-values. Our proposed method consists of three steps: (1) divide data into subsets and perform permutation tests for the subsets; (2) integrate p-values by Stouffers z-score method; and (3) repeat the first and second steps, and average them. We herein demonstrate and validate our method, using simulation studies and two real biological examples. Those assessments showed that two p-values of about 1.0e-20 and 1.0e-50 could be well-estimated by the proposed method, in a single day, for samples larger than 5,000.rn
Keywords: permutation test; low p-value; rapid approximation.
sleBioRepo: a gene expression compendium in systemic lupus erythematosus
by Sungjin Park, Seungyoon Nam
Abstract: Systemic lupus erythematosus (SLE, lupus), is an autoimmune disease in which the human immune system attacks the bodys own organs and tissues. However, SLE is very difficult to diagnose, mimicking numerous other conditions, and manually curated database for mRNA expression levels, and SLE association studies, have yet to be established. Here, we constructed a user-friendly database, sleBioRepo, equipped with a graphical web interface, for simple browsing or searching differential gene expression between control and case groups. Datasets in our database cover 32 gene expression omnibus (GEO) series, consisting of 2,042 samples of diverse body tissues. Also, we provide demographics, and differentially expressed genes, of user-selected samples and genes, for the purpose of improving knowledge of this pathology. We assert that our database will serve as a comprehensive resource for mRNA biomarker studies, as well as for prioritizing mRNAs, for functional validation in SLE, with likely extension to other diseases.
Keywords: systemic lupus erythematosus; SLE; lupus; curated database.
Fast practical on-line exact single and multiple pattern matching algorithms in highly similar sequences
by Nadia Ben Nsira, Thierry Lecroq, Elise Prieur-Gatson
Abstract: With the advent of high-throughput sequencing technologies there are
more and more genomic sequences of individuals of the same species available.
These sequences only differ by a very small amount of variations. There is thus a
strong need for efficient algorithms for performing fast pattern matching in such
specific sets of sequences. In this paper we propose efficient practical algorithms
that solve on-line exact pattern matching problem in a set of highly similar DNA
sequences. We first present a method for exact single pattern matching when k
variations are allowed in a window which size is equal to the pattern length. We
then propose an algorithm for exact multiple pattern matching when only one
variation is allowed in a window which size is equal to the length of the longest
pattern. Experimental results show that our algorithms, though not optimal in the
worst case, have good performances in practice.
Keywords: Computational biology; bioinformatics; algorithm design; pattern matching; string matching; DNA sequences; genomic sequences; Landau-Vishkin algorithm; Aho-Corasick algorithm; similar sequences.
Topological data analysis can extract subgroups with high incidence rates of Type 2 diabetes
by Hyung Sun Kim, Chahngwoo Yi, Yongkang Kim, Uhnmee Park, Woong Kook, Bermseok Oh, Hyuk Kim, Taesung Park
Abstract: Type 2 diabetes (T2D) is now a rapidly increasing, worldwide scourge, and the identification of genetic contributors is vital. However, current analyses of multiple, disease-contributing factors, and their combined interactions, remains quite difficult, using traditional approaches. Topological data analysis (TDA) shows what shape a data set can have, facilitating clustering analysis, by determining which components are close to each other. Thus, TDA can generate a network, using single-nucleotide polymorphism (SNP) data, revealing the genetic relatedness of specific individuals, and can derive multiple ordered subgroups, from one with a low patient concentration, to one with a high patient concentration. Since it is widely accepted that T2D pathogenesis is affected by multiple genetic factors, we performed TDA on T2D data from the Korea Association REsource (KARE) project, a population-based, genome-wide association study of the Korean adult population. Since KARE data contains follow-up information about the incidence of T2D, we compared the T2D status of each individual, at baseline, with that of 10 years later. For the TDA network-driven subgroups, ordered by prevalence, we compared the T2D incidence rate, after 10 years, for individuals initially without T2D. As a result, we found that the TDA network-driven, ordered subgroups had significantly increased incidence rates, linearly correlated with prevalence (p-value = 0.006914). Our results demonstrate the usefulness of TDA in both identifying genetic contributors (e.g., SNPs), and their interrelationships, in the pathology of complex diseases.
Keywords: type 2 diabetes; Korea Association REsource (KARE); single-nucleotide polymorphism; topological data analysis; network; subgroup analysis.
Exploratory analysis for detecting population structures by iterative pruning based on independent component analysis
by Mira Park, Eunbin Choi, Heonsu Lee, Yongkang Kim, Taesung Park
Abstract: One of the main issues in genome-wide association studiesrn(GWAS) is detecting population stratification. As populationrnstratification can produce spurious associations in GWAS, it isrnnecessary to find hidden structures and assign individuals tornsubpopulations in advance. We suggest an exploratory approach forrnpopulation structure analysis based on independent component analysisrnAuthorrn(ICA). ICA is unsupervised approach to identifying and separatingrnmixed sources from observed signals with little prior information. BothrnICA and principal component analysis (PCA) reduce data into smallerrnsets of components. However, unlike PCA, ICA can treat non-Gaussianrndata and use higher moments. To determine the population structure,rnwe first reduce the dimensionality of samples by projecting the data torna lower-dimensional subspace built by ICA. The samples are thenrnbisected using fuzzy clustering. Repeating this procedure until somernpredetermined stopping criterion, we can detect the population structurernand assign individuals to subpopulations. Information about the numberrnof optimal subpopulations can also be obtained. We consider negativernentropy as a measure of the importance of each component. To assessrnthe proposed method, we analyze simulated genotypic data withrndifferent degrees of structure. We compare the proposed method torncurrent PCA-based methods. Real data from the HapMap project arernalso analyzed for illustration.
Keywords: Independent component analysis; pruning; genome-widernassociation study; negative entropy; subpopulation.
Protein Fold Recognition Model based on Cubic Lattice
by Farzad Peyravi, Ali Mohammad Latif, Seyed Mohammad Moshtaghioun
Abstract: Proteins are essential for the biological processes in the human body. They can only perform their functions when they fold into their tertiary structure. We propose a novel fold recognition method for protein tertiary structure prediction based on a hidden Markov model and 3D coordinates of amino acid residues in 3D space. The method introduces states based on the basis vectors in Bravais cubic lattice to recognize the fold of proteins. The accuracy of proposed model is quite better in comparison with SAM, 3-HMM optimized and Markov chain in overall experiment.
Keywords: Protein structure prediction; Tertiary Structure; Fold recognition; Hidden Markov model; Bravais lattice; Cubic lattice.