International Journal of Data Mining and Bioinformatics (7 papers in press)
Regular Issues
Analyzing SEER Cancer Data Using Signed Maximal Frequent Itemset Networks  by Yunuscan Kocak, Tansel Ozyer Abstract: Background: Evaluating patient prognosis is an important factor for predicting
the effects and consequences of diseases. With the advancement in populationlevel
data collection and with the development in statistical models, it has become
possible to develop systems capable of analyzing disease prognosis. Powered
by data mining and machine learning techniques, systems can find interesting
properties within a dataset and predict unseen cases. Initial and important steps
in this process are known as feature extraction and feature selection. Feature
extraction is the process of developing new features based on existing features
whereas feature selection is the process of deciding which features will be used
within the model.
Methods: Grouping many features into a single one and understanding
relationships between features has been proven as a good approach for selecting
strong features. In this work, a novel network-based feature extraction method
is presented and tested on two cancer cases, namely (1) Lung and Bronchus
cancer, and (2) Pancreatic cancer. Named as Signed Maximal Frequent Itemset
Network, the proposed method uses maximal frequent itemsets as actors in a
network and extracts features by considering their co-occurrence and structure of
the subgraph. Maximal frequent itemsets are selected as actors to have a compact
representation of the features and a network is created to model the relationship
between these actors.
Results: For performance comparison, extracted features and original features
are tested by employing some of the well-accepted and tested machine learning
algorithms. In both cases, relatively the best results are obtained when itemsets
are used as features; and combining extracted and original features will increase
performance which is measured with the root mean square error metric. It has
been reported as 13.74 and 7.60 for Lung and Bronchus cancer and Pancreatic
cancer, respectively. To investigate patterns on prediction, the top 10 maximal
itemsets are selected with the recursive feature elimination method and their
distributions are analyzed.
Conclusions: For most of the cases, features created from itemsets and extracted
from the network increased the performance of the well-known machine
learning algorithms compared to the original features. Itemset analysis confirmed
previously known knowledge. As a result of the conducted analysis, it has been
realized that survival months are low for cases where information on the disease
was unknown or blank, and higher for cases when chemotherapy was given and
the primary site was labelled, such as head of the pancreas.
Keywords: Keywords: cancer data analysis; frequent pattern mining; machine learning;
network analysis; signed networks; maximal frequent itemsets; feature selection;
lung cancer; pancreatic cancer.
Multiple-Ensemble Methods for Prediction of Alzheimer Disease  by Ashutosh Mishra Abstract: Alzheimer's disease (AD) is a neurodegenerative disease whose permanent cure is not yet available. However, its prediction at an early stage may increase the life span of a person by many years. The main predicament is to detect AD at an early stage and select the features responsible for it. The objective of this study was to predict AD at an early stage and identify the features that facilitate early prediction using ensemble learning. First, we implemented the ADNI dataset on different machine-learning and deep-learning models. The proposed multiple ensemble method overcomes the limitations of existing models by applying feature selection for the early prediction, and it is observed that the best ensemble model is having the top 6-selected features and achieves an accuracy of 96.71% with higher ROC. Our model performed well compared with other machine and deep learning models. Keywords: Alzheimer Disease (AD); Machine Learning (ML); Ensemble Learning (EL); Deep Learning (DL); Feature Selection.
Diagnostic and prognostic value of HSPD1 in esophageal cancer  by Xin Chen, Can Luo, Yuting Bai, Xi Zhou, Lei Xu, Xiaolan Guo, Qing Wu, Xiaowu Zhong Abstract: HSPD1 is a potential biomarker in many cancers. However, its role in esophageal cancer (ESCA) is poorly understood. Among patients with ESCA, a high HSPD1 expression is linked to a poor outcome. As suggested by Cox analysis results combined with ROC (receiver operating characteristic) graph, HSPD1 is an independent outcome predictor for the ESCA population and had a diagnostic value. Moreover, HSPD1 is linked to immunofiltration, genetic alteration and methylation in ESCA, which is also involved in biological processes, such as chaperonin-containing T-complex, PI3K/Akt signalling pathway, and thyroid hormone signalling pathway. According to a final analysis of drug susceptibility, low HSPD1 expression is correlated with resistance to 23 drugs. This phenomenon provided new insights for the probable predictor role of HSPD1 in the ESCA diagnosis and prognosis. Keywords: HSPD1; Esophageal cancer; Bioinformatics; Prognosis; Diagnosis; Biomarkers. DOI: 10.1504/IJDMB.2021.10048132
Smart Variant Filtering  by Vladimir Kovacevic, Predrag Obradovic Abstract: Variant filtering as a part of the genome reconstruction process is used for identifying falsely called variants. Availability of truth set variants published for several human DNA samples enabled the creation of the machine learning-based Smart Variant Filtering tool and framework for filtering germline variants. Conceptually, the framework consists of selecting an optimal machine learning algorithm, configuration, set of features, and producing a model used for filtering novel variants. rnWith direct comparison, we demonstrated that the presented solution outperforms variant filtering currently used within most secondary DNA analyses. Smart Variant Filtering increases the precision of called single nucleotide variants (removes false positives) for up to 0.2% while keeping the overall f-score higher by 0.12-0.27% than in existing solutions. The precision of calling insertions and deletions is increased up to 7.8%, while the f-score increase is in the range of 0.1 to 3.2%. Keywords: genomic variant filtering; variant calling; machine learning.
Comorbidities and risk factors impact of COVID-19 in Mexico: A Feature Utility Metrics Approach  by Eduardo Emmanuel Rodríguez López, Daniel Hernández González, Francisco Javier Álvarez Rodríguez, Julio Cesar Ponce Gallegos Abstract: By applying Machine Learning, it is possible to determine the impact of main comorbidities and risk factors associated with COVID-19 based on an analysis of official Mexican Secretary of Health data. This analysis was performed using Feature Utility Metrics: Mutual Information (MI), Permutation Importance (PI), and Partial Dependence Plot (PDP) with two different learning models (RandomForest and XGBoost), finding similarities between these metrics. According to these models, the main comorbidities and risk factors associated with COVID-19 are Age, Gender, Obesity, Diabetes, and Hypertension. Regarding MI and PI (RandomForest), the main risk factor is Age, while for PI (XGBoost) is Obesity. Finally, the PDP graph for Age, shows that the associated probability of risk of COVID-19 infection increases considerably after 60 years old. Therefore, it was confirmed that the main comorbidities and risk factors associated with COVID-19 in Mexico are coherent with the diseases and conditions most present in the population. Keywords: Comorbidities; COVID-19 risk factors; mutual information; permutation importance; feature utility metrics. DOI: 10.1504/IJDMB.2021.10048434
Protein complex prediction based on dense subgraph merging  by Tushar Ranjan Sahoo, Swati Vipsita, Sabyasachi Patra Abstract: Protein complex prediction is an essential task in cell biology to understand and analyze the protein-protein interaction networks, further bringing about the knowledge of many important biological functions. In this article, the authors presented a PROtein COmplex Prediction technique based on Dense Subgraph Merging (PROCOP), which considers the inherent organization of proteins and the regions with heavy interactions in PPI networks. The work is intended to isolate the dense regions of the PPI network by simply a neighbourhood search, followed by a merging strategy based on the weighted cluster density. Two or more dense regions are merged iteratively to produce biologically meaningful protein complexes. The predicted protein complexes are evaluated and analyzed using the PPI network of S. cerevisiae and Homosapiens. The performance of the proposed algorithm is at par with most of the existing algorithms and outperforms in terms of evaluation metrics like F-measure and accuracy. Keywords: biological network; protein complex; induced subgraph; subgraph merging; clustering. DOI: 10.1504/IJDMB.2021.10048571
Integrative Analysis of Molecular Genetic Targets and Pathways in Colorectal Cancer Through Screening Large?Scale Microarray Data  by Elif ONUR, Tuba DENKÇEKEN Abstract: Our aim was to make comprehensive analyses of mRNAs and miRNAs in early diagnosis of Colorectal Cancer (CRC) via Principal Component Analysis (PCA)_based Unsupervised Feature Extraction (UFE) and additional bioinformatics approaches. miRNA and mRNA expression profiling studies of CRC in the GEO were downloaded. PCA_based UFE was used to define significant mRNA and miRNAs. The target genes of the identified miRNAs were determined, and the common gene clusters were determined with the mRNAs analyzed from GEO. Functional enrichment analysis was conducted with DAVID. PPI network was established with the STRING, and the mRNA-miRNA regulatory network was established with Cytoscape. Determined hub-miRNAs/hub-genes were verified using TCGA. PPI, Cytoscape, and TCGA verification analysis demonstrated that three hub-genes and five hub-miRNAs were found to be significant in CRC. Dysregulation of these may contribute to CRC development and may be considered a new target in CRC therapy. Keywords: Bioinformatics; Colorectal cancer; mRNA; miRNA; Microarray; Principal Component Analysis-based Unsupervised Feature Extraction. DOI: 10.1504/IJDMB.2021.10048645
|