Title: Topological data analysis can extract sub-groups with high incidence rates of Type 2 diabetes

Authors: Hyung Sun Kim; Chahngwoo Yi; Yongkang Kim; Uhnmee Park; Woong Kook; Bermseok Oh; Hyuk Kim; Taesung Park

Addresses: Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea ' The Research Institute of Basic Sciences, College of Natural Sciences, Seoul National University, Seoul, South Korea ' Department of Statistics, Seoul National University, Seoul, South Korea ' Department of Bioscience and biotechnology, University of Suwon, Suwon, South Korea ' Department of Mathematical Sciences and Institute for Mathematical Data Analytics Research Centre, Seoul National University, Seoul, South Korea ' Department of Biochemistry and Molecular Biology, School of Medicine, Kyung Hee University, Seoul, South Korea ' Department of Mathematical Sciences, Seoul National University, Seoul, South Korea ' Department of Statistics, Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea

Abstract: Type 2 Diabetes (T2D) is now a rapidly increasing, worldwide scourge, and the identification of genetic contributors is vital. However, current analyses of multiple, disease-contributing factors, and their combined interactions, remains quite difficult, using traditional approaches. Topological Data Analysis (TDA) shows what shape a data set can have, facilitating clustering analysis, by determining which components are close to each other. Thus, TDA can generate a network, using Single-Nucleotide Polymorphism (SNP) data, revealing the genetic relatedness of specific individuals, and can derive multiple ordered sub-groups, from one with a low patient concentration, to one with a high patient concentration. Since it is widely accepted that T2D pathogenesis is affected by multiple genetic factors, we performed TDA on T2D data from the Korea Association REsource (KARE) project, a population-based, genome-wide association study of the Korean adult population. Since KARE data contains follow-up information about the incidence of T2D, we compared the T2D status of each individual, at baseline, with that of ten years later. For the TDA network-driven sub-groups, ordered by prevalence, we compared the T2D incidence rate, after ten years, for individuals initially without T2D. As a result, we found that the TDA network-driven, ordered sub-groups had significantly increased incidence rates, linearly correlated with prevalence (p-value = 0.006914). Our results demonstrate the usefulness of TDA in both identifying genetic contributors (e.g., SNPs), and their interrelationships, in the pathology of complex diseases.

Keywords: type 2 diabetes; KARE; Korea association resource; single-nucleotide polymorphism; topological data analysis; network; sub-group analysis.

DOI: 10.1504/IJDMB.2019.099287

International Journal of Data Mining and Bioinformatics, 2019 Vol.22 No.1, pp.44 - 60

Received: 12 Jan 2019
Accepted: 12 Jan 2019

Published online: 24 Apr 2019 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article