Title: Shape-based retrieval of CNV regions in read coverage data

Authors: Sangkyun Hong; Jeehee Yoon; Dongwan Hong; Unjoo Lee; Baeksop Kim; Sanghyun Park

Addresses: Department of Computer Engineering, Hallym University 39 Hallymdaehak-gil, Chuncheon-si, Kangwon-do 200 702, Republic of Korea ' Department of Computer Engineering, Hallym University 39 Hallymdaehak-gil, Chuncheon-si, Kangwon-do 200 702, Republic of Korea ' Cancer Genomics Branch, National Cancer Center 323 Ilsan-ro, Ilsandong-gu, Goyang-si, Gyeonggi-do 410 769, Republic of Korea ' Department of Electronic Engineering, Hallym University 39 Hallymdaehak-gil, Chuncheon-si, Kangwon-do 200 702, Republic of Korea ' Department of Computer Engineering, Hallym University 39 Hallymdaehak-gil, Chuncheon-si, Kangwon-do 200 702, Republic of Korea ' Department of Computer Science, Yonsei University 134 Shinchon-dong, Seodaemun-gu, Seoul, 120 749, Republic of Korea

Abstract: This study proposes a novel copy number variation (CNV) detection method, CNV_shape, based on variations in the shape of the read coverage data which are obtained from millions of short reads aligned to a reference sequence. The proposed method carries out two transforms, mean shift transform and mean slope transform, to extract the shape of a CNV more precisely from real human data, which are vulnerable to experimental and biological noises. The mean shift transform is a procedure for gaining a preliminary estimation of the CNVs by statistically evaluating moving averages of given read coverage data. The mean slope transform extracts candidate CNVs by filtering out non-stationary sub-regions from each of the primary CNVs pre-estimated in the mean shift procedure. Each of the candidate CNVs is merged with neighbours depending on the merging score to be finally identified as a putative CNV, where the merging score is estimated by the ratio of the positions with non-zero values of the mean shift transform to the total length of the region including two neighbouring candidate CNVs and the interval between them. The proposed CNV detection method was validated experimentally with simulated data and real human data. The simulated data with coverage in the range of 1× to 10× were generated for various sampling sizes and p-values. Five individual human genomes were used as real human data. The results show that relatively small CNVs (>1 kbp) can be detected from low coverage (> 1.7×) data. The results also reveal that, in contrast to conventional methods, performance improvement from 8.18 to 87.90% was achieved in CNV_shape. The outcomes suggest that the proposed method is very effective in reducing noises inherent in real data as well as in detecting CNVs of various sizes and types.

Keywords: CNV detection; copy number variation; next-generation sequencing; shape-based retrieval; bioinformatics; read coverage data; short reads; reference sequences; mean shift transform; mean slope transform; human genomes; simulation; noise reduction.

DOI: 10.1504/IJDMB.2014.060051

International Journal of Data Mining and Bioinformatics, 2014 Vol.9 No.3, pp.254 - 276

Received: 07 Aug 2011
Accepted: 20 Jan 2012

Published online: 21 Oct 2014 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article