Title: Statistical quality control analysis of high-dimensional omics data

Authors: Yongkang Kim; Gyu-Tae Kim; Min-Seok Kwon; Taesung Park

Addresses: Department of Statistics, Seoul National University, Seoul, Korea ' Department of Statistics, Seoul National University, Seoul, Korea ' Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea ' Department of Statistics, Seoul National University, Seoul, Korea; Interdisciplinary program in Bioinformatics, Seoul National University, Seoul, Korea

Abstract: Quality control (QC) is a most important pre-processing procedure to remove unwanted variation in omics data, such as microarray, next generation sequencing, and mass spectrometry data. QC has become a standard procedure for identifying important biological 'signatures' of interest. Although several QC analysis tools are now used widely, these usually require a subjective guideline to determine the quality of the omics data being assessed. Here, we propose a new simple QC plot for high-dimensional omics data that can identify samples of poor quality in a more objective manner. The proposed QC plot can easily identify samples of poor quality by comparing the between/within group distances, between all possible pairs of samples. Through a permutation procedure, the distribution of these distances is derived, generating p-values for each sample. These p-values can then be used as a more objective criterion to determine the quality of the sample. To exemplify the utility of this approach, we applied the proposed QC plot to MicroArray Quality Control (MAQC), project 1 data.

Keywords: distance measure; quality control; microarray; omics data.

DOI: 10.1504/IJDMB.2017.087173

International Journal of Data Mining and Bioinformatics, 2017 Vol.18 No.3, pp.210 - 222

Received: 19 Apr 2017
Accepted: 21 Apr 2017

Published online: 06 Oct 2017 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article