Title: Evaluation of BIC and Cross Validation for model selection on sequence segmentations

Authors: Niina Haiminen, Heikki Mannila

Addresses: HIIT, University of Helsinki and Helsinki University of Technology, P.O. Box 68, FI-00014 University of Helsinki, Finland. ' HIIT, University of Helsinki and Helsinki University of Technology, P.O. Box 68, FI-00014 University of Helsinki, Finland

Abstract: Segmentation is a general data mining technique for summarising and analysing sequential data. Segmentation can be applied, e.g., when studying large-scale genomic structures such as isochores. Choosing the number of segments remains a challenging question. We present extensive experimental studies on model selection techniques, Bayesian Information Criterion (BIC) and Cross Validation (CV). We successfully identify segments with different means or variances, and demonstrate the effect of linear trends and outliers, frequently occurring in real data. Results are given for real DNA sequences with respect to changes in their codon, G + C, and bigram frequencies, and copy-number variation from CGH data.

Keywords: sequence segmentation; model selection; cross validation; BIC; Bayesian information criterion; binary; categorical; genome; likelihood; data mining; bioinformatics; DNA sequences.

DOI: 10.1504/IJDMB.2010.037547

International Journal of Data Mining and Bioinformatics, 2010 Vol.4 No.6, pp.675 - 700

Received: 03 Nov 2007
Accepted: 08 Jun 2008

Published online: 16 Dec 2010 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article