Title: Semi-supervised clustering algorithm for haplotype assembly problem based on MEC model

Authors: Xin-Shun Xu; Ying-Xin Li

Addresses: School of Computer Science and Technology, Shandong University, Jinan 250101, China; The National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China ' Institute of Machine Vision and Machine Intelligence, Beijing Jingwei Textile Machinery New Technology Co., Ltd., No. 8 Yongchang Zhong Road, BDA, Beijing 100176, China

Abstract: Haplotype assembly is to infer a pair of haplotypes from localized polymorphism data. In this paper, a semi-supervised clustering algorithm—SSK (Semi-Supervised K-means) is proposed for it, which, to our knowledge, is the first semi-supervised clustering method for it. In SSK, some positive information is firstly extracted. The information is then used to help k-means to cluster all SNP fragments into two sets from which two haplotypes can be reconstructed. The performance of SSK is tested on both real data and simulated data. The results show that it outperforms several state-of-the-art algorithms on Minimum Error Correction (MEC) model.

Keywords: semi-supervised clustering; machine learning; k-means; haplotype assembly; bioinformatics; MEC model; minimum error correction; haplotypes.

DOI: 10.1504/IJDMB.2012.049279

International Journal of Data Mining and Bioinformatics, 2012 Vol.6 No.4, pp.429 - 446

Received: 26 Aug 2010
Accepted: 01 Jan 2011

Published online: 17 Dec 2014 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article