Title: Fast decision tree-based method to index large DNA-protein sequence databases using hybrid distributed-shared memory programming model

Authors: Khalid Mohammad Jaber; Rosni Abdullah; Nur'Aini Abdul Rashid

Addresses: Faculty of Science and Information Technology, Al-Zaytoonah University of Jordan, Amman, Jordan ' School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia ' School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia

Abstract: In recent times, the size of biological databases has increased significantly, with the continuous growth in the number of users and rate of queries; such that some databases have reached the terabyte size. There is therefore, the increasing need to access databases at the fastest rates possible. In this paper, the decision tree indexing model (PDTIM) was parallelised, using a hybrid of distributed and shared memory on resident database; with horizontal and vertical growth through Message Passing Interface (MPI) and POSIX Thread (PThread), to accelerate the index building time. The PDTIM was implemented using 1, 2, 4 and 5 processors on 1, 2, 3 and 4 threads respectively. The results show that the hybrid technique improved the speedup, compared to a sequential version. It could be concluded from results that the proposed PDTIM is appropriate for large data sets, in terms of index building time.

Keywords: indexing approaches; searching algorithms; decision tree; DNA; proteins; bioinformatics; data mining; DNA-protein sequence databases; shared memory programming; sequences; distributed memory; index building; large data sets.

DOI: 10.1504/IJBRA.2014.060765

International Journal of Bioinformatics Research and Applications, 2014 Vol.10 No.3, pp.321 - 340

Received: 08 May 2012
Accepted: 27 Jul 2012

Published online: 24 Oct 2014 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article