Title: MAIL: mining sequential patterns with wildcards

Authors: Fei Xie; Xindong Wu; Xuegang Hu; Jun Gao; Dan Guo; Yulian Fei; Ertian Hua

Addresses: College of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China; Department of Computer Science and Technology, Hefei Normal University, Hefei 230601, China ' College of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China; Department of Computer Science, University of Vermont, Burlington, VT 05405, USA ' College of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China ' College of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China ' College of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China ' College of Computer Science and Information Engineering, Zhejiang Gongshang University, Hangzhou, China ' College of Computer Science and Information Engineering, Zhejiang Gongshang University, Hangzhou, China

Abstract: Sequential pattern mining is an important research task in many domains, such as biological science. In this paper, we study the problem of mining frequent patterns from sequences with wildcards. The user can specify the gap constraints with flexibility. Given a subject sequence, a minimal support threshold and a gap constraint, we aim to find frequent patterns whose supports in the sequence are no less than the given support threshold. We design an efficient mining algorithm MAIL. Two pattern growth strategies are proposed to improve the completeness and the time efficiency. One is based on the candidate occurrence pruning, and the other uses an occurrence graph. A random data generator is designed to test the completeness on artificial data. Experiments on DNA sequences show that MAIL mines four times more patterns than one of its peers and the time performance is six times faster on average than its another peer. We also give a concrete example in which our algorithm is applied on DNA sequences to find interesting patterns.

Keywords: data mining; sequential patterns; pattern mining; bioinformatics; wildcards; one-off condition.

DOI: 10.1504/IJDMB.2013.054690

International Journal of Data Mining and Bioinformatics, 2013 Vol.8 No.1, pp.1 - 23

Received: 02 Oct 2010
Accepted: 29 Apr 2011

Published online: 20 Oct 2014 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article