Int. J. of Big Data Intelligence   »   2015 Vol.2, No.3



Title: Unstructured data mining: use case for CouchDB


Authors: Richard K. Lomotey; Ralph Deters


Department of Computer Science, University of Saskatchewan, Saskatoon, Canada
Department of Computer Science, University of Saskatchewan, Saskatoon, Canada


Abstract: 'Big data' has changed the status quo on digital content creation, storage and management. While data hoarding over the years has followed the structured-style storage approach, the recent nature of digital content, which is widely unstructured, creates the need to adopt different storage techniques. The NoSQL database systems are therefore proposed to accommodate most of the content being generated today. One of such NoSQL databases that have received significant enterprise adoption is the document-append style storage. The problem however is that, research and tools that can aid data mining tasks from such NoSQL databases is generally lacking. Even though document-append style storages allow data accessibility as web services and over URL/I, building a corresponding data mining tool deviates from the underlying techniques governing web crawlers. Also, existing data mining tools that have been designed for schema-based storages (e.g., RDBMS) are misfits. Hence, our goal in this work is to design a data analytics tool that enables knowledge discovery through information retrieval (i.e., terms) from document-append style storage. Three algorithms for terms extraction are tested which are: the inference-based apriori with a Bayesian component, the hidden Markov model, and the Bernoulli process. Overall, the paper proves the accuracy and speed of each algorithm.


Keywords: data mining; NoSQL databases; Bayesian rule; unstructured data; term extraction; inference based apriori; hidden Markov model; HMM; Bernoulli process; big data; data analytics; knowledge discovery; information retrieval.


DOI: 10.1504/IJBDI.2015.070597


Int. J. of Big Data Intelligence, 2015 Vol.2, No.3, pp.168 - 182


Submission date: 21 Apr 2014
Date of acceptance: 08 Nov 2014
Available online: 11 Jul 2015



Editors Full text accessAccess for SubscribersPurchase this articleComment on this article