Int. J. of Big Data Intelligence   »   2015 Vol.2, No.3

 

 

Title: Unstructured data mining: use case for CouchDB

 

Authors: Richard K. Lomotey; Ralph Deters

 

Addresses:
Department of Computer Science, University of Saskatchewan, Saskatoon, Canada
Department of Computer Science, University of Saskatchewan, Saskatoon, Canada

 

Abstract: 'Big data' has changed the status quo on digital content creation, storage and management. While data hoarding over the years has followed the structured-style storage approach, the recent nature of digital content, which is widely unstructured, creates the need to adopt different storage techniques. The NoSQL database systems are therefore proposed to accommodate most of the content being generated today. One of such NoSQL databases that have received significant enterprise adoption is the document-append style storage. The problem however is that, research and tools that can aid data mining tasks from such NoSQL databases is generally lacking. Even though document-append style storages allow data accessibility as web services and over URL/I, building a corresponding data mining tool deviates from the underlying techniques governing web crawlers. Also, existing data mining tools that have been designed for schema-based storages (e.g., RDBMS) are misfits. Hence, our goal in this work is to design a data analytics tool that enables knowledge discovery through information retrieval (i.e., terms) from document-append style storage. Three algorithms for terms extraction are tested which are: the inference-based apriori with a Bayesian component, the hidden Markov model, and the Bernoulli process. Overall, the paper proves the accuracy and speed of each algorithm.

 

Keywords: data mining; NoSQL databases; Bayesian rule; unstructured data; term extraction; inference based apriori; hidden Markov model; HMM; Bernoulli process; big data; data analytics; knowledge discovery; information retrieval.

 

DOI: 10.1504/IJBDI.2015.070597

 

Int. J. of Big Data Intelligence, 2015 Vol.2, No.3, pp.168 - 182

 

Submission date: 21 Apr 2014
Date of acceptance: 08 Nov 2014
Available online: 11 Jul 2015

 

 

Editors Full text accessAccess for SubscribersPurchase this articleComment on this article