Authors: Richard K. Lomotey; Ralph Deters
Addresses: Department of Computer Science, University of Saskatchewan, Saskatoon, Canada ' Department of Computer Science, University of Saskatchewan, Saskatoon, Canada
Abstract: 'Big data' has changed the status quo on digital content creation, storage and management. While data hoarding over the years has followed the structured-style storage approach, the recent nature of digital content, which is widely unstructured, creates the need to adopt different storage techniques. The NoSQL database systems are therefore proposed to accommodate most of the content being generated today. One of such NoSQL databases that have received significant enterprise adoption is the document-append style storage. The problem however is that, research and tools that can aid data mining tasks from such NoSQL databases is generally lacking. Even though document-append style storages allow data accessibility as web services and over URL/I, building a corresponding data mining tool deviates from the underlying techniques governing web crawlers. Also, existing data mining tools that have been designed for schema-based storages (e.g., RDBMS) are misfits. Hence, our goal in this work is to design a data analytics tool that enables knowledge discovery through information retrieval (i.e., terms) from document-append style storage. Three algorithms for terms extraction are tested which are: the inference-based apriori with a Bayesian component, the hidden Markov model, and the Bernoulli process. Overall, the paper proves the accuracy and speed of each algorithm.
Keywords: data mining; NoSQL databases; Bayesian rule; unstructured data; term extraction; inference based apriori; hidden Markov model; HMM; Bernoulli process; big data; data analytics; knowledge discovery; information retrieval.
International Journal of Big Data Intelligence, 2015 Vol.2 No.3, pp.168 - 182
Available online: 11 Jul 2015 *Full-text access for editors Access for subscribers Purchase this article Comment on this article