Title: Comparing SQL and NoSQL approaches for clustering over big data

Authors: Filipe Assunção; Manuel Levi; Pedro Furtado

Addresses: Department of Informatics Engineering, University of Coimbra, Pólo II – Pinhal de Marrocos, 3030-290 Coimbra, Portugal ' Department of Informatics Engineering, University of Coimbra, Pólo II – Pinhal de Marrocos, 3030-290 Coimbra, Portugal ' Department of Informatics Engineering, University of Coimbra, Pólo II – Pinhal de Marrocos, 3030-290 Coimbra, Portugal

Abstract: Data mining is the process of discovering patterns in large datasets. With the exponential growth of available information, new machine learning, statistics and other analytics techniques have to be developed to solve the processing needs required to do such analysis fast enough to be used successfully. In this study, techniques like cluster analysis are used over generated data in order to do customer segmentation, and the system performance is evaluated by measuring the processing time. The data used in the current paper is generated using the Star Schema Benchmark (SSB). Our main goal is to find a scalable solution to run data mining over a decision support benchmark. Four different systems will be tested: single node MySQL, MySQL cluster, Apache Mahout and R. By running MySQL cluster and Mahout, each system distributed by four nodes, the paper compares the performance of k-means run in parallel. MySQL and R will allow for comparison of this kind of execution against methods running on a single machine, both on relational and non-relational systems.

Keywords: data mining; Star Schema; benchmarking; scalability evaluation; MySQL; Apache Mahout; SQL; NoSQL; clustering; big data; cluster analysis; customer segmentation.

DOI: 10.1504/IJBPIM.2015.073657

International Journal of Business Process Integration and Management, 2015 Vol.7 No.4, pp.335 - 344

Published online: 15 Dec 2015 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article