Title: Dataset comparison workflows
Authors: Marko Robnik-Šikonja
Addresses: Faculty of Computer and Information Science, University of Ljubljana, Večna pot 113, 1001 Ljubljana, Slovenia
Abstract: To assess similarity of two datasets from the point of view of data science, univariate statistical comparisons are mostly insufficient. We present a methodology which estimates similarity of datasets from the point of view of data mining tasks. For example, we provide a relevant information for a decision if a new/related dataset can be used with an existing supervised or unsupervised model or not. We propose several workflows which cover: (a) statistical properties of generated data; (b) distance based structural similarity and (c) predictive similarity of two datasets. We evaluate the proposed workflows on random splits of several datasets and by comparing original datasets with datasets produced by a generator of semi-artificial data. The results show that the proposed workflows can reveal relevant similarity information about datasets needed in many data mining scenarios.
Keywords: data analytics; data mining; machine learning; data similarity; clustering; classification.
International Journal of Data Science, 2018 Vol.3 No.2, pp.126 - 145
Available online: 27 May 2018 *Full-text access for editors Access for subscribers Purchase this article Comment on this article