Title: A data cleaning method for heterogeneous attribute fusion and record linkage
Authors: Hui-Juan Zhu; Tong-Hai Jiang; Yi Wang; Li Cheng; Bo Ma; Fan Zhao
Addresses: The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, No. 40-1, Beijing South Road, Xin Shi Zone, Urumqi 830011, China; University of Chinese Academy of Sciences, Beijing 100049, China; Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China ' The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, No. 40-1, Beijing South Road, Xin Shi Zone, Urumqi 830011, China; Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China ' The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, No. 40-1, Beijing South Road, Xin Shi Zone, Urumqi 830011, China; Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China ' The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, No. 40-1, Beijing South Road, Xin Shi Zone, Urumqi 830011, China; Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China ' The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, No. 40-1, Beijing South Road, Xin Shi Zone, Urumqi 830011, China; Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China ' The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, No. 40-1, Beijing South Road, Xin Shi Zone, Urumqi 830011, China; Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China
Abstract: In big data era, massive heterogeneous data are generated from various data sources, the cleaning of dirty data is critical for reliable data analysis. Existing rule-based methods are generally developed in single data source environment, issues like data standardisation and duplication detection for different data type attributes, are not fully studied. In order to address these challenges, we introduce a method based on dynamic configurable rules which can integrate data detection, modification and transformation together. Secondly, we propose a type-based blocking and a varying window size selection mechanism based on classic sorted-neighbourhood algorithm. We present a reference implementation of our method in a real-life data fusion system and validate its effectiveness and efficiency using recall and precision metrics. Experimental results indicate that our method is suitable in the scenario of multiple data sources with heterogeneous attribute properties.
Keywords: big data; varying window; data cleaning; record linkage; record similarity; SNM; type-based blocking.
DOI: 10.1504/IJCSE.2019.101341
International Journal of Computational Science and Engineering, 2019 Vol.19 No.3, pp.311 - 324
Received: 03 Aug 2016
Accepted: 03 Feb 2017
Published online: 05 Aug 2019 *