Title: Ranking the blocking keys for data de-duplication in information systems
Authors: Asif Sohail; Syed Waqar Jaffry
Addresses: Department of information Technology, University of the Punjab, Allama Iqbal Campus, Shahrahe Quaid-e-Azam, 54000, Lahore, Punjab, Pakistan ' Department of information Technology, University of the Punjab, Allama Iqbal Campus, Shahrahe Quaid-e-Azam, 54000, Lahore, Punjab, Pakistan
Abstract: Data de-duplication is an essential activity in data integration and data cleansing. It identifies and removes the disguised duplicates in a dataset. Blocking is an established technique for reducing the inherent quadratic complexity of de-duplication. Blocking gathers the potential matching records in the same block on the basis of a blocking key. The results of blocking fluctuate considerably when different blocking keys are employed. Hence, it becomes extremely important to select an appropriate blocking key for maximising the efficacy and efficiency of blocking. The proposed technique ranks the attributes of a dataset with respect to their usability as a blocking key. We have introduced a novel correlation measure called R-score for computing correlation between gold rankings and computed rankings of the blocking keys. The proposed technique is evaluated using benchmark datasets and the experimental results confirm that the proposed technique outperforms the existing techniques.
Keywords: data integration; data cleansing; blocking; candidate record pairs; rank correlation; promising attributes; reduction ratio; recall.
DOI: 10.1504/IJBIS.2025.146600
International Journal of Business Information Systems, 2025 Vol.49 No.2, pp.180 - 198
Received: 19 Jul 2021
Accepted: 23 Oct 2021
Published online: 06 Jun 2025 *