Article: Ranking the blocking keys for data de-duplication in information systems Journal: International Journal of Business Information Systems (IJBIS) 2025 Vol.49 No.2 pp.180 - 198 Abstract: Data de-duplication is an essential activity in data integration and data cleansing. It identifies and removes the disguised duplicates in a dataset. Blocking is an established technique for reducing the inherent quadratic complexity of de-duplication. Blocking gathers the potential matching records in the same block on the basis of a blocking key. The results of blocking fluctuate considerably when different blocking keys are employed. Hence, it becomes extremely important to select an appropriate blocking key for maximising the efficacy and efficiency of blocking. The proposed technique ranks the attributes of a dataset with respect to their usability as a blocking key. We have introduced a novel correlation measure called R-score for computing correlation between gold rankings and computed rankings of the blocking keys. The proposed technique is evaluated using benchmark datasets and the experimental results confirm that the proposed technique outperforms the existing techniques. Inderscience Publishers - linking academia, business and industry through research

Title: Ranking the blocking keys for data de-duplication in information systems

Authors: Asif Sohail; Syed Waqar Jaffry

Addresses: Department of information Technology, University of the Punjab, Allama Iqbal Campus, Shahrahe Quaid-e-Azam, 54000, Lahore, Punjab, Pakistan ' Department of information Technology, University of the Punjab, Allama Iqbal Campus, Shahrahe Quaid-e-Azam, 54000, Lahore, Punjab, Pakistan

Abstract: Data de-duplication is an essential activity in data integration and data cleansing. It identifies and removes the disguised duplicates in a dataset. Blocking is an established technique for reducing the inherent quadratic complexity of de-duplication. Blocking gathers the potential matching records in the same block on the basis of a blocking key. The results of blocking fluctuate considerably when different blocking keys are employed. Hence, it becomes extremely important to select an appropriate blocking key for maximising the efficacy and efficiency of blocking. The proposed technique ranks the attributes of a dataset with respect to their usability as a blocking key. We have introduced a novel correlation measure called R-score for computing correlation between gold rankings and computed rankings of the blocking keys. The proposed technique is evaluated using benchmark datasets and the experimental results confirm that the proposed technique outperforms the existing techniques.

Keywords: data integration; data cleansing; blocking; candidate record pairs; rank correlation; promising attributes; reduction ratio; recall.

DOI: 10.1504/IJBIS.2025.146600

International Journal of Business Information Systems, 2025 Vol.49 No.2, pp.180 - 198

Received: 19 Jul 2021
Accepted: 23 Oct 2021
Published online: 06 Jun 2025 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Ranking the blocking keys for data de-duplication in information systems

Keep up-to-date