Title: GDC-a-CGI: efficient algorithms for dynamic graph data cleaning and indexing

Authors: D.K. Santhosh Kumar; Demian Antony D'Mello

Addresses: Department of Computer Science and Engineering, Canara Engineering College, Mangaluru, Visvesvaraya Technological University, Belagavi, Karnataka, India ' Department of Computer Science and Engineering, Canara Engineering College, Mangaluru, Visvesvaraya Technological University, Belagavi, Karnataka, India

Abstract: The era of big data has led the graph data collection and analytics to grow rapidly in numerous fields. Data quality and data access are the two decisive factors of performance (accuracy and efficiency) for graph data analytics model. The authors propose graph data cleaning (GDC) technique, which removes erroneous messy data, leading to a better data quality. The GDC is a dynamic cleaning technique that facilitates the user to update rules and expressions at runtime and support inheritance rules from inter-domains. In addition to cleaning, GDC verifies and validates the graph data. The authors present cache-based graph indexing (CGI) technique to address data access, which is built using the tree structure 'CSS-tree' on the Hadoop distributed framework. The CGI is a scalable index construction technique, which builds efficient indexing for an extensive graph dataset. We carried out experiments with different graph datasets and results reveal that, the proposed GDC and CGI techniques outperform the state-of-the-art.

Keywords: data mining; graph data cleaning; GDC; graph data indexing; big data; Hadoop; graph data analytics; dynamic-cleaning; cache-based indexing.

DOI: 10.1504/IJCSE.2021.119979

International Journal of Computational Science and Engineering, 2021 Vol.24 No.6, pp.598 - 609

Received: 21 Sep 2020
Accepted: 09 Feb 2021

Published online: 04 Jan 2022 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article