Title: Context-aware automated quality assessment of textual data

Authors: Goutam Mylavarapu; Kannan Ashwin Viswanathan; Johnson Thomas

Addresses: Department of Computer Science and Information Systems, Murray State University, Murray, USA ' Department of Computer Science, Oklahoma State University, Stillwater, Oklahoma, USA ' Department of Computer Science, Oklahoma State University, Stillwater, Oklahoma, USA

Abstract: Data analysis is a crucial process in the field of data science that extracts useful information from any form of data. With the rapid growth of technology, more and more unstructured data, such as text and images, are being produced in large amounts. Apart from the analytical techniques used, the quality of the data plays a prominent role in the accurate analysis. Data quality becomes inferior to poor maintenance and mediocre data generation strategies employed by amateur users. This problem escalates with the advent of big data. In this paper, we propose a quality assessment model for the textual form of unstructured data (TDQA). The context of data plays an important role in determining the quality of the data. Therefore, we automate the process of context extraction in textual data using natural language processing to identify data errors and assess quality.

Keywords: automated data quality assessment; textual data; context-aware; data context; sentiment analysis; lexicon; Doc2Vec; data accuracy; data consistency.

DOI: 10.1504/IJBIDM.2023.130588

International Journal of Business Intelligence and Data Mining, 2023 Vol.22 No.4, pp.451 - 469

Received: 24 Oct 2021
Accepted: 07 Dec 2021

Published online: 01 May 2023 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article