Title: Context-aware automated quality assessment of textual data
Authors: Goutam Mylavarapu; Kannan Ashwin Viswanathan; Johnson Thomas
Addresses: Department of Computer Science and Information Systems, Murray State University, Murray, USA ' Department of Computer Science, Oklahoma State University, Stillwater, Oklahoma, USA ' Department of Computer Science, Oklahoma State University, Stillwater, Oklahoma, USA
Abstract: Data analysis is a crucial process in the field of data science that extracts useful information from any form of data. With the rapid growth of technology, more and more unstructured data, such as text and images, are being produced in large amounts. Apart from the analytical techniques used, the quality of the data plays a prominent role in the accurate analysis. Data quality becomes inferior to poor maintenance and mediocre data generation strategies employed by amateur users. This problem escalates with the advent of big data. In this paper, we propose a quality assessment model for the textual form of unstructured data (TDQA). The context of data plays an important role in determining the quality of the data. Therefore, we automate the process of context extraction in textual data using natural language processing to identify data errors and assess quality.
Keywords: automated data quality assessment; textual data; context-aware; data context; sentiment analysis; lexicon; Doc2Vec; data accuracy; data consistency.
DOI: 10.1504/IJBIDM.2023.130588
International Journal of Business Intelligence and Data Mining, 2023 Vol.22 No.4, pp.451 - 469
Received: 24 Oct 2021
Accepted: 07 Dec 2021
Published online: 01 May 2023 *