Article Abstract

Title: Ranking Wikipedia article's data quality by learning dimension distributions
  Author: Jingyu Han, Kejia Chen   Email author(s)
  Address: College of Computer Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing 210003, China ' College of Computer Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
  Journal: International Journal of Information Quality 2014 - Vol. 3, No.3  pp. 207 - 227
  Abstract: As the largest free user-generated knowledge repository, data quality of Wikipedia has attracted great attention these years. Automatic assessment of Wikipedia article's data quality is a pressing concern. We observe that every Wikipedia quality class exhibits its specific characteristic along different first-class quality dimensions including accuracy, completeness, consistency and minimality. We propose to extract quality dimension values from article's content and editing history using dynamic Bayesian network (DBN) and information extraction techniques. Next, we employ multivariate Gaussian distributions to model quality dimension distributions for each quality class, and combine multiple trained classifiers to predict an article's quality class, which can distinguish different quality classes effectively and robustly. Experiments demonstrate that our approach generates a good performance.
  Keywords: data quality; Wikipedia; quality dimensions; multivariate Gaussian distribution; ensemble learning; article ranking; dynamic Bayesian network; DBN; information extraction.
  DOI: 10.1504/IJIQ.2014.064056
  Submission date: 14 Apr 2013
Date of acceptance: 05 Aug 2013
Available online: 31 Jul 2014
  Access for editors and complimentary subscribers       Access for Subscribers     Purchase this article        We welcome your comments about this article Comment on the article