Title: Analysis of structured data on Wikipedia

Authors: Johny Moreira; Everaldo Costa Neto; Luciano Barbosa

Addresses: Centro de Informática, Universidade Federal de Pernambuco, Recife, Pernambuco, Brazil ' Centro de Informática, Universidade Federal de Pernambuco, Recife, Pernambuco, Brazil ' Centro de Informática, Universidade Federal de Pernambuco, Recife, Pernambuco, Brazil

Abstract: Wikipedia has been widely used for information consumption or for implementing solutions using its content. It contains primarily unstructured text about entities, but it can also contain infoboxes, which are structured attributes describing these entities. Owing to its structural nature, infoboxes have been shown useful to many applications. In this work, we perform an extensive data analysis on different aspects of Wikipedia structured data: infoboxes, templates and categories, aiming to uncover data issues and limitations, and to guide researchers in the use of these structured data. We devise a framework to process, index and query the Wikipedia data, using it to analyse different scenarios such as the popularity of infoboxes, their size distribution and usage across categories. Some of our findings are: only 54% of Wikipedia articles have infoboxes; there is a considerable amount of geographical and temporal information in infoboxes; and there is great heterogeneity of infoboxes across a same category.

Keywords: metadata; knowledge management; structured data; data analysis; Wikipedia; infoboxes; indexing strategy; categories; templates; entities.

DOI: 10.1504/IJMSO.2021.117108

International Journal of Metadata, Semantics and Ontologies, 2021 Vol.15 No.1, pp.71 - 86

Received: 06 Jan 2021
Accepted: 05 May 2021

Published online: 10 Aug 2021 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article