International Journal of Metadata, Semantics and Ontologies (10 papers in press)
Applying cross-dataset identity reasoning for producing URI embeddings over hundreds of RDF datasets
by Michalis Mountantonakis, Yannis Tzitzikas
Abstract: There is a proliferation of approaches that exploit RDF datasets for creating
URI embeddings, i.e., embeddings that are produced by taking as input URI sequences
(instead of simple words or phrases), since they can be of primary importance for several tasks (e.g., machine learning tasks). However, existing techniques exploit either a single or a few datasets for creating URI embeddings. For this reason, we introduce a prototype, called LODVec, which exploits LODsyndesis for enabling the creation of URI embeddings by using hundreds of datasets simultaneously, after enriching them with the results of cross-dataset identity reasoning. By using LODVec, it is feasible to produce URI sequences by following paths of any length (according to a given configuration), and the produced URI sequences are used as input for creating embeddings through word2vec model. We provide comparative results for evaluating the gain of using several datasets for creating URI embeddings, for the tasks of classification and regression, and for finding the most similar entities to a given one.
Keywords: embeddings; cross-dataset identity reasoning; RDF; machine learning; data integration; Linked Data; finding similar entities; classification; regression.
An ontology-driven perspective on the emotional human reactions to social events
by Danilo Cavaliere, Sabrina Senatore
Abstract: Social media has become a fulcrum for sharing information about everyday-life events: people, companies, and organisations express opinions about new products, political and social situations, football matches, concerts. The recognition of feelings and reactions to events from social networks requires dealing with great amounts of data streams, especially for tweets, to investigate the main sentiments and opinions that justify some reactions. This paper presents an emotion-based classification model to extract feelings from tweets related to an event or a trend, described by a hashtag, and builds an Emotional Concept Ontology to study human reactions to events in a context. From the tweet analysis, terms expressing a feeling are selected to build a topological space of emotion-based concepts. The extracted concepts serve to train a multi-class SVM classifier that is used to perform soft classification aimed at identifying the emotional reactions towards events. Then, an ontology allows arranging classification results, enriched with additional DBpedia concepts. SPARQL queries on the final knowledge base provide specific insights to explain peoples reactions towards events. Practical case studies and test results demonstrate the applicability and potential of the approach.
Keywords: sentiment analysis; simplicial complex; SVM; ontology; OWL; SPARQL.
Persons, GLAM institutes and collections: an analysis of entity linking based on the COURAGE registry
by Faraj Ghazal, András Micsik
Abstract: It is an important task to connect encyclopaedic knowledge graphs by finding and linking the same entity nodes. Various available automated linking solutions cannot be applied in situations where data is sparse or private or a high degree of correctness is expected. Wikidata has grown into a leading linking hub collecting entity identifiers from various registries and repositories. To get a picture of connectability, we analysed the linking methods and results between the COURAGE registry and Wikidata, VIAF, ISNI, and ULAN. This paper describes our investigations and solutions while mapping and enriching entities in Wikidata. Each possible mapped pair of entities received a numeric score of reliability. Using this score-based matching method, we tried to minimise the need for human decisions, hence we introduced the term human decision window for the mappings where neither acceptance nor refusal can be made automatically and safely. Furthermore, Wikidata has been enriched with related COURAGE entities and bi-directional links between mapped persons, organisations, collections, and collection items. We also describe the findings on coverage and quality of mapping among the above mentioned authority databases.
Keywords: linked data; cultural heritage; link discovery; entity linking; authority data; metadata quality; Wikidata; VIAF; ISNI; ULAN.
Documenting flooding areas calculation: a PROV approach
by Monica De Martino, Alfonso Quarati, Sergio Rosim, Laércio Massaru Namikawa
Abstract: Flooding events related to waste-lake dam ruptures are one of the most threatening natural disasters in Brazil. They must be managed in advance by public institutions through the use of adequate hydrographic and environmental information. Although the Open Data paradigm offers an opportunity to share hydrographic datasets, their actual reuse is still low because of metadata quality. Our previous work highlighted a lack of detailed provenance information. This paper presents an Open Data approach to improve the release of hydrographic datasets. We discuss a methodology, based on W3C recommendations, for documenting the provenance of hydrographic datasets, considering the workflow activities related to the study of flood areas caused by the waste-lake breakdowns. We provide an illustrative example that documents, through the W3C PROV metadata model, the generation of flooding area maps by integrating land use classification, from Sentinel images, with hydrographic datasets produced by the Brazilian National Institute for Space Research.
Keywords: hydrography datasets; Open Data; reusability; provenance workflow; metadata; W3C PROV.
Children's art museum collections as Linked Open Data
by Konstantinos Kotis, Sotiris Angelis, Maria Chondrogianni, Efstathia Marini
Abstract: Museums currently provide web access to their collections. It is argued that it is rather beneficial for institutions to provide their datasets as Linked Open Data (LOD) in order to achieve cross-referencing, interlinking and integration with other datasets in the LOD cloud. In this paper, we present the Greek Childrens Art Museum (GCAM) linked dataset, along with dataset and vocabulary statistics, as well as lessons learned from the process of transforming the collections to HTML-embedded structured data (using RDFa and microdata encodings) using mainly two different models i.e., the Europeana Data Model and the Schema.org model. The dataset consisted of three cultural collections of 121 child artworks (paintings), including detailed descriptions and interlinks to external datasets (DBpedia, WikiData, WikiArt, MoMA and others). The paper, in addition to the presentation of GCAM dataset and the lessons learned from the experimentation of non-ICT experts with LOD paradigm, introduces a new metric for measuring LD dataset quality in terms of links to and from other datasets, namely the D-h-index metric. Such a metric goes beyond the dataset's quality metrics of linkage dynamics, i.e., average, min and max number of links of datasets resources to external ones.
Keywords: Linked Open Data; HTML-embedded RDF; RDFa; microdata; museum; artwork.
A fuzzy logic and ontology based approach for improving the CV and job offer matching in recruitment process
by Amine Habous, El Habib Nfaoui
Abstract: The recruitment process is a critical activity for every organisation, it allows to find the appropriate candidate for a job offer and its employer work criteria. The competitive nature of the recruitment environment makes the task of hiring new employees very hard for companies owing to the high number of CV (resume) and profiles to process, the personal job interests, the customised requirements and precise skills requested by employees, etc. The time becomes crucial for recruiters' choices; consequently, it might impact the selection process quality. In this paper, we propose a retrieval system for automating the recruitment process. It is designed based on natural language processing, machine learning, and fuzzy logic to handle the matching between the job description and the CVs. It also considers the proficiency level for the technical skills. Moreover, it offers an estimation of the overall CV/Job offer expertise level. In that way, it overcomes the under-qualification and over-qualification issue in the ICT process. Experimental results on a ground-truth data of a recruiter company demonstrate that our proposal provides effective results.
Keywords: text mining; natural language processing; feature extraction; metadata weighting; ICT recruitment; fuzzy logic; machine learning.
Analysis of structured data on Wikipedia
by Johny Moreira, Everaldo Costa Neto, Luciano Barbosa
Abstract: Wikipedia has been widely used for information consumption or for implementing solutions using its content. It contains primarily unstructured text about entities, but it can also contain infoboxes, which are structured attributes describing these entities. Owing to their structural nature, infoboxes have been shown useful for many applications. In this work, we perform an extensive data analysis on different aspects of Wikipedia structured data: infoboxes, templates and categories, aiming to uncover data issues and limitations, and guide researchers in the use of these structured data. We devise a framework to process, index and query the Wikipedia data, using it to analyse different scenarios such as the popularity of infoboxes, their size distribution and usage across categories. Some of our findings are: only 54% of Wikipedia articles have infoboxes; there is a considerable amount of geographical and temporal information in infoboxes; and there is great heterogeneity of infoboxes across the same category.
Keywords: metadata; knowledge management; structured data; data analysis; Wikipedia; infoboxes; indexing strategy; categories; templates; entities.
Systematic design and implementation of a semantic assistance system for aero-engine design and manufacturing
by Sonika Gogineni, Jörg Brünnhäußer, Jonas Nickel, Heiko Witte, Rainer Stark, Kai Lindow, Erik Konietzko
Abstract: Data in organisations is often spread across various information and communication technology (ICT) systems, leading to redundancies, lack of overview and time loss in searching for information while carrying out daily activities. This paper focuses on addressing these problems by using semantic technologies to design and develop an assistance system on existing infrastructure. The focus here is set around the aero-engine industry where complex data systems are common, but also a lot of unstructured data and information become available during production. A systematic approach is followed to design the system, which integrates data silos by using a common ontology. This paper highlights the problems being addressed, the approach selected to develop the system and implementation of two use cases to support user activities in an aerospace company.
Keywords: knowledge management; assistance system; semantic integration; machine learning; ontologies; industrial implementation; manufacturing; product data management; Heterogeneous data; Interoperability.
Keyphrase extraction from single textual documents based on semantically defined background knowledge and co-occurrence graphs
by Mauro Dalle Lucca Tosi, Julio Cesar Dos Reis
Abstract: The keyphrase extraction task is a fundamental and challenging task designed to automatically extract a set of keyphrases from textual documents. Keyphrases are fundamental to assist publishers in indexing documents and readers in identifying the most relevant ones. They are short phrases composed of one or more terms used to best represent a textual document and its main topics. In this article, we extend our research on C-Rank, an unsupervised approach that automatically extracts keyphrases from single documents. C-Rank uses a concept-linking approach that links concepts in common between single documents and an external background knowledge base. Our approach uses those concepts as candidate keyphrases, which are modelled in a co-occurrence graph. On this basis, keyphrases are extracted relying on heuristics and their centrality in the graph. We advance our study over C-Rank by evaluating it using different concept-linking approaches - Babelfy and DBPedia Spotlight. The evaluation was performed in five gold-standard datasets composed of distinct types of data - academic articles, academic abstracts, and news articles. Our findings indicate that C-Rank achieves state-of-the-art results extracting keyphrases from scientific documents by experimentally comparing it with other unsupervised existing approaches.
Keywords: keyphrase extraction; complex networks; semantic annotation.
Links between research artifacts: use cases for digital libraries
by Fidan Limani, Atif Latif, Klaus Tochtermann
Abstract: The generation and availability of links between scholarly resources continue to increase. Initiatives to support it - both in terms of a (standard) representation model and accompanying infrastructure for collection and exchange - make this emerging artifact interesting to explore. Its role towards a more transparent, reproducible, and, ultimately, richer research context, makes it a valuable proposition for information infrastructures such as digital libraries. In this paper, we assess the potential of link artifacts for such an environment. We rely on a public link collection subset of >4.8 M links, which we represent based on the Linked Data approach that results with a collection of >163.8 M RDF triples. The incorporated use cases demonstrate the usefulness of this artifact in this study. We claim that the adoption of links extends the scholarly data collection and advances the services that a digital library offers to its users.
Keywords: research artifact links; digital library; semantic web.