International Journal of Metadata, Semantics and Ontologies (15 papers in press)
Semantic architectures and dashboard creation processes within the data and analytics framework
by Michele Petito, Francesca Fallucchi, Ernesto William De Luca
Abstract: It is almost twenty years since Tim Berners-Lee, creator of the web and the semantic web, described his idea of the web (Berners-Lee, Hendler, & Lassila, 2001) as an environment in which programs were able to understand the meaning of words and make autonomous decisions. The open data tools currently on the market do not exploit the semantic web, or provide tools for data analysis and visualisation. Most of them are simple open data portals that display a data catalogue, often not even fulfilling the lowest level of the famous five-star model. Current solutions (commercial or free) do not provide users with easy access to data, nor with tools for analysing and displaying data. The Data and Analytics Framework (DAF), a project run by the Italian government, was launched at the end of 2017 with the aim of becoming the single platform to solve all the problems related to the management of semantic data in public administration (PA). DAF extracts knowledge from the immense amount of data owned by the State. It favours the spread of linked open data (LOD) within PA thanks to the integration of open source business intelligence products and the network of controlled ontologies and vocabularies (OntoPiA). The research outlined in this paper illustrates some of the platforms competitive solutions and introduces the five-step process to create a DAF dashboard, as well as the related data story. The case study created by the authors concerns tourism in Sardinia (a region of Italy) and represents one of the few demonstrations of a real case being tested in DAF.
Keywords: big data; data and analytics framework; data visualisation; dashboard; open data; linked open data.
HSLD: a hybrid similarity measure for linked data resources
by Gabriela Silva, Frederico Durão, Paulo Roberto De Souza
Abstract: The web of data is a set of deeply linked resources that can be instantly read and understood by both humans and machines. A vast amount of RDF data have been published in freely accessible and interconnected datasets creating the so-called Linked Open Data cloud. Such a huge amount of available data, along with the development of semantic web standards, has opened up opportunities for the development of semantic applications. However, most of the semantic recommender systems use only the link structure between resources to calculate the similarity between resources. In this paper we propose HSLD, a hybrid similarity measure for linked data that exploits information present in RDF literals besides the links between resources. We evaluate the proposed approach in the context of a LOD-based recommender system using data from DBpedia. Experiment results indicate that HSLD increases the precision of the recommendations in comparison with pure link-based baseline methods.
Keywords: recommender systems; linked data; lexical similarity; semantic similarity.
EngMeta: metadata for computational engineering
by Björn Schembera, Dorothea Iglezakis
Abstract: Computational engineering generates knowledge through the analysis and interpretation of research data, which is produced by computer simulation. Supercomputers produce huge amounts of research data. To address a research question, a lot of simulations are run over a large parameter space. Therefore, handling this data and keeping an overview becomes a challenge. Data documentation is mostly handled by file and folder names in inflexible file systems, making it almost impossible for data to be findable, accessible, and interoperable and hence reusable. To enable and improve a structured documentation of research data from computational engineering, we developed EngMeta as a metadata model. We built this model by incorporating existing standards for general descriptive and technical information and adding metadata fields for discipline-specific information, such as the components and parameters of the simulated target system, and information about the research process, such as the used methods, software and computational environment. EngMeta functions, in practical use, as the descriptive core for an institutional repository. In order to reduce the burden of description on scientists, we have developed an approach for automatically extracting metadata information from the output and log files of computer simulations. Through a qualitative analysis, we show that EngMeta fulfills the criteria of a good metadata model. Through a quantitative survey, we can show that it meets the needs of engineering scientists. The overall outcome is the metadata model EngMeta in XML/XSD, ready for usage in computational engineering. This metadata product is backed by an automated metadata extraction and a repository, making specific research data management possible in computational engineering.
Keywords: research data management; metadata; big data; high performance computing; simulation; computational engineering; metadata extraction; repository.
SWRL reasoning on ontology-based clinical Dengue knowledge base
by Runumi Devi, Deepti Mehrotra, Hajer Baazaoui Zghal
Abstract: Dengue is a widespread mosquito-borne viral illness that may lead to death if not treated timely and properly. The aim of this study is to propose a semantic rule-based modelling and reasoning approach directed towards formalising Dengue disease definition in conjunction with operational definitions (semantics) that support clinical and diagnostic reasoning. The operational definitions are incorporated using Semantic Web Rule Language (SWRL) as logical rules that enhance the expressive capability of the knowledge base. A knowledge base has been designed by formalising diagnostic physical and clinical symptoms, kind of clinical test along with the pharmacological treatment applied. The designed knowledge base is extended with International Classification of Diseases (ICD) ontology for associating Dengue fever with ICD code. The knowledge base created can be reasoned upon for diagnostic classification that can discover Dengue symptoms and predict the possibility of patients to suffer from the disease apart from offering interoperability. One hundred and fifty-three real patient cases are classified successfully against the operational definitions incorporated by SWRL rules, with the reasoner.
Keywords: SWRL; ICD-10; DENV; description logic; Dengue fever.
Layout logical labelling and finding the semantic relationships between citing and cited paper content
by Sergey Parinov, Amir Bakarov, Daniil Vodolazcky
Abstract: Currently, large datasets of in-text citations and citation contexts are becoming available for research and development tools. Using the topic model method to analyse these data, one can characterise thematic relationships between citation contexts from citing and the cited paper content. However, to build relevant topic models and to compare them accurately for papers linked by citation relationships we have to know the semantic labels of PDF papers' layout, such as section titles, paragraph boundaries, etc. Recent achievements in the conversion of papers from a PDF form into a rich attributed JSON format allow us to develop new approaches for the logical labelling of the papers layout. This paper presents a reusable method and open source software for the logical labelling of PDF papers, which gave good quality of a layout element's recognition for a set of research papers. Using these semantic labels we made a precise comparison of topic models built for citing and cited papers and we found some level of similarity between them.
Keywords: Cirtec project; in-text citation; citation contexts; research paper layout recognition; logical labelling; hierarchical topic models.
Document-based RDF storage method for parallel evaluation of basic graph pattern queries
by Eleftherios Kalogeros, Manolis Gergatsoulis, Matthew Damigos
Abstract: In this paper, we investigate the problem of efficiently evaluating (Basic Graph Pattern) BGP SPARQL queries over a large amount of RDF data. We propose an effective data model for storing RDF data in a document database using maximum replication factor of 2 (i.e., in the worst case scenario, the data graph will be doubled in storage size). The proposed storage model is used for efficiently evaluating SPARQL queries, in a distributed manner. Each query is decomposed into a set of generalised star queries, which are queries that allow both subject-object and object-subject edges from a specific node, called the central node. The proposed data model ensures that no joining operations over multiple datasets are required to evaluate generalised star queries. The results of the evaluation of the generalised star subqueries of a query Q are then combined properly, in order to compute the answers of the query Q posed over the RDF data. The proposed approach has been implemented using MongoDB and Apache Spark.
Keywords: semantic web; parallel processing; query processing; resource description framework; big data applications.
Citation content/context data as a source for research cooperation analysis
by Sergey Parinov, Victoria Antonova
Abstract: Currently many research information systems can provide, for a selected author, two groups of citation relationships: a) the outgoing citations linking the authors papers with papers cited by him/her; and b) the ingoing citations created by others and citing the authors papers. Using these citation relationships, one can build three groups of papers: (1) the papers of a selected author; (2) those papers cited by the author; (3) papers citing the author. Authors of papers from these three groups can be presented as a fragment of a research cooperation network, because they use/cite research outputs of each other. Their papers full texts, and especially the contexts of their in-text citations, contain some information about the character of this research cooperation. We present a concept of research cooperation, based on publications and the current results of the Cirtec project, for building the research cooperation characteristics. This work is based on the processing of citation content/context data. The results include an online service for authors to monitor the citation content data extractions and three types of built indicators/parameters: co-citation statistics, spatial distribution of citations over papers body and topic models for citation contexts.
Keywords: citation content/context; Cirtec project; research cooperation; scientific knowledge transformation; RePEc; Socionet.
Extending the GLOBDEF framework with support for semantic enhancement of various data formats
by Maria Nisheva-Pavlova, Asen Alexandrov
Abstract: Semantic enhancement links sections of data files with well-described concepts from some knowledge domain. This allows for further automated reasoning about that data and can be especially useful for extracting value from Big Data, where the information is unstructured. Most of the available enhancement tools, however, focus on specific enhancement needs and data types. Usage of multiple knowledge domains or varied data types often requires detailed prior knowledge about the specific tool and how it is configured. The GLOBDEF framework, already presented in an earlier work, aims to find a way for processing of large amounts of data and enhancing the data automatically. The framework is designed to leverage a variety of external enhancement tools (each with its knowledge domain) and has no limitations on the format of the enhanced data. To achieve this the framework chains the enhancement tools into dynamically created pipelines, in which each enhancer is called to do its job independently of the others. In this paper, we present our efforts to expand the framework with support for enhancement of image data. We demonstrate how the framework behaves on a mixed dataset of texts and images. And finally, we showcase the capabilities of dynamic pipelines by explaining how an image can be semantically enhanced with the simple automated combination of an object recogniser and a text-based automated enhancer.
Keywords: linked open data; semantic annotation; semantic enhancement; metadata; ontology; unstructured data; automatic annotation; enrichment pipeline; semantic interoperability.
Modelling weightlifting 'training-diet-competition' cycle following a modular and scalable approach
by Piyaporn Tumnark, Paulo Cardoso, Jorge Cabral, Filipe Conceicao
Abstract: Studies in weightlifting have been characterised by unclear results and information paucity, mainly owing to the lack of information sharing between athletes, coaches, biomechanists, physiologists and nutritionists. These experts knowledges are not captured, classified or integrated into an information system for decision-making. An ontology-driven knowledge model for Olympic weightlifting was developed to leverage a better understanding of the weightlifting domain as a whole, bringing together related knowledge domains of training methodology, weightlifting biomechanics, and dietary, while modelling the synergy among them. It unifies terminology, semantics, and concepts among sport scientists, coaches, nutritionists, and athletes to partially obviate the recognised limitations and inconsistencies, leading to the provision of superior coaching and a research environment that promotes better understanding and more conclusive results. The ontology-assisted weightlifting knowledge base consists of 110 classes, 50 object properties, 92 data properties, and 167 inheritance relationships concepts, in a total of 1761 axioms, alongside 23 SWRL rules.
Keywords: ontology; nutrition; weightlifting; biomechanics; semantics; reasoning.
Service traceability in SOA-based software systems: a traceability network add-in for BPAOntoSOA framework
by Rana Yousef, Sarah Imtera
Abstract: BPAOntoSOA is a generic framework that generates a service model from a given organisational business process architecture. Service Oriented Architecture (SOA) traceability is essentially important to facilitate change management and support reusability of an SOA. It has a wide application in the development and maintenance process. Such a traceability network is not available for the BPAOntoSOA framework. This paper introduces an ontology-based traceability network for the BPAOntoSOA framework that semantically generates trace links between services and business process architectural elements both in forward and backward directions. The proposed traceability approach was evaluated using the postgraduate faculty information system case study in order to assess the framework behaviour in general. As a continued evaluation effort, a group of parameters have been selected to create an evaluation criterion, which was used to compare the BPAOntoSOA trace solution with one of the most related traceability frameworks, STraS traceability framework.
Keywords: traceability; service oriented architecture; ontology.
Special Issue on: Towards an Enriched, Linked, Open and Filtered Metadata Model
Intermediary XML schemas: constraint, templating and interoperability in complex environments
by Richard Gartner
Abstract: This article introduces the methodology of intermediary schemas for complex metadata environments. Metadata in instances conforming to these is not generally intended for dissemination but must usually be transformed by XSLT transformations to generate instances conforming to the referent schemas to which they mediate. The methodology is designed to enhance the interoperability of complex metadata within XML architectures. This methodology incorporates three subsidiary methods: these are project-specific schemas which represent constrained mediators to over-complex or over-flexible referents (Method 1), templates or conceptual maps from which instances may be generated (Method 2) and serialized maps of instances conforming to their referent schemas (Method 3). The three methods are detailed and their applications to current research in digital ecosystems, archival description and digital asset management and preservation are examined. A possible synthesis of the three is also proposed in order to enable the methodology to operate within a single schema, the Metadata Encoding and Transmission Standard (METS).
Keywords: XML; intermediary XML schemas; metadata; interoperability; digital asset management; digital preservation; METS; constraint; templating.
Unique challenges facing linked data implementation for National Educational Television
by Chris Pierce
Abstract: Implementing linked data involves a costly process of converting metadata to an exchange format substantially different from traditional library 'records-based' exchange. To achieve full implementation, it is necessary to navigate a complex process of data modelling, crosswalking, and publishing. This paper documents the transition of a dataset of National Educational Television (NET) collection records to a 'data-based' exchange environment of linked data by discussing challenges faced during the conversion. These challenges include silos such as the Librarys media asset management system Merged Audiovisual Information System (MAVIS), aligning PBCore with the bibliographic linked data model BIBFRAME, modelling differences in works between archival moving image cataloging and other domains using Entertainment Identifier Registry IDs (EIDR IDs), and possible alignments with EBUCore (the European Broadcasting Union linked data model) to address gaps between PBCore and BIBFRAME.
Keywords: linked data; MARC21; PBCore; BIBFRAME 2.0; National Educational Television,
EIDR; EBUCore; crosswalking; data modelling.
Exploring the utility of metadata record graphs and network analysis for metadata quality evaluation and augmentation
by Mark Phillips, Oksana Zavalina, Hannah Tarver
Abstract: Our study explores the possible uses and effectiveness of network analysis, including metadata record graphs, as a method of evaluating collections of metadata records at a scale. This paper presents the results of an experiment applying these methods to records in a university digital library system as well as two sub-collections of different sizes and composition. The data includes count- and value-based statistics as well as network metrics for every Dublin Core element in each of the metadata sets. We discuss the benefits and constraints of these metrics based on this analysis and suggest possible future applications.
Keywords: metadata record graphs; metadata quality; metadata linking.
CMDIfication process for textbook resources
by Francesca Fallucchi, Ernesto William De Luca
Abstract: Interoperability between heterogeneous resources and services is the key to a correctly functioning digital infrastructure which can provide shared resources in one place. We analyse the establishment of a common infrastructure standard covering metadata, content, and inferred knowledge to allow collaborative work between researchers in the humanities. In this article, we discuss how to provide a CMDI (Component MetaData Infrastructure) profile for textbooks, in order to integrate it into the Common Language Resources and Technology Infrastructure (CLARIN) and thus to make the data available in an open way and according to the FAIR principles. Textbooks are an ideal source of material from which to investigate world views, constructed universes of meaning, and values in relation to social cohesion and the political rationales by which societies seek to legitimate themselves. Furthermore, in this work, we present a digital infrastructure of our textbook-related services and data, which are available and openly accessible for researchers worldwide. The challenge in this paper is mapping different metadata formats for libraries and language resources into a CMDI profile in order to integrate these resources into the CLARIN infrastructure. The integration makes our data more compatible, visible and open for other communities, thus facilitating resource discovery by expanding the amount of searchable metadata. We focus on the CMDIfication process, which fulfils the needs of our related projects. We describe a process of building resources using CMDI description from Text Encoding Initiative (TEI), Metadata Encoding and Transmission Standard (METS) and Dublin Core (DC) metadata, testing it on the textbook resources of the Georg Eckert Institute.
Keywords: component metadata infrastructure; virtual language observatory; CLARIN; textbook; metadata for language resources; digital humanities.
From the web of bibliographic data to the web of bibliographic meaning: structuring, interlinking and validating ontologies on the semantic web
by Helena Simões Patrício, Maria Inês Cordeiro, Pedro Nogueira Ramos
Abstract: Bibliographic datasets have revealed good levels of technical interoperability observing the principles and good practices of linked data. However, they have a low level of quality from the semantic point of view, owing to many factors: lack of a common conceptual framework for a diversity of standards often used together, reduced number of links between the ontologies underlying datasets, proliferation of heterogeneous vocabularies, underuse of semantic mechanisms in data structures, 'ontology hijacking' (Feeney et al., 2018), and point-to-point mappings, as well as limitations of semantic web languages for the requirements of bibliographic data interoperability. After reviewing such issues, a research direction is proposed to overcome the misalignments found by means of a reference model and a superontology, using SHACL (Shapes Constraint Language) to solve current limitations of RDF languages.
Keywords: linked open data; bibliographic data; semantic web; SHACL; LOD validation; ontologies; reference model; bibliographic standards.