International Journal of Metadata, Semantics and Ontologies (14 papers in press)
Modelling weightlifting 'training-diet-competition' cycle following a modular and scalable approach
by Piyaporn Tumnark, Paulo Cardoso, Jorge Cabral, Filipe Conceicao
Abstract: Studies in weightlifting have been characterised by unclear results and information paucity, mainly owing to the lack of information sharing between athletes, coaches, biomechanists, physiologists and nutritionists. These experts knowledges are not captured, classified or integrated into an information system for decision-making. An ontology-driven knowledge model for Olympic weightlifting was developed to leverage a better understanding of the weightlifting domain as a whole, bringing together related knowledge domains of training methodology, weightlifting biomechanics, and dietary, while modelling the synergy among them. It unifies terminology, semantics, and concepts among sport scientists, coaches, nutritionists, and athletes to partially obviate the recognised limitations and inconsistencies, leading to the provision of superior coaching and a research environment that promotes better understanding and more conclusive results. The ontology-assisted weightlifting knowledge base consists of 110 classes, 50 object properties, 92 data properties, and 167 inheritance relationships concepts, in a total of 1761 axioms, alongside 23 SWRL rules.
Keywords: ontology; nutrition; weightlifting; biomechanics; semantics; reasoning.
An algorithm to generate short sentences in natural language from linked open data based on linguistic templates
by Augusto Lopes Da Silva, Sandro Rigo, Jéssica Moraes
Abstract: The generation of natural language phrases from linked open data can benefit from a significant amount of information available on the internet, as well as from the existence of properties within them, which appears, mostly, in the RDF format. These properties can represent semantic relationships between concepts that might help in creating sentences in natural language. Nevertheless, research in this field tends not to use the information in RDF. We support that this is a factor that might foster the generation of more natural phrases. In this scenario, this research explores these RDF properties for the generation of natural language phrases. The short sentences generated by the algorithm implementation were evaluated regarding their fluency by linguists and native English speakers. The results show that the sentences generated are promising regarding sentence fluency.
Keywords: linked open data; natural language generation; RDF; ontologies; linguistic templates; fluency.
Towards linked open government data in Canada
by Enayat Rajabi
Abstract: Governments are publishing enormous amounts of open data on the web every day in an effort to increase transparency and reusability. Linking data from multiple sources on the web enables the performance of advanced data analytics, which can lead to the development of valuable services and data products. However, Canadas open government data portals are isolated from one another and remain unlinked to other resources on the web. In this paper, we first expose the statistical datasets in Canadian provincial open data portals such as Linked Data, and then integrate them using RDF Cube vocabulary, thereby making different open data portals available through a single search endpoint. We leverage semantic web technologies to publish open data sets taken from two provincial portals (Nova Scotia and Alberta) as RDF (the Linked Data format), and to connect them to one another. The success of our approach illustrates its high potential for linking open government datasets across Canada, which will in turn enable greater data accessibility and improved search results.
Keywords: open data; RDF cube; linked data; semantic web.
Semantic similarity measurement: an intrinsic information content model
by Abhijit Adhikari, Biswanath Dutta, Animesh Dutta, Deepjyoti Mondal
Abstract: Ontology-dependent Semantic Similarity (SS) measurement has emerged as a new research paradigm in finding the semantic strength between any two entities. In this regard, as observed, the information-theoretic intrinsic approach yields better accuracy in correlation with human cognition. The precision of such a technique highly depends on how accurately we calculate Information Content (IC) of concepts and its compatibility with an SS model. In this work, we develop an intrinsic IC model to facilitate better SS measurement. The proposed model has been evaluated using three vocabularies, namely SNOMED CT, MeSH, and WordNet, against a set of benchmark datasets. We compare the results with the state-of-the-art IC models. The results show that the proposed intrinsic IC model yields a high correlation with human assessment. The article also evaluates the compatibility of the proposed IC model and the other existing IC models in combination with a set of state-of-the-art SS models.
Keywords: semantic similarity; knowledge-based systems; ontology; intrinsic information content; natural language processing.
Automatic classification of digital objects for improved metadata quality of electronic theses and dissertations in institutional repositories
by Lighton Phiri
Abstract: Higher education institutions typically employ Institutional Repositories (IRs) in order to curate and make available Electronic Theses and Dissertations (ETDs). While most of these IRs are implemented with self-archiving functionalities, self-archiving practices are still a challenge. This arguably leads to inconsistencies in the tagging of digital objects with descriptive metadata, potentially compromising searching and browsing of scholarly research output in IRs. This paper proposes an approach to automatically classify ETDs in IRs, using supervised machine learning techniques, by extracting features from the minimum possible input expected from document authors: the ETD manuscript. The experiment results demonstrate the feasibility of automatically classifying IR ETDs and, additionally, ensuring that repository digital objects are appropriately structured. Automatic classification of repository objects has the obvious benefit of improving the searching and browsing of content in IRs and further presents opportunities for the implementation of third-party tools and extensions that could potentially result in effective self-archiving strategies.
Keywords: digital libraries; Dublin Core; OAI-PMH; document classification; automatic classification; digital objects; metadata quality; electronic theses and dissertations; institutional repositories; self-archiving.
Formalisation and classification of grammar and template-mediated techniques to model and ontology verbalisation
by Zola Mahlaza, C. Maria Keet
Abstract: Computational tools that translate modelling languages into a restricted natural language can improve end-user involvement in modelling. Templates are a popular approach for such a translation and are often paired with computational grammar rules to support grammatical complexity to obtain better quality sentences. There is no explicit specification of the relations used for the pairing of templates with grammar rules, so it is challenging to compare the latter templates' suitability for less-resourced languages, where grammar reuse is vital in reducing development effort. In order to enable such comparisons, we devise a model of pairing templates and rules, and assess its applicability by considering 54 existing systems for classification, and 16 of them in detail. Our classification shows that most grammar-infused template systems support detachable grammar rules and half of them introduce syntax trees for multilingualism or error checking. Furthermore, out of the 16 considered grammar-infused template systems, most do not currently support any of form of aggregation (63%) or the embedding of verb conjugation rules (81%); hence, if such features would be required, then they would need to be implemented from the ground up.
Keywords: ontology verbalisation; model verbalisation; natural language generation; template classification; grammar-infused templates.
Applying cross-dataset identity reasoning for producing URI embeddings over hundreds of RDF datasets
by Michalis Mountantonakis, Yannis Tzitzikas
Abstract: There is a proliferation of approaches that exploit RDF datasets for creating
URI embeddings, i.e., embeddings that are produced by taking as input URI sequences
(instead of simple words or phrases), since they can be of primary importance for several tasks (e.g., machine learning tasks). However, existing techniques exploit either a single or a few datasets for creating URI embeddings. For this reason, we introduce a prototype, called LODVec, which exploits LODsyndesis for enabling the creation of URI embeddings by using hundreds of datasets simultaneously, after enriching them with the results of cross-dataset identity reasoning. By using LODVec, it is feasible to produce URI sequences by following paths of any length (according to a given configuration), and the produced URI sequences are used as input for creating embeddings through word2vec model. We provide comparative results for evaluating the gain of using several datasets for creating URI embeddings, for the tasks of classification and regression, and for finding the most similar entities to a given one.
Keywords: embeddings; cross-dataset identity reasoning; RDF; machine learning; data integration; Linked Data; finding similar entities; classification; regression.
An ontology-driven perspective on the emotional human reactions to social events
by Danilo Cavaliere, Sabrina Senatore
Abstract: Social media has become a fulcrum for sharing information about everyday-life events: people, companies, and organisations express opinions about new products, political and social situations, football matches, concerts. The recognition of feelings and reactions to events from social networks requires dealing with great amounts of data streams, especially for tweets, to investigate the main sentiments and opinions that justify some reactions. This paper presents an emotion-based classification model to extract feelings from tweets related to an event or a trend, described by a hashtag, and builds an Emotional Concept Ontology to study human reactions to events in a context. From the tweet analysis, terms expressing a feeling are selected to build a topological space of emotion-based concepts. The extracted concepts serve to train a multi-class SVM classifier that is used to perform soft classification aimed at identifying the emotional reactions towards events. Then, an ontology allows arranging classification results, enriched with additional DBpedia concepts. SPARQL queries on the final knowledge base provide specific insights to explain peoples reactions towards events. Practical case studies and test results demonstrate the applicability and potential of the approach.
Keywords: sentiment analysis; simplicial complex; SVM; ontology; OWL; SPARQL.
Persons, GLAM institutes and collections: an analysis of entity linking based on the COURAGE registry
by Faraj Ghazal, András Micsik
Abstract: It is an important task to connect encyclopaedic knowledge graphs by finding and linking the same entity nodes. Various available automated linking solutions cannot be applied in situations where data is sparse or private or a high degree of correctness is expected. Wikidata has grown into a leading linking hub collecting entity identifiers from various registries and repositories. To get a picture of connectability, we analysed the linking methods and results between the COURAGE registry and Wikidata, VIAF, ISNI, and ULAN. This paper describes our investigations and solutions while mapping and enriching entities in Wikidata. Each possible mapped pair of entities received a numeric score of reliability. Using this score-based matching method, we tried to minimise the need for human decisions, hence we introduced the term human decision window for the mappings where neither acceptance nor refusal can be made automatically and safely. Furthermore, Wikidata has been enriched with related COURAGE entities and bi-directional links between mapped persons, organisations, collections, and collection items. We also describe the findings on coverage and quality of mapping among the above mentioned authority databases.
Keywords: linked data; cultural heritage; link discovery; entity linking; authority data; metadata quality; Wikidata; VIAF; ISNI; ULAN.
Data Aggregation Lab: an experimental framework for data aggregation in cultural heritage
by Nuno Freire
Abstract: This paper describes the Data Aggregation Lab software, a system that implements the metadata aggregation workflow of cultural heritage, based on the underlying concepts and technologies of the Web of Data. It provides a framework to support several of our research activities within the Europeana network, such as conducting case studies, provide reference implementations, and support technology adoption. Currently, it provides working implementations for metadata aggregation methods with which our research has obtained positive results. These methods explore technologies such as linked data, Schema.org, IIIF, Sitemaps and RDF-related technologies for innovation in data aggregation, data analysis and data conversion focused on cultural heritage data.
Keywords: linked data; IIIF, Schema.org; Sitemaps; RDF; software; cultural heritage; metadata aggregation; harvesting; crawling; semantics; Europeana; research framework; technology adoption.
Special Issue on: Data Analytics and Semantic Web
Automatic metadata extraction via image processing using Migne's Patrologia Graeca
by Evagelos Varthis, Sozon Papavlasopoulos, Ilias Giarenis, Marios Poulos
Abstract: A wealth of knowledge is kept in libraries and cultural institutions in various digital forms without, however, the possibility of a simple term search, let alone a substantial semantic search. One such important collection that contains knowledge, accumulated in the passage of ages but which remains largely inaccessible, is the Patrologia Graeca (PG). In this study, a novel approach is proposed which strives to recognise words and automatically generate metadata from large machine-printed corpora, such as Migne's Patrologia Graeca. The proposed framework firstly applies an efficient segmentation process at word level and transforms the word-images of the Greek polytonic script of the Patrologia Graeca into special compact shapes. The contours of these shapes are extracted and compared with the contour of a similarly transformed query word-image. For the comparison, we use Hu's invariant moments for discarding unlikely similar matches, Shape Context (SC) for the contour similarity and Pearson's Correlation Coefficient (PCC) for final pruning of the dissimilar words and additional verification. Comparative results are presented by using the Long-Short Term Memory (LSTM) Neural Network (NN) engine of the Tesseract Optical Character Recognition (OCR) system instead of PCC. In addition, an intelligent scenario is proposed for automatic generation of PG metadata by librarians. The described scenario, owing to the simplicity and efficiency it provides, can be applied to massive metadata extraction, building search indexes and consequently semantic enrichment of the Patrologia Graeca.
Keywords: Patrologia Graeca; word spotting; shape context; time series; metadata extraction; semantic enrichment; digital librarian.
A survey study on Arabic WordNet: opportunities and future research directions
by Abdulmohsen Albesher, Osama Rabie
Abstract: WordNet (WN) plays an essential role in knowledge management and information retrieval because as it allows for a better understanding of word relationships, which leads to more accurate text processing. The success of WN for the English language encouraged researchers to develop WNs for other languages. One of the most common of such languages is Arabic. However, the current state of affairs of Arabic WN (AWN) has not been properly studied. Thus, this paper presents a survey study on AWN conducted to explore opportunities and possible future research directions. The results involve the synthesis of over 100 research papers on AWN. These research papers were divided into categories and subcategories.
Keywords: natural language processing; semantics; information retrieval; WordNet; Arabic; Arabic WordNet; AWN.
An ontology-based method for improving quality of process event logs using database bin logs
by Shokoufeh Ghalibafan, Behshid Behkamal, Mohsen Kahani, Mohammad Allahbakhsh
Abstract: The main goal of process mining is discovering models from event logs. The usefulness level of these discovered models is directly related to the quality of the event logs. Many researchers have observed serious deficiencies with regard to the quality of event logs and have proposed various solutions to improve the overall quality of discovered process models. Amongst these, only a few have considered the application of a reliable external source in the improvement of the quality of process event logs. Every activity, in a process, is known to leave some trace in the process event file. Besides, when a process instance is executed, the corresponding operations are applied to the database and their impacts are stored in the database bin log. Therefore, the database bin log can be used as a reliable relevant source in improving the quality improvement of an event log. In this paper, we propose a method to repair the event log using the database bin log. We show that database operations can be employed in order to overcome the inadequacies of event logs, including incorrect and missing data. To this end, we, first, extract an ontology from each of the event logs and the bin log. Then, we match the extracted ontologies in order to remove inadequacies from the event log. Model validation and evaluation results show the stability of our proposed model, as well as its superiority over related works.
Keywords: data quality; process mining; event log; ontology matching; database bin log.
Stress-testing big data to extract smart and interoperable food safety analytics
by Ioanna Polychronou, Giannis Stoitsis, Mihalis Papakonstantinou, Nikos Manouselis
Abstract: One of the significant challenges for the future is to guarantee safe food for all inhabitants of the planet. During the last 15 years, very important fraud issues like the '2013 horse meat scandal' and the '2008 Chinese milk scandal' have greatly affected the food industry and public health. One of the options for this issue consists of increasing production, but to accomplish this, it is necessary that innovative measures be applied to enhance the safety of the food supply chain. For this reason, it is quite important to have the right infrastructure in order to manage data of the food safety sector and provide useful analytics to food safety experts. In this paper, we describe Agroknow's Big Data Platform architecture and examine its scalability for data management and experimentation.
Keywords: big data; stress-testing data; data platform; data management; data experimentation.