International Journal of Metadata, Semantics and Ontologies (11 papers in press)
Applying cross-dataset identity reasoning for producing URI embeddings over hundreds of RDF datasets
by Michalis Mountantonakis, Yannis Tzitzikas
Abstract: There is a proliferation of approaches that exploit RDF datasets for creating
URI embeddings, i.e., embeddings that are produced by taking as input URI sequences
(instead of simple words or phrases), since they can be of primary importance for several tasks (e.g., machine learning tasks). However, existing techniques exploit either a single or a few datasets for creating URI embeddings. For this reason, we introduce a prototype, called LODVec, which exploits LODsyndesis for enabling the creation of URI embeddings by using hundreds of datasets simultaneously, after enriching them with the results of cross-dataset identity reasoning. By using LODVec, it is feasible to produce URI sequences by following paths of any length (according to a given configuration), and the produced URI sequences are used as input for creating embeddings through word2vec model. We provide comparative results for evaluating the gain of using several datasets for creating URI embeddings, for the tasks of classification and regression, and for finding the most similar entities to a given one.
Keywords: embeddings; cross-dataset identity reasoning; RDF; machine learning; data integration; Linked Data; finding similar entities; classification; regression.
An ontology-driven perspective on the emotional human reactions to social events
by Danilo Cavaliere, Sabrina Senatore
Abstract: Social media has become a fulcrum for sharing information about everyday-life events: people, companies, and organisations express opinions about new products, political and social situations, football matches, concerts. The recognition of feelings and reactions to events from social networks requires dealing with great amounts of data streams, especially for tweets, to investigate the main sentiments and opinions that justify some reactions. This paper presents an emotion-based classification model to extract feelings from tweets related to an event or a trend, described by a hashtag, and builds an Emotional Concept Ontology to study human reactions to events in a context. From the tweet analysis, terms expressing a feeling are selected to build a topological space of emotion-based concepts. The extracted concepts serve to train a multi-class SVM classifier that is used to perform soft classification aimed at identifying the emotional reactions towards events. Then, an ontology allows arranging classification results, enriched with additional DBpedia concepts. SPARQL queries on the final knowledge base provide specific insights to explain peoples reactions towards events. Practical case studies and test results demonstrate the applicability and potential of the approach.
Keywords: sentiment analysis; simplicial complex; SVM; ontology; OWL; SPARQL.
Persons, GLAM institutes and collections: an analysis of entity linking based on the COURAGE registry
by Faraj Ghazal, András Micsik
Abstract: It is an important task to connect encyclopaedic knowledge graphs by finding and linking the same entity nodes. Various available automated linking solutions cannot be applied in situations where data is sparse or private or a high degree of correctness is expected. Wikidata has grown into a leading linking hub collecting entity identifiers from various registries and repositories. To get a picture of connectability, we analysed the linking methods and results between the COURAGE registry and Wikidata, VIAF, ISNI, and ULAN. This paper describes our investigations and solutions while mapping and enriching entities in Wikidata. Each possible mapped pair of entities received a numeric score of reliability. Using this score-based matching method, we tried to minimise the need for human decisions, hence we introduced the term human decision window for the mappings where neither acceptance nor refusal can be made automatically and safely. Furthermore, Wikidata has been enriched with related COURAGE entities and bi-directional links between mapped persons, organisations, collections, and collection items. We also describe the findings on coverage and quality of mapping among the above mentioned authority databases.
Keywords: linked data; cultural heritage; link discovery; entity linking; authority data; metadata quality; Wikidata; VIAF; ISNI; ULAN.
Data Aggregation Lab: an experimental framework for data aggregation in cultural heritage
by Nuno Freire
Abstract: This paper describes the Data Aggregation Lab software, a system that implements the metadata aggregation workflow of cultural heritage, based on the underlying concepts and technologies of the Web of Data. It provides a framework to support several of our research activities within the Europeana network, such as conducting case studies, providing reference implementations, and supporting technology adoption. Currently, it provides working implementations for metadata aggregation methods with which our research has obtained positive results. These methods explore technologies such as linked data, Schema.org, IIIF, Sitemaps and RDF-related technologies for innovation in data aggregation, data analysis and data conversion focused on cultural heritage data.
Keywords: linked data; IIIF; Schema.org; Sitemaps; RDF; software; cultural heritage; metadata aggregation; harvesting; crawling; semantics; Europeana; research framework; technology adoption.
Documenting flooding areas calculation: a PROV approach
by Monica De Martino, Alfonso Quarati, Sergio Rosim, Laércio Massaru Namikawa
Abstract: Flooding events related to waste-lake dam ruptures are one of the most threatening natural disasters in Brazil. They must be managed in advance by public institutions through the use of adequate hydrographic and environmental information. Although the Open Data paradigm offers an opportunity to share hydrographic datasets, their actual reuse is still low because of metadata quality. Our previous work highlighted a lack of detailed provenance information. This paper presents an Open Data approach to improve the release of hydrographic datasets. We discuss a methodology, based on W3C recommendations, for documenting the provenance of hydrographic datasets, considering the workflow activities related to the study of flood areas caused by the waste-lake breakdowns. We provide an illustrative example that documents, through the W3C PROV metadata model, the generation of flooding area maps by integrating land use classification, from Sentinel images, with hydrographic datasets produced by the Brazilian National Institute for Space Research.
Keywords: hydrography datasets; Open Data; reusability; provenance workflow; metadata; W3C PROV.
Children's art museum collections as Linked Open Data
by Konstantinos Kotis, Sotiris Angelis, Maria Chondrogianni, Efstathia Marini
Abstract: Museums currently provide web access to their collections. It is argued that it is rather beneficial for institutions to provide their datasets as Linked Open Data (LOD) in order to achieve cross-referencing, interlinking and integration with other datasets in the LOD cloud. In this paper, we present the Greek Childrens Art Museum (GCAM) linked dataset, along with dataset and vocabulary statistics, as well as lessons learned from the process of transforming the collections to HTML-embedded structured data (using RDFa and microdata encodings) using mainly two different models i.e., the Europeana Data Model and the Schema.org model. The dataset consisted of three cultural collections of 121 child artworks (paintings), including detailed descriptions and interlinks to external datasets (DBpedia, WikiData, WikiArt, MoMA and others). The paper, in addition to the presentation of GCAM dataset and the lessons learned from the experimentation of non-ICT experts with LOD paradigm, introduces a new metric for measuring LD dataset quality in terms of links to and from other datasets, namely the D-h-index metric. Such a metric goes beyond the dataset's quality metrics of linkage dynamics, i.e., average, min and max number of links of datasets resources to external ones.
Keywords: Linked Open Data; HTML-embedded RDF; RDFa; microdata; museum; artwork.
A fuzzy logic and ontology based approach for improving the CV and job offer matching in recruitment process
by Amine Habous, El Habib Nfaoui
Abstract: The recruitment process is a critical activity for every organisation, it allows to find the appropriate candidate for a job offer and its employer work criteria. The competitive nature of the recruitment environment makes the task of hiring new employees very hard for companies owing to the high number of CV (resume) and profiles to process, the personal job interests, the customised requirements and precise skills requested by employees, etc. The time becomes crucial for recruiters' choices; consequently, it might impact the selection process quality. In this paper, we propose a retrieval system for automating the recruitment process. It is designed based on natural language processing, machine learning, and fuzzy logic to handle the matching between the job description and the CVs. It also considers the proficiency level for the technical skills. Moreover, it offers an estimation of the overall CV/Job offer expertise level. In that way, it overcomes the under-qualification and over-qualification issue in the ICT process. Experimental results on a ground-truth data of a recruiter company demonstrate that our proposal provides effective results.
Keywords: text mining; natural language processing; feature extraction; metadata weighting; ICT recruitment; fuzzy logic; machine learning.
Special Issue on: Data Analytics and Semantic Web
Automatic metadata extraction via image processing using Migne's Patrologia Graeca
by Evagelos Varthis, Sozon Papavlasopoulos, Ilias Giarenis, Marios Poulos
Abstract: A wealth of knowledge is kept in libraries and cultural institutions in various digital forms without, however, the possibility of a simple term search, let alone a substantial semantic search. One such important collection that contains knowledge, accumulated in the passage of ages but which remains largely inaccessible, is the Patrologia Graeca (PG). In this study, a novel approach is proposed which strives to recognise words and automatically generate metadata from large machine-printed corpora, such as Migne's Patrologia Graeca. The proposed framework firstly applies an efficient segmentation process at word level and transforms the word-images of the Greek polytonic script of the Patrologia Graeca into special compact shapes. The contours of these shapes are extracted and compared with the contour of a similarly transformed query word-image. For the comparison, we use Hu's invariant moments for discarding unlikely similar matches, Shape Context (SC) for the contour similarity and Pearson's Correlation Coefficient (PCC) for final pruning of the dissimilar words and additional verification. Comparative results are presented by using the Long-Short Term Memory (LSTM) Neural Network (NN) engine of the Tesseract Optical Character Recognition (OCR) system instead of PCC. In addition, an intelligent scenario is proposed for automatic generation of PG metadata by librarians. The described scenario, owing to the simplicity and efficiency it provides, can be applied to massive metadata extraction, building search indexes and consequently semantic enrichment of the Patrologia Graeca.
Keywords: Patrologia Graeca; word spotting; shape context; time series; metadata extraction; semantic enrichment; digital librarian.
A survey study on Arabic WordNet: opportunities and future research directions
by Abdulmohsen Albesher, Osama Rabie
Abstract: WordNet (WN) plays an essential role in knowledge management and information retrieval because as it allows for a better understanding of word relationships, which leads to more accurate text processing. The success of WN for the English language encouraged researchers to develop WNs for other languages. One of the most common of such languages is Arabic. However, the current state of affairs of Arabic WN (AWN) has not been properly studied. Thus, this paper presents a survey study on AWN conducted to explore opportunities and possible future research directions. The results involve the synthesis of over 100 research papers on AWN. These research papers were divided into categories and subcategories.
Keywords: natural language processing; semantics; information retrieval; WordNet; Arabic; Arabic WordNet; AWN.
An ontology-based method for improving quality of process event logs using database bin logs
by Shokoufeh Ghalibafan, Behshid Behkamal, Mohsen Kahani, Mohammad Allahbakhsh
Abstract: The main goal of process mining is discovering models from event logs. The usefulness level of these discovered models is directly related to the quality of the event logs. Many researchers have observed serious deficiencies with regard to the quality of event logs and have proposed various solutions to improve the overall quality of discovered process models. Amongst these, only a few have considered the application of a reliable external source in the improvement of the quality of process event logs. Every activity, in a process, is known to leave some trace in the process event file. Besides, when a process instance is executed, the corresponding operations are applied to the database and their impacts are stored in the database bin log. Therefore, the database bin log can be used as a reliable relevant source in improving the quality improvement of an event log. In this paper, we propose a method to repair the event log using the database bin log. We show that database operations can be employed in order to overcome the inadequacies of event logs, including incorrect and missing data. To this end, we, first, extract an ontology from each of the event logs and the bin log. Then, we match the extracted ontologies in order to remove inadequacies from the event log. Model validation and evaluation results show the stability of our proposed model, as well as its superiority over related works.
Keywords: data quality; process mining; event log; ontology matching; database bin log.
Stress-testing big data to extract smart and interoperable food safety analytics
by Ioanna Polychronou, Giannis Stoitsis, Mihalis Papakonstantinou, Nikos Manouselis
Abstract: One of the significant challenges for the future is to guarantee safe food for all inhabitants of the planet. During the last 15 years, very important fraud issues like the '2013 horse meat scandal' and the '2008 Chinese milk scandal' have greatly affected the food industry and public health. One of the options for this issue consists of increasing production, but to accomplish this, it is necessary that innovative measures be applied to enhance the safety of the food supply chain. For this reason, it is quite important to have the right infrastructure in order to manage data of the food safety sector and provide useful analytics to food safety experts. In this paper, we describe Agroknow's Big Data Platform architecture and examine its scalability for data management and experimentation.
Keywords: big data; stress-testing data; data platform; data management; data experimentation.