International Journal of Big Data Intelligence (10 papers in press)
A Survey of Computation Techniques on Time Evolving Graphs
by Shalini Sharma, Jerry Chou
Abstract: Time Evolving Graph (TEG) refers to graphs whose topology or attribute
values change over time due to update events, including edge addition/deletion, vertex
addition/deletion and attributes changes on vertex or edge. Driven by the Big Data
paradigm, the ability to process and analyze TEG in a timely fashion is critical in
many application domains, such as social network, web graph, road network trac, etc.
Recently, many research eorts have been made with the aim to address the challenges
of volume and velocity from dealing with such datasets. However it remains to be an
active and challenged research topic. Therefore, in this survey, we summarize the state-
of-art computation techniques for TEG. We collect these techniques from three dierent
research communities: i)The data mining community for graph analysis; ii)The theory
community for graph algorithm; iii)The computation community for graph computing
framework. Based on our study, we also propose our own computing framework DASH
for TEG. We have even performed some experiments by comparing DASH and Graph
Processing System (GPS).We are optimistic that this paper will help many researchers to
understand various dimensions of problems in TEG and continue developing the necessary
techniques to resolve these problems more eciently.
Keywords: Big Data; Time evolving graphs; Computing framework; Algorithm; Data Mining.
Uncovering data stream behavior of automated analytical tasks in edge computing
by Lilian Hernandez, Monica Wachowicz, Robert Barton, Marc Breissinger
Abstract: Massive volumes of data streams are expected to be generated by the Internet of Things (IoT). Due to their dispersed and mobile nature, they need to be processed using automated analytical tasks. The research challenge is to uncover whether the data streams, which are being generated by billions of IoT devices, actually conform to a data flow that is required to perform streaming analytics. In this paper, we propose process discovery and conformance checking techniques of Process Mining in order to expose the flow dependency of IoT data streams between automated analytical tasks running at the edge of a network. Towards this end, we have developed a Petri Net model to ensure the optimal execution of analytical tasks by finding path deviations, bottlenecks, and parallelism. A real-world scenario in smart transit is used to evaluate the full advantage of our proposed model. Uncovering the actual behavior of data flows from IoT devices to edge nodes has allowed us to detect discrepancies that have a negative impact on the performance of automated analytical tasks.
Keywords: streaming analytics; process mining; Petri Net; smart transit; Internet of Things; edge computing.
Combining the Richness of GIS Techniques with Visualization Tools to Better Understand the Spatial Distribution of Data- A Case Study of Chicago City Crime Analysis
by Omar Bani Taha, M. Omair Shafiq
Abstract: This study aims to achieve the following objective: (1) To explore the benefits of adding a Spatial GIS layer of analysis to other existing visualization techniques. (2) To identify and evaluate the patterns in selected crime data by analysing Chicagos open dataset and examine related existing literature on crime trends in this city. Some of the motivations for this study include the magnitude and scale of crime incidents across the world as well as the need for a better understanding of patterns and prediction of crime trends within the selected geographical location. We conclude that Chicago seems to be on course to have both the lowest violent crime rate since 1972, and the lowest murder frequency since 1967. Chicago has witnessed a vigorous drop in most crimes types over the last few years in compares to the previous crime index data. Also, Chicago crime naturally upsurges during summer months and declines during winter months. Our study results align with previous several decades of studies and analysis of Chicago crimes, in which the same communities of highest crime rates still experience the mainstream of crime. One may go back and compare the crime pattern of those 1930s study and will find it very typical. The present study confirmed the efficiency of the Geographic Information System and other visualization techniques as a tool in scrutinizing crimes in Chicago city.
Keywords: spatial analysis; geographic information system (GIS); human-centred data science; visualization tools; traditional qualitative techniques; data visualization; spatial and crime mapping.
Improving collaborative filterings rating prediction coverage in sparse datasets by exploiting the friend of a friend concept
by Dionisis Margaris, Costas Vassilakis
Abstract: Collaborative filtering computes personalized recommendations by taking into account ratings expressed by users. Collaborative filtering algorithms firstly identify people having similar tastes, by examining the likeness of already entered ratings. Users with highly similar tastes are termed near neighbours and recommendations for a user are based on her near neighbours ratings. However, for a number of users no near neighbours can be found, a problem termed as the gray sheep problem. This problem is more intense in sparse datasets, i.e. datasets with relatively small number of ratings, compared to the number of users and items. In this work, we propose an algorithm for alleviating this problem by exploiting the friend of a friend (FOAF) concept. The proposed algorithm, CFfoaf, has been evaluated against eight widely used sparse datasets and under two widely used collaborative filtering correlation metrics, namely the Pearson Correlation Coefficient and the Cosine Similarity and has been proven to be particularly effective in increasing the percentage of users for which personalized recommendations can be formulated in the context of sparse datasets, while at the same time improving rating prediction quality.
Keywords: collaborative filtering; recommender systems; sparse datasets; friend-of-a-friend; Pearson correlation coefficient; cosine similarity; evaluation.
Improving collaborative filterings rating prediction accuracy by considering users dynamic rating variability
by Dionisis Margaris, Costas Vassilakis
Abstract: Users that populate ratings databases, follow different marking practices, in the sense that some are stricter, while others are more lenient. Similarly, users rating practices may also differ in rating variability, in the sense that some users may be entering ratings close to their mean, while other users may be entering more extreme ratings, close to the limits of the rating scale. While this aspect has been recently addressed through the computation and exploitation of an overall rating variability measure per user, the fact that user rating practices may vary along the users rating history time axis may render the use of the overall rating variability measure inappropriate for performing the rating prediction adjustment. In this work, we: 1) propose an algorithm that considers two variability metrics per user, the global (overall) and the local one, with the latter representing the users variability at prediction time; 2) present alternative methods for computing a users local variability; 3) evaluate the performance of the proposed algorithm in terms of rating prediction quality and compare it against the state-of-the-art algorithm that employs a single variability metric in the rating prediction computation process.
Keywords: collaborative filtering; users’ ratings dynamic variability; Pearson correlation coefficient; cosine similarity; evaluation; prediction accuracy.
Towards a systematic collect data process
by Iman Tikito, Nissrine Souissi
Abstract: Big data has become a known topic by a large number of researchers in different areas. Actions to improve data lifecycle in Big Data context was conduct in different phases and focused mainly on problems such as storage, security, analysis and visualization. In this paper, we focus basically on improvement of collect phase, which make the other phases more efficient and effective.
We propose in this paper a process to follow to resolve the problematic of collecting a huge amount of data and as a result, optimize data lifecycle. To do this, we analyze different data collect processes present in literature and identify the similitude with the process of Systematic Literature Review. We apply our process by mapping the seven characteristics of Big Data with the sub-processes of proposed collect data process. This mapping provides a guide for the customer to have a clear decision of the need to use the proposed process by answering a set of questions.
Keywords: Big Data; Data Collect; Data Lifecycle; Systematic Literature Review; Process; SLR.
Real-time Maritime Anomaly Detection: Detecting intentional AIS switch-off
by Ioannis Kontopoulos, Konstantinos Chatzikokolakis, Dimitrios Zissis, Konstantinos Tserpes, Giannis Spiliopoulos
Abstract: Today, most of the maritime surveillance systems rely on the Automatic Identification System (AIS), which is compulsory for vessels of specific categories to carry. Anomaly detection typically refers to the problem of finding patterns in data that do not conform to expected behaviour. AIS switch-off is such a pattern that refers to the fact that many vessels turn off their AIS transponder in order to hide their whereabouts when travelling in waters with frequent piracy attacks or potential illegal activity, thus deceiving either the authorities or other piracy vessels. Furthermore, fishing vessels switch off their AIS transponders so as other fishing vessels do not fish in the same area. To the best of our knowledge limited work has focused on AIS switch-off in real-time. We present a system that detects such cases in real-time and can handle high velocity, large volume of streams of AIS messages received from terrestrial base stations. We evaluate the proposed system in a real-world dataset collected from AIS receivers and show the achieved detection accuracy.
Keywords: distributed stream processing; big data; AIS vessel monitoring; anomaly detection.
A survey about legible Arabic fonts for young readers
by Anoual El Kah, Abdelhak Lakhouaja
Abstract: Reading is an interconnected cognitive process including recognition and comprehension. The objective of the reading act could not be achieved unless the text is legible enough to interpret. For that reason, legibility is crucial for the reading mechanism, it will affect reading speed and the recognition of the graphs in the right way. Based on the fact that fonts and the way the text is presented influence childrens reading performance and fluency, the current paper investigates different Arabic fonts in order to determine the optimal font for a fluent reading for children with a low rate of errors in both printed and on-screen texts. This study recruits 33 primary Moroccan school students of third grade and investigates the reading fluency and error rates for five Arabic font types. This paper recommends, as a result, the use of Simplified Arabic font for reducing reading errors due to graphs presentation for either printed or on-screen texts.
Keywords: reading; legibility; Arabic language; fonts; primary schools; Simplified Arabic.
A Survey on Context-Aware Monitoring in Healthcare with Big Data
by Reeja S R, Murthuza Ali, Rino Cherian
Abstract: Context aware monitoring are the three words which are building a rapt in healthcare with help of emerging technologies like BIG DATA, Cloud Computing, IoT etc. through which the healthcare had reached to advance level. With advance to the emerging technologies Wireless Sensor Network and Body Sensor Networks are playing a prominent role in healthcare through which the data is collected and send to the cloud for better analysis. The services with context-aware are built in mobile services and applications so that they can offer contextually needed data to the developer. The word context recognizes the information in automatic manner and respond anticipated according to the needs and helps the people to be aware of social surroundings based on the contextual information. As various mobile devices, technologies, application and networks are developing in huge number the efforts to make a usable application is more important in achieve success in the industry. As context-aware is spreading in different technologies with Big Data, Cloud, IoT, Machine Learning is making adverse effects with these Technologies its gaining boom in market which helps the context aware to develop further in technology to make people life easy and finding a solution to a appropriate problem.
Keywords: Context-aware monitoring; Big Data; Cloud Computing; Iot; Machine Learning.
To beat or not to beat: uncovering the world social battles in Wikipedia
by Massimo Marchiori, Enrico Bonetti Vieno
Abstract: The online world has deeply changed the rules with which we engage with information. We have at our disposal a huge amount of information, growing every single day, and as such with the increasing need to wisely access it. Because of this evolution, a few selected system have emerged as information centralizers, providing easy and seamless access to information: on the one side we have the search engines, which try to compress the number of pages to interact with, and on the other side we have systems like Wikipedia, that try to compress information being an online encyclopaedia. These two systems (search engine and Wikipedia) have together had an enormous success, simplifying the process of information foraging. However, success also brings problems. In the case of Wikipedia, these problems are due to its distributed nature: everybody can access and contribute. As such the Wikipedia system, playing a primary role of information distribution, has been subject to "attacks", in the form of attempts to manipulate information for the most various reasons. This extra layer of information manipulation is however practically invisible to the general public, that only sees the final outcome, usually taking it as a reference source. In this paper we describe the Negapedia system, which is an attempt to actually provide the general public with a more complete picture of what is actually going on with information. We describe the challenges and choices that had to be made, coming not only from the point of big data analysis, but also and foremost from the problem of potential information overload, given the general target audience. Along this journey, we also provide some novel insights on the important issue of Wikipedia categorization, analysing the problem of presenting general users with easy and meaningful category information, thus helping users (and scholars) to better tame the multitude of information topics present nowadays in Wikipedia.rn
Keywords: Social data; big data analysis; Wikipedia; categorization; online information; bias; data science.