Forthcoming articles

International Journal of Big Data Intelligence

International Journal of Big Data Intelligence (IJBDI)

These articles have been peer-reviewed and accepted for publication but are pending final changes, are not yet published and may not appear here in their final order of publication until they are assigned to issues. Therefore, the content conforms to our standards but the presentation (e.g. typesetting and proof-reading) is not necessarily up to the Inderscience standard. Additionally, titles, authors, abstracts and keywords may change before publication. Articles will not be published until the final proofs are validated by their authors.

Forthcoming articles must be purchased for the purposes of research, teaching and private study only. These articles can be cited using the expression "in press". For example: Smith, J. (in press). Article Title. Journal Title.

Articles marked with this shopping trolley icon are available for purchase - click on the icon to send an email request to purchase.

Register for our alerting service, which notifies you by email when new issues are published online.

Open AccessArticles marked with this Open Access icon are freely available and openly accessible to all without any restriction except the ones stated in their respective CC licenses.
We also offer which provide timely updates of tables of contents, newly published articles and calls for papers.

International Journal of Big Data Intelligence (20 papers in press)

Regular Issues

  • A Survey of Computation Techniques on Time Evolving Graphs   Order a copy of this article
    by Shalini Sharma, Jerry Chou 
    Abstract: Time Evolving Graph (TEG) refers to graphs whose topology or attribute values change over time due to update events, including edge addition/deletion, vertex addition/deletion and attributes changes on vertex or edge. Driven by the Big Data paradigm, the ability to process and analyze TEG in a timely fashion is critical in many application domains, such as social network, web graph, road network trac, etc. Recently, many research e orts have been made with the aim to address the challenges of volume and velocity from dealing with such datasets. However it remains to be an active and challenged research topic. Therefore, in this survey, we summarize the state- of-art computation techniques for TEG. We collect these techniques from three di erent research communities: i)The data mining community for graph analysis; ii)The theory community for graph algorithm; iii)The computation community for graph computing framework. Based on our study, we also propose our own computing framework DASH for TEG. We have even performed some experiments by comparing DASH and Graph Processing System (GPS).We are optimistic that this paper will help many researchers to understand various dimensions of problems in TEG and continue developing the necessary techniques to resolve these problems more eciently.
    Keywords: Big Data; Time evolving graphs; Computing framework; Algorithm; Data Mining.

  • Uncovering data stream behavior of automated analytical tasks in edge computing   Order a copy of this article
    by Lilian Hernandez, Monica Wachowicz, Robert Barton, Marc Breissinger 
    Abstract: Massive volumes of data streams are expected to be generated by the Internet of Things (IoT). Due to their dispersed and mobile nature, they need to be processed using automated analytical tasks. The research challenge is to uncover whether the data streams, which are being generated by billions of IoT devices, actually conform to a data flow that is required to perform streaming analytics. In this paper, we propose process discovery and conformance checking techniques of Process Mining in order to expose the flow dependency of IoT data streams between automated analytical tasks running at the edge of a network. Towards this end, we have developed a Petri Net model to ensure the optimal execution of analytical tasks by finding path deviations, bottlenecks, and parallelism. A real-world scenario in smart transit is used to evaluate the full advantage of our proposed model. Uncovering the actual behavior of data flows from IoT devices to edge nodes has allowed us to detect discrepancies that have a negative impact on the performance of automated analytical tasks.
    Keywords: streaming analytics; process mining; Petri Net; smart transit; Internet of Things; edge computing.

  • Combining the Richness of GIS Techniques with Visualization Tools to Better Understand the Spatial Distribution of Data- A Case Study of Chicago City Crime Analysis   Order a copy of this article
    by Omar Bani Taha, M. Omair Shafiq 
    Abstract: This study aims to achieve the following objective: (1) To explore the benefits of adding a Spatial GIS layer of analysis to other existing visualization techniques. (2) To identify and evaluate the patterns in selected crime data by analysing Chicagos open dataset and examine related existing literature on crime trends in this city. Some of the motivations for this study include the magnitude and scale of crime incidents across the world as well as the need for a better understanding of patterns and prediction of crime trends within the selected geographical location. We conclude that Chicago seems to be on course to have both the lowest violent crime rate since 1972, and the lowest murder frequency since 1967. Chicago has witnessed a vigorous drop in most crimes types over the last few years in compares to the previous crime index data. Also, Chicago crime naturally upsurges during summer months and declines during winter months. Our study results align with previous several decades of studies and analysis of Chicago crimes, in which the same communities of highest crime rates still experience the mainstream of crime. One may go back and compare the crime pattern of those 1930s study and will find it very typical. The present study confirmed the efficiency of the Geographic Information System and other visualization techniques as a tool in scrutinizing crimes in Chicago city.
    Keywords: spatial analysis; geographic information system (GIS); human-centred data science; visualization tools; traditional qualitative techniques; data visualization; spatial and crime mapping.

  • Improving collaborative filterings rating prediction coverage in sparse datasets by exploiting the friend of a friend concept   Order a copy of this article
    by Dionisis Margaris, Costas Vassilakis 
    Abstract: Collaborative filtering computes personalized recommendations by taking into account ratings expressed by users. Collaborative filtering algorithms firstly identify people having similar tastes, by examining the likeness of already entered ratings. Users with highly similar tastes are termed near neighbours and recommendations for a user are based on her near neighbours ratings. However, for a number of users no near neighbours can be found, a problem termed as the gray sheep problem. This problem is more intense in sparse datasets, i.e. datasets with relatively small number of ratings, compared to the number of users and items. In this work, we propose an algorithm for alleviating this problem by exploiting the friend of a friend (FOAF) concept. The proposed algorithm, CFfoaf, has been evaluated against eight widely used sparse datasets and under two widely used collaborative filtering correlation metrics, namely the Pearson Correlation Coefficient and the Cosine Similarity and has been proven to be particularly effective in increasing the percentage of users for which personalized recommendations can be formulated in the context of sparse datasets, while at the same time improving rating prediction quality.
    Keywords: collaborative filtering; recommender systems; sparse datasets; friend-of-a-friend; Pearson correlation coefficient; cosine similarity; evaluation.

  • Improving collaborative filterings rating prediction accuracy by considering users dynamic rating variability   Order a copy of this article
    by Dionisis Margaris, Costas Vassilakis 
    Abstract: Collaborative filtering computes personalized recommendations by taking into account ratings expressed by users. Collaborative filtering algorithms firstly identify people having similar tastes, by examining the likeness of already entered ratings. Users with highly similar tastes are termed near neighbours and recommendations for a user are based on her near neighbours ratings. However, for a number of users no near neighbours can be found, a problem termed as the gray sheep problem. This problem is more intense in sparse datasets, i.e. datasets with relatively small number of ratings, compared to the number of users and items. In this work, we propose an algorithm for alleviating this problem by exploiting the friend of a friend (FOAF) concept. The proposed algorithm, CFfoaf, has been evaluated against eight widely used sparse datasets and under two widely used collaborative filtering correlation metrics, namely the Pearson Correlation Coefficient and the Cosine Similarity and has been proven to be particularly effective in increasing the percentage of users for which personalized recommendations can be formulated in the context of sparse datasets, while at the same time improving rating prediction quality.
    Keywords: collaborative filtering; users’ ratings dynamic variability; Pearson correlation coefficient; cosine similarity; evaluation; prediction accuracy.

Special Issue on: Big Data Infrastructure and Deep Learning Applications

  • A computational Bayesian approach for estimating density functions based on noise-multiplied data   Order a copy of this article
    by Yan-Xia Lin 
    Abstract: In this big data era, an enormous amount of personal and company information can be easily collected by third parties. Sharing the data with the public and allowing data users to access the data for data mining often bring many benefits to the public. However, sharing the microdata with the public usually causes the issue of data privacy. Protecting data privacy through noise-multiplied data is one of the approaches studied in the literature. This paper introduces the B-M L2014 Approach for estimating the density function of the original data based on noise-multiplied microdata. This paper shows applications of the B-M L2014 Approach and demonstrates that the statistical information of the original data can be retrieved from their noise-multiplied data reasonably while the disclosure risk is under control. The B-M L2014 Approach provides a new data mining technique for big data when data privacy is concerned.
    Keywords: big data mining; data anonymisation; privacy-preserving; microdata confidentiality; noise-multiplied data.
    DOI: 10.1504/IJBDI.2019.10021723
  • New algorithms for inferring gene regulatory networks from time-series expression data on Apache Spark   Order a copy of this article
    by Yasser Abduallah, Jason T.L. Wang 
    Abstract: Gene regulatory networks (GRNs) are crucial to understand the inner workings of the cell and the complexity of gene interactions. Numerous algorithms have been developed to infer GRNs from gene expression data. As the number of identified genes increases and the complexity of their interactions is uncovered, gene networks become cumbersome to test. Furthermore, prodding through experimental results requires an enormous amount of computation, resulting in slow data processing. Therefore, new approaches are needed to analyse copious amounts of experimental data from cellular GRNs. To meet this need, cloud computing is promising as reported in the literature. Here we present two new algorithms for reverse engineering GRNs in a cloud environment. The algorithms, implemented in Spark, employ an information-theoretic approach to infer GRNs from time-series gene expression data. Experimental results show that one of our new algorithms is faster than, yet as accurate as, two existing cloud-based GRN inference methods.
    Keywords: network inference; systems biology; spark; big data; MapReduce; gene regulatory networks; GRN; time-series; gene expression; big data intelligence.
    DOI: 10.1504/IJBDI.2019.10021724
  • A scalable system for executing and scoring K-means clustering techniques and its impact on applications in agriculture   Order a copy of this article
    by Nevena Golubovic, Chandra Krintz, Rich Wolski, Balaji Sethuramasamyraja, Bo Liu 
    Abstract: We present Centaurus - a scalable, open source, clustering service for K-means clustering of correlated, multidimensional data. Centaurus provides users with automatic deployment via public or private cloud resources, model selection (using Bayesian information criterion), and data visualisation. We apply Centaurus to a real-world, agricultural analytics application and compare its results to the industry standard clustering approach. The application uses soil electrical conductivity (EC) measurements, GPS coordinates, and elevation data from a field to produce a 'map' of differing soil zones (so that management can be specialised for each). We use Centaurus and these datasets to empirically evaluate the impact of considering multiple K-means variants and large numbers of experiments. We show that Centaurus yields more consistent and useful clusterings than the competitive approach for use in zone-based soil decision-support applications where a 'hard' decision is required.
    Keywords: K-means clustering; cloud computing.
    DOI: 10.1504/IJBDI.2019.10021277
  • Scalable mining, analysis and visualisation of protein-protein interaction networks   Order a copy of this article
    by Shaikh Arifuzzaman, Bikesh Pandey 
    Abstract: Proteins are linear chain biomolecules that are the basis of functional networks in all organisms. Protein-protein interaction (PPI) networks are networks of protein complexes formed by biochemical events and electrostatic forces. PPI networks can be used to study diseases and discover drugs. The causes of diseases are evident on a protein interaction level. For instance, elevation of interaction edge weights of oncogenes is manifested in cancers. The availability of large datasets and need for efficient analysis necessitate the design of scalable methods leveraging modern high-performance computing (HPC) platforms. In this paper, we design a lightweight framework on a distributed-memory parallel system to study PPI networks. Our framework supports automated analytics based on methods for extracting signed motifs, computing centrality, and finding functional units. We design message passing interface (MPI)-based parallel methods and workflow, scalable to large networks. To the best of our knowledge, these capabilities collectively make our tool novel.
    Keywords: protein interaction; biological networks; network visualisation; massive networks; HPC systems; network mining.
    DOI: 10.1504/IJBDI.2019.10019036
  • Optimising NBA player signing strategies based on practical constraints and statistics analytics   Order a copy of this article
    by Lin Li, Yihang Zhao, Ramya Nagarajan 
    Abstract: In National Basketball Association (NBA), how to comprehensively measure a player's performance and how to sign talented players with reasonable contracts are always challenging. Due to various practical constraints such as the salary cap and the players' on-court minutes, no teams can sign all desired players. To ensure the team's competency on both offence and defence sides, player's efficiency must be comprehensively evaluated. This research studied the key indicators widely used to measure player efficiency and team performance. Through data analytics, the most frequently referred statistics including player efficiency rating, defence rating, real plus minus, points, rebounds, assists, blocks, steals, etc. were chosen to formulate the prediction of the team winning rate in different schemes. Based on the models trained and tested, two player selection strategies were proposed according to different objectives and constraints. Experimental results show that the developed team winning rate prediction models have high accuracy and the player selection strategies are effective.
    Keywords: optimisation; prediction; regression; linear programming; statistics analytics; constraints.
    DOI: 10.1504/IJBDI.2019.10021725
  • Text visualisation for feature selection in online review analysis   Order a copy of this article
    by Keerthika Koka, Shiaofen Fang 
    Abstract: Opinion spamming is a reality, and it can have unpleasant consequences in the retail industry. While there are, several promising research works done on identifying the fake online reviews from genuine online reviews, there have been few involving visualisation and visual analytics. The purpose of this work is to show that feature selection through visualisation is at least as powerful as the best automatic feature selection algorithms. This is achieved by applying radial chart visualisation technique to the online review classification into fake and genuine reviews. Radial chart and the colour overlaps are used to explore the best feature selection through visualisation for classification. Parallel coordinate visualisation of the review data is also explored and compared with radial chart results. The system gives a structure to each text review based on certain attributes, compares how different or similar the structure of the different or same categories are, and highlights the key features that contribute to the classification the most. Our visualisation technique helps the user get insights into the high dimensional data by providing means to eliminate the worst features right away, pick some best features without statistical aids, understand the behaviour of the dimensions in different combinations.
    Keywords: text visualisation; feature selection; radial chart; online review analysis.
    DOI: 10.1504/IJBDI.2019.10021726
  • Network traffic driven storage repair   Order a copy of this article
    by Danilo Gligoroski, Katina Kralevska, Rune E. Jensen, Per Simonsen 
    Abstract: Recently we constructed an explicit family of locally repairable and locally regenerating codes. Their existence was proven by Kamath et al. but no explicit construction was given. Our design is based on HashTag codes that can have different sub-packetisation levels. In this work we emphasise the importance of having two ways to repair a node: repair only with local parity nodes or repair with both local and global parity nodes. We say that the repair strategy is network traffic driven since it is in connection with the concrete system and code parameters: the repair bandwidth of the code, the number of I/O operations, the access time for the contacted parts and the size of the stored file. We show the benefits of having repair duality in one practical example implemented in Hadoop. We also give algorithms for efficient repair of the global parity nodes.
    Keywords: vector codes; repair bandwidth; repair locality; exact repair; parity-splitting; global parities; Hadoop.
    DOI: 10.1504/IJBDI.2019.10021727
  • DeepSim: cluster level behavioural simulation model for deep learning   Order a copy of this article
    by Yuankun Shi, Kevin J. Long, Kaushik Balasubramanian, Zhaojuan Bian, Adam Procter, Ramesh Illikkal 
    Abstract: We are witnessing an explosion of AI use cases driving the computer industry, and especially datacentre and server architectures. As Intel faces fierce competition in this emerging technology space, it is critical that architecture definitions and directions are driven with data from proper tools and methodologies, and insights are drawn from end-to-end holistic analysis at the datacentre levels. In this paper, we introduce DeepSim, a cluster-level behavioural simulation model for deep learning. DeepSim, which is based on the Intel CoFluent simulation framework, uses timed behavioural models to simulate complex interworking between compute nodes, networking, and storage at the datacentre level, providing a realistic performance model of real-world image recognition applications based on the popular deep learning framework Caffe. The end-to-end simulation data from DeepSim provides insights which can be used for architecture analysis driving future datacentre architecture directions. DeepSim enables scalable system design, deployment, and capacity planning through accurate performance insights. Results from preliminary scaling studies (e.g., node scaling and network scaling) and what-if analyses (e.g., Xeon with HBM and Xeon Phi with dual OPA) are presented in this paper. The simulation results are correlated well with empirical measurements, achieving an accuracy of 95%.
    Keywords: deep learning; datacentre; behavioural simulation; AlexNet; architecture analysis; performance analysis; server srchitecture.
    DOI: 10.1504/IJBDI.2019.10021728
  • MapReduce-based fuzzy very fast decision tree for constructing prediction intervals   Order a copy of this article
    by Ojha Manish Kumar, Kumar Ravi, Vadlamani Ravi 
    Abstract: We propose the fuzzy version of very fast decision tree (VFDT) to predict prediction intervals and compared them with those generated by traditional VFDT. The proposed fuzzy VFDT is able to capture intrinsic features of VFDT as well as uncertainties available in data. The VFDT and fuzzy VFDT were trained using the lower upper bound estimation (LUBE) method in order to generate prediction intervals. We also implemented VFDT; developed and implemented fuzzy VFDT using Apache Hadoop MapReduce framework, where multiple slave nodes build a VFDT and fuzzy VFDT model. The developed models were tested on six datasets taken from the web. We conducted sensitivity analysis by studying the influence of the window size of the data stream, number of bins in discretisation on the final results. Results demonstrated that the proposed MapReduce-based fuzzy VFDT and VFDT can construct high-quality prediction intervals precisely and quickly.
    Keywords: very fast decision tree; VFDT; fuzzy VFDT; MapReduce; prediction interval; big data; Hadoop; stream data.
    DOI: 10.1504/IJBDI.2019.10021731
  • Real-time event search using social stream for inbound tourist corresponding to place and time   Order a copy of this article
    by Ruriko Kudo, Miki Enoki, Akihiro Nakao, Shu Yamamoto, Saneyasu Yamaguchi, Masato Oguchi 
    Abstract: Since the decision was made to hold the Olympic Games in 2020 in Tokyo, the number of foreign tourists visiting the city has been increasing rapidly. Accordingly, tourists have been seeking more sightseeing information. While guidebooks are good for pointing out popular tourist attractions, it is more difficult for tourists to get information on local events and spots that are just becoming popular. We developed a tourist information distribution system that sends information corresponding to places and times. The system extracts event information from social media streams in a per place and time manner and provides it to tourists. In order to extract useful information, we performed event classification using actual Twitter data. We examined how to distribute the events in order to make the system more user-friendly. We also developed an information supplement function using external information.
    Keywords: Twitter; social networking service; SNS; local event; sightseeing information; event name; information extract; external information; Mecab; support vector machine; SVM; random forest.
    DOI: 10.1504/IJBDI.2019.10021733
  • Two-channel convolutional neural network for facial expression recognition using facial parts   Order a copy of this article
    by Hui Wang, Jiang Lu, Lucy Nwosu, Ishaq Unwala 
    Abstract: This paper proposes the design of a facial expression recognition system based on the deep convolutional neural network by using facial parts. In this work, a solution for facial expression recognition that uses a combination of algorithms for face detection, feature extraction and classification is discussed. The proposed method builds a two-channel convolutional neural network model in which facial parts are used as inputs: the extracted eyes are used as inputs to the first channel, while the mouth is the input into the second channel. Feature information from both channels converges in a fully connected layer which is used to learn global information from these local features and is then used for expression classification. Experiments are carried out on the Japanese female facial expression dataset and the extended Cohn-Kanada dataset to determine the expression recognition accuracy for the proposed facial expression recognition system based on facial parts. The results achieved shows that the system provides state of art classification accuracy with 97.6% and 95.7% respectively when compared to other methods.
    Keywords: facial expression recognition; convolutional neural networks; facial parts.
    DOI: 10.1504/IJBDI.2019.10021734
  • Efficient clustering techniques on Hadoop and Spark   Order a copy of this article
    by Sami Al Ghamdi, Giuseppe Di Fatta 
    Abstract: Clustering is an essential data mining technique that divides observations into groups where each group contains similar observations. K-means is one of the most popular clustering algorithms that has been used for over 50 years. Due to the current exponential growth of the data, it became a necessity to improve the efficiency and scalability of K-means even further to cope with large-scale datasets known as big data. This paper presents K-means optimisations using triangle inequality on two well-known distributed computing platforms: Hadoop and Spark. K-means variants that use triangle inequality usually require caching extra information from the previous iteration, which is a challenging task to achieve on Hadoop. Hence, this work introduces two methods to pass information from one iteration to the next on Hadoop to accelerate K-means. The experimental work shows that the efficiency of K-means on Hadoop and Spark can be significantly improved by using triangle inequality optimisations.
    Keywords: K-means; Hadoop; Spark; MapReduce; efficient clustering; triangle inequality K-means.
    DOI: 10.1504/IJBDI.2019.10018592
  • A hybrid power management schema for multi-tier data centres   Order a copy of this article
    by Aryan Azimzadeh, Babak Maleki Shoja, Nasseh Tabrizi 
    Abstract: Data centres play an important role in the operation and management of IT infrastructures but because of their huge power consumption, it raises an issue of great concern as it relates to global warming. This paper explores the sleep state of data centres' servers under specific conditions such as setup time and identifies an optimal number of servers potentially to increase energy efficiency. We use a dynamic power management policy-based model with the optimal number of servers that is required in each tier while increasing servers' setup time after sleep mode. The reactive approach is used to validate the results and energy efficiency by calculating the average power consumption of each server under specific sleep mode and setup time. Our method uses average power consumption to calculate the normalised-performance-per-watt in order to evaluate the power efficiency. The results indicate that the schema reported in this paper can improve power efficiency in data centres with high setup time servers.
    Keywords: power; green; management; schema; multi-tier.
    DOI: 10.1504/IJBDI.2019.10021740
  • Predicting hospital length of stay using neural networks   Order a copy of this article
    by Thanos Gentimis, Ala' J. Alnaser, Alex Durante, Kyle Cook, Robert Steele 
    Abstract: Accurate prediction of hospital length of stay can provide benefits for hospital resource planning and quality-of-care. We describe the utilisation of neural networks for predicting the length of hospital stay for patients with various diagnoses based on selected administrative and clinical attributes. An all-condition neural network, that can be applied to all patients and not limited to a specific diagnosis, is trained to predict whether patient stay will be long or short in terms of the median length of stay as the cut-off between long and short, and predicted at the time the patient leaves the intensive care unit. In addition, neural networks are trained to predict whether patients of 14 specific common primary diagnoses will have a long or short stay, as defined as greater than or less than or equal to the median length of stay for that particular condition. Our dataset is drawn from the MIMIC III database. Our prediction accuracy is approximately 80% for the all-condition neural network and the neural networks for specific conditions generally demonstrated higher accuracy and all clearly out-performed any linear model.
    Keywords: length of stay; health analytics; neural networks; MIMIC III.
    DOI: 10.1504/IJBDI.2019.10019022
  • Towards an automation of the fact-checking in the journalistic web context   Order a copy of this article
    by Edouard Ngor Sarr, Ousmane Sall, Aminata Maiga, Mouhamadou Saliou Diallo 
    Abstract: Is Fact checking automatisable? Apparently, yes, since numerous moved forward noted in the search and the analysis of digital data. However, this task which in priori seemed to be simple, turns out rather binding. Indeed, automate the check of facts combine and requires at the same time very advanced knowledge in analysis of data, in search of web data, the web technologies, in image processing and sometimes in automatic natural language processing (NLP). Nevertheless, the latter years, numerous researches are led to deepen such an analysis. In this article, having revisited the state of the art concerning the question, we identify and diagnose in detail the obstacles before concluding with an explanation of the methods.
    Keywords: fact-checking; data journalism; semantic web.
    DOI: 10.1504/IJBDI.2019.10021741