Forthcoming articles

 


International Journal of Big Data Intelligence

 

These articles have been peer-reviewed and accepted for publication in IJBDI, but are pending final changes, are not yet published and may not appear here in their final order of publication until they are assigned to issues. Therefore, the content conforms to our standards but the presentation (e.g. typesetting and proof-reading) is not necessarily up to the Inderscience standard. Additionally, titles, authors, abstracts and keywords may change before publication. Articles will not be published until the final proofs are validated by their authors.

 

Forthcoming articles must be purchased for the purposes of research, teaching and private study only. These articles can be cited using the expression "in press". For example: Smith, J. (in press). Article Title. Journal Title.

 

Articles marked with this shopping trolley icon are available for purchase - click on the icon to send an email request to purchase.

 

Articles marked with this Open Access icon are freely available and openly accessible to all without any restriction except the ones stated in their respective CC licenses.

 

Register for our alerting service, which notifies you by email when new issues of IJBDI are published online.

 

We also offer RSS feeds which provide timely updates of tables of contents, newly published articles and calls for papers.

 

International Journal of Big Data Intelligence (24 papers in press)

 

Regular Issues

 

  • Improving Straggler Task Performance in a Heterogeneous MapReduce Framework Using Reinforcement Learning   Order a copy of this article
    by Srinivas Naik Nenavath, Atul Negi, V.N. Sastry 
    Abstract: MapReduce is one of the most significant distributed and parallel processing frameworks for large-scale data-intensive jobs proposed in recent times. Intelligent scheduling decisions can potentially help in significantly reducing the overall runtime of jobs. It is observed that the total time to completion of a job gets extended because of some slow tasks. Especially in heterogeneous environments, the job completion times do not synchronize. As originally conceived MapReduce default scheduler was not very effective about slow task identification. In the literature, Longest Approximate Time to End (LATE) scheduler extends to the heterogeneous environment, but it has limitations in properly estimating the progress of the tasks. It takes a static view of the task progress. In this paper, we propose a novel Reinforcement Learning based MapReduce scheduler for heterogeneous environments called MapReduce Reinforcement Learning (MRRL) scheduler. It observes the system state of task execution and suggests speculative re-execution of the slower tasks to available nodes in the heterogeneous cluster without assuming any prior knowledge of the environmental characteristics.We observe that the experimental results show consistent improvements in performance as compared to the LATE and Hadoop default schedulers for different workloads of the Hi-Bench benchmark suite.
    Keywords: MapReduce; Reinforcement Learning; Speculative Execution; Task Scheduler; Heterogeneous Environments.

  • Algorithms for Fast Estimation of Social Network Centrality Measures   Order a copy of this article
    by Ashok Kumar, R. Chulaka Gunasekara, Kishan Mehrotra, Chilukuri Mohan 
    Abstract: Centrality measures are extremely important in the analysis of social networks, with applications such as the identification of the most influential individuals for effective target marketing. Eigenvector centrality and PageRank are among the most useful centrality measures, but computing these measures can be prohibitively expensive for large social networks. This paper explores multiple approaches to improve the computational effort required to compute relative centrality measures. First, we show that neural networks can be effective in learning and estimating the ordering of vertices in a social network based on these centrality measures. We show that the proposed neural networks approach requires far less computational effort, and to be is faster than early termination of the power iteration method that can be used for computing the centrality measures. We also show that four features describing the size of the social network and two vertex-specific attributes sufficed as inputs to the neural networks, requiring very few hidden neurons. Then we focus on how network sampling can be used to reduce the running times for calculating the ordering of vertices. We introduce the notion of degree centrality based sampling to reduce the running time of the key node identification problem. Finally we propose the approach of incremental updating of centrality measures in dynamic networks.
    Keywords: Social network; Centrality; Eigenvector centrality; PageRank; Network sampling; Incremental updating.

  • Collective Tweet Analysis for Accurate User Sentiment Analysis - a Case Study with Delhi Assembly Election 2015   Order a copy of this article
    by Lija Mohan, Sudheep Ealyidom 
    Abstract: Social media has exploded as a category of online discourse where people create and share the contents at a massive rate. Because of its ease of use, speed and reach, social media is fast changing the public discourse in society and setting trends and agendas in topics that range from the environment and politics to technology and the entertainment industry. Since social media can also be construed as a form of collective wisdom, the authors decided to investigate its power at predicting real-world outcomes. The objective was to design a Twitter based sentiment mining. We introduced a keyword-aware user-based collective tweet mining approach to rank the sentiment of each user. To prove the accuracy of the proposed method, we chose an interesting Election Winner Prediction application and observed how the sentiment of people on different political issues at that time, got reflected on their votes. A Domain thesaurus is built by collecting keywords related to each issue. Since twitter data is too huge in size, it is very difficult to process using traditional architecture. Hence, we introduced a scalable and efficient Map Reduce programming model based approach to classify the tweets. The experiments were designed to predict the winner of Delhi Assembly Elections, 2015 by analyzing the sentiments of people on different political issues and from the analysis that we performed, we correctly predicted that Aam Admy Party has a higher support, compared to the existing ruling party, BJP. Thus we introduced a Big Data Approach to do sentiment analysis on Twitter data which have wide spread applications in todays world.
    Keywords: Twitter Anlaysis; Collective Tweet Analysis; Sentiment Analysis; Big Data; Hadoop; Map Reduce.

  • Comparison of Hives Query Optimization Techniques   Order a copy of this article
    by Sikha Bagui, Keerthi Devulapalli 
    Abstract: The ever increasing size of data sets in this Big Data era has forced data analytics to be moved from traditional RDBMS systems to distributed technologies like Hadoop. Since data analysts are more familiar with SQL than the MapReduce programming paradigm, HiveQL was built on Hadoops MapReduce framework. Traditional RDBMS query optimization techniques, which are used in the Rule Based Optimizer (RBO) of Hive, do not perform well in the MapReduce environment. Hence, the Correlation Optimizer (CRO) and Cost Based Optimizers (CBO) were developed. These optimizers perform query optimizations considering the MapReduce execution framework. In this work, the three optimizers, RBO, CRO, and CBO are compared. Queries with common intra-query operations were found to be optimized better with CRO.
    Keywords: Hive; Query Optimization; Correlation Based Optimizer; Rule Based Optimizer; Cost Based Optimizer.

  • BIG DATA ENSEMBLE CLINICAL PREDICTION FOR HEALTHCARE DATA BY USING DEEP LEARNING MODEL   Order a copy of this article
    by Sreekanth Rallapalli, Gondkar R R 
    Abstract: Big data has revolutionise the healthcare industry. Electronic health records (EHRs) is growing at an exponential rate. Healthcare data being unstructured in nature requires a complete new technology to process the data. Clinical applications also need machine learning techniques and data mining methods which include decision trees and artificial neural networks. Classification algorithms have to be considered for developing predictive models. Ensemble model is gaining popularity among various other individual contributors. Ensemble systems can provide better accuracy. In this paper we combine four algorithms support vector machines,na
    Keywords: algorithm; big data; classification; decision trees; deep learning; electronic health records; HER; ensemble model; predictive model.
    DOI: 10.1504/IJBDI.2018.10008867
     
  • Resource management for deadline constrained MapReduce jobs for minimizing energy consumption   Order a copy of this article
    by Adam Gregory, Shikharesh Majumdar 
    Abstract: Cloud computing has emerged as one of the leading platforms for processing large-scale data intensive applications. Such applications are executed in large clusters and data centers which require a substantial amount of energy. Energy consumption within data centers accounts for a considerable fraction of costs and is a significant contributor to global greenhouse gas emissions. Therefore, minimizing energy consumption in data centers is a critical concern for data center operators, cluster owners, and cloud service providers. In this paper, we devise a novel Energy Aware MapReduce Resource Manager for an open system, called EAMR-RM, that can effectively perform matchmaking and scheduling of MapReduce jobs each of which is characterized by a Service Level Agreement (SLA) for performance that includes a client specified earliest start time, execution time, and a deadline with the objective of minimizing data center energy consumption. Performance analysis demonstrates that for a range of system and workload parameters experimented with the proposed technique can effectively satisfy SLA requirements while achieving up to a 45% reduction in energy consumption compared to approaches which do not consider energy in resource management decisions.
    Keywords: Resource management on clouds; MapReduce with deadlines; Constraint Programming; Energy management; Big data analytics; Job turnaound time.

  • A Novel Entropy Based Dynamic Data Placement Strategy for Data Intensive Applications in Hadoop Clusters   Order a copy of this article
    by K. Hemant Kumar Reddy, Diptendu Sinha Roy, Vishal Pandey 
    Abstract: In the last decade, efficient data analysis of data-intensive applications has become an increasingly important research issue. The popular map-reduce framework has offered an enthralling solution to this problem by means of distributing the work load across interconnected data centers. Hadoop is most widely used platform for data intensive application such as analysis of web logs, detection of global weather patterns, bioinformatics applications among others. However, most Hadoop implementations assume that every node attached to a cluster are homogeneous in nature having same computational capacities which may reduce map-reduce performance by increasing extra over-head for run-time data communications. However, majority of data placement strategies attempt placing related data close to each other for faster run-time access. However they disregard scenarios where such placement algorithms have to work with data sets which are new, either generated or for different MapReduce jobs. This paper deals with improving the map-reduce performance over multi-cluster data sets by means of a novel entropy based data placement strategy (EDPS), that works in three phases and account for new data sets. In the first phase, a k-means clustering strategy is employed to extract dependencies among different datasets and group them into data groups. In second phase, these data groups placed in different data centers while considering heterogeneity of virtual machines into account. Finally, the third phase uses an entropy based grouping of the newly generated datasets where these datasets are grouped with most similar existing cluster based on their relative entropy. The essence of the entropy based scheme lies in computing expected entropy, a measure for dissimilarity of MapReduce jobs and their data usage patterns in terms of data blocks stored in HDFS, and finally placing new data among clusters such that entropy is reduced. The experimental results shows efficacy of the proposed three fold dynamic grouping and data placement policy, which significantly reduces the time of execution and improve Hadoop performance in heterogeneous clusters with varying server and user applications and its parameters sizes.
    Keywords: Dynamic Data Placement Strategy; Hadoop Clusters; MapReduce;k-means clustering; entropy;.

  • Hybrid neural network and bi-criteria tabu-machine: comparison of new approaches to maximum clique problem   Order a copy of this article
    by Eduard Babkin, Tatiana Babkina, Alexander Demidovskij 
    Abstract: This paper presents two new approaches to solving a classical NP-hard problem of maximum clique problem (MCP), which frequently arises in the domain of information management, including design of database structures and big data processing. In our research, we are focusing on solving that problem using the paradigm of artificial neural networks. The first approach combines the artificial neuro-network paradigm and genetic programming. For boosting the convergence of the Hopfield neural network (HNN), we propose a specific design of the genetic algorithm as the selection mechanism for terms of the HNN energy function. The second approach incorporates and extends the tabu-search heuristics improving performance of network dynamics of so-called tabu machine. Introduction of a special penalty function in tabu machine facilitates better evaluation of the search space. As a result, we demonstrate the proposed approaches on well-known experimental graphs and formulate two hypotheses for further research.
    Keywords: maximum clique problem; MCP; data structures; Hopfield network; genetic algorithm; tabu machine.
    DOI: 10.1504/IJBDI.2018.10008744
     
  • A collective matrix factorisation approach to social recommendation with eWOM propagation effects   Order a copy of this article
    by Ren-Shiou Liu 
    Abstract: In recent years, recommender systems have become an important tool for many online retailers to increase sales. Many of these recommender systems predict users' interests in products by using the browsing history or item rating records of users. However, many studies show that, before making a purchase, people often read online reviews and exchange opinions with friends in their social circles. The resulting electronic word-of-mouth (eWOM) has a huge impact on customer's purchase intention. Nonetheless, most recommender systems in the current literature do not consider eWOM, let alone the effect of its propagation. Therefore, this paper proposes a new recommendation model based on the collective matrix factorisation technique for predicting customer preferences in this paper. A series of experiments using data collected from Epinions and Yelp are conducted. The experimental results show that the proposed model significantly outperforms other closely related models by 5%-13% in terms of RMSE and MAE.
    Keywords: recommender systems; matrix factorisation; electronic word-of-mouth; eWOM; collaborative filtering; regularisation.
    DOI: 10.1504/IJBDI.2018.10008752
     
  • Big uncertain data of multiple sensors efficient processing with high order multi-hypothesis: an evidence theoretic approach   Order a copy of this article
    by Hossein Jafari, Xiangfang Li, Lijun Qian, Alexander J. Aved, Timothy S. Kroecker 
    Abstract: With the proliferation of IoT, numerous sensors are deployed and big uncertain data are collected due to the different accuracy, sensitivity range, and decay of the sensors. The goal is to process the data and determine the most potential hypothesis among the set of high order multi-hypothesis. In this study, we propose a novel big uncertain sensor fusion framework to take advantage of evidence theory's capability of representing uncertainty for decision making and effectively dealing with conflict. However, the methods in evidence theory are in general very computationally expensive, thus they may not be directly applied to multiple data sources with high cardinality of hypotheses. Furthermore, we propose a Dezert-Smarandache hybrid model that can apply to applications with high number of hypotheses while the computational cost is reduced. Both synthetic and real data from experiments are used to demonstrate the feasibility of the proposed method for practical situation awareness applications.
    Keywords: Dezert-Smarandache theory; DSmT; Dempster-Shafer theory; DST; internet of things; IOT; comfort zone; uncertain data fusion; multiple sensor; multi-hypothesis.
    DOI: 10.1504/IJBDI.2018.10008754
     

Special Issue on: Big Data Management in Clouds Opportunities, Issues, Challenges and Solutions

  • Semi-structured data analysis and visualisation using NoSQL   Order a copy of this article
    by Srinidhi Hiriyannaiah, G.M. Siddesh, P. Anoop, K.G. Srinivasa 
    Abstract: In the field of computing, every day huge amounts of data are created by scientific experiments, companies and users' activities. These large datasets are labelled as 'big data', presenting new challenges for computer science researchers and professionals in terms of storage, processing and analysis. Traditional relational database systems (RDBMS) supported with conventional searches cannot be effectively used to handle such multi-structured data. NoSQL databases complement to the challenges of managing RDBMS with big data and facilitate in further analysis of data. In this paper, we introduce a framework that aims at analysing semi-structured data applications using NoSQL database MongoDB. The proposed framework focuses on the key aspects needed for semi-structured data analytics in terms of data collection, data parsing and data prediction. The layers involved in the framework are request layer facilitating the queries from user, input layer that interfaces the data sources and the analytics layer; and the output layer facilitating the visualisation of the analytics performed. A performance analysis for select+fetch operations needed for analytics, of MySQL and MongoDB is carried out where NoSQL database MongoDB outperforms MySQL database. The proposed framework is applied on predicting the performance and monitoring of cluster of servers.
    Keywords: analytics; semi-structured data; big data analytics; server performance monitoring; cluster analytics; MongoDB; NoSQL analytics.
    DOI: 10.1504/IJBDI.2018.10008726
     
  • Computation migration: a new approach to execute big-data bioinformatics workflows   Order a copy of this article
    by Rickey T. P. Nunes, Santosh L. Deshpande 
    Abstract: Bioinformatics workflows frequently access various distributed biological data sources and computational analysis tools for data analysis and knowledge discovery. They move large volumes of data from biological data sources to computational analysis tools and follow the traditional data migration approach for workflow execution. However, in the advent of big-data in bioinformatics, moving large volumes of data to computation during workflow execution is no longer feasible. Considering the fact that the size of biological data is continuously growing and is much larger than the computational analysis tool size, moving computation to data in a workflow is a better solution to handle the growing data. In this paper, we therefore propose a computation migration approach to execute bioinformatics workflows. We move computational analysis tools to data sources during workflow execution and demonstrate with workflow patterns that moving computation instead of data yields high performance gains in terms of data-flow and execution time.
    Keywords: big-data; bioinformatics; workflows; orchestration; computation migration.
    DOI: 10.1504/IJBDI.2018.10008727
     
  • Parallel computing for preserving privacy using k-anonymisation algorithms from big data   Order a copy of this article
    by Sharath Yaji, B. Neelima.B 
    Abstract: Many organisations still consider preserving privacy for big data as a major challenge. Parallel computation can be used to optimise big data analysis. This paper gives a proposal for parallelising k-anonymisation algorithms through comparative study and survey. The k-anonymisation algorithms considered are MinGen, DataFly, Incognito and Mondrian. The result shows the parallel versions of the algorithms perform better than sequential counterparts, as data size increases. For small size dataset in sequential mode MinGen is 71.83% faster than parallel version. However, in sequential mode DataFly and in parallel mode incognito performed well. For large size dataset in parallel mode Incognito is 101.186% faster than sequential. However, in sequential mode MinGen and DataFly performed well. In parallel mode Incognito, DataFly and MinGen performed well. The paper acts as a single point of reference for choosing big data mining k-anonymisation algorithms. This paper gives direction of applying HPC concepts such as parallelisation for privacy preserving algorithms.
    Keywords: big data; k-anonymisation; privacy preserving; big data analysis; parallel computing in big data.
    DOI: 10.1504/IJBDI.2018.10008733
     

Special Issue on: Data to Decision

  • Predicting baseline for analysis of electricity pricing   Order a copy of this article
    by Taehoon Kim, Jaesik Choi, Dongeun Lee, Alex Sim, C. Anna Spurlock, Annika Todd, Kesheng Wu 
    Abstract: To understand the impact of a new pricing structure on residential electricity peak demands, we need a baseline model that captures every factor other than the new price. The gold standard baseline is a randomised control trial, however, control trials are hard to design. The alternative to learn a baseline model from the past measurements fails to make reliable predictions about the daily peak usage values next summer. To overcome these shortcomings, we propose several new methods. Among these methods, the one named LTAP is particularly promising. It accurately predicts future usages of the control group. It also predicts the reductions of the peak demands to remain the same, while previous studies have found the reduction to be diminishing over time. We believe that LTAP is capturing the self-selection bias of the treatment groups better than techniques used in previous studies and are looking for opportunities to confirm this feature.
    Keywords: baseline model; residential electricity consumption; outdoor temperature; gradient tree boosting; GTB; electricity rate scheme.
    DOI: 10.1504/IJBDI.2018.10008133
     
  • Sign language recognition in complex background scene based on adaptive skin colour modelling and support vector machine   Order a copy of this article
    by Tse-Yu Pan, Li-Yun Lo, Chung-Wei Yeh, Jhe-Wei Li, Hou-Tim Liu, Min-Chun Hu 
    Abstract: With the advances of wearable cameras, the user can record the first-person view videos for gesture recognition or even sign language recognition to help the deaf or hard of hearing people communicate with others. In this paper, we propose a purely vision-based sign language recognition system which can be used in complex background scene. We design an adaptive skin colour modelling method for hand segmentation so that the hand contour can be derived more accurately even when different users use our system in various light conditions. Four kinds of feature descriptors are integrated to describe the contours and the salient points of hand gestures, and support vector machine (SVM) is applied to classify hand gestures. Our recognition method is evaluated by two datasets: 1) the CSL dataset collected by ourselves in which images were captured in three different environments including complex background; 2) the public ASL dataset, in which images of the same gesture were captured in different lighting conditions. The proposed recognition method achieves acceptable accuracy rates of 100.0% and 94.0% for the CSL and ASL datasets, respectively.
    Keywords: sign language recognition; support vector machine; SVM; human-computer interaction; gesture recognition.
    DOI: 10.1504/IJBDI.2018.10008140
     
  • Sightseeing value estimation by analysing geosocial images   Order a copy of this article
    by Yizhu Shen, Min Ge, Chenyi Zhuang, Qiang Ma 
    Abstract: Recommendation of points of interests (POIs) is drawing more attention to meet the growing demands of tourists. Thus, a POI's quality (sightseeing value) needs to be estimated. In contrast to conventional studies that rank POIs on the basis of user behaviour analysis, this paper presents methods to estimate quality by analysing geo-social images. Our approach estimates the sightseeing value from two aspects: 1) nature value; 2) culture value. For the nature value, we extract image features that are related to favourable human perception to verify whether a POI would satisfy tourists in terms of environmental psychology. Three criteria are defined accordingly: coherence, image-ability, and visual-scale. For the culture value, we recognise the main cultural element (i.e., architecture) included in a POI. In the experiments, we applied our methods to real POIs and found that our approach assessed sightseeing value effectively.
    Keywords: points of interests; sightseeing value; geosocial image; human perception; image processing; UCG mining.
    DOI: 10.1504/IJBDI.2017.10006525
     
  • Detecting spam web pages using multilayer extreme learning machine   Order a copy of this article
    by Rajendra Kumar Roul 
    Abstract: Web spamming generally increases the ranking of some unimportant pages higher in the search results. Detecting and eliminating such spam pages are the need of the day, which mislead the search engine to obtain high-quality information. Aiming in this direction, this study focuses on two important aspects of machine learning. First, it proposes a new content-based spam detection technique which identifies nine important features that help to detect a page is either spam or non-spam. Each feature has an associated value which is calculated by parsing the documents and then performing the require techniques i.e. necessary steps to compute its score. These nine important features along with the class label (spam or non-spam) generate a feature vector for training the classifiers in order to detect the spam pages. Secondly, it highlights the importance of deep learning using multilayer extreme learning machine in the field of spam page detection. For experimental work, two benchmark datasets (WEBSPAM-UK2002 and WEBSPAM-UK2006) have been used and the results using multilayer ELM are found to be more promising compared to other established classifiers.
    Keywords: content-based; deep learning; extreme learning machine; multilayer ELM; support vector machine; spam page.
    DOI: 10.1504/IJBDI.2018.10008141
     
  • Ontology-based faceted semantic search with automatic sense disambiguation for bioenergy domain   Order a copy of this article
    by Feroz Farazi, Craig Chapman, Pathmeswaran Raju, Lynsey Melville 
    Abstract: WordNet is a lexicon widely known and used as an ontological resource hosting comparatively large collection of semantically interconnected words. Use of such resources produces meaningful results and improves users' search experience through the increased precision and recall. This paper presents our facet-enabled WordNet powered semantic search work done in the context of the bioenergy domain. The main hurdle to achieving the expected result was sense disambiguation further complicated by the occasional fine-grained distinction of meanings of the terms in WordNet. To overcome this issue, this paper proposes a sense disambiguation methodology that uses bioenergy domain related ontologies (extracted from WordNet automatically), WordNet concept hierarchy and term sense rank.
    Keywords: semantic search; faceted search; faceted semantic search; knowledge base; WordNet; ontology; bioenergy; semantics; domain; domain ontologies.
    DOI: 10.1504/IJBDI.2018.10008142
     
  • An unsupervised service annotation by review analysis   Order a copy of this article
    by Masafumi Yamamoto, Yuguan Xing, Toshihiko Yamasaki, Kiyoharu Aizawa 
    Abstract: With the increase in popularity of review sites, users can write reviews on services that they have used in addition to reading reviews by other users. However, a number of reviews make it almost impossible for users to read all the reviews in detail. It is even more burdensome to compare multiple services. Thus, useful tools for extracting the unique features of services are necessary so that users can easily and intuitively understand the quality of services and compare them. In this study, we present an unsupervised method for extracting the unique and detailed features of services and the users' opinions on these features. By using the term frequency and inverse document frequency (TF-IDF) algorithm, our method can also extract in particular the praised or criticised features of a specific service. We conducted evaluations to show the validity of our method. In addition, we implemented an intuitive graphical user interface.
    Keywords: service annotation; service profiling; review analysis; summarisation.
    DOI: 10.1504/IJBDI.2018.10008149
     
  • Emotion-based topic impact on social media   Order a copy of this article
    by Fernando H. Calderon Alvarado, Yi-Shin Chen 
    Abstract: The increasing use of micro-blogging sites have made them very rich data repositories. Information generated is dynamic by nature, tied to temporal conditions and the subjectivity of its users. Everyday life experiences, discussions or events have a direct impact on the behaviours reflected in social networks. It has become important to asses to which degree are these interactions affecting a social group. A possibility is to analyse how impactful a topic is according to the behaviour presented on a social network over time. It is then necessary to develop methods that can contribute towards this task. Having identified a topic in social media, we can obtain a general summary of the emotions it is generating over a social group. We then propose a topic impact score which will be given to each topic based on how this emotions transition, for how many time they span and how many users they reach. This lays ground to quantify how impactful a topic is over a social group, specifically regarding events detected on twitter.
    Keywords: social impact; influence; social media; emotion analysis; microblogs.
    DOI: 10.1504/IJBDI.2018.10008150
     
  • Document stream classification based on transfer learning using latent topics   Order a copy of this article
    by Masato Shirai, Jianquan Liu, Takao Miura, Yi-Cheng Chen 
    Abstract: In this investigation, we propose a classification framework based on transfer learning using latent intermediate domain for document stream classification. In document stream, word frequency changes dramatically because of transition of themes. To classify document stream, we capture new features and modify the classification criteria during the stream. Transfer learning utilises extracted knowledge from source domain to analyse the target domain. We extract latent topics based on topic model from unlabeled documents. Our approach connects each domain using latent topics to classify documents. And we capture change of features by update of intermediate domain in document stream.
    Keywords: transfer learning; topic model; document stream classification.
    DOI: 10.1504/IJBDI.2018.10008152
     
  • A customised automata algorithm and toolkit for language learning and application   Order a copy of this article
    by Ruoyu Wang, Guoqiang Li, Jianwen Xiang, Hongming Cai 
    Abstract: Automata are abstract computing machines. They play a basic role in computability theory and programming language theory. More recently in data analytics, data automata have become a formal way to represent pipelines and workflows. However, in researches involved with automata, there are still situations where redundant work and ununified standards occur. In order to solve that problem, we propose a new toolkit: CAT, which provides a simple and unified framework for automaton construction and customisation. We adopted both structural and behavioural analysis in order to design the body structure. Several calculus algorithms are implemented according to the theoretical accomplishments and designed as overloaded operators. To test the correctness and performance of this toolkit, several bare automata were constructed and compared with 'GREP' in Ubuntu Linux. The result showed that CAT has realised most of the design purposes and presents a more illustrative way for writing codes of automata construction and calculation.
    Keywords: automata; customise; C++; big data analytics; semantics; toolkit; L*; automata theory; DFA; NFA; PDA; regular language; context free language; Infer.net; framework.
    DOI: 10.1504/IJBDI.2018.10008173
     
  • A big data-based RF localisation method for unmanned search and rescue   Order a copy of this article
    by Ju Wang, Hongzhe Liu, Hong Bao, Cesar Flores Montoya, James Hinton 
    Abstract: Autonomous mobile robots require efficient big-data methods to process a large amount of real-time sensory data to perform a task. We investigate a novel RF sensing-based method for target localisation where a large set of sensor data are mined to produce meaningful location information of a target device. The estimated location of the target is further used by the navigation algorithm to execute a movement plan. Using the networked RF beacon data, the proposed big data approach alleviates the problem of noisy RF measurements in location estimation. A particle filter algorithm is used to track the location of target node. The algorithm demonstrates a beyond-the-grid accuracy even only a coarse RF map is used.
    Keywords: RF mapping; robot localisation; navigation; measurement mining.
    DOI: 10.1504/IJBDI.2018.10008132
     

Special Issue on: Advances in Cyber Security and Privacy of Big Data in Mobile and Cloud Computing

  • Interoperable Identity Management Protocol for Multi-Cloud Platform   Order a copy of this article
    by Tania Chaudhary, Sheetal Kalra 
    Abstract: Multi-cloud adaptive application provisioning promises to solve data storage problem and leads to interoperability of data within multi-cloud environment. This also raises concern about interoperability of user among these computing domains. Although various standards and techniques have been developed to secure the identity of cloud consumer, but neither of them provides facility to interoperate and to secure the identity of cloud consumer. Thus, there is a need to develop an efficient authentication protocol that maintains single unique identity of cloud consumer and make it interoperable among various cloud service providers. Elliptic curve cryptography (ECC) based algorithms are the best choice among Public Key Cryptography (PKC) algorithms due to their small key sizes and efficient computation. In this paper, a secure ECC based mutual authentication protocol for cloud service provider servers using smart device and one time token has been proposed. The proposed scheme achieves mutual authentication and provides interoperability among multiple cloud service providers. The security analysis of the proposed protocol proves that the protocol is robust against all the security attacks. The formal verification of the proposed protocol is performed using AVISPA tool, which proves its security in the presence of intruder.
    Keywords: Authentication; Cloud Computing; Elliptic Curve Cryptography; Multi-Cloud; One Time Token; Smart Device.