International Journal of Big Data Intelligence (16 papers in press)
A Novel Entropy Based Dynamic Data Placement Strategy for Data Intensive Applications in Hadoop Clusters
by K. Hemant Kumar Reddy, Diptendu Sinha Roy, Vishal Pandey
Abstract: In the last decade, efficient data analysis of data-intensive applications has become an increasingly important research issue. The popular map-reduce framework has offered an enthralling solution to this problem by means of distributing the work load across interconnected data centers. Hadoop is most widely used platform for data intensive application such as analysis of web logs, detection of global weather patterns, bioinformatics applications among others. However, most Hadoop implementations assume that every node attached to a cluster are homogeneous in nature having same computational capacities which may reduce map-reduce performance by increasing extra over-head for run-time data communications. However, majority of data placement strategies attempt placing related data close to each other for faster run-time access. However they disregard scenarios where such placement algorithms have to work with data sets which are new, either generated or for different MapReduce jobs. This paper deals with improving the map-reduce performance over multi-cluster data sets by means of a novel entropy based data placement strategy (EDPS), that works in three phases and account for new data sets. In the first phase, a k-means clustering strategy is employed to extract dependencies among different datasets and group them into data groups. In second phase, these data groups placed in different data centers while considering heterogeneity of virtual machines into account. Finally, the third phase uses an entropy based grouping of the newly generated datasets where these datasets are grouped with most similar existing cluster based on their relative entropy. The essence of the entropy based scheme lies in computing expected entropy, a measure for dissimilarity of MapReduce jobs and their data usage patterns in terms of data blocks stored in HDFS, and finally placing new data among clusters such that entropy is reduced. The experimental results shows efficacy of the proposed three fold dynamic grouping and data placement policy, which significantly reduces the time of execution and improve Hadoop performance in heterogeneous clusters with varying server and user applications and its parameters sizes.
Keywords: Dynamic Data Placement Strategy; Hadoop Clusters; MapReduce;k-means clustering; entropy;.
HYBRIDIZATION OF CLASSIFIERS FOR ANOMALY DETECTION IN BIG DATA
by Rasim Alguliyev, Ramiz Aliguliyev, Fargana Abdullayeva
Abstract: Recently the widespread use of cloud technologies has led to the rapid increase the scale and complexity of this infrastructure. The degradation and downtimes in the performance metrics of these large-scale systems are considered to be a major problem. The key issue in addressing these problems is to detect anomalies that can occur in hardware, software and state of the systems of cloud infrastructure. In this paper for the detection of anomalies in performance metrics of cloud infrastructure, based on ensemble classifiers a semi-supervized classification method is proposed. In the proposed method Naive Bayes, J48, SMO, multilayer Perseptron, IBK and PART algorithms is used. To detect anomalous behavior on the performance metrics the public data of the Google and Yahoo! companies, Python 2.7, Matlab, Weka and Google Cloud SDK Shell applications is used. In the result of the experimental study of the model 0.90 percent detection accuracy is obtained.
Keywords: Anomaly in performance metrics; CPU-usage; memory usage; Naive Bayes; J48 decision tree; semi-supervised algorithms; ensemble classifiers; Google cluster trace.
Fast Approaches for Semantic Service Composition in Big Data Environment
by Jun Huang, Yide Zhou, Qiang Duan, Cong-cong Xing
Abstract: The widespread deployment of Web services and the rapid development of big data applications bring in new challenges to Web service compositions in the context of big data. The large number of Web services processing a huge amount of diverse data together with the complex and dynamic relationships among the services require automatic composition of semantic Web services to perform quickly, thereby demanding fast and cost-effective service composition algorithms. In this paper, we investigate the Web service composition in big data environments by proposing novel composition algorithms with low time-complexity. In our proposed algorithm, we decompose the service composition into three stages: construction of parameter expansion graphs, transformation of service dependence graphs, and backtracking search for service compositions. Based on the parameter expansion strategies, we then propose two fast service composition algorithms, for which we also analyze their time complexities. We conduct comparison experimentally to evaluate the performance of the algorithms and validate their effectiveness using a big semantic service dataset. Our results reveal that the proposed approaches are more preferable than a well-known algorithm in terms of execution time and precision.
Keywords: Big data semantics; quality of services; service composition; virtual parameter.
Semi-structured data analysis and visualisation using NoSQL
by Srinidhi Hiriyannaiah, G.M. Siddesh, P. Anoop, K.G. Srinivasa
Abstract: In the field of computing, every day huge amounts of data are created by scientific experiments, companies and users' activities. These large datasets are labelled as 'big data', presenting new challenges for computer science researchers and professionals in terms of storage, processing and analysis. Traditional relational database systems (RDBMS) supported with conventional searches cannot be effectively used to handle such multi-structured data. NoSQL databases complement to the challenges of managing RDBMS with big data and facilitate in further analysis of data. In this paper, we introduce a framework that aims at analysing semi-structured data applications using NoSQL database MongoDB. The proposed framework focuses on the key aspects needed for semi-structured data analytics in terms of data collection, data parsing and data prediction. The layers involved in the framework are request layer facilitating the queries from user, input layer that interfaces the data sources and the analytics layer; and the output layer facilitating the visualisation of the analytics performed. A performance analysis for select+fetch operations needed for analytics, of MySQL and MongoDB is carried out where NoSQL database MongoDB outperforms MySQL database. The proposed framework is applied on predicting the performance and monitoring of cluster of servers.
Keywords: analytics; semi-structured data; big data analytics; server performance monitoring; cluster analytics; MongoDB; NoSQL analytics.
Hybrid neural network and bi-criteria tabu-machine: comparison of new approaches to maximum clique problem
by Eduard Babkin, Tatiana Babkina, Alexander Demidovskij
Abstract: This paper presents two new approaches to solving a classical NP-hard problem of maximum clique problem (MCP), which frequently arises in the domain of information management, including design of database structures and big data processing. In our research, we are focusing on solving that problem using the paradigm of artificial neural networks. The first approach combines the artificial neuro-network paradigm and genetic programming. For boosting the convergence of the Hopfield neural network (HNN), we propose a specific design of the genetic algorithm as the selection mechanism for terms of the HNN energy function. The second approach incorporates and extends the tabu-search heuristics improving performance of network dynamics of so-called tabu machine. Introduction of a special penalty function in tabu machine facilitates better evaluation of the search space. As a result, we demonstrate the proposed approaches on well-known experimental graphs and formulate two hypotheses for further research.
Keywords: maximum clique problem; MCP; data structures; Hopfield network; genetic algorithm; tabu machine.
Computation migration: a new approach to execute big-data bioinformatics workflows
by Rickey T. P. Nunes, Santosh L. Deshpande
Abstract: Bioinformatics workflows frequently access various distributed biological data sources and computational analysis tools for data analysis and knowledge discovery. They move large volumes of data from biological data sources to computational analysis tools and follow the traditional data migration approach for workflow execution. However, in the advent of big-data in bioinformatics, moving large volumes of data to computation during workflow execution is no longer feasible. Considering the fact that the size of biological data is continuously growing and is much larger than the computational analysis tool size, moving computation to data in a workflow is a better solution to handle the growing data. In this paper, we therefore propose a computation migration approach to execute bioinformatics workflows. We move computational analysis tools to data sources during workflow execution and demonstrate with workflow patterns that moving computation instead of data yields high performance gains in terms of data-flow and execution time.
Keywords: big-data; bioinformatics; workflows; orchestration; computation migration.
A collective matrix factorisation approach to social recommendation with eWOM propagation effects
by Ren-Shiou Liu
Abstract: In recent years, recommender systems have become an important tool for many online retailers to increase sales. Many of these recommender systems predict users' interests in products by using the browsing history or item rating records of users. However, many studies show that, before making a purchase, people often read online reviews and exchange opinions with friends in their social circles. The resulting electronic word-of-mouth (eWOM) has a huge impact on customer's purchase intention. Nonetheless, most recommender systems in the current literature do not consider eWOM, let alone the effect of its propagation. Therefore, this paper proposes a new recommendation model based on the collective matrix factorisation technique for predicting customer preferences in this paper. A series of experiments using data collected from Epinions and Yelp are conducted. The experimental results show that the proposed model significantly outperforms other closely related models by 5%-13% in terms of RMSE and MAE.
Keywords: recommender systems; matrix factorisation; electronic word-of-mouth; eWOM; collaborative filtering; regularisation.
Big uncertain data of multiple sensors efficient processing with high order multi-hypothesis: an evidence theoretic approach
by Hossein Jafari, Xiangfang Li, Lijun Qian, Alexander J. Aved, Timothy S. Kroecker
Abstract: With the proliferation of IoT, numerous sensors are deployed and big uncertain data are collected due to the different accuracy, sensitivity range, and decay of the sensors. The goal is to process the data and determine the most potential hypothesis among the set of high order multi-hypothesis. In this study, we propose a novel big uncertain sensor fusion framework to take advantage of evidence theory's capability of representing uncertainty for decision making and effectively dealing with conflict. However, the methods in evidence theory are in general very computationally expensive, thus they may not be directly applied to multiple data sources with high cardinality of hypotheses. Furthermore, we propose a Dezert-Smarandache hybrid model that can apply to applications with high number of hypotheses while the computational cost is reduced. Both synthetic and real data from experiments are used to demonstrate the feasibility of the proposed method for practical situation awareness applications.
Keywords: Dezert-Smarandache theory; DSmT; Dempster-Shafer theory; DST; internet of things; IOT; comfort zone; uncertain data fusion; multiple sensor; multi-hypothesis.
Parallel computing for preserving privacy using k-anonymisation algorithms from big data
by Sharath Yaji, B. Neelima
Abstract: Many organisations still consider preserving privacy for big data as a major challenge. Parallel computation can be used to optimise big data analysis. This paper gives a proposal for parallelising k-anonymisation algorithms through comparative study and survey. The k-anonymisation algorithms considered are MinGen, DataFly, Incognito and Mondrian. The result shows the parallel versions of the algorithms perform better than sequential counterparts, as data size increases. For small size dataset in sequential mode MinGen is 71.83% faster than parallel version. However, in sequential mode DataFly and in parallel mode incognito performed well. For large size dataset in parallel mode Incognito is 101.186% faster than sequential. However, in sequential mode MinGen and DataFly performed well. In parallel mode Incognito, DataFly and MinGen performed well. The paper acts as a single point of reference for choosing big data mining k-anonymisation algorithms. This paper gives direction of applying HPC concepts such as parallelisation for privacy preserving algorithms.
Keywords: big data; k-anonymisation; privacy preserving; big data analysis; parallel computing in big data.
Improving straggler task performance in a heterogeneous MapReduce framework using reinforcement learning
by Nenavath Srinivas Naik, Atul Negi, V.N. Sastry
Abstract: MapReduce is one of the most significant distributed and parallel processing frameworks for large-scale data-intensive jobs proposed in recent times. Intelligent scheduling decisions can potentially help in significantly reducing the overall runtime of jobs. It is observed that the total time to completion of a job gets extended because of some slow tasks. Especially in heterogeneous environments, the job completion times do not synchronise. As originally conceived, MapReduce default scheduler was not very effective about slow task identification. In the literature, longest approximate time to end (LATE) scheduler extends to the heterogeneous environment, but it has limitations in properly estimating the progress of the tasks. It takes a static view of the task progress. In this paper, we propose a novel reinforcement learning-based MapReduce scheduler for heterogeneous environments called MapReduce reinforcement learning (MRRL) scheduler. It observes the system state of task execution and suggests speculative re-execution of the slower tasks to available nodes in the heterogeneous cluster without assuming any prior knowledge of the environmental characteristics. We observe that the experimental results show consistent improvements in performance as compared to the LATE and Hadoop default schedulers for different workloads of the Hi-bench benchmark suite.
Keywords: MapReduce; reinforcement learning; speculative execution; task scheduler; heterogeneous environments.
Algorithms for fast estimation of social network centrality measures
by Ashok Kumar, R. Chulaka Gunasekara, Kishan G. Mehrotra, Chilukuri K. Mohan
Abstract: Centrality measures are extremely important in the analysis of social networks, with applications such as the identification of the most influential individuals for effective target marketing. Eigenvector centrality and PageRank are among the most useful centrality measures, but computing these measures can be prohibitively expensive for large social networks. This paper explores multiple approaches to improve the computational effort required to compute relative centrality measures. First, we show that small neural networks can be effective in fast estimation of the relative ordering of vertices in a social network based on these centrality measures. Then, we show how network sampling can be used to reduce the running times for calculating the ordering of vertices; degree centrality-based sampling reduces the running time of the key node identification problem. Finally, we propose the approach of incremental updating of centrality measures in dynamic networks.
Keywords: social network; centrality; eigenvector centrality; PageRank; network sampling; incremental updating.
Collective tweet analysis for accurate user sentiment analysis - a case study with Delhi Assembly Election 2015
by Lija Mohan, M. Sudheep Elayidom
Abstract: Social media postings range from the environment and politics to technology and the entertainment industry. Since this can be construed as a form of collective wisdom, the authors decided to investigate its power at predicting the real-world outcomes. The objective was to design a keyword-aware user-based collective tweet mining approach to identify the opinion of each user, which is proved to be more accurate compared to the sentiment analysis done to each tweet. To make our application scalable, MapReduce programming on a Hadoop distributed processing framework is utilised. From the analysis done on 2015 Delhi Assembly Elections case study, we correctly predicted that Aam Admy Party has a higher support compared to the existing ruling party, BJP. Also, we compared our sentiment analysis algorithm with other existing techniques and identified that ours is efficient in terms of space and time complexity which makes it suitable for other BigData applications.
Keywords: twitter analysis; collective tweet analysis; sentiment analysis; big data; hadoop; Map Reduce.
Comparison of Hive's query optimisation techniques
by Sikha Bagui, Keerthi Devulapalli
Abstract: The ever increasing size of data sets in this big data era has forced data analytics to be moved from traditional RDBMS systems to distributed technologies like Hadoop. Since data analysts are more familiar with SQL than the MapReduce programming paradigm, HiveQL was built on Hadoop's MapReduce framework. Traditional RDBMS query optimisation techniques used in the rule-based optimiser (RBO) of Hive do not perform well in the MapReduce environment, hence, the correlation optimiser (CRO) and cost-based optimisers (CBOs) were developed. These optimisers perform query optimisations taking the MapReduce execution framework into account. In this work, the three optimisers, RBO, CRO, and CBO are compared. Queries with common intra-query operations are found to be better optimised with CRO.
Keywords: Hive; query optimisation; correlation optimiser; CRO; rule-based optimiser; RBO; cost-based optimiser; CBO.
Big data ensemble clinical prediction for healthcare data by using deep learning model
by Sreekanth Rallapalli, R.R. Gondkar
Abstract: Big data has revolutionised the healthcare industry. Electronic health records (EHRs) is growing at an exponential rate. Healthcare data being unstructured in nature requires a complete new technology to process the data. Clinical applications also need machine learning techniques and data mining methods which include decision trees and artificial neural networks. Classification algorithms have to be considered for developing predictive models. Ensemble model is gaining popularity among various other individual contributors. Ensemble systems can provide better accuracy. In this paper, we combine four algorithms support vector machines, naïve Bayes, random forest and deep learning models are used to design the ensemble framework. Deep learning model is used to find the predicted labels. The data sets are collected from MIMIC-III clinical database repository. Results shows that the proposed ensemble model provides the better accuracy results when deep learning model is included as deep learning is an efficient method for complex problems and large data sets.
Keywords: algorithm; big data; classification; decision trees; deep learning; electronic health records; HER; ensemble model; predictive model.
Resource management for deadline constrained MapReduce jobs for minimising energy consumption
by Adam Gregory, Shikharesh Majumdar
Abstract: Cloud computing has emerged as one of the leading platforms for processing large-scale data intensive applications. Such applications are executed in large clusters and data centres which require a substantial amount of energy. Energy consumption within data centres accounts for a considerable fraction of costs and is a significant contributor to global greenhouse gas emissions. Therefore, minimising energy consumption in data centres is a critical concern for data centre operators, cluster owners, and cloud service providers. In this paper, we devise a novel energy aware MapReduce resource manager for an open system, called EAMR-RM, that can effectively perform matchmaking and scheduling of MapReduce jobs each of which is characterised by a service level agreement (SLA) for performance that includes a client specified earliest start time, execution time, and a deadline with the objective of minimising data centre energy consumption. Performance analysis demonstrates that for a range of system and workload parameters experimented with the proposed technique can effectively satisfy SLA requirements while achieving up to a 45% reduction in energy consumption compared to approaches which do not consider energy in resource management decisions.
Keywords: resource management on clouds; MapReduce with deadlines; constraint programming; energy management; big data analytics; job turnaround time; big data; service level agreement.
Special Issue on: Advances in Cyber Security and Privacy of Big Data in Mobile and Cloud Computing
Interoperable Identity Management Protocol for Multi-Cloud Platform
by Tania Chaudhary, Sheetal Kalra
Abstract: Multi-cloud adaptive application provisioning promises to solve data storage problem and leads to interoperability of data within multi-cloud environment. This also raises concern about interoperability of user among these computing domains. Although various standards and techniques have been developed to secure the identity of cloud consumer, but neither of them provides facility to interoperate and to secure the identity of cloud consumer. Thus, there is a need to develop an efficient authentication protocol that maintains single unique identity of cloud consumer and make it interoperable among various cloud service providers. Elliptic curve cryptography (ECC) based algorithms are the best choice among Public Key Cryptography (PKC) algorithms due to their small key sizes and efficient computation. In this paper, a secure ECC based mutual authentication protocol for cloud service provider servers using smart device and one time token has been proposed. The proposed scheme achieves mutual authentication and provides interoperability among multiple cloud service providers. The security analysis of the proposed protocol proves that the protocol is robust against all the security attacks. The formal verification of the proposed protocol is performed using AVISPA tool, which proves its security in the presence of intruder.
Keywords: Authentication; Cloud Computing; Elliptic Curve Cryptography; Multi-Cloud; One Time Token; Smart Device.