International Journal of Big Data Intelligence (33 papers in press)
A Big Data Based RF Localization Method for Unmanned Search and Rescue
by Ju Wang
Abstract: Autonomous mobile robots require efficient big-data methods to process a large amount of real time sensory data to perform a task. We investigate a novel RF sensing based method for target localization where a large set of sensor data are mined to produce meaningful location information of a target device. The estimated location of the target is further used by the navigation algorithm to execute a movement plan. Using the networked RF beacon data, the proposed big data approach alleviates the problem of noisy RF measurements in location estimation. A particle filter algorithm is used to track the location of target node. The algorithm demonstrates a beyond-the-grid accuracy even only a coarse RF map is used.
Keywords: RF mapping, Robot localization, Navigation, Measurementrnmining
A Big Data Analytics Framework for Border Crossing Transportation
by Haibo Wang, Da Huo, Yaquan Xu
Abstract: In this paper, authors present a framework on developing a comprehensive system to analyze border crossing transportation using an open-source meta-data acquisition and aggregation tool. It is a platform integration approach based on Hadoop, MapReduce and MongoDB to consolidate databases from both U.S. and Mexico. We design data-driven XML schema for tagging the data entries from different sources with different formats, and implement a package using open-source software R to aggregate XML-transformed data into time and space dimensions. Then the transformed data is analyzed by a Difference-in-Difference (DiD) estimation model to understand the behavior of border crossing transportation.
Keywords: Big Data Analytics; Border Crossing Transportation; Difference-in-Difference Estimation
Composition and Verification of Student-oriented Courses
by Naseem Ibrahim
Abstract: In the last few years, the popularity of online degrees has dramatically increased. In current online degrees, the school specifies the courses required to obtain a degree. For each course the instructor specifies the course elements including teaching method and assessments. But different students have different capabilities and constraints. Most institutions provide the same courses. A student should be able to select the course that best matches his capabilities and constraints as long as it satisfies the required course outcomes. To achieve this goal, we propose the use of Service-oriented Architecture (SOA). We introduce an extended service-oriented architecture and an extended service definition, which will enable the specification and provision of student-oriented courses. We also propose a formal composition approach. To formally verify the result of the composition, we have also introduced a formal verification approach using the model checking tool UPPAAL.
Keywords: Student-oriented; SOA; Context; Service Model;UPPAAL.
S3R: Storage-Sensitive Services Redeployment in the Cloud
by Huining Yan, Yiming Zhang, Huaimin Wang, Bo Ding, Haibo Mi
Abstract: Services redeployment is one of the critical techniques for energy-efficiency in cloud data centers. In recent years, cloud providers have been providing local storage for cloud services, e.g., Amazon EC2, Aliyun ECS and RDS, since it offers a better performance with identified price. Nevertheless, as it is often assumed that cloud services utilize the shared storage only, when redeploying cloud services, most existing work did not consider the problems introduced by utilizing local storage, e.g., migrating much more data (stored on the local storage), and therefore consuming much more migration time and network bandwidth. Meanwhile, instance migration is a costly operation, thus, the total number of migrated instances must be taken into account. However, the total migrated data size and the percentage of total migrated instances are not often accordant, and therefore a tradeoff should be made between them. To address this problem, this paper proposes S^3R, a storage-sensitive services redeployment approach. How to select released servers is one key issue for services redeployment, where the instances on these servers are redeployed, and S^3R focuses on developing released servers selection strategies. S^3R firstly builds a tradeoff model to estimate the release cost for each server, i.e., the releasing priority of the servers, and then adopts a FFD-based heuristic algorithm to migrate/redeploy services instances. Evaluation results on production traces demonstrate the effectiveness of S^3R.
Keywords: Cloud Computing; Energy Efficiency; Storage-Sensitive; Services Redeployment.
Data Partition Optimization for Column-Family NoSQL databases
by Meng-Ju Hsieh, Li-Yung Ho, Jan-Jan Wu, Pangfeng Liu
Abstract: Data conversion has become an emerging topic in BigData era. To face the challenge of rapid data growth, legacy or existing relational databases have the need to convert into NoSQL column-family database in order to achieve better scalability. The conversion from SQL to NoSQL databases requires combining small, normalized SQL data tables into larger NoSQL data tables; a process called denormalization. A challenging issues in data conversion is how to group the denormalized columns in a large data table into "families" in order to ensure the performance of query processing. In this paper, we propose an efficient heuristic algorithm, GPA (Graph-based Partition Algorithm), to address this problem. We use TPC-C and TPC-H benchmarks to demonstrate that, the column-families produced by GPA is very efficient for large scale data processing.
Keywords: vertical partition; column partition; column family; NoSQL database.
Improving Straggler Task Performance in a Heterogeneous MapReduce Framework Using Reinforcement Learning
by Srinivas Naik Nenavath, Atul Negi, V.N. Sastry
Abstract: MapReduce is one of the most significant distributed and parallel processing frameworks for large-scale data-intensive jobs proposed in recent times. Intelligent scheduling decisions can potentially help in significantly reducing the overall runtime of jobs. It is observed that the total time to completion of a job gets extended because of some slow tasks. Especially in heterogeneous environments, the job completion times do not synchronize. As originally conceived MapReduce default scheduler was not very effective about slow task identification. In the literature, Longest Approximate Time to End (LATE) scheduler extends to the heterogeneous environment, but it has limitations in properly estimating the progress of the tasks. It takes a static view of the task progress. In this paper, we propose a novel Reinforcement Learning based MapReduce scheduler for heterogeneous environments called MapReduce Reinforcement Learning (MRRL) scheduler. It observes the system state of task execution and suggests speculative re-execution of the slower tasks to available nodes in the heterogeneous cluster without assuming any prior knowledge of the environmental characteristics.We observe that the experimental results show consistent improvements in performance as compared to the LATE and Hadoop default schedulers for different workloads of the Hi-Bench benchmark suite.
Keywords: MapReduce; Reinforcement Learning; Speculative Execution; Task Scheduler; Heterogeneous Environments.
An Adaptive Memory Tuning Strategy with High Performance for Spark
by Di Chen, Haopeng Chen, Zhipeng Jiang, Yao Zhao
Abstract: With the rapid development of Internet, people put more and more focus on data, which contains much information and are of great value. To gain better performance in data analysis, in-memory computing has been more and more popular. Spark is a successful example of improving computing performance through in-memory computing. However, how to make full use of memory resource is still a problem for Spark. In this paper, we presented an adaptive memory tuning strategy for Spark, which enables dynamic data compression and serialization selecting strategy to use less resource usage and obtain faster data process.We derived the strategy of selecting the optimal data compression and serialization mathematically. It chooses proper memory tuning strategy according to resource usage and can obtain good performance in applications, which persist data frequently.
Keywords: Spark; In-memory Computing; Data Persisting; Data Caching; Memroy Tuning.
A Customized Automata Algorithm and Toolkit for Language Learning and Application
by Ruoyu Wang, Guoqiang Li, Jianwen Xiang
Abstract: Automata are abstract computing machines. They play a basic role in computability theory and programming language theory. They are also widely used in programming language compilers as token scanners and syntactic analysers. More recently in data analytics, data automata have become a formal way to represent pipelines and workflows. In the research involved with automata, however, there are many situations where a practitioner has to build a new automaton with bare hands, which causes a lot of redundant works to rebuild the frame work of an automaton. Moreover, when lot of researchers need to display their ideas and to discuss about new algorithms, it will be extremely hard for them to switch among different styles of codes, not to mention modifying parts of others' programs.
In order to solve that problem, we propose a new toolkit: CAT, which provides a simple and unified framework for automaton construction and customization. We observed eight prevailing types of automata and decomposed their logical structure, extracting similar semantics. Besides, behavioural similarities are compared and taken into consideration to generate a hierarchical framework. Each of the eight specific type of automata is implemented at the leaf node in the tree-like structure. Several calculus algorithms are implemented according the theoretical accomplishments and designed as overloaded operators, which simplifies and visualises the source codes written by end users. To test the correctness and performance of this toolkit, several bare automata were constructed and put into calculation. And a simple textual retrieval programme was implemented by CAT, taken into comparison with the well-known tool ``GREP'' in Ubuntu Linux.
The result showed that CAT has realised most of the design purposes: the calculus is correct and the framework provides a universal solution for various types of automata. Also, CAT presents a more illustrative way for writing codes of automata construction and calculation.
Keywords: Automata; Customize; C++.
Hybrid Neural Network and Bi-criteria Tabu-machine: comparison of new approaches to Maximum Clique Problem
by Eduard Babkin, Tatiana Babkina, Alexander Demidovskij
Abstract: This paper presents two new approaches to solving a classical NP-hard problem of maximum clique (MCP), which frequently arises in the domain of information management, including design of database structures and big data processing. In our research we are focusing on solving that problem using the paradigm of artificial neural networks. The first approach combines the artificial neuro-network paradigm and genetic programming. For boosting the convergence of the Hopfield Neural Network (HNN) we propose a specific design of the genetic algorithm as the selection mechanism for terms of the HNN energy function. The second approach incorporates and extends the Tabu-search heuristics improving performance of network dynamics of so-called Tabu machine. Introduction of a special penalty function in Tabu machine facilitates better evaluation of the search space. As a result, we demonstrate the proposed approaches on well-known experimental graphs and formulate two hypotheses for further research.
Keywords: Maximum Clique Problem; Data structures; Hopfield Network; Genetic Algorithm; Tabu Machine.
Algorithms for Fast Estimation of Social Network Centrality Measures
by Ashok Kumar, R. Chulaka Gunasekara, Kishan Mehrotra, Chilukuri Mohan
Abstract: Centrality measures are extremely important in the analysis of social networks, with applications such as the identification of the most influential individuals for effective target marketing. Eigenvector centrality and PageRank are among the most useful centrality measures, but computing these measures can be prohibitively expensive for large social networks. This paper explores multiple approaches to improve the computational effort required to compute relative centrality measures. First, we show that neural networks can be effective in learning and estimating the ordering of vertices in a social network based on these centrality measures. We show that the proposed neural networks approach requires far less computational effort, and to be is faster than early termination of the power iteration method that can be used for computing the centrality measures. We also show that four features describing the size of the social network and two vertex-specific attributes sufficed as inputs to the neural networks, requiring very few hidden neurons. Then we focus on how network sampling can be used to reduce the running times for calculating the ordering of vertices. We introduce the notion of degree centrality based sampling to reduce the running time of the key node identification problem. Finally we propose the approach of incremental updating of centrality measures in dynamic networks.
Keywords: Social network; Centrality; Eigenvector centrality; PageRank; Network sampling; Incremental updating.
A Collective Matrix Factorization Approach to Social Recommendation with eWOM Propagation Effects
by Ren-Shiou Liu
Abstract: In recent years, recommender systems have become an important tool for many online retailers to increase sales. Many of these recommender systems predict users interests in products by using the browsing history or item rating records of users. However, many studies show that, before making a purchase, people often read on-line reviews and exchange opinions with friends in their social circles. The resulting electronic word-of-mouth (eWOM) has a huge impact on customer's purchase intention. Nonetheless, most recommender systems in the current literature do not consider eWOM, let alone the effect of its propagation. Therefore, this paper proposes a new recommendation model based on the collective matrix factorization technique for predicting customer preferences in this paper. A series of experiments using data collected from Epinions and Yelp are conducted. The experimental results show that the proposed model significantly outperforms other closely related models by 5%-13% in terms of RMSE and MAE.
Keywords: recommender systems; matrix factorization; collaborative filtering; electronic word-of-mouth; regularization.
Collective Tweet Analysis for Accurate User Sentiment Analysis - a Case Study with Delhi Assembly Election 2015
by Lija Mohan, Sudheep Ealyidom
Abstract: Social media has exploded as a category of online discourse where people create and share the contents at a massive rate. Because of its ease of use, speed and reach, social media is fast changing the public discourse in society and setting trends and agendas in topics that range from the environment and politics to technology and the entertainment industry. Since social media can also be construed as a form of collective wisdom, the authors decided to investigate its power at predicting real-world outcomes. The objective was to design a Twitter based sentiment mining. We introduced a keyword-aware user-based collective tweet mining approach to rank the sentiment of each user. To prove the accuracy of the proposed method, we chose an interesting Election Winner Prediction application and observed how the sentiment of people on different political issues at that time, got reflected on their votes. A Domain thesaurus is built by collecting keywords related to each issue. Since twitter data is too huge in size, it is very difficult to process using traditional architecture. Hence, we introduced a scalable and efficient Map Reduce programming model based approach to classify the tweets. The experiments were designed to predict the winner of Delhi Assembly Elections, 2015 by analyzing the sentiments of people on different political issues and from the analysis that we performed, we correctly predicted that Aam Admy Party has a higher support, compared to the existing ruling party, BJP. Thus we introduced a Big Data Approach to do sentiment analysis on Twitter data which have wide spread applications in todays world.
Keywords: Twitter Anlaysis; Collective Tweet Analysis; Sentiment Analysis; Big Data; Hadoop; Map Reduce.
Big Uncertain Data of Multiple Sensors Efficient Processing with High Order Multi-Hypothesis: An Evidence Theoretic Approach
by Hossein Jafari, Xiangfang Li, Lijun Qian, Alexander Aved, Timothy Kroecker
Abstract: With the proliferation of IoT, numerous sensors are deployed and big uncertain data are collected due to the different accuracy, sensitivity range, and decay of the sensors. The goal is to process the data and determine the most potential hypothesis among the set of high order multi-hypothesis.
In this study, we propose a novel big uncertain sensor fusion framework to take advantage of evidence theory's capability of representing uncertainty for decision making and effectively dealing with conflict.
However, the methods in evidence theory are in general very computationally expensive, thus they may not be directly applied to multiple data sources with high cardinality of hypotheses. Furthermore, we propose a Dezert-Smarandache hybrid model that can apply to applications with high number of hypotheses while the computational cost is reduced.
Both synthetic and real data from experiments are used to demonstrate the feasibility of the proposed method for practical situation awareness applications.
Keywords: Dezert-Smarandache Theory (DSmT); Dempster-Shafer Theory (DST) ;Internet of Things (IOT); Comfort Zone; Uncertain Data Fusion; Multiple Sensor; Multi-Hypothesis.
Large-scale spectral clustering for managing big data in healthcare operations
by Maoqing Liu, Nasser Fard, Keivan Sadeghzadeh
Abstract: Healthcare industries have access to a large volume and variety of data about patients' behaviours, diseases, and treatments. There is a significant need for a data-driven system to discover patterns for better understanding of the impact of human risk behaviours on numerous diseases. In order to discover and extract interesting knowledge and pattern from large amount of data, a data mining process for discovering knowledge from unprocessed and raw healthcare data is studied. Methods for analysis of big data, and the role and types of clustering methods are presented. An in-depth analysis of spectral clustering method as a superior clustering algorithm for big healthcare data is presented. The spectral clustering algorithm is applied to a large healthcare data from the behavioural risk factor surveillance system (BRFSS), by partitioning the untrained data to at least four clusters. The MATLAB® R2011b programming environment is utilised as a calculation tool in the experimental design and analysis. The experimental results and analysis, and the implementation process are discussed and the data processing is presented. Sensitivity analysis for both parameters of the spectral clustering are performed to determine their influence on the clustering results.
Keywords: big data; healthcare; spectral clustering; visualisation.
Research on encryption strategy in large data environment based on proxy re-encryption
by Qiang Zhan, Jingtao Su, Yuelong Hu
Abstract: Human society has been experiencing an unprecedented revolution in the recent years' rapid development of information technology. In the open and distributed environment of large data, data encryption technology has become an important issue. In order to solve the security problem of big data storage, in this paper, we study the encryption algorithm based on big data, and we design a proxy re-encryption scheme by using the method of random number encryption. We propose the main steps of proxy re-encryption schemes, and we design the proxy re-encryption scheme of each part of the main encryption algorithm. In this scheme, we ensure the big server cannot get the clear text information. Finally, we use the JPBC function library and other techniques to program the proxy re-encryption scheme, which proves the availability and feasibility of the scheme. Then we put forward the network information security encryption solution in big data environment.
Keywords: big data; information security; proxy re-encryption; bilinear pairing.
Special Issue on: Big Data Analytics, Infrastructure and Applications
Hybrid approach-based support vector machine for electric load forecasting incorporating feature selection
by Malek Sarhani, Abdellatif El Afia, Rdouan Faizi
Abstract: Forecasting future electricity demand is very important for the electric power industry. In fact, it has been shown in several research works that machine learning methods are useful for electric load forecasting (ELF) since electric load data are nonlinear in relation and complex. On the other hand, it is important to determine the irrelevant factors as a preprocessing step for ELF. Our objective in this paper is to investigate the importance of applying the feature selection approach to remove the irrelevant factors of electric load. To this end, we introduce a hybrid machine learning approach that combines support vector machine (SVM) and particle swarm optimisation (PSO) in both continuous and binary forms. Specifically, the binary hybridisation is used for feature selection and the continuous one is used for ELF. Experimental results demonstrate the feasibility of applying feature selection by SVM and PSO algorithms without decreasing the performance of the forecasting model for ELF.
Keywords: machine learning; electric load forecasting; ELF; feature selection; big data; support vector machine; SVM; particle swarm optimisation; PSO.
Throughput enhancement of a novel hybrid-MAC protocol for M2M networks
by Pawan Kumar Verma, Rajesh Verma, Arun Prakash, Rajeev Tripathi
Abstract: When the M2M devices communicate with each other within a group or cluster without any human intervention, this is called inter-M2M communications. Hence, there is a critical requirement of a scalable medium access control (MAC) protocol to enable multiple M2M devices to access the channel. For this purpose, contention or reservation-based MAC protocols can be used, but with multiple M2M devices, adaptability, and scalability become bottlenecks. Therefore, in this paper, we propose a novel hybrid-MAC protocol, which mainly consists of a contention interval (CI), and a data transmission interval (DTI). During CI, all the active M2M devices contend for the channel access. After contention, the successful devices win time-slots in DTI. The M2M devices are enabled with another proposed high throughput MAC (HT-MAC) protocol, and share data with each other within each time-slot during DTI. Simulation results show significant per time-slot throughput improvement as compared to IEEE 802.11 MAC protocol.
Keywords: machine-to-machine; M2M; contention; ubiquitous; IEEE 802.11 DCF; MAC protocol.
CityPro: city-surveillance collaborative platform
by Mohamed Dbouk, Hamid Mcheick, Ihab Sbeity
Abstract: Day by day, modern cities face a big challenge in terms of public safety and security. In the city, the surveillance systems mostly use video surveillance techniques, they incorporate thousands of cameras and relay to high-speed networking infrastructures. Moreover, in a city, there exist multiple computerised standalone systems that operate independently each other, e.g., banking systems, customs, and hospitals. These systems generate huge datasets. The collected data stimulate a gigantic mine of scattered information. A smart city is a city that intelligently benefits from such omnipresent systems. This paper presents an integrated platform to gather multiple existing systems in a city. The platform consists of a collaborative surveillance system, called CityPro. The proposed architecture is intended to protect and monitor people and public infrastructures, such as bridges, roads, buildings, etc.; it is designed to deal with and prevent abnormal activities like terrorist attacks. CityPro is expected to operate in live-mode by using the city digital-infrastructures. At the end of this paper, a typical case study is given, and challenges and future works are also discussed.
Keywords: smart cities; digital world; event-driven process; collaborative business process; software architecture; business-intelligence; big-data.
NoSQL databases for big data
by Ahmed Oussous, Fatima-Zahra Benjelloun, Ayoub Ait Lahcen, Samir Belfkih
Abstract: NoSQL solutions have been created to respond to many issues encountered when dealing with some specific applications, e.g., storage of very large datasets. In fact, traditional RDMS ensure data integrity and transaction consistency. But, this is at the cost of a rigid storage schema and a complex management. Certainly, data integrity and consistency are required in many cases like in financial applications but they are not always needed. The goal of this paper is to establish a precise picture about NoSQL's evolution and mechanisms as well as the advantages and disadvantages of the main NoSQL data models and frameworks. For this purpose, first, a deep comparison between SQL and NoSQL databases is presented. Many criteria are examined such as: scalability, performance, consistency, security, analytical capabilities and fault-tolerance mechanisms. Second, the four major types of NoSQL databases are defined and compared: key-value stores, document databases, column-oriented databases and graph databases. Third, we compare for each NoSQL data model the main available technical solutions.
Keywords: NoSQL; key-value databases; document databases; column-oriented databases; graph databases; big data.
Special Issue on: E-Health Systems and Semantic Web
A Graph Traversal Attack on Bloom Filter Based Medical Data Aggregation
by William Mitchell, Rinku Dewri, Ramakrishna Thurimella, Max Roschke
Abstract: We present a novel cryptanalytic method based on graph traversals
to show that record linkage using Bloom Filter Encoding does not preserve privacy in a two-party setting. Bloom Filter Encoding is often suggested as a practical approach to medical data aggregation. This attack is
stronger than a simple dictionary attack in that it does not assume knowledge of the universe. The attack is very practical and produced accurate
results when experimented on large amounts of name-like data derived
from a North Carolina voter registration database. We also give theoretical arguments that show that going from bigrams to n-grams, n > 2, does
not increase privacy; on the contrary it actually makes the attack more
effective. Finally, some ways to resist this attack are suggested.
Keywords: Bloom filter encoding, privacy-preserving record linkage, medical data aggregation, cryptanalysis, two-party linkage, private record linkage.
Special Issue on: Big Data Visualisation and Analytics
Improving execution speed of incremental runs of MapReduce using provenance
by Anu Mary Chacko, Anish Gupta, S. Madhu, S.D. Madhu Kumar
Abstract: Hadoop MapReduce is an analytic tool used to solve big data problems that are parallelisable. MapReduce jobs need to be rerun frequently for data changes. Many times these data changes are made by appending new data to the existing file. So in a rerun, if we can reuse the output of the previous run and limit the job execution to the new data, we can reduce the overall job execution time. In the literature, there are schemes that use the concept of memoisation, storing the intermediate result, etc. to implement efficient incremental rerun. In this paper, we explain how provenance can be used to implement transparent incremental MapReduce for 'append only' input files. Our approach requires no additional storage or modification of existing Hadoop infrastructure or scheduler. Experimental evaluation of running MapReduce on multinode cluster with provenance stored in HBase gave good results for incremental runs in cases of an addition of new file/new data.
Keywords: Hadoop MapReduce; provenance; HBase; incremental MapReduce.
Special Issue on: Big Data Management in Clouds Opportunities, Issues, Challenges and Solutions
Semi-structured Data Analysis and Visualization using NoSQL
by Srinidhi Hiriyannaiah, Siddesh G M, K.G. Srinivasa, Anoop P
Abstract: In the field of computing, everyday huge amounts of data are created by scientific experiments, companies and users activities. These large datasets are labelled as "Big data", presenting new challenges for computer science researchers and professionals in terms of storage, processing and analysis. Traditional relational database systems (RDBMS) supported with conventional searches cant be effectively used to handle such multi-structured data. NoSQL databases complement to challenges of managing RDBMS with big data and facilitate in further analysis of data In this paper, we introduce a framework that aims at analyzing semi-structured data applications using NoSQL database MongoDB. The proposed framework focuses on the key aspects needed for semi-strucutred data analytics in terms of data collection, data parsing and data prediction. The layers involved in the framework are request layer facilitating the queries from user, input layer that interfaces the data sources and the analytics layer; and the output layer facilitating the visualization of the analytics performed. A performance analysis for select+fetch operations needed for analytics, of MySQL and MongoDB is carried out where NoSQL database MongoDB outperforms MySQL database. The proposed framework is applied on predicting the performance and monitoring of cluster of servers.
Keywords: analytics;semi-strucutured data; big data analytics;cluster analytics;server performance monitoring;MongoDB;NoSQL analytics.
Computation Migration: A new approach to execute big-data bioinformatics workflows
by Rickey T. P. Nunes, Santosh L. Deshpande
Abstract: Bioinformatics workflows frequently access various distributed biological data sources and computational analysis tools for data analysis and knowledge discovery. They move large volumes of data from biological data sources to computational analysis tools and follow the traditional data migration approach for workflow execution. However, in the advent of big-data in bioinformatics, moving large volumes of data to computation during workflow execution is no longer feasible. Considering the fact that the size of biological data is continuously growing and is much larger than the computational analysis tool size, moving computation to data in a workflow is a better solution to handle the growing data. In this paper, we therefore propose a computation migration approach to execute bioinformatics workflows. We move computational analysis tools to data sources during workflow execution and demonstrate with workflow patterns that moving computation instead of data yields high performance gains in terms of data-flow and execution time.
Keywords: Big-data; Bioinformatics; Workflows; Orchestration; Computation migration.
Parallel Computing For Preserving Privacy Using k-anonymization Algorithms from Big Data
by Sharath Yaji, Neelima.B Reddy
Abstract: For many organizations preserving privacy for Big Data is still major challenge. Big Data analysis can be optimized through parallel computation. This paper gives a proposal for parallelizing k-anonymization algorithms through comparative study and survey. The main k-anonymization algorithms considered for study and comparison are MinGen, DataFly, Incognito and Mondrian. It is noted that as the data size increases, the parallel version of the algorithms performs better than its sequential counterparts. For Small size data set in sequential mode MinGen gives 71.83% faster than parallel mode. However overall, in sequential mode Datafly and in parallel mode incognito performed well. For large size data set in parallel mode Incognito is faster 101.186% than serial mode. However overall, in sequential mode MinGen and Datafly performed well. In parallel mode Incognito, Datafly and MinGen performed well. The paper acts as a single point of reference for choosing Big Data mining k-anonymization algorithms and gives a direction of applying HPC concepts such as parallelization to be applied for privacy preserving algorithms. As a future work the authors are planning to port these algorithms on to heterogeneous architectures such as Graphics Processing Units (GPU) as applicable.
Keywords: Big Data; K–anonymization; Privacy preserving in Big Data analysis; Parallel computing for Big Data.
Special Issue on: Data to Decision
Predicting Baseline for Analysis of Electricity Pricing
by Taehoon Kim, Jaesik Choi, Dongeun Lee, Alex Sim, C. Anna Spurlock, Annika Todd, Kesheng Wu
Abstract: To understand the impact of a new pricing structure on residential
electricity demand, we need a baseline model that captures every factor other
than the new price. The gold standard baseline is a randomized control group,
however, a good control group is hard to design, and could only serve as a baseline
for a group, not any individual household. To overcome these shortcomings,
we develop a number of techniques that could predict the hourly usage years
ahead. To capture the fact that daily electricity demand peaks a few hours after
the temperature reaching its peak, existing methods rely on lagged variables,
such as the usages from a day ago and a week ago. When making predictions
years into the future, the values from a week ago are also in the future and
therefore unknown. In this work, we develop a continuous prediction strategy
that first forecasts lagged variables and then uses them in further predictions, but
we find that the prediction error increases over time. Based on an observed linear
relationship between temperature and aggregate power (LTAP), we design a new
prediction method named LTAP that avoids this cumulation of prediction errors
when forecasting usage of each household. In our test, the average predicted
usages by LTAP match the control group for all two summers the test data covered.
This suggests that we might be able to use LTAP predictions in future studies.
Keywords: baseline model; residential electricity consumption; outdoor temperature; gradient tree boosting; electricity rate scheme.
Sign Language Recognition in Complex Background Scene Based on Adaptive Skin Color Modeling and Support Vector Machine
by Tse-Yu Pan, Li-Yun Lo, Chung-Wei Yeh, Jhe-Wei Li, Hou-Tim Liu, Min-Chun Hu
Abstract: With the advances of wearable cameras, user can record the first-person view videos for gesture recognition or even sign language recognition to help the deaf or hard of hearing people communicate with others. In this paper, we propose a purely vision-based sign language recognition system which can be used in complex background scene. We design an adaptive skin color modeling method for hand segmentation so that the hand contour can be derived more accurately even when different users use our system in various light conditions. Four kinds of feature descriptors are integrated to describe the contours and the salient points of hand gestures, and Support Vector Machine (SVM) is applied to classify hand gestures. Our recognition method is evaluated by two datasets: (1) The CSL dataset collected by ourselves, in which images were captured in three different environments including complex background. (2) The public ASL dataset, in which images of the same gesture were captured in different lighting conditions. The proposed recognition method achieves acceptable accuracy rates of 100.0% and 94.0% for the CSL and ASL datasets, respectively.
Keywords: Sign Language Recognition; Support Vector Machine; Human-Computer Interaction; Gesture Recognition.
Detecting Spam Web Pages using Multilayer Extreme Learning Machine
by Rajendra Kumar Roul
Abstract: Web spamming generally increases the ranking of some unimportant pages higher in the search results. Detecting and eliminating such spam pages are the need of the day, which not only mislead the search engine, but also become the roadblock for obtaining high-quality information from the Web. Hence, spam page detection has become the vibrant area of research in the field of information retrieval. Aiming in this direction, this study focuses on two important aspects of machine learning. First, it proposes a new content-based spam detection technique which identifies nine important features that help to detect a page is either non-spam or spam. Each feature has an associated value which is calculated by parsing the documents and then performing the require techniques i.e. necessary steps to compute its score. These nine important features along with the class label (spam or non-spam) generate a feature vector for training the classifiers in order to detect the spam pages. Secondly, it highlights the importance of deep learning using Multilayer extreme learning machine in the field of spam page detection. For experimental work, two benchmark datasets (WEBSPAMUK2006 and WEBSPAM-UK2002) have been used and the results using Multilayer ELM are found to be more promising compared to other established classifiers.
Keywords: Content based; Deep Learning; Extreme Learning Machine; Multilayer ELM; Support Vector Machine; Spam Page.
An unsupervised service annotation by review analysis
by Masafumi Yamamoto, Yuguan Xing, Toshihiko Yamasaki, Kiyoharu Aizawa
Abstract: With the increase in popularity of online review sites, users can write reviews on services that they have used in addition to reading reviews by other users to gain information about the services.
However, the number of reviews about a service may be large, which makes it almost impossible for users to read all the reviews in detail.
It is even more burdensome to compare multiple services.
Users may also want to know some information about the unique and relatively praised features of services, and there are a few works that solve such a problem.
Thus, useful tools for extracting the unique features of services are necessary so that users can easily and intuitively understand the quality of services and compare them.
In this study, we present an unsupervised method for extracting the unique and detailed features of services and the users' opinions on these features.
By using only the term frequency (TF), our method will extract only the general features (e.g., for restaurants, food and service), and many review sites show users how these general features are evaluated.
However, users may want to know more detailed features and how these features are evaluated.
Furthermore, by using the term frequency and inverse document frequency (TF-IDF) algorithm, our proposed method can also extract in particular the praised or criticized features of a specific service.
We conducted evaluations from multiple viewpoints to show the validity of our proposed method.
In addition, we implemented a graphical user interface to help users intuitively understand the results.
Keywords: Service annotation; service profiling; review analysis; summarization.
Emotion Based Topic Impact on Social Media
by Fernando Calderon Alvarado, Yi-Shin Chen
Abstract: The increasing use of Micro-blogging sites have made them very rich
data repositories. Information generated is dynamic by nature, tied to temporal
conditions and the subjectivity of its users. Everyday life experiences, discussions
or events have a direct impact on the behaviors reflected in social networks. It
has become important to asses to which degree are these interactions affecting a
social group. A possibility is to analyze how impactful a topic is according to the
behavior presented on a social network over time. It is then necessary to develop
methods that can contribute towards this task. Having identified a topic in social
media we can obtain a general summary of the emotions it is generating over a
social group. We then propose a Topic Impact score which will be given to each
topic based on how this emotions transition, for how many time they span and
how many users they reach. This lays ground to quantify how impactful a topic
is over a social group, specifically regarding events detected on twitter.
Keywords: Social Impact; Influence; Social Media; Emotion Analysis;
Document Stream Classification based on Transfer Learning using Latent Topics
by Masato Shirai, Jianquan Liu, Takao Miura
Abstract: In this investigation, we propose a classification framework based on transfer learning using latent intermediate domain for document stream classification.
In document stream, word frequency changes dramatically because of transition of themes. To classify document stream, we capture new features and modify the classification criteria during the stream.
Transfer learning utilizes extracted knowledge from source domain to analyze the target domain.
We extract latent topics based on topic model from unlabeled documents. Our approach connect each domain using latent topics to classify documents. And we capture change of features by update of intermediate domain in document stream.
Keywords: Transfer Learning; NMTF; Topic Model; Document Classification;.
Sightseeing Value Estimation by Analyzing Geosocial Images
by Yizhu Shen, Min Ge, Chenyi Zhuang, Qiang Ma
Abstract: Recommendation of points of interests (POIs) is drawing more attention to meet the growing demands of tourists. Thus, a POIs quality (sightseeing value) needs to be estimated. In contrast to conventional studies that rank POIs on the basis of user behavior analysis, this paper presents methods to estimate quality by analyzing geo-social images. Our approach estimates the sightseeing value from two aspects: (1) nature value and (2) culture value. For the nature value, we extract image features that are related to favorable human perception to verify whether a POI would satisfy tourists in terms of environmental psychology. Three criteria are defined accordingly: coherence, image-ability, and visual-scale. For the culture value, we recognize the main cultural element (i.e., architecture) included in a POI. In the experiments, we applied our methods to real POIs and found that our approach assessed sightseeing value effectively.
Keywords: Points of Interests; Sightseeing value; Geosocial image; human perception; image processing; UCG Mining.
Ontology-based Faceted Semantic Search with Automatic Sense Disambiguation for Bioenergy Domain
by Feroz Farazi, Craig Chapman, Pathmeswaran Raju, Lynsey Melville
Abstract: WordNet is a lexicon widely known and used as an ontological resource hosting comparatively large collection of semantically interconnected words. Use of such resources produces meaningful results and improves users search experience through the increased precision and recall. This paper presents our facet-enabled WordNet powered semantic search work done in the context of the bioenergy domain. The main hurdle to achieving the expected result was sense disambiguation further complicated by the occasional fine-grained distinction of meanings of the terms in WordNet. To overcome this issue, this paper proposes a sense disambiguation methodology that uses bioenergy domain related ontologies (extracted from WordNet automatically), WordNet concept hierarchy and term sense rank.
Keywords: semantic search; faceted search; faceted semantic search; Knowledg Base; WordNet; ontology; bioenergy.
Special Issue on: Advances in Cyber Security and Privacy of Big Data in Mobile and Cloud Computing
Interoperable Identity Management Protocol for Multi-Cloud Platform
by Tania Chaudhary, Sheetal Kalra
Abstract: Multi-cloud adaptive application provisioning promises to solve data storage problem and leads to interoperability of data within multi-cloud environment. This also raises concern about interoperability of user among these computing domains. Although various standards and techniques have been developed to secure the identity of cloud consumer, but neither of them provides facility to interoperate and to secure the identity of cloud consumer. Thus, there is a need to develop an efficient authentication protocol that maintains single unique identity of cloud consumer and make it interoperable among various cloud service providers. Elliptic curve cryptography (ECC) based algorithms are the best choice among Public Key Cryptography (PKC) algorithms due to their small key sizes and efficient computation. In this paper, a secure ECC based mutual authentication protocol for cloud service provider servers using smart device and one time token has been proposed. The proposed scheme achieves mutual authentication and provides interoperability among multiple cloud service providers. The security analysis of the proposed protocol proves that the protocol is robust against all the security attacks. The formal verification of the proposed protocol is performed using AVISPA tool, which proves its security in the presence of intruder.
Keywords: Authentication; Cloud Computing; Elliptic Curve Cryptography; Multi-Cloud; One Time Token; Smart Device.