Forthcoming articles

 


International Journal of Big Data Intelligence

 

These articles have been peer-reviewed and accepted for publication in IJBDI, but are pending final changes, are not yet published and may not appear here in their final order of publication until they are assigned to issues. Therefore, the content conforms to our standards but the presentation (e.g. typesetting and proof-reading) is not necessarily up to the Inderscience standard. Additionally, titles, authors, abstracts and keywords may change before publication. Articles will not be published until the final proofs are validated by their authors.

 

Forthcoming articles must be purchased for the purposes of research, teaching and private study only. These articles can be cited using the expression "in press". For example: Smith, J. (in press). Article Title. Journal Title.

 

Articles marked with this shopping trolley icon are available for purchase - click on the icon to send an email request to purchase.

 

Articles marked with this Open Access icon are freely available and openly accessible to all without any restriction except the ones stated in their respective CC licenses.

 

Register for our alerting service, which notifies you by email when new issues of IJBDI are published online.

 

We also offer RSS feeds which provide timely updates of tables of contents, newly published articles and calls for papers.

 

International Journal of Big Data Intelligence (30 papers in press)

 

Regular Issues

 

  • Improving execution speed of Incremental runs of MapReduce using Provenance   Order a copy of this article
    by Anu Chacko, Madhu S, Madhu Kumar S D, Anish Gupta 
    Abstract: Hadoop MapReduce is an analytic tool used to solve big data problems that are parallelizable. When input data changes the same job is usually rerun to produce fresh results. Many times these data changes are made by appending new data to the existing input file. So in a rerun, if we can reuse the output of the previous run and limit the job execution to the new data, we can reduce the overall job execution time. In literature some schemes use the concept of memoization, storing the intermediate result, etc. to implement efficient incremental rerun. In this paper, we explain how provenance can be used to implement transparent incremental MapReduce for say{append only} input files. Our approach requires no additional storage or modification of existing Hadoop infrastructure or scheduler. Experimental evaluation of running MapReduce on Multinode Cluster with provenance stored in HBase gave good results for incremental runs in cases of an addition of new file/new data.
    Keywords: Hadoop MapReduce; Provenance; HBase; Incremental MapReduce.

  • A Big Data Based RF Localization Method for Unmanned Search and Rescue   Order a copy of this article
    by Ju Wang 
    Abstract: Autonomous mobile robots require efficient big-data methods to process a large amount of real time sensory data to perform a task. We investigate a novel RF sensing based method for target localization where a large set of sensor data are mined to produce meaningful location information of a target device. The estimated location of the target is further used by the navigation algorithm to execute a movement plan. Using the networked RF beacon data, the proposed big data approach alleviates the problem of noisy RF measurements in location estimation. A particle filter algorithm is used to track the location of target node. The algorithm demonstrates a beyond-the-grid accuracy even only a coarse RF map is used.
    Keywords: RF mapping, Robot localization, Navigation, Measurementrnmining

  • Research on encryption strategy in large data environment based on proxy re-encryption   Order a copy of this article
    by Qiang Zhan, Jingtao Su, Yuelong Hu 
    Abstract: Abstract: Human society has been experiencing an unprecedented revolution in the recent years' rapid development of information technology. In the open and distributed environment of big data, data encryption technology has become an important issue. In order to solve the security problem of big data storage, in this paper, we study the encryption algorithm based on big data, and we design a proxy re-encryption scheme by using the method of random number encryption. We propose the main steps of proxy re-encryption schemes, and we design the proxy re-encryption scheme of each part of the main encryption algorithm. In this scheme, we ensure the big server cannot get the clear text information. Finally, we use the JPBC function library and other techniques to program the proxy re encryption scheme, which proves the availability and feasibility of the scheme. Then we put forward the network information security encryption solution in big data environment.
    Keywords: Keywords: Big Data, information security, proxy re-encryption, bilinear pairing

  • A Big Data Analytics Framework for Border Crossing Transportation   Order a copy of this article
    by Haibo Wang, Da Huo, Yaquan Xu 
    Abstract: In this paper, authors present a framework on developing a comprehensive system to analyze border crossing transportation using an open-source meta-data acquisition and aggregation tool. It is a platform integration approach based on Hadoop, MapReduce and MongoDB to consolidate databases from both U.S. and Mexico. We design data-driven XML schema for tagging the data entries from different sources with different formats, and implement a package using open-source software R to aggregate XML-transformed data into time and space dimensions. Then the transformed data is analyzed by a Difference-in-Difference (DiD) estimation model to understand the behavior of border crossing transportation.
    Keywords: Big Data Analytics; Border Crossing Transportation; Difference-in-Difference Estimation

  • Composition and Verification of Student-oriented Courses   Order a copy of this article
    by Naseem Ibrahim 
    Abstract: In the last few years, the popularity of online degrees has dramatically increased. In current online degrees, the school specifies the courses required to obtain a degree. For each course the instructor specifies the course elements including teaching method and assessments. But different students have different capabilities and constraints. Most institutions provide the same courses. A student should be able to select the course that best matches his capabilities and constraints as long as it satisfies the required course outcomes. To achieve this goal, we propose the use of Service-oriented Architecture (SOA). We introduce an extended service-oriented architecture and an extended service definition, which will enable the specification and provision of student-oriented courses. We also propose a formal composition approach. To formally verify the result of the composition, we have also introduced a formal verification approach using the model checking tool UPPAAL.
    Keywords: Student-oriented; SOA; Context; Service Model;UPPAAL.

  • S3R: Storage-Sensitive Services Redeployment in the Cloud   Order a copy of this article
    by Huining Yan, Yiming Zhang, Huaimin Wang, Bo Ding, Haibo Mi 
    Abstract: Services redeployment is one of the critical techniques for energy-efficiency in cloud data centers. In recent years, cloud providers have been providing local storage for cloud services, e.g., Amazon EC2, Aliyun ECS and RDS, since it offers a better performance with identified price. Nevertheless, as it is often assumed that cloud services utilize the shared storage only, when redeploying cloud services, most existing work did not consider the problems introduced by utilizing local storage, e.g., migrating much more data (stored on the local storage), and therefore consuming much more migration time and network bandwidth. Meanwhile, instance migration is a costly operation, thus, the total number of migrated instances must be taken into account. However, the total migrated data size and the percentage of total migrated instances are not often accordant, and therefore a tradeoff should be made between them. To address this problem, this paper proposes S^3R, a storage-sensitive services redeployment approach. How to select released servers is one key issue for services redeployment, where the instances on these servers are redeployed, and S^3R focuses on developing released servers selection strategies. S^3R firstly builds a tradeoff model to estimate the release cost for each server, i.e., the releasing priority of the servers, and then adopts a FFD-based heuristic algorithm to migrate/redeploy services instances. Evaluation results on production traces demonstrate the effectiveness of S^3R.
    Keywords: Cloud Computing; Energy Efficiency; Storage-Sensitive; Services Redeployment.

  • Large-Scale Spectral Clustering for Managing Big Data in Healthcare Operations   Order a copy of this article
    by Maoqing Liu, Nasser Fard, Keivan Sadeghzadeh 
    Abstract: Healthcare industries have access to a large volume and variety of data about patients behaviors, diseases, and treatments. There is a significant need for a data-driven system to discover patterns for better understanding of the impact of human risk behaviors on numerous diseases. In order to discover and extract interesting knowledge and pattern from large amount of data, a data mining process for discovering knowledge from unprocessed and raw healthcare data is studied. Methods for analysis of big data, and the role and types of clustering methods are presented. An in-depth analysis of spectral clustering method as a superior clustering algorithm for big health care data is presented. The spectral clustering algorithm is applied to a large heath care data from the Behavioral Risk Factor Surveillance System (BRFSS), by partitioning the untrained data to at least four clusters. The MATLAB
    Keywords: big data; healthcare; spectral clustering; visualization.

  • Data Partition Optimization for Column-Family NoSQL databases   Order a copy of this article
    by Meng-Ju Hsieh, Li-Yung Ho, Jan-Jan Wu, Pangfeng Liu 
    Abstract: Data conversion has become an emerging topic in BigData era. To face the challenge of rapid data growth, legacy or existing relational databases have the need to convert into NoSQL column-family database in order to achieve better scalability. The conversion from SQL to NoSQL databases requires combining small, normalized SQL data tables into larger NoSQL data tables; a process called denormalization. A challenging issues in data conversion is how to group the denormalized columns in a large data table into "families" in order to ensure the performance of query processing. In this paper, we propose an efficient heuristic algorithm, GPA (Graph-based Partition Algorithm), to address this problem. We use TPC-C and TPC-H benchmarks to demonstrate that, the column-families produced by GPA is very efficient for large scale data processing.
    Keywords: vertical partition; column partition; column family; NoSQL database.

  • Improving Straggler Task Performance in a Heterogeneous MapReduce Framework Using Reinforcement Learning   Order a copy of this article
    by Srinivas Naik Nenavath, Atul Negi, V.N. Sastry 
    Abstract: MapReduce is one of the most significant distributed and parallel processing frameworks for large-scale data-intensive jobs proposed in recent times. Intelligent scheduling decisions can potentially help in significantly reducing the overall runtime of jobs. It is observed that the total time to completion of a job gets extended because of some slow tasks. Especially in heterogeneous environments, the job completion times do not synchronize. As originally conceived MapReduce default scheduler was not very effective about slow task identification. In the literature, Longest Approximate Time to End (LATE) scheduler extends to the heterogeneous environment, but it has limitations in properly estimating the progress of the tasks. It takes a static view of the task progress. In this paper, we propose a novel Reinforcement Learning based MapReduce scheduler for heterogeneous environments called MapReduce Reinforcement Learning (MRRL) scheduler. It observes the system state of task execution and suggests speculative re-execution of the slower tasks to available nodes in the heterogeneous cluster without assuming any prior knowledge of the environmental characteristics.We observe that the experimental results show consistent improvements in performance as compared to the LATE and Hadoop default schedulers for different workloads of the Hi-Bench benchmark suite.
    Keywords: MapReduce; Reinforcement Learning; Speculative Execution; Task Scheduler; Heterogeneous Environments.

  • An Adaptive Memory Tuning Strategy with High Performance for Spark   Order a copy of this article
    by Di Chen, Haopeng Chen, Zhipeng Jiang, Yao Zhao 
    Abstract: With the rapid development of Internet, people put more and more focus on data, which contains much information and are of great value. To gain better performance in data analysis, in-memory computing has been more and more popular. Spark[1] is a successful example of improving computing performance through in-memory computing. However, how to make full use of memory resource is still a problem for Spark. In this paper, we presented an adaptive memory tuning strategy for Spark, which enables dynamic data compression and serialization selecting strategy to use less resource usage and obtain faster data process.We derived the strategy of selecting the optimal data compression and serialization mathematically. It chooses proper memory tuning strategy according to resource usage and can obtain good performance in applications, which persist data frequently.
    Keywords: Spark; In-memory Computing; Data Persisting; Data Caching; Memroy Tuning.

  • A Customized Automata Algorithm and Toolkit for Language Learning and Application   Order a copy of this article
    by Ruoyu Wang, Guoqiang Li, Jianwen Xiang 
    Abstract: Automata are abstract computing machines. They play a basic role in computability theory and programming language theory. They are also widely used in programming language compilers as token scanners and syntactic analysers. More recently in data analytics, data automata have become a formal way to represent pipelines and workflows. In the research involved with automata, however, there are many situations where a practitioner has to build a new automaton with bare hands, which causes a lot of redundant works to rebuild the frame work of an automaton. Moreover, when lot of researchers need to display their ideas and to discuss about new algorithms, it will be extremely hard for them to switch among different styles of codes, not to mention modifying parts of others' programs. In order to solve that problem, we propose a new toolkit: CAT, which provides a simple and unified framework for automaton construction and customization. We observed eight prevailing types of automata and decomposed their logical structure, extracting similar semantics. Besides, behavioural similarities are compared and taken into consideration to generate a hierarchical framework. Each of the eight specific type of automata is implemented at the leaf node in the tree-like structure. Several calculus algorithms are implemented according the theoretical accomplishments and designed as overloaded operators, which simplifies and visualises the source codes written by end users. To test the correctness and performance of this toolkit, several bare automata were constructed and put into calculation. And a simple textual retrieval programme was implemented by CAT, taken into comparison with the well-known tool ``GREP'' in Ubuntu Linux. The result showed that CAT has realised most of the design purposes: the calculus is correct and the framework provides a universal solution for various types of automata. Also, CAT presents a more illustrative way for writing codes of automata construction and calculation.
    Keywords: Automata; Customize; C++.

  • Hybrid Neural Network and Bi-criteria Tabu-machine: comparison of new approaches to Maximum Clique Problem   Order a copy of this article
    by Eduard Babkin, Tatiana Babkina, Alexander Demidovskij 
    Abstract: This paper presents two new approaches to solving a classical NP-hard problem of maximum clique (MCP), which frequently arises in the domain of information management, including design of database structures and big data processing. In our research we are focusing on solving that problem using the paradigm of artificial neural networks. The first approach combines the artificial neuro-network paradigm and genetic programming. For boosting the convergence of the Hopfield Neural Network (HNN) we propose a specific design of the genetic algorithm as the selection mechanism for terms of the HNN energy function. The second approach incorporates and extends the Tabu-search heuristics improving performance of network dynamics of so-called Tabu machine. Introduction of a special penalty function in Tabu machine facilitates better evaluation of the search space. As a result, we demonstrate the proposed approaches on well-known experimental graphs and formulate two hypotheses for further research.
    Keywords: Maximum Clique Problem; Data structures; Hopfield Network; Genetic Algorithm; Tabu Machine.

  • Algorithms for Fast Estimation of Social Network Centrality Measures   Order a copy of this article
    by Ashok Kumar, R. Chulaka Gunasekara, Kishan Mehrotra, Chilukuri Mohan 
    Abstract: Centrality measures are extremely important in the analysis of social networks, with applications such as the identification of the most influential individuals for effective target marketing. Eigenvector centrality and PageRank are among the most useful centrality measures, but computing these measures can be prohibitively expensive for large social networks. This paper explores multiple approaches to improve the computational effort required to compute relative centrality measures. First, we show that neural networks can be effective in learning and estimating the ordering of vertices in a social network based on these centrality measures. We show that the proposed neural networks approach requires far less computational effort, and to be is faster than early termination of the power iteration method that can be used for computing the centrality measures. We also show that four features describing the size of the social network and two vertex-specific attributes sufficed as inputs to the neural networks, requiring very few hidden neurons. Then we focus on how network sampling can be used to reduce the running times for calculating the ordering of vertices. We introduce the notion of degree centrality based sampling to reduce the running time of the key node identification problem. Finally we propose the approach of incremental updating of centrality measures in dynamic networks.
    Keywords: Social network; Centrality; Eigenvector centrality; PageRank; Network sampling; Incremental updating.

  • A Collective Matrix Factorization Approach to Social Recommendation with eWOM Propagation Effects   Order a copy of this article
    by Ren-Shiou Liu 
    Abstract: In recent years, recommender systems have become an important tool for many online retailers to increase sales. Many of these recommender systems predict users interests in products by using the browsing history or item rating records of users. However, many studies show that, before making a purchase, people often read on-line reviews and exchange opinions with friends in their social circles. The resulting electronic word-of-mouth (eWOM) has a huge impact on customer's purchase intention. Nonetheless, most recommender systems in the current literature do not consider eWOM, let alone the effect of its propagation. Therefore, this paper proposes a new recommendation model based on the collective matrix factorization technique for predicting customer preferences in this paper. A series of experiments using data collected from Epinions and Yelp are conducted. The experimental results show that the proposed model significantly outperforms other closely related models by 5%-13% in terms of RMSE and MAE.
    Keywords: recommender systems; matrix factorization; collaborative filtering; electronic word-of-mouth; regularization.

Special Issue on: Big Data Analytics, Infrastructure and Applications

  • Hybrid approach based support vector machine for electric load forecasting incorporating feature selection   Order a copy of this article
    by Malek Sarhani, Abdellatif El Afia 
    Abstract: Forecasting future electricity demand is very important for the electric power industry. In fact, it has been shown in several research works that machine learning methods are useful for electric load forecasting (ELF) since electric load data are non-linear in relation and complex. On the other hand, it is important to determine the irrelevant factors as a preprocessing step for ELF. Our objective in this paper is to investigate the importance of applying the feature selection approach to remove the irrelevant factors of electric load. To this end, we introduce a hybrid machine learning approach that combines support vector machine (SVM) and particle swarm optimization (PSO) in both continuous and binary forms. Specifically, the binary hybridization is used for feature selection and the continuous one is used for ELF. Experimental results demonstrate the feasibility of applying feature selection by SVM and PSO algorithms without decreasing the performance of the forecasting model for ELF.
    Keywords: Machine learning; electric load forecasting; feature selection; big data; support vector machine; particle swarm optimization.

  • CityPro: City-Surveillance Collaborative Platform   Order a copy of this article
    by Mohamed Dbouk, Hamid Mcheick, Ihab Sbeity 
    Abstract: Day by day, modern cities face a big challenge in terms of public safety and security. In the city, the surveillance systems mostly use video surveillance techniques, they incorporate thousands of cameras and relay to high-speed networking infrastructures. Moreover, in a city, there exist multiple computerized standalone systems that operate independently each other, e.g. banking systems, customs, and hospitals. These systems generate huge data sets. The collected data stimulate a gigantic mine of scattered information. A smart city is a city that intelligently benefits from such Omni-present systems. This paper presents an integrated platform to gather multiple existing systems in a city. The platform consists of a collaborative surveillance system, called CityPro. The proposed architecture is intended to protect and monitor people and public infrastructures, such as bridges, roads, buildings, etc.; it is designed to deal with and prevent abnormal activities like terrorist attacks. CityPro is expected to operate in live-mode by using the city digital- infrastructures. At the end of this paper, a typical case study is given, and challenges and future works are also discussed.
    Keywords: Smart cities; Digital world; Event-driven process, Collaborative business process; Software architecture; Business-intelligence; Big-data.

  • NoSQL Databases for Big Data   Order a copy of this article
    by Ahmed Oussous, Fatima-Zahra Benjelloun, Ayoub AIT LAHCEN, Samir Belfkih 
    Abstract: NoSQL solutions have been created to respond to many issues encountered when dealing with some specific applications, e.g., storage of very large data sets. In fact, traditional RDMS ensure data integrity and transaction consistency. But, this is at the cost of a rigid storage schema and a complex management. Certainly, data integrity and consistency are required in many cases like in financial applications but they are not always needed. The goal of this paper is to establish a precise picture about NoSQLs evolution and mechanisms as well as the advantages and disadvantages of the main NoSQL data models and frameworks. For this purpose, first, a deep comparison between SQL and NoSQL databases is presented. Many criteria are examined such as: scalability, performance, consistency, security, analytical capabilities and fault-tolerance mechanisms. Second, the four major types of NoSQL databases are defined and compared: key-value stores, document databases, column-oriented databases and graph databases. Third, we compare for each NoSQL data model the main available technical solutions.
    Keywords: NoSQL; Key-Value Databases; Document Databases; Column-Oriented Databases; Graph Databases; Big Data.

  • Throughput Enhancement of a Novel Hybrid-MAC Protocol for M2M Networks   Order a copy of this article
    by Pawan Verma, Rajesh Verma, Arun Prakash, Rajeev Tripathi 
    Abstract: When the M2M devices communicate with each other within a group or cluster without any human intervention, then this is called inter-M2M communications. For this purpose, there is a critical requirement of a scalable medium access control (MAC) protocol to enable multiple M2M devices to access the channel. To accomplish this task, contention or reservation-based MAC protocols can be used, but with multiple M2M devices, adaptability, and scalability become bottlenecks. Therefore, in this paper, we propose a novel hybrid-MAC protocol for a densely deployed M2M network. This protocol mainly consists of a contention interval (CI), and a transmission interval (TI). During CI, all the active M2M devices contend for the channel access, following p-persistent carrier sense multiple access (CSMA) protocol. After contention, the successful devices win timeslots in DTI, following time division multiple access (TDMA) mechanism.. Further, to accommodate more number of M2M devices within each TDMA time slot, we also propose a high throughput MAC (HT-MAC) protocol. This protocol works on a single channel, single transceiver and facilitates spatial reuse to enable multiple M2M devices to access the channel simultaneously, and consequently enhancing the throughput of the M2M network significantly. HT-MAC protocol inserts additional access intervals (AAIs) between the transmission of control packets (RTS/CTS) and data packets (DATA/ACK). When the M2M devices communicate with each other using this proposed MAC protocol within each time-slot during DTI, it allows a series of RTS/CTS exchanges between the devices in the vicinity of the transmitting or receiving device to schedule possible concurrent data transmissions. Simulation results show significant per TDMA time-slot throughput improvement as compared to IEEE 802.11 MAC protocol.
    Keywords: M2M, contention, ubiquitous, IEEE 802.11 DCF, MAC protocol

Special Issue on: E-Health Systems and Semantic Web

  • A Graph Traversal Attack on Bloom Filter Based Medical Data Aggregation   Order a copy of this article
    by William Mitchell, Rinku Dewri, Ramakrishna Thurimella, Max Roschke 
    Abstract: We present a novel cryptanalytic method based on graph traversals to show that record linkage using Bloom Filter Encoding does not preserve privacy in a two-party setting. Bloom Filter Encoding is often suggested as a practical approach to medical data aggregation. This attack is stronger than a simple dictionary attack in that it does not assume knowledge of the universe. The attack is very practical and produced accurate results when experimented on large amounts of name-like data derived from a North Carolina voter registration database. We also give theoretical arguments that show that going from bigrams to n-grams, n > 2, does not increase privacy; on the contrary it actually makes the attack more effective. Finally, some ways to resist this attack are suggested.
    Keywords: Bloom filter encoding, privacy-preserving record linkage, medical data aggregation, cryptanalysis, two-party linkage, private record linkage.

Special Issue on: Big Data Management in Clouds Opportunities, Issues, Challenges and Solutions

  • Semi-structured Data Analysis and Visualization using NoSQL   Order a copy of this article
    by Srinidhi Hiriyannaiah, Siddesh G M, K.G. Srinivasa, Anoop P 
    Abstract: In the field of computing, everyday huge amounts of data are created by scientific experiments, companies and users activities. These large datasets are labelled as "Big data", presenting new challenges for computer science researchers and professionals in terms of storage, processing and analysis. Traditional relational database systems (RDBMS) supported with conventional searches cant be effectively used to handle such multi-structured data. NoSQL databases complement to challenges of managing RDBMS with big data and facilitate in further analysis of data In this paper, we introduce a framework that aims at analyzing semi-structured data applications using NoSQL database MongoDB. The proposed framework focuses on the key aspects needed for semi-strucutred data analytics in terms of data collection, data parsing and data prediction. The layers involved in the framework are request layer facilitating the queries from user, input layer that interfaces the data sources and the analytics layer; and the output layer facilitating the visualization of the analytics performed. A performance analysis for select+fetch operations needed for analytics, of MySQL and MongoDB is carried out where NoSQL database MongoDB outperforms MySQL database. The proposed framework is applied on predicting the performance and monitoring of cluster of servers.
    Keywords: analytics;semi-strucutured data; big data analytics;cluster analytics;server performance monitoring;MongoDB;NoSQL analytics.

  • Computation Migration: A new approach to execute big-data bioinformatics workflows   Order a copy of this article
    by Rickey T. P. Nunes, Santosh L. Deshpande 
    Abstract: Bioinformatics workflows frequently access various distributed biological data sources and computational analysis tools for data analysis and knowledge discovery. They move large volumes of data from biological data sources to computational analysis tools and follow the traditional data migration approach for workflow execution. However, in the advent of big-data in bioinformatics, moving large volumes of data to computation during workflow execution is no longer feasible. Considering the fact that the size of biological data is continuously growing and is much larger than the computational analysis tool size, moving computation to data in a workflow is a better solution to handle the growing data. In this paper, we therefore propose a computation migration approach to execute bioinformatics workflows. We move computational analysis tools to data sources during workflow execution and demonstrate with workflow patterns that moving computation instead of data yields high performance gains in terms of data-flow and execution time.
    Keywords: Big-data; Bioinformatics; Workflows; Orchestration; Computation migration.

  • Parallel Computing For Preserving Privacy Using k-anonymization Algorithms from Big Data   Order a copy of this article
    by Sharath Yaji, Neelima.B Reddy 
    Abstract: For many organizations preserving privacy for Big Data is still major challenge. Big Data analysis can be optimized through parallel computation. This paper gives a proposal for parallelizing k-anonymization algorithms through comparative study and survey. The main k-anonymization algorithms considered for study and comparison are MinGen, DataFly, Incognito and Mondrian. It is noted that as the data size increases, the parallel version of the algorithms performs better than its sequential counterparts. For Small size data set in sequential mode MinGen gives 71.83% faster than parallel mode. However overall, in sequential mode Datafly and in parallel mode incognito performed well. For large size data set in parallel mode Incognito is faster 101.186% than serial mode. However overall, in sequential mode MinGen and Datafly performed well. In parallel mode Incognito, Datafly and MinGen performed well. The paper acts as a single point of reference for choosing Big Data mining k-anonymization algorithms and gives a direction of applying HPC concepts such as parallelization to be applied for privacy preserving algorithms. As a future work the authors are planning to port these algorithms on to heterogeneous architectures such as Graphics Processing Units (GPU) as applicable.
    Keywords: Big Data; K–anonymization; Privacy preserving in Big Data analysis; Parallel computing for Big Data.

Special Issue on: Data to Decision

  • Predicting Baseline for Analysis of Electricity Pricing   Order a copy of this article
    by Taehoon Kim, Jaesik Choi, Dongeun Lee, Alex Sim, C. Anna Spurlock, Annika Todd, Kesheng Wu 
    Abstract: To understand the impact of a new pricing structure on residential electricity demand, we need a baseline model that captures every factor other than the new price. The gold standard baseline is a randomized control group, however, a good control group is hard to design, and could only serve as a baseline for a group, not any individual household. To overcome these shortcomings, we develop a number of techniques that could predict the hourly usage years ahead. To capture the fact that daily electricity demand peaks a few hours after the temperature reaching its peak, existing methods rely on lagged variables, such as the usages from a day ago and a week ago. When making predictions years into the future, the values from a week ago are also in the future and therefore unknown. In this work, we develop a continuous prediction strategy that first forecasts lagged variables and then uses them in further predictions, but we find that the prediction error increases over time. Based on an observed linear relationship between temperature and aggregate power (LTAP), we design a new prediction method named LTAP that avoids this cumulation of prediction errors when forecasting usage of each household. In our test, the average predicted usages by LTAP match the control group for all two summers the test data covered. This suggests that we might be able to use LTAP predictions in future studies.
    Keywords: baseline model; residential electricity consumption; outdoor temperature; gradient tree boosting; electricity rate scheme.

  • Sign Language Recognition in Complex Background Scene Based on Adaptive Skin Color Modeling and Support Vector Machine   Order a copy of this article
    by Tse-Yu Pan, Li-Yun Lo, Chung-Wei Yeh, Jhe-Wei Li, Hou-Tim Liu, Min-Chun Hu 
    Abstract: With the advances of wearable cameras, user can record the first-person view videos for gesture recognition or even sign language recognition to help the deaf or hard of hearing people communicate with others. In this paper, we propose a purely vision-based sign language recognition system which can be used in complex background scene. We design an adaptive skin color modeling method for hand segmentation so that the hand contour can be derived more accurately even when different users use our system in various light conditions. Four kinds of feature descriptors are integrated to describe the contours and the salient points of hand gestures, and Support Vector Machine (SVM) is applied to classify hand gestures. Our recognition method is evaluated by two datasets: (1) The CSL dataset collected by ourselves, in which images were captured in three different environments including complex background. (2) The public ASL dataset, in which images of the same gesture were captured in different lighting conditions. The proposed recognition method achieves acceptable accuracy rates of 100.0% and 94.0% for the CSL and ASL datasets, respectively.
    Keywords: Sign Language Recognition; Support Vector Machine; Human-Computer Interaction; Gesture Recognition.

  • Detecting Spam Web Pages using Multilayer Extreme Learning Machine   Order a copy of this article
    by Rajendra Kumar Roul 
    Abstract: Web spamming generally increases the ranking of some unimportant pages higher in the search results. Detecting and eliminating such spam pages are the need of the day, which not only mislead the search engine, but also become the roadblock for obtaining high-quality information from the Web. Hence, spam page detection has become the vibrant area of research in the field of information retrieval. Aiming in this direction, this study focuses on two important aspects of machine learning. First, it proposes a new content-based spam detection technique which identifies nine important features that help to detect a page is either non-spam or spam. Each feature has an associated value which is calculated by parsing the documents and then performing the require techniques i.e. necessary steps to compute its score. These nine important features along with the class label (spam or non-spam) generate a feature vector for training the classifiers in order to detect the spam pages. Secondly, it highlights the importance of deep learning using Multilayer extreme learning machine in the field of spam page detection. For experimental work, two benchmark datasets (WEBSPAMUK2006 and WEBSPAM-UK2002) have been used and the results using Multilayer ELM are found to be more promising compared to other established classifiers.
    Keywords: Content based; Deep Learning; Extreme Learning Machine; Multilayer ELM; Support Vector Machine; Spam Page.

  • An unsupervised service annotation by review analysis   Order a copy of this article
    by Masafumi Yamamoto, Yuguan Xing, Toshihiko Yamasaki, Kiyoharu Aizawa 
    Abstract: With the increase in popularity of online review sites, users can write reviews on services that they have used in addition to reading reviews by other users to gain information about the services. However, the number of reviews about a service may be large, which makes it almost impossible for users to read all the reviews in detail. It is even more burdensome to compare multiple services. Users may also want to know some information about the unique and relatively praised features of services, and there are a few works that solve such a problem. Thus, useful tools for extracting the unique features of services are necessary so that users can easily and intuitively understand the quality of services and compare them. In this study, we present an unsupervised method for extracting the unique and detailed features of services and the users' opinions on these features. By using only the term frequency (TF), our method will extract only the general features (e.g., for restaurants, food and service), and many review sites show users how these general features are evaluated. However, users may want to know more detailed features and how these features are evaluated. Furthermore, by using the term frequency and inverse document frequency (TF-IDF) algorithm, our proposed method can also extract in particular the praised or criticized features of a specific service. We conducted evaluations from multiple viewpoints to show the validity of our proposed method. In addition, we implemented a graphical user interface to help users intuitively understand the results.
    Keywords: Service annotation; service profiling; review analysis; summarization.

  • Emotion Based Topic Impact on Social Media   Order a copy of this article
    by Fernando Calderon Alvarado, Yi-Shin Chen 
    Abstract: The increasing use of Micro-blogging sites have made them very rich data repositories. Information generated is dynamic by nature, tied to temporal conditions and the subjectivity of its users. Everyday life experiences, discussions or events have a direct impact on the behaviors reflected in social networks. It has become important to asses to which degree are these interactions affecting a social group. A possibility is to analyze how impactful a topic is according to the behavior presented on a social network over time. It is then necessary to develop methods that can contribute towards this task. Having identified a topic in social media we can obtain a general summary of the emotions it is generating over a social group. We then propose a Topic Impact score which will be given to each topic based on how this emotions transition, for how many time they span and how many users they reach. This lays ground to quantify how impactful a topic is over a social group, specifically regarding events detected on twitter.
    Keywords: Social Impact; Influence; Social Media; Emotion Analysis; Microblogs.

  • Document Stream Classification based on Transfer Learning using Latent Topics   Order a copy of this article
    by Masato Shirai, Jianquan Liu, Takao Miura 
    Abstract: In this investigation, we propose a classification framework based on transfer learning using latent intermediate domain for document stream classification. In document stream, word frequency changes dramatically because of transition of themes. To classify document stream, we capture new features and modify the classification criteria during the stream. Transfer learning utilizes extracted knowledge from source domain to analyze the target domain. We extract latent topics based on topic model from unlabeled documents. Our approach connect each domain using latent topics to classify documents. And we capture change of features by update of intermediate domain in document stream.
    Keywords: Transfer Learning; NMTF; Topic Model; Document Classification;.

  • Sightseeing Value Estimation by Analyzing Geosocial Images   Order a copy of this article
    by Yizhu Shen, Min Ge, Chenyi Zhuang, Qiang Ma 
    Abstract: Recommendation of points of interests (POIs) is drawing more attention to meet the growing demands of tourists. Thus, a POIs quality (sightseeing value) needs to be estimated. In contrast to conventional studies that rank POIs on the basis of user behavior analysis, this paper presents methods to estimate quality by analyzing geo-social images. Our approach estimates the sightseeing value from two aspects: (1) nature value and (2) culture value. For the nature value, we extract image features that are related to favorable human perception to verify whether a POI would satisfy tourists in terms of environmental psychology. Three criteria are defined accordingly: coherence, image-ability, and visual-scale. For the culture value, we recognize the main cultural element (i.e., architecture) included in a POI. In the experiments, we applied our methods to real POIs and found that our approach assessed sightseeing value effectively.
    Keywords: Points of Interests; Sightseeing value; Geosocial image; human perception; image processing; UCG Mining.

  • Ontology-based Faceted Semantic Search with Automatic Sense Disambiguation for Bioenergy Domain   Order a copy of this article
    by Feroz Farazi, Craig Chapman, Pathmeswaran Raju, Lynsey Melville 
    Abstract: WordNet is a lexicon widely known and used as an ontological resource hosting comparatively large collection of semantically interconnected words. Use of such resources produces meaningful results and improves users search experience through the increased precision and recall. This paper presents our facet-enabled WordNet powered semantic search work done in the context of the bioenergy domain. The main hurdle to achieving the expected result was sense disambiguation further complicated by the occasional fine-grained distinction of meanings of the terms in WordNet. To overcome this issue, this paper proposes a sense disambiguation methodology that uses bioenergy domain related ontologies (extracted from WordNet automatically), WordNet concept hierarchy and term sense rank.
    Keywords: semantic search; faceted search; faceted semantic search; Knowledg Base; WordNet; ontology; bioenergy.