International Journal of Big Data Intelligence (29 papers in press)
A Big Data Based RF Localization Method for Unmanned Search and Rescue
by Ju Wang
Abstract: Autonomous mobile robots require efficient big-data methods to process a large amount of real time sensory data to perform a task. We investigate a novel RF sensing based method for target localization where a large set of sensor data are mined to produce meaningful location information of a target device. The estimated location of the target is further used by the navigation algorithm to execute a movement plan. Using the networked RF beacon data, the proposed big data approach alleviates the problem of noisy RF measurements in location estimation. A particle filter algorithm is used to track the location of target node. The algorithm demonstrates a beyond-the-grid accuracy even only a coarse RF map is used.
Keywords: RF mapping, Robot localization, Navigation, Measurementrnmining
Improving Straggler Task Performance in a Heterogeneous MapReduce Framework Using Reinforcement Learning
by Srinivas Naik Nenavath, Atul Negi, V.N. Sastry
Abstract: MapReduce is one of the most significant distributed and parallel processing frameworks for large-scale data-intensive jobs proposed in recent times. Intelligent scheduling decisions can potentially help in significantly reducing the overall runtime of jobs. It is observed that the total time to completion of a job gets extended because of some slow tasks. Especially in heterogeneous environments, the job completion times do not synchronize. As originally conceived MapReduce default scheduler was not very effective about slow task identification. In the literature, Longest Approximate Time to End (LATE) scheduler extends to the heterogeneous environment, but it has limitations in properly estimating the progress of the tasks. It takes a static view of the task progress. In this paper, we propose a novel Reinforcement Learning based MapReduce scheduler for heterogeneous environments called MapReduce Reinforcement Learning (MRRL) scheduler. It observes the system state of task execution and suggests speculative re-execution of the slower tasks to available nodes in the heterogeneous cluster without assuming any prior knowledge of the environmental characteristics.We observe that the experimental results show consistent improvements in performance as compared to the LATE and Hadoop default schedulers for different workloads of the Hi-Bench benchmark suite.
Keywords: MapReduce; Reinforcement Learning; Speculative Execution; Task Scheduler; Heterogeneous Environments.
Hybrid Neural Network and Bi-criteria Tabu-machine: comparison of new approaches to Maximum Clique Problem
by Eduard Babkin, Tatiana Babkina, Alexander Demidovskij
Abstract: This paper presents two new approaches to solving a classical NP-hard problem of maximum clique (MCP), which frequently arises in the domain of information management, including design of database structures and big data processing. In our research we are focusing on solving that problem using the paradigm of artificial neural networks. The first approach combines the artificial neuro-network paradigm and genetic programming. For boosting the convergence of the Hopfield Neural Network (HNN) we propose a specific design of the genetic algorithm as the selection mechanism for terms of the HNN energy function. The second approach incorporates and extends the Tabu-search heuristics improving performance of network dynamics of so-called Tabu machine. Introduction of a special penalty function in Tabu machine facilitates better evaluation of the search space. As a result, we demonstrate the proposed approaches on well-known experimental graphs and formulate two hypotheses for further research.
Keywords: Maximum Clique Problem; Data structures; Hopfield Network; Genetic Algorithm; Tabu Machine.
Algorithms for Fast Estimation of Social Network Centrality Measures
by Ashok Kumar, R. Chulaka Gunasekara, Kishan Mehrotra, Chilukuri Mohan
Abstract: Centrality measures are extremely important in the analysis of social networks, with applications such as the identification of the most influential individuals for effective target marketing. Eigenvector centrality and PageRank are among the most useful centrality measures, but computing these measures can be prohibitively expensive for large social networks. This paper explores multiple approaches to improve the computational effort required to compute relative centrality measures. First, we show that neural networks can be effective in learning and estimating the ordering of vertices in a social network based on these centrality measures. We show that the proposed neural networks approach requires far less computational effort, and to be is faster than early termination of the power iteration method that can be used for computing the centrality measures. We also show that four features describing the size of the social network and two vertex-specific attributes sufficed as inputs to the neural networks, requiring very few hidden neurons. Then we focus on how network sampling can be used to reduce the running times for calculating the ordering of vertices. We introduce the notion of degree centrality based sampling to reduce the running time of the key node identification problem. Finally we propose the approach of incremental updating of centrality measures in dynamic networks.
Keywords: Social network; Centrality; Eigenvector centrality; PageRank; Network sampling; Incremental updating.
A Collective Matrix Factorization Approach to Social Recommendation with eWOM Propagation Effects
by Ren-Shiou Liu
Abstract: In recent years, recommender systems have become an important tool for many online retailers to increase sales. Many of these recommender systems predict users interests in products by using the browsing history or item rating records of users. However, many studies show that, before making a purchase, people often read on-line reviews and exchange opinions with friends in their social circles. The resulting electronic word-of-mouth (eWOM) has a huge impact on customer's purchase intention. Nonetheless, most recommender systems in the current literature do not consider eWOM, let alone the effect of its propagation. Therefore, this paper proposes a new recommendation model based on the collective matrix factorization technique for predicting customer preferences in this paper. A series of experiments using data collected from Epinions and Yelp are conducted. The experimental results show that the proposed model significantly outperforms other closely related models by 5%-13% in terms of RMSE and MAE.
Keywords: recommender systems; matrix factorization; collaborative filtering; electronic word-of-mouth; regularization.
Collective Tweet Analysis for Accurate User Sentiment Analysis - a Case Study with Delhi Assembly Election 2015
by Lija Mohan, Sudheep Ealyidom
Abstract: Social media has exploded as a category of online discourse where people create and share the contents at a massive rate. Because of its ease of use, speed and reach, social media is fast changing the public discourse in society and setting trends and agendas in topics that range from the environment and politics to technology and the entertainment industry. Since social media can also be construed as a form of collective wisdom, the authors decided to investigate its power at predicting real-world outcomes. The objective was to design a Twitter based sentiment mining. We introduced a keyword-aware user-based collective tweet mining approach to rank the sentiment of each user. To prove the accuracy of the proposed method, we chose an interesting Election Winner Prediction application and observed how the sentiment of people on different political issues at that time, got reflected on their votes. A Domain thesaurus is built by collecting keywords related to each issue. Since twitter data is too huge in size, it is very difficult to process using traditional architecture. Hence, we introduced a scalable and efficient Map Reduce programming model based approach to classify the tweets. The experiments were designed to predict the winner of Delhi Assembly Elections, 2015 by analyzing the sentiments of people on different political issues and from the analysis that we performed, we correctly predicted that Aam Admy Party has a higher support, compared to the existing ruling party, BJP. Thus we introduced a Big Data Approach to do sentiment analysis on Twitter data which have wide spread applications in todays world.
Keywords: Twitter Anlaysis; Collective Tweet Analysis; Sentiment Analysis; Big Data; Hadoop; Map Reduce.
Big Uncertain Data of Multiple Sensors Efficient Processing with High Order Multi-Hypothesis: An Evidence Theoretic Approach
by Hossein Jafari, Xiangfang Li, Lijun Qian, Alexander Aved, Timothy Kroecker
Abstract: With the proliferation of IoT, numerous sensors are deployed and big uncertain data are collected due to the different accuracy, sensitivity range, and decay of the sensors. The goal is to process the data and determine the most potential hypothesis among the set of high order multi-hypothesis.
In this study, we propose a novel big uncertain sensor fusion framework to take advantage of evidence theory's capability of representing uncertainty for decision making and effectively dealing with conflict.
However, the methods in evidence theory are in general very computationally expensive, thus they may not be directly applied to multiple data sources with high cardinality of hypotheses. Furthermore, we propose a Dezert-Smarandache hybrid model that can apply to applications with high number of hypotheses while the computational cost is reduced.
Both synthetic and real data from experiments are used to demonstrate the feasibility of the proposed method for practical situation awareness applications.
Keywords: Dezert-Smarandache Theory (DSmT); Dempster-Shafer Theory (DST) ;Internet of Things (IOT); Comfort Zone; Uncertain Data Fusion; Multiple Sensor; Multi-Hypothesis.
Comparison of Hives Query Optimization Techniques
by Sikha Bagui, Keerthi Devulapalli
Abstract: The ever increasing size of data sets in this Big Data era has forced data analytics to be moved from traditional RDBMS systems to distributed technologies like Hadoop. Since data analysts are more familiar with SQL than the MapReduce programming paradigm, HiveQL was built on Hadoops MapReduce framework. Traditional RDBMS query optimization techniques, which are used in the Rule Based Optimizer (RBO) of Hive, do not perform well in the MapReduce environment. Hence, the Correlation Optimizer (CRO) and Cost Based Optimizers (CBO) were developed. These optimizers perform query optimizations considering the MapReduce execution framework. In this work, the three optimizers, RBO, CRO, and CBO are compared. Queries with common intra-query operations were found to be optimized better with CRO.
Keywords: Hive; Query Optimization; Correlation Based Optimizer; Rule Based Optimizer; Cost Based Optimizer.
BIG DATA ENSEMBLE CLINICAL PREDICTION FOR HEALTHCARE DATA BY USING DEEP LEARNING MODEL
by Sreekanth Rallapalli, Gondkar R R
Abstract: Big Data has revolutionized in healthcare by classifying the data into volume, velocity, variety, veracity, variability, visualization and value. Electronic Health Records (EHRs) is growing at an exponential rate that is being stored in enterprise databases or cloud storages. Identifying the strong indicators for accurate prediction is a challenging task. Ensemble model is gaining popularity among various other individual contributors. Ensemble systems can provide better accuracy when used over the best single model. In this paper we combine four algorithms Support Vector machines, Na
Keywords: Algorithm; Big Data; Classification; Decision trees; Deep learning; EHR; Ensemble model; Predictive model.
Resource management for deadline constrained MapReduce jobs for minimizing energy consumption
by Adam Gregory, Shikharesh Majumdar
Abstract: Cloud computing has emerged as one of the leading platforms for processing large-scale data intensive applications. Such applications are executed in large clusters and data centers which require a substantial amount of energy. Energy consumption within data centers accounts for a considerable fraction of costs and is a significant contributor to global greenhouse gas emissions. Therefore, minimizing energy consumption in data centers is a critical concern for data center operators, cluster owners, and cloud service providers. In this paper, we devise a novel Energy Aware MapReduce Resource Manager for an open system, called EAMR-RM, that can effectively perform matchmaking and scheduling of MapReduce jobs each of which is characterized by a Service Level Agreement (SLA) for performance that includes a client specified earliest start time, execution time, and a deadline with the objective of minimizing data center energy consumption. Performance analysis demonstrates that for a range of system and workload parameters experimented with the proposed technique can effectively satisfy SLA requirements while achieving up to a 45% reduction in energy consumption compared to approaches which do not consider energy in resource management decisions.
Keywords: Resource management on clouds; MapReduce with deadlines; Constraint Programming; Energy management; Big data analytics; Job turnaound time.
A big data analytics framework for border crossing transportation
by Haibo Wang, Da Huo, Yaquan Xu
Abstract: In this paper, the authors present a framework on developing a comprehensive system to analyse border crossing transportation using an open-source meta-data acquisition and aggregation tool. It is a platform integration approach based on Hadoop, MapReduce and MongoDB to consolidate databases from both the USA and Mexico. We design data-driven XML schema for tagging the data entries from different sources with different formats, and implement a package using open-source software R to aggregate XML-transformed data into time and space dimensions. Then the transformed data is analysed by a difference-in-difference (DiD) estimation model to understand the behaviour of border crossing transportation.
Keywords: big data analytics; border crossing transportation; difference-in-difference estimation.
Composition and verification of student-oriented courses
by Naseem Ibrahim
Abstract: In the last few years, the popularity of online degrees has dramatically increased. In current online degrees, the school specifies the courses required to obtain a degree. For each course, the instructor specifies the course elements including teaching method and assessments. But different students have different capabilities and constraints. Most institutions provide the same courses. A student should be able to select the course that best matches his capabilities and constraints as long as it satisfies the required course outcomes. To achieve this goal, we propose the use of service-oriented architecture (SOA). We introduce an extended service-oriented architecture and an extended service definition, which will enable the specification and provision of student-oriented courses. We also propose a formal composition approach. To formally verify the result of the composition, we have also introduced a formal verification approach using the model checking tool UPPAAL.
Keywords: student-oriented; service-oriented architecture; SOA; context; service model; UPPAAL.
S3R: storage-sensitive services redeployment in the cloud
by Huining Yan, Yiming Zhang, Huaimin Wang, Bo Ding, Haibo Mi
Abstract: Services redeployment is one of the critical techniques for energy-efficiency in cloud data centres. In recent years, cloud providers have been providing local storage for cloud services, since it offers a better performance with identified price. Nevertheless, most existing work did not consider the problems introduced by utilising local storage, e.g., migrating much more data, and therefore consuming much more migration time and network bandwidth. Meanwhile, instance migration is a costly operation, the number of migrated instances must be considered. However, the data size and the number of instances on servers are not often accordant, and therefore a tradeoff should be made. To address this problem, this paper proposes S3R, a storage-sensitive services redeployment approach. S3R firstly builds a tradeoff model to estimate the release cost for each server, and then adopts a FFD-based heuristic algorithm to migrate/redeploy instances. Evaluation results on production traces demonstrate the effectiveness of S3R.
Keywords: cloud computing; energy efficiency; storage-sensitive; services redeployment.
Data partition optimisation for column-family NoSQL databases
by Meng-Ju Hsieh, Li-Yung Ho, Jan-Jan Wu, Pangfeng Liu
Abstract: Data conversion has become an emerging topic in BigData era. To face the challenge of rapid data growth, legacy or existing relational databases have the need to convert into NoSQL column-family database in order to achieve better scalability. The conversion from SQL to NoSQL databases requires combining small, normalised SQL data tables into larger NoSQL data tables; a process called denormalisation. A challenging issue in data conversion is how to group the denormalised columns in a large data table into 'families' in order to ensure the performance of query processing. In this paper, we propose an efficient heuristic algorithm, graph-based partition algorithm (GPA), to address this problem. We use TPC-C and TPC-H benchmarks to demonstrate that the column-families produced by GPA is very efficient for large-scale data processing.
Keywords: vertical partition; column partition; column family; NoSQL database.
An adaptive memory tuning strategy with high performance for Spark
by Di Chen, Haopeng Chen, Zhipeng Jiang, Yao Zhao
Abstract: With the rapid development of internet, people put more and more focus on data, which contains much information and are of great value. To gain better performance in data analysis, in-memory computing has been more and more popular. Spark (Zaharia et al., 2010) is a successful example of improving computing performance through in-memory computing. However, how to make full use of memory resource is still a problem for Spark. In this paper, we presented an adaptive memory tuning strategy for Spark, which enables dynamic data compression and serialisation selecting strategy to use less resource usage and obtain faster data process. We derived the strategy of selecting the optimal data compression and serialisation mathematically. It chooses proper memory tuning strategy according to resource usage and can obtain good performance in applications, which persist data frequently.
Keywords: Spark; in-memory computing; data persisting; data caching; memory tuning.
Special Issue on: E-Health Systems and Semantic Web
A graph traversal attack on Bloom filter-based medical data aggregation
by William Mitchell, Rinku Dewri, Ramakrishna Thurimella, Max Roschke
Abstract: We present a novel cryptanalytic method based on graph traversals to show that record linkage using Bloom filter encoding does not preserve privacy in a two-party setting. Bloom filter encoding is often suggested as a practical approach to medical data aggregation. This attack is stronger than a simple dictionary attack in that it does not assume knowledge of the universe. The attack is very practical and produced accurate results when experimented on large amounts of name-like data derived from a North Carolina voter registration database. We also give theoretical arguments that show that going from bigrams to n-grams, n > 2, does not increase privacy; on the contrary, it actually makes the attack more effective. Finally, some ways to resist this attack are suggested.
Keywords: Bloom filter encoding; BFE; privacy-preserving record linkage; PPRL; medical data aggregation; cryptanalysis; two-party linkage; private record linkage; PRL.
Special Issue on: Big Data Management in Clouds Opportunities, Issues, Challenges and Solutions
Semi-structured Data Analysis and Visualization using NoSQL
by Srinidhi Hiriyannaiah, Siddesh G M, K.G. Srinivasa, Anoop P
Abstract: In the field of computing, everyday huge amounts of data are created by scientific experiments, companies and users activities. These large datasets are labelled as "Big data", presenting new challenges for computer science researchers and professionals in terms of storage, processing and analysis. Traditional relational database systems (RDBMS) supported with conventional searches cant be effectively used to handle such multi-structured data. NoSQL databases complement to challenges of managing RDBMS with big data and facilitate in further analysis of data In this paper, we introduce a framework that aims at analyzing semi-structured data applications using NoSQL database MongoDB. The proposed framework focuses on the key aspects needed for semi-strucutred data analytics in terms of data collection, data parsing and data prediction. The layers involved in the framework are request layer facilitating the queries from user, input layer that interfaces the data sources and the analytics layer; and the output layer facilitating the visualization of the analytics performed. A performance analysis for select+fetch operations needed for analytics, of MySQL and MongoDB is carried out where NoSQL database MongoDB outperforms MySQL database. The proposed framework is applied on predicting the performance and monitoring of cluster of servers.
Keywords: analytics;semi-strucutured data; big data analytics;cluster analytics;server performance monitoring;MongoDB;NoSQL analytics.
Computation Migration: A new approach to execute big-data bioinformatics workflows
by Rickey T. P. Nunes, Santosh L. Deshpande
Abstract: Bioinformatics workflows frequently access various distributed biological data sources and computational analysis tools for data analysis and knowledge discovery. They move large volumes of data from biological data sources to computational analysis tools and follow the traditional data migration approach for workflow execution. However, in the advent of big-data in bioinformatics, moving large volumes of data to computation during workflow execution is no longer feasible. Considering the fact that the size of biological data is continuously growing and is much larger than the computational analysis tool size, moving computation to data in a workflow is a better solution to handle the growing data. In this paper, we therefore propose a computation migration approach to execute bioinformatics workflows. We move computational analysis tools to data sources during workflow execution and demonstrate with workflow patterns that moving computation instead of data yields high performance gains in terms of data-flow and execution time.
Keywords: Big-data; Bioinformatics; Workflows; Orchestration; Computation migration.
Parallel Computing For Preserving Privacy Using k-anonymization Algorithms from Big Data
by Sharath Yaji, Neelima.B Reddy
Abstract: For many organizations preserving privacy for Big Data is still major challenge. Big Data analysis can be optimized through parallel computation. This paper gives a proposal for parallelizing k-anonymization algorithms through comparative study and survey. The main k-anonymization algorithms considered for study and comparison are MinGen, DataFly, Incognito and Mondrian. It is noted that as the data size increases, the parallel version of the algorithms performs better than its sequential counterparts. For Small size data set in sequential mode MinGen gives 71.83% faster than parallel mode. However overall, in sequential mode Datafly and in parallel mode incognito performed well. For large size data set in parallel mode Incognito is faster 101.186% than serial mode. However overall, in sequential mode MinGen and Datafly performed well. In parallel mode Incognito, Datafly and MinGen performed well. The paper acts as a single point of reference for choosing Big Data mining k-anonymization algorithms and gives a direction of applying HPC concepts such as parallelization to be applied for privacy preserving algorithms. As a future work the authors are planning to port these algorithms on to heterogeneous architectures such as Graphics Processing Units (GPU) as applicable.
Keywords: Big Data; K–anonymization; Privacy preserving in Big Data analysis; Parallel computing for Big Data.
Special Issue on: Data to Decision
Predicting Baseline for Analysis of Electricity Pricing
by Taehoon Kim, Jaesik Choi, Dongeun Lee, Alex Sim, C. Anna Spurlock, Annika Todd, Kesheng Wu
Abstract: To understand the impact of a new pricing structure on residential
electricity demand, we need a baseline model that captures every factor other
than the new price. The gold standard baseline is a randomized control group,
however, a good control group is hard to design, and could only serve as a baseline
for a group, not any individual household. To overcome these shortcomings,
we develop a number of techniques that could predict the hourly usage years
ahead. To capture the fact that daily electricity demand peaks a few hours after
the temperature reaching its peak, existing methods rely on lagged variables,
such as the usages from a day ago and a week ago. When making predictions
years into the future, the values from a week ago are also in the future and
therefore unknown. In this work, we develop a continuous prediction strategy
that first forecasts lagged variables and then uses them in further predictions, but
we find that the prediction error increases over time. Based on an observed linear
relationship between temperature and aggregate power (LTAP), we design a new
prediction method named LTAP that avoids this cumulation of prediction errors
when forecasting usage of each household. In our test, the average predicted
usages by LTAP match the control group for all two summers the test data covered.
This suggests that we might be able to use LTAP predictions in future studies.
Keywords: baseline model; residential electricity consumption; outdoor temperature; gradient tree boosting; electricity rate scheme.
Sign Language Recognition in Complex Background Scene Based on Adaptive Skin Color Modeling and Support Vector Machine
by Tse-Yu Pan, Li-Yun Lo, Chung-Wei Yeh, Jhe-Wei Li, Hou-Tim Liu, Min-Chun Hu
Abstract: With the advances of wearable cameras, user can record the first-person view videos for gesture recognition or even sign language recognition to help the deaf or hard of hearing people communicate with others. In this paper, we propose a purely vision-based sign language recognition system which can be used in complex background scene. We design an adaptive skin color modeling method for hand segmentation so that the hand contour can be derived more accurately even when different users use our system in various light conditions. Four kinds of feature descriptors are integrated to describe the contours and the salient points of hand gestures, and Support Vector Machine (SVM) is applied to classify hand gestures. Our recognition method is evaluated by two datasets: (1) The CSL dataset collected by ourselves, in which images were captured in three different environments including complex background. (2) The public ASL dataset, in which images of the same gesture were captured in different lighting conditions. The proposed recognition method achieves acceptable accuracy rates of 100.0% and 94.0% for the CSL and ASL datasets, respectively.
Keywords: Sign Language Recognition; Support Vector Machine; Human-Computer Interaction; Gesture Recognition.
Detecting Spam Web Pages using Multilayer Extreme Learning Machine
by Rajendra Kumar Roul
Abstract: Web spamming generally increases the ranking of some unimportant pages higher in the search results. Detecting and eliminating such spam pages are the need of the day, which not only mislead the search engine, but also become the roadblock for obtaining high-quality information from the Web. Hence, spam page detection has become the vibrant area of research in the field of information retrieval. Aiming in this direction, this study focuses on two important aspects of machine learning. First, it proposes a new content-based spam detection technique which identifies nine important features that help to detect a page is either non-spam or spam. Each feature has an associated value which is calculated by parsing the documents and then performing the require techniques i.e. necessary steps to compute its score. These nine important features along with the class label (spam or non-spam) generate a feature vector for training the classifiers in order to detect the spam pages. Secondly, it highlights the importance of deep learning using Multilayer extreme learning machine in the field of spam page detection. For experimental work, two benchmark datasets (WEBSPAMUK2006 and WEBSPAM-UK2002) have been used and the results using Multilayer ELM are found to be more promising compared to other established classifiers.
Keywords: Content based; Deep Learning; Extreme Learning Machine; Multilayer ELM; Support Vector Machine; Spam Page.
An unsupervised service annotation by review analysis
by Masafumi Yamamoto, Yuguan Xing, Toshihiko Yamasaki, Kiyoharu Aizawa
Abstract: With the increase in popularity of online review sites, users can write reviews on services that they have used in addition to reading reviews by other users to gain information about the services.
However, the number of reviews about a service may be large, which makes it almost impossible for users to read all the reviews in detail.
It is even more burdensome to compare multiple services.
Users may also want to know some information about the unique and relatively praised features of services, and there are a few works that solve such a problem.
Thus, useful tools for extracting the unique features of services are necessary so that users can easily and intuitively understand the quality of services and compare them.
In this study, we present an unsupervised method for extracting the unique and detailed features of services and the users' opinions on these features.
By using only the term frequency (TF), our method will extract only the general features (e.g., for restaurants, food and service), and many review sites show users how these general features are evaluated.
However, users may want to know more detailed features and how these features are evaluated.
Furthermore, by using the term frequency and inverse document frequency (TF-IDF) algorithm, our proposed method can also extract in particular the praised or criticized features of a specific service.
We conducted evaluations from multiple viewpoints to show the validity of our proposed method.
In addition, we implemented a graphical user interface to help users intuitively understand the results.
Keywords: Service annotation; service profiling; review analysis; summarization.
Emotion Based Topic Impact on Social Media
by Fernando Calderon Alvarado, Yi-Shin Chen
Abstract: The increasing use of Micro-blogging sites have made them very rich
data repositories. Information generated is dynamic by nature, tied to temporal
conditions and the subjectivity of its users. Everyday life experiences, discussions
or events have a direct impact on the behaviors reflected in social networks. It
has become important to asses to which degree are these interactions affecting a
social group. A possibility is to analyze how impactful a topic is according to the
behavior presented on a social network over time. It is then necessary to develop
methods that can contribute towards this task. Having identified a topic in social
media we can obtain a general summary of the emotions it is generating over a
social group. We then propose a Topic Impact score which will be given to each
topic based on how this emotions transition, for how many time they span and
how many users they reach. This lays ground to quantify how impactful a topic
is over a social group, specifically regarding events detected on twitter.
Keywords: Social Impact; Influence; Social Media; Emotion Analysis;
Document Stream Classification based on Transfer Learning using Latent Topics
by Masato Shirai, Jianquan Liu, Takao Miura
Abstract: In this investigation, we propose a classification framework based on transfer learning using latent intermediate domain for document stream classification.
In document stream, word frequency changes dramatically because of transition of themes. To classify document stream, we capture new features and modify the classification criteria during the stream.
Transfer learning utilizes extracted knowledge from source domain to analyze the target domain.
We extract latent topics based on topic model from unlabeled documents. Our approach connect each domain using latent topics to classify documents. And we capture change of features by update of intermediate domain in document stream.
Keywords: Transfer Learning; NMTF; Topic Model; Document Classification;.
Sightseeing Value Estimation by Analyzing Geosocial Images
by Yizhu Shen, Min Ge, Chenyi Zhuang, Qiang Ma
Abstract: Recommendation of points of interests (POIs) is drawing more attention to meet the growing demands of tourists. Thus, a POIs quality (sightseeing value) needs to be estimated. In contrast to conventional studies that rank POIs on the basis of user behavior analysis, this paper presents methods to estimate quality by analyzing geo-social images. Our approach estimates the sightseeing value from two aspects: (1) nature value and (2) culture value. For the nature value, we extract image features that are related to favorable human perception to verify whether a POI would satisfy tourists in terms of environmental psychology. Three criteria are defined accordingly: coherence, image-ability, and visual-scale. For the culture value, we recognize the main cultural element (i.e., architecture) included in a POI. In the experiments, we applied our methods to real POIs and found that our approach assessed sightseeing value effectively.
Keywords: Points of Interests; Sightseeing value; Geosocial image; human perception; image processing; UCG Mining.
A Customized Automata Algorithm and Toolkit for Language Learning and Application
by Ruoyu Wang, Guoqiang Li, Jianwen Xiang
Abstract: Automata are abstract computing machines. They play a basic role in computability theory and programming language theory. They are also widely used in programming language compilers as token scanners and syntactic analysers. More recently in data analytics, data automata have become a formal way to represent pipelines and workflows. In the research involved with automata, however, there are many situations where a practitioner has to build a new automaton with bare hands, which causes a lot of redundant works to rebuild the frame work of an automaton. Moreover, when lot of researchers need to display their ideas and to discuss about new algorithms, it will be extremely hard for them to switch among different styles of codes, not to mention modifying parts of others' programs.
In order to solve that problem, we propose a new toolkit: CAT, which provides a simple and unified framework for automaton construction and customization. We observed eight prevailing types of automata and decomposed their logical structure, extracting similar semantics. Besides, behavioural similarities are compared and taken into consideration to generate a hierarchical framework. Each of the eight specific type of automata is implemented at the leaf node in the tree-like structure. Several calculus algorithms are implemented according the theoretical accomplishments and designed as overloaded operators, which simplifies and visualises the source codes written by end users. To test the correctness and performance of this toolkit, several bare automata were constructed and put into calculation. And a simple textual retrieval programme was implemented by CAT, taken into comparison with the well-known tool ``GREP'' in Ubuntu Linux.
The result showed that CAT has realised most of the design purposes: the calculus is correct and the framework provides a universal solution for various types of automata. Also, CAT presents a more illustrative way for writing codes of automata construction and calculation.
Keywords: Automata; Customize; C++.
Ontology-based Faceted Semantic Search with Automatic Sense Disambiguation for Bioenergy Domain
by Feroz Farazi, Craig Chapman, Pathmeswaran Raju, Lynsey Melville
Abstract: WordNet is a lexicon widely known and used as an ontological resource hosting comparatively large collection of semantically interconnected words. Use of such resources produces meaningful results and improves users search experience through the increased precision and recall. This paper presents our facet-enabled WordNet powered semantic search work done in the context of the bioenergy domain. The main hurdle to achieving the expected result was sense disambiguation further complicated by the occasional fine-grained distinction of meanings of the terms in WordNet. To overcome this issue, this paper proposes a sense disambiguation methodology that uses bioenergy domain related ontologies (extracted from WordNet automatically), WordNet concept hierarchy and term sense rank.
Keywords: semantic search; faceted search; faceted semantic search; Knowledg Base; WordNet; ontology; bioenergy.
Special Issue on: Advances in Cyber Security and Privacy of Big Data in Mobile and Cloud Computing
Interoperable Identity Management Protocol for Multi-Cloud Platform
by Tania Chaudhary, Sheetal Kalra
Abstract: Multi-cloud adaptive application provisioning promises to solve data storage problem and leads to interoperability of data within multi-cloud environment. This also raises concern about interoperability of user among these computing domains. Although various standards and techniques have been developed to secure the identity of cloud consumer, but neither of them provides facility to interoperate and to secure the identity of cloud consumer. Thus, there is a need to develop an efficient authentication protocol that maintains single unique identity of cloud consumer and make it interoperable among various cloud service providers. Elliptic curve cryptography (ECC) based algorithms are the best choice among Public Key Cryptography (PKC) algorithms due to their small key sizes and efficient computation. In this paper, a secure ECC based mutual authentication protocol for cloud service provider servers using smart device and one time token has been proposed. The proposed scheme achieves mutual authentication and provides interoperability among multiple cloud service providers. The security analysis of the proposed protocol proves that the protocol is robust against all the security attacks. The formal verification of the proposed protocol is performed using AVISPA tool, which proves its security in the presence of intruder.
Keywords: Authentication; Cloud Computing; Elliptic Curve Cryptography; Multi-Cloud; One Time Token; Smart Device.