International Journal of Big Data Intelligence (26 papers in press)
Interoperable Identity Management Protocol for Multi-Cloud Platform
by Tania Chaudhary, Sheetal Kalra
Abstract: Multi-cloud adaptive application provisioning promises to solve data storage problem and leads to interoperability of data within multi-cloud environment. This also raises concern about interoperability of user among these computing domains. Although various standards and techniques have been developed to secure the identity of cloud consumer, but neither of them provides facility to interoperate and to secure the identity of cloud consumer. Thus, there is a need to develop an efficient authentication protocol that maintains single unique identity of cloud consumer and make it interoperable among various cloud service providers. Elliptic curve cryptography (ECC) based algorithms are the best choice among Public Key Cryptography (PKC) algorithms due to their small key sizes and efficient computation. In this paper, a secure ECC based mutual authentication protocol for cloud service provider servers using smart device and one time token has been proposed. The proposed scheme achieves mutual authentication and provides interoperability among multiple cloud service providers. The security analysis of the proposed protocol proves that the protocol is robust against all the security attacks. The formal verification of the proposed protocol is performed using AVISPA tool, which proves its security in the presence of intruder.
Keywords: Authentication; Cloud Computing; Elliptic Curve Cryptography; Multi-Cloud; One Time Token; Smart Device.
A Novel Entropy Based Dynamic Data Placement Strategy for Data Intensive Applications in Hadoop Clusters
by K. Hemant Kumar Reddy, Diptendu Sinha Roy, Vishal Pandey
Abstract: In the last decade, efficient data analysis of data-intensive applications has become an increasingly important research issue. The popular map-reduce framework has offered an enthralling solution to this problem by means of distributing the work load across interconnected data centers. Hadoop is most widely used platform for data intensive application such as analysis of web logs, detection of global weather patterns, bioinformatics applications among others. However, most Hadoop implementations assume that every node attached to a cluster are homogeneous in nature having same computational capacities which may reduce map-reduce performance by increasing extra over-head for run-time data communications. However, majority of data placement strategies attempt placing related data close to each other for faster run-time access. However they disregard scenarios where such placement algorithms have to work with data sets which are new, either generated or for different MapReduce jobs. This paper deals with improving the map-reduce performance over multi-cluster data sets by means of a novel entropy based data placement strategy (EDPS), that works in three phases and account for new data sets. In the first phase, a k-means clustering strategy is employed to extract dependencies among different datasets and group them into data groups. In second phase, these data groups placed in different data centers while considering heterogeneity of virtual machines into account. Finally, the third phase uses an entropy based grouping of the newly generated datasets where these datasets are grouped with most similar existing cluster based on their relative entropy. The essence of the entropy based scheme lies in computing expected entropy, a measure for dissimilarity of MapReduce jobs and their data usage patterns in terms of data blocks stored in HDFS, and finally placing new data among clusters such that entropy is reduced. The experimental results shows efficacy of the proposed three fold dynamic grouping and data placement policy, which significantly reduces the time of execution and improve Hadoop performance in heterogeneous clusters with varying server and user applications and its parameters sizes.
Keywords: Dynamic Data Placement Strategy; Hadoop Clusters; MapReduce;k-means clustering; entropy;.
HYBRIDIZATION OF CLASSIFIERS FOR ANOMALY DETECTION IN BIG DATA
by Rasim Alguliyev, Ramiz Aliguliyev, Fargana Abdullayeva
Abstract: Recently the widespread use of cloud technologies has led to the rapid increase the scale and complexity of this infrastructure. The degradation and downtimes in the performance metrics of these large-scale systems are considered to be a major problem. The key issue in addressing these problems is to detect anomalies that can occur in hardware, software and state of the systems of cloud infrastructure. In this paper for the detection of anomalies in performance metrics of cloud infrastructure, based on ensemble classifiers a semi-supervized classification method is proposed. In the proposed method Naive Bayes, J48, SMO, multilayer Perseptron, IBK and PART algorithms is used. To detect anomalous behavior on the performance metrics the public data of the Google and Yahoo! companies, Python 2.7, Matlab, Weka and Google Cloud SDK Shell applications is used. In the result of the experimental study of the model 0.90 percent detection accuracy is obtained.
Keywords: Anomaly in performance metrics; CPU-usage; memory usage; Naive Bayes; J48 decision tree; semi-supervised algorithms; ensemble classifiers; Google cluster trace.
Fast Approaches for Semantic Service Composition in Big Data Environment
by Jun Huang, Yide Zhou, Qiang Duan, Cong-cong Xing
Abstract: The widespread deployment of Web services and the rapid development of big data applications bring in new challenges to Web service compositions in the context of big data. The large number of Web services processing a huge amount of diverse data together with the complex and dynamic relationships among the services require automatic composition of semantic Web services to perform quickly, thereby demanding fast and cost-effective service composition algorithms. In this paper, we investigate the Web service composition in big data environments by proposing novel composition algorithms with low time-complexity. In our proposed algorithm, we decompose the service composition into three stages: construction of parameter expansion graphs, transformation of service dependence graphs, and backtracking search for service compositions. Based on the parameter expansion strategies, we then propose two fast service composition algorithms, for which we also analyze their time complexities. We conduct comparison experimentally to evaluate the performance of the algorithms and validate their effectiveness using a big semantic service dataset. Our results reveal that the proposed approaches are more preferable than a well-known algorithm in terms of execution time and precision.
Keywords: Big data semantics; quality of services; service composition; virtual parameter.
An Insight into Mobile Advertising and its Impact on the Resources of Hand-held Devices: A survey
by Abdurhman Albasir, Maazen Alsabaan, Kshirasagar Naik
Abstract: With the rapid advancement of mobile devices, people become more attached to them than ever. This growth combined with millions of applications (apps) make smartphones a favorite means of communication among users. The available contents on smartphones, apps and web content, come into two versions: (i) free contents that are monetized via advertisements (ads); (ii) paid ones that are monetized by users' subscription fees. However, the resources on-board are limited, and the existence of ads can adversely impact them. These issues brought the need for good understanding of mobile advertising eco-system and how such limited resources should be efficiently used.
This survey paper gives an overview on the mobile advertising eco-system, and reviews the work done in the regard of the influence of such ads on smartphones' battery life and monthly data usage. It discusses and slightly addresses the open issues and research directions that need to be further investigated. This work is meant to motivate: (i) the researchers to investigate the energy and bandwidth issues further and hence, come up with more practical solutions; (ii) App and Web developers to consider the seriousness implications of embedding "expensive" ads in their apps and web-pages on the end users limited resources.
Keywords: Mobile Advertising; handheld devices; Energy Cost; Bandwidth Cost; Energy
A Five-layer Architecture for Big Data Processing and Analytics
by Yixuan Zhu, Bo Tang, Victor Li
Abstract: Big data technologies have attracted much attention in recent years. The academia and industry have reached a consensus, that is, the ultimate goal of big data is about transforming big data to real value. In this article, we discuss how to achieve this goal and propose a five-layer architecture for big data processing and analytics (BDPA), including a collection layer, a storage layer, a processing layer, an analytics layer, and an application layer. The five-layer architecture targets to set up a de facto standard for current BDPA solutions, to collect, manage, process, and analyse the vast volume of both static data and online data streams, and make valuable decisions for all types of industries. Functionalities and challenges of the five layers are illustrated, with the most recent technologies and solutions discussed accordingly. We conclude with the requirements for the future BDPA solutions, which may serve as a foundation for the future big data ecosystem.
Keywords: Big data processing and analytics (BDPA); online big data stream; five-layer architecture.
Extended results from the Measurement and Analysis of Safety in a Large City
by Rami Ibrahim, M. Omair Shafiq
Abstract: This paper presents an extended version of our measurement and analysis of data from the city of Los Angeles . More specifically, we analyzed datasets about crimes that took place in Los Angeles. This dataset was prepared by the Los Angeles Police Department (LAPD) and is also updated on a regular basis. This dataset contains approximately 1.5 million records, where each record represents a crime incident in the city. We analyzed multiple features of the dataset including different activities of crimes (i.e. number of crimes) in terms of year, month, weekdays, time of the day, area, victim sex, victim age, victim descent, suspect activities and crime seriousness index. In addition to it, we also analyzed the reporting period of a crime incident by calculating the average reporting days (i.e. number of days the victim took to report a crime incident) in terms of multiple factors. Our analysis uncovers the unique characteristics and insights of safety measures and crime prevention in the city. This extended version of paper contains some new results and discussions. This includes new graphs for number of crimes based on suspect activities and crime seriousness index, a new graph for crimes distribution based on crime seriousness index, and average reporting period based on crime seriousness index. We introduced a section that provides discussion on potential implications of our analytical results.
Keywords: Data Analysis; Smart City; Crime Statistics; Dataset; Extended version of the original paper.
Perceptions of independent financial advisors on the usefulness of Big Data in the context of decision making in the UK
by Ketty Grishikashvili, Clemens Bechter
Abstract: Big Data Analytics (BDA) are based on massive amounts of structured and unstructured data and should extract meaningful, accurate, and relevant data. However, what is considered as meaningful and relevant is a perception of the analyst. In our paper we focused on perceptions and not on the technical implementation of BDA. For this we looked into one of the most advanced services the financial services market in the UK. We chose eight independent financial advisors who have had first-hand experiences with large amounts of structured and unstructured data. Results show that there is moderate enthusiasm about the new possibilities of BDA such as better decision making and better market segmentation. This implies the necessity for easy to use tools and training on the side of the financial advisor as well as on the clients side. The paper fills a research gap by analyzing the actual impact of BDA on a real company and trying to measure the value of BDA for a financial organization in form of improved customer relationship through prescriptive analytics.
Keywords: Big Data; Financial Services; Independent Financial Advisors.
Improving straggler task performance in a heterogeneous MapReduce framework using reinforcement learning
by Nenavath Srinivas Naik, Atul Negi, V.N. Sastry
Abstract: MapReduce is one of the most significant distributed and parallel processing frameworks for large-scale data-intensive jobs proposed in recent times. Intelligent scheduling decisions can potentially help in significantly reducing the overall runtime of jobs. It is observed that the total time to completion of a job gets extended because of some slow tasks. Especially in heterogeneous environments, the job completion times do not synchronise. As originally conceived, MapReduce default scheduler was not very effective about slow task identification. In the literature, longest approximate time to end (LATE) scheduler extends to the heterogeneous environment, but it has limitations in properly estimating the progress of the tasks. It takes a static view of the task progress. In this paper, we propose a novel reinforcement learning-based MapReduce scheduler for heterogeneous environments called MapReduce reinforcement learning (MRRL) scheduler. It observes the system state of task execution and suggests speculative re-execution of the slower tasks to available nodes in the heterogeneous cluster without assuming any prior knowledge of the environmental characteristics. We observe that the experimental results show consistent improvements in performance as compared to the LATE and Hadoop default schedulers for different workloads of the Hi-bench benchmark suite.
Keywords: MapReduce; reinforcement learning; speculative execution; task scheduler; heterogeneous environments.
Algorithms for fast estimation of social network centrality measures
by Ashok Kumar, R. Chulaka Gunasekara, Kishan G. Mehrotra, Chilukuri K. Mohan
Abstract: Centrality measures are extremely important in the analysis of social networks, with applications such as the identification of the most influential individuals for effective target marketing. Eigenvector centrality and PageRank are among the most useful centrality measures, but computing these measures can be prohibitively expensive for large social networks. This paper explores multiple approaches to improve the computational effort required to compute relative centrality measures. First, we show that small neural networks can be effective in fast estimation of the relative ordering of vertices in a social network based on these centrality measures. Then, we show how network sampling can be used to reduce the running times for calculating the ordering of vertices; degree centrality-based sampling reduces the running time of the key node identification problem. Finally, we propose the approach of incremental updating of centrality measures in dynamic networks.
Keywords: social network; centrality; eigenvector centrality; PageRank; network sampling; incremental updating.
Collective tweet analysis for accurate user sentiment analysis - a case study with Delhi Assembly Election 2015
by Lija Mohan, M. Sudheep Elayidom
Abstract: Social media postings range from the environment and politics to technology and the entertainment industry. Since this can be construed as a form of collective wisdom, the authors decided to investigate its power at predicting the real-world outcomes. The objective was to design a keyword-aware user-based collective tweet mining approach to identify the opinion of each user, which is proved to be more accurate compared to the sentiment analysis done to each tweet. To make our application scalable, MapReduce programming on a Hadoop distributed processing framework is utilised. From the analysis done on 2015 Delhi Assembly Elections case study, we correctly predicted that Aam Admy Party has a higher support compared to the existing ruling party, BJP. Also, we compared our sentiment analysis algorithm with other existing techniques and identified that ours is efficient in terms of space and time complexity which makes it suitable for other BigData applications.
Keywords: twitter analysis; collective tweet analysis; sentiment analysis; big data; hadoop; Map Reduce.
Comparison of Hive's query optimisation techniques
by Sikha Bagui, Keerthi Devulapalli
Abstract: The ever increasing size of data sets in this big data era has forced data analytics to be moved from traditional RDBMS systems to distributed technologies like Hadoop. Since data analysts are more familiar with SQL than the MapReduce programming paradigm, HiveQL was built on Hadoop's MapReduce framework. Traditional RDBMS query optimisation techniques used in the rule-based optimiser (RBO) of Hive do not perform well in the MapReduce environment, hence, the correlation optimiser (CRO) and cost-based optimisers (CBOs) were developed. These optimisers perform query optimisations taking the MapReduce execution framework into account. In this work, the three optimisers, RBO, CRO, and CBO are compared. Queries with common intra-query operations are found to be better optimised with CRO.
Keywords: Hive; query optimisation; correlation optimiser; CRO; rule-based optimiser; RBO; cost-based optimiser; CBO.
Big data ensemble clinical prediction for healthcare data by using deep learning model
by Sreekanth Rallapalli, R.R. Gondkar
Abstract: Big data has revolutionised the healthcare industry. Electronic health records (EHRs) is growing at an exponential rate. Healthcare data being unstructured in nature requires a complete new technology to process the data. Clinical applications also need machine learning techniques and data mining methods which include decision trees and artificial neural networks. Classification algorithms have to be considered for developing predictive models. Ensemble model is gaining popularity among various other individual contributors. Ensemble systems can provide better accuracy. In this paper, we combine four algorithms support vector machines, naïve Bayes, random forest and deep learning models are used to design the ensemble framework. Deep learning model is used to find the predicted labels. The data sets are collected from MIMIC-III clinical database repository. Results shows that the proposed ensemble model provides the better accuracy results when deep learning model is included as deep learning is an efficient method for complex problems and large data sets.
Keywords: algorithm; big data; classification; decision trees; deep learning; electronic health records; HER; ensemble model; predictive model.
Resource management for deadline constrained MapReduce jobs for minimising energy consumption
by Adam Gregory, Shikharesh Majumdar
Abstract: Cloud computing has emerged as one of the leading platforms for processing large-scale data intensive applications. Such applications are executed in large clusters and data centres which require a substantial amount of energy. Energy consumption within data centres accounts for a considerable fraction of costs and is a significant contributor to global greenhouse gas emissions. Therefore, minimising energy consumption in data centres is a critical concern for data centre operators, cluster owners, and cloud service providers. In this paper, we devise a novel energy aware MapReduce resource manager for an open system, called EAMR-RM, that can effectively perform matchmaking and scheduling of MapReduce jobs each of which is characterised by a service level agreement (SLA) for performance that includes a client specified earliest start time, execution time, and a deadline with the objective of minimising data centre energy consumption. Performance analysis demonstrates that for a range of system and workload parameters experimented with the proposed technique can effectively satisfy SLA requirements while achieving up to a 45% reduction in energy consumption compared to approaches which do not consider energy in resource management decisions.
Keywords: resource management on clouds; MapReduce with deadlines; constraint programming; energy management; big data analytics; job turnaround time; big data; service level agreement.
Special Issue on: DaSAA 2017 Recent advances in Data Sciences and Applications
Unified Framework for Data Management in Multi-Cloud Environment
by Kirthica S, Sabireen H, Rajeswari Sridhar
Abstract: Cloud Storage is growing rapidly, as it offers an on-demand and highly elastic storage provisioning. The recent emergence of managing data in a multi-cloud environment opens new challenges for selecting the suitable cloud for storage from many services based on the data access patterns. Indeed, it is crucial to avoid vendor lock-in problem and increase availability and durability by managing unlimited data in multiple clouds by providing access to various cloud services based on the customer requirements.
This work provides an approach to dynamically pick optimal clouds from a heterogeneous multi-cloud environment based on the data access patterns. The voluminous data is split into chunks using data split algorithm and each chunk is efficiently predicted using predictor engine. Depending on the predicted chunks, suitable clouds are picked using proposed cloud picker algorithm. The proposed algorithm picks clouds based on QoS attributes and weighted threshold coefficient. Furthermore, the data is placed into the selected clouds by checking the availability of the clouds. Additionally, the proposed data retention policies - fTUBA and TR^P is applied to the data along with homomorphic encryption to provide security and place the highest priority data into the appropriate cloud. The archived data is placed in a cloud from the set of selected clouds based on the cloud picker algorithm. The results indicate the cost effectiveness of storing huge data in a multi-cloud environment to provide the best of breed of services.
Keywords: Cloud Computing; Data Management; Multi-cloud; Cloud Inter-operation.
Special Issue on: DataCom 2017 Big Data Infrastructure and Deep Learning Applications
A Scalable System for Executing and Scoring K-Means Clustering Techniques and Its Impact on Applications in Agriculture
by Nevena Golubovic, Chandra Krintz, Rich Wolski, Balaji Sethuramasamyraja, Bo Liu
Abstract: We present Centaurus- a scalable, open source, clustering service for k-means
clustering of correlated, multidimensional data. Centaurus provides users with
automatic deployment via public or private cloud resources, model selection (using
Bayesian Information Criterion), and data visualization.
We apply Centaurus to a real-world, agricultural analytics application and compare
its results to the industry standard clustering approach. The application uses soil electrical
conductivity measurements, GPS coordinates, and elevation data from a field to produce
a map of differing soil zones (so that management can be specialized for each). We
use Centaurus and these datasets to empirically evaluate the impact of considering
multiple k-means variants and large numbers of experiments. We show that Centaurus
yields more consistent and useful clusterings than the competitive approach for use in
zone-based soil decision-support applications where a hard decision is required.
Keywords: K-means Clustering; Cloud Computing.
Scalable Mining, Analysis, and Visualization of Protein-Protein Interaction Networks
by Shaikh Arifuzzaman, Bikesh Pandey
Abstract: Proteins are linear chain biomolecules that are the basis of functional networks in all organisms. Protein-protein interaction (PPI) networks are the networks of protein complexes formed by biochemical events and electrostatic forces. PPI networks can be used to study diseases and discover drugs. The causes of diseases are evident on a protein interaction level. For instance, an elevation of interaction edge weights of oncogenes is manifested in cancers. Further, the majority of approved drugs target a particular PPI, and thus studying PPI networks is vital to drug discovery.
The availability of large datasets and need for efficient analysis necessitate the design of scalable methods leveraging modern high-performance computing (HPC) platforms. In this paper, we design a lightweight framework on a distributed-memory parallel system, which includes scalable algorithmic and analytic techniques to study PPI networks and visualize them. Our study of PPIs is based on network-centric mining and analysis approaches. Since PPI networks are signed (labeled) and weighted, many existing network mining methods working on simple unweighted networks will be unsuitable to study PPIs. Further, the large volume and variety of such data limits the use of sequential tool or methods. Many existing tools also do not support a convenient workflow starting from automated
data preprocessing to visualizing results and reports for efficient extraction of intelligence from large-scale PPI networks. Our framework support automated analytics based on a large range of extensible methods for extracting signed motifs, computing
centrality, and finding functional units. We design MPI (Message Passing Interface) based parallel methods and workflow, which scale to large networks. The framework is also extensible and sufficiently generic. To the best of our knowledge, all these capabilities collectively make our tool novel.
Keywords: protein-protein interaction; biological networks; network visualization; massive networks; HPC systems; network mining.
Optimizing NBA Player Signing Strategies Based on Practical Constraints and Statistics Analytics
by Lin Li
Abstract: In National Basketball Association (NBA), how to comprehensively measure a player's performance and how to sign talented players with reasonable contracts are always challenging. Due to various practical constraints such as the salary cap and the players' on-court minutes, no teams can sign all desired players. To ensure the team's competency on both offense and defense sides, player's efficiency must be comprehensively evaluated. This research studied the key indicators widely used to measure player efficiency and team performance. Through data analytics, the most frequently referred statistics including Player Efficiency Rating, Defense Rating, Real Plus Minus, Points, Rebounds, Assists, Blocks, Steals, etc., were chosen to formulate the prediction of the team winning rate in different schemes. Based on the models trained and tested, two player selection strategies were proposed according to different objectives and constraints. Experimental results show that the developed team winning rate prediction models have high accuracy and the player selection strategies are effective.
Keywords: Optimization; Prediction; Regression; Linear Programming; Sports Data Analytics;.
Network Traffic Driven Storage Repair
by Danilo Gligoroski, Katina Kralevska, Rune Jensen, Per Simonsen
Abstract: Recently we constructed an explicit family of locally repairable and locally regenerating codes. Their existence was proven by Kamath et al. but no explicit construction was given. Our design is based on HashTag codes that can have different sub-packetization levels. In this work we emphasize the importance of having two ways to repair a node: repair only with local parity nodes or repair with both local and global parity nodes. We say that the repair strategy is network traffic driven since it is in connection with the concrete system and code parameters: the repair bandwidth of the code, the number of I/O operations, the access time for the contacted parts and the size of the stored file. We show the benefits of having repair duality in one practical example implemented in Hadoop. We also give algorithms for efficient repair of the global parity nodes.
Keywords: Vector codes; Repair bandwidth; Repair locality; Exact repair; Parity-splitting; Global parities; Hadoop.
MapReduce based fuzzy very fast decision tree for constructing prediction intervals
by Ojha Manish Kumar, Kumar Ravi, Vadlamani Ravi
Abstract: Prediction Interval is a methodology to measure the uncertainties in-point forecasts and predictions. In this paper, we proposed the fuzzy version of Very Fast Decision Tree (VFDT) to predict the Lower Upper Bound Estimation (LUBE), which is further compared with VFDT. VFDT is one pass incremental decision tree learner, which scans each instance only once. The proposed fuzzy VFDT is able to capture intrinsic features of VFDT as well as uncertainties available in data. It outputs fuzzy if-then rules. The VFDT and fuzzy VFDT were trained using the LUBE method. Due to increasing demand of Cloud Computing and Big Data challenges, the traditional decision tree is not an evident option, especially when the volume of data is large. Hence, we implemented VFDT; developed and implemented Fuzzy VFDT using Apache Hadoop MapReduce framework, where multiple slave nodes build a VFDT & fuzzy VFDT model. The developed models were tested on 6 datasets taken from the web. We conducted sensitivity analysis by varying the window size of the data stream, number of bins in discretization and observing their impact on the final results in all datasets. Experiments with real-world case studies demonstrated that the proposed MapReduce based Fuzzy VFDT and VFDT can construct high-quality prediction intervals precisely and quickly.
Keywords: VFDT; Fuzzy VFDT; MapReduce; Prediction Interval; Big Data.
A Computational Bayesian Approach for Estimating Density Functions Based on Noise-Multiplied Data
by Yan-Xia Lin
Abstract: In this big data era, an enormous amount of personal and company information can be easily collected by third parties. Sharing the data with the public and allowing data users to access the data for data mining often bring many benefits to the public, policymakers, national economy, and society. However, sharing the micro data with the public usually causes the issue of data privacy. Protecting data privacy and mining statistical information of the original data from protected data are the essential issues in big data.
Protecting data privacy through noise-multiplied data is one of the approaches studied in the literature. This paper introduces the B-M L2014 Approach for estimating the density function of the original data based on noise-multiplied microdata.
This paper shows applications of the B-M L2014 Approach and demonstrates that the statistical information of the original data can be retrieved from their noise-multiplied data reasonably while the disclosure risk is under control if the multiplicative noise used to mask the original data is appropriate. The B-M L2014 Approach provides a new data mining technique for big data when data privacy is concerned.
Keywords: Data mining; Data anonymization; Privacy-preserving;.
New algorithms for inferring gene regulatory networks from time-series expression data on Apache Spark
by Yasser Abduallah, Jason T.L. Wang
Abstract: Gene regulatory networks (GRNs) are crucial to understand the inner workings of the cell and the complexity of gene interactions. Numerous algorithms have been developed to infer GRNs from gene expression data. As the number of identified genes increases and the complexity of their interactions is uncovered, gene networks become cumbersome to test. Furthermore, prodding through experimental results requires an enormous amount of computation, resulting in slow data processing. Therefore, new approaches are needed to analyse copious amounts of experimental data from cellular GRNs. To meet this need, cloud computing is promising as reported in the literature. Here we present two new algorithms for reverse engineering GRNs in a cloud environment. The algorithms, implemented in Spark, employ an information-theoretic approach to infer GRNs from time-series gene expression data. Experimental results show that one of our new algorithms is faster than, yet as accurate as, two existing cloud-based GRN inference methods.
Keywords: network inference; systems biology; spark; big data; MapReduce; gene regulatory networks; GRN; time-series; gene expression; big data intelligence.
Text Visualization for Feature Selection in Online Review Analysis
by Keerthika Koka, Shiaofen Fang
Abstract: Opinion spamming is a reality, and it can have unpleasant consequences in the retail industry. While there are, several promising research works done on identifying the fake online reviews from genuine online reviews, there have been few involving visualization and visual analytics. Thernpurpose of this work is to show that feature selection through visualization is at least as powerful as the best automatic feature selection algorithms. This is achieved by applying radial chart visualization technique to the online review classification into fake and genuine reviews. Radial chart and the color overlaps are used to explore the best feature selection through visualization for classification. Parallel coordinate visualization of the review data is also explored and compared with radial chart results. The system gives a structure to each text review based on certain attributes, compares how different or similar the structure of the different or same categories are, and highlights the key features that contribute to the classification the most. Our visualization technique helps the user get insights into the high dimensional data by providing means to eliminate the worst features right away, pick some best features without statistical aids, understand the behavior of the dimensions in different combinations.
Keywords: multi-dimensional visualization; feature detection; text mining; online review analysis.
DeepSim: Cluster Level Behavioral Simulation Model for Deep Learning
by Yuankun Shi, Kevin Long, Kaushik Balasubramanian, Zhaojuan Bian, Adam Procter, Ramesh Illikkal
Abstract: We are witnessing an explosion of AI use cases driving the computer industry, and especially datacenter and server architectures. As Intel faces fierce competition in this emerging technology space, it is critical that architecture definitions and directions are driven with data from proper tools and methodologies, and insights are drawn from end-to-end holistic analysis at the datacenter levels. In this paper, we introduce DeepSim, a cluster-level behavioral simulation model for deep learning. DeepSim, which is based on the Intel CoFluent simulation framework, uses timed behavioral models to simulate complex interworking between compute nodes, networking, and storage at the datacenter level, providing a realistic performance model of a real-world image recognition applications based on the popular Deep Learning Framework Caffe. The end-to-end simulation data from DeepSim provides insights which can be used for architecture analysis driving future datacenter architecture directions. DeepSim enables scalable system design, deployment, and capacity planning through accurate performance insights. Results from preliminary scaling studies (e.g. node scaling and network scaling) and what-if analyses (e.g., Xeon with HBM and Xeon Phi with dual OPA) are presented in this paper. The simulation results are correlated well with empirical measurements, achieving an accuracy of 95%.
Keywords: Deep Learning; Datacenter; Behavioral Simulation; AlexNet; Architecture Analysis; Performance Analysis; Server Architecture.
Predicting Hospital Length of Stay Using Neural Networks
by Robert Steele
Abstract: Accurate prediction of hospital length of stay can provide benefits for hospital
resource planning and quality-of-care. We describe the utilization of neural networks
for predicting the length of hospital stay for patients with various diagnoses based on
selected administrative and clinical attributes. An all-condition neural network, that can
be applied to all patients and not limited to a specific diagnosis, is trained to predict
whether patient stay will be long or short in terms of the median length of stay as the
cut-off between long and short, and predicted at the time the patient leaves the intensive
care unit. In addition, neural networks are trained to predict whether patients of fourteen
specific common primary diagnoses will have a long or short stay, as defined as greater
than or less than or equal to the median length of stay for that particular condition. Our
dataset is drawn from the MIMIC III database. Our prediction accuracy is approximately
80% for the all-condition neural network and the neural networks for specific conditions
generally demonstrated higher accuracy and all clearly out-performed any linear model.
Keywords: length of stay; health analytics; neural networks; MIMIC III.
Special Issue on: DataCom 2017 Big Data Infrastructure and Deep Learning Applications
Real-Time Event Search using Social Stream for Inbound Tourist Corresponding to Place and Time
by ruriko kudo, Miki Enoki, Akihiro Nakao, Shu Yamamoto, Saneyasu Yamaguchi, Masato Oguchi
Abstract: Since the decision was made to hold the Olympic Games in 2020 in Tokyo, the number of foreign tourists visiting the city has been increasing rapidly. Accordingly, tourists have been seeking more sightseeing information. While guidebooks are good for pointing out popular tourist attractions, it is more difficult for tourists to get information on local events and spots that are just becoming popular. We developed a tourist information distribution system that sends information corresponding to a place and time. The system extracts event information from social media streams in a per place and time manner and provides it to tourists. In order to extract useful information, we performed event classification using actual Twitter data. We examined how to distribute the events in order and make it more user - friendly system. Furthermore, we developed the information supplement function using external information.
Keywords: Twitter, Social media stream, Local event, Information extract