International Journal of Big Data Intelligence (16 papers in press)
A Survey of Computation Techniques on Time Evolving Graphs
by Shalini Sharma, Jerry Chou
Abstract: Time Evolving Graph (TEG) refers to graphs whose topology or attribute
values change over time due to update events, including edge addition/deletion, vertex
addition/deletion and attributes changes on vertex or edge. Driven by the Big Data
paradigm, the ability to process and analyze TEG in a timely fashion is critical in
many application domains, such as social network, web graph, road network trac, etc.
Recently, many research eorts have been made with the aim to address the challenges
of volume and velocity from dealing with such datasets. However it remains to be an
active and challenged research topic. Therefore, in this survey, we summarize the state-
of-art computation techniques for TEG. We collect these techniques from three dierent
research communities: i)The data mining community for graph analysis; ii)The theory
community for graph algorithm; iii)The computation community for graph computing
framework. Based on our study, we also propose our own computing framework DASH
for TEG. We have even performed some experiments by comparing DASH and Graph
Processing System (GPS).We are optimistic that this paper will help many researchers to
understand various dimensions of problems in TEG and continue developing the necessary
techniques to resolve these problems more eciently.
Keywords: Big Data; Time evolving graphs; Computing framework; Algorithm; Data Mining.
Special Issue on: DataCom 2017 Big Data Infrastructure and Deep Learning Applications
A Scalable System for Executing and Scoring K-Means Clustering Techniques and Its Impact on Applications in Agriculture
by Nevena Golubovic, Chandra Krintz, Rich Wolski, Balaji Sethuramasamyraja, Bo Liu
Abstract: We present Centaurus a scalable, open source, clustering service for K-means clustering of correlated, multidimensional data. Centaurus provides users with automatic deployment via public or private cloud resources, model selection (using Bayesian information criterion), and data visualisation. We apply Centaurus to a real-world, agricultural analytics application and compare its results to the industry standard clustering approach. The application uses soil electrical conductivity (EC) measurements, GPS coordinates, and elevation data from a field to produce a map of
differing soil zones (so that management can be specialised for each). We use Centaurus and these datasets to empirically evaluate the impact of considering multiple K-means variants and large numbers of experiments. We show that Centaurus yields more consistent and useful clusterings than the competitive approach for use in zone-based soil decision-support applications where a hard decision is required.
Keywords: K-means clustering; cloud computing.
Optimizing NBA Player Signing Strategies Based on Practical Constraints and Statistics Analytics
by Lin Li
Abstract: In National Basketball Association (NBA), how to comprehensively measure a player's performance and how to sign talented players with reasonable contracts are always challenging. Due to various practical constraints such as the salary cap and the players' on-court minutes, no teams can sign all desired players. To ensure the team's competency on both offense and defense sides, player's efficiency must be comprehensively evaluated. This research studied the key indicators widely used to measure player efficiency and team performance. Through data analytics, the most frequently referred statistics including Player Efficiency Rating, Defense Rating, Real Plus Minus, Points, Rebounds, Assists, Blocks, Steals, etc., were chosen to formulate the prediction of the team winning rate in different schemes. Based on the models trained and tested, two player selection strategies were proposed according to different objectives and constraints. Experimental results show that the developed team winning rate prediction models have high accuracy and the player selection strategies are effective.
Keywords: Optimization; Prediction; Regression; Linear Programming; Sports Data Analytics;.
Network Traffic Driven Storage Repair
by Danilo Gligoroski, Katina Kralevska, Rune Jensen, Per Simonsen
Abstract: Recently we constructed an explicit family of locally repairable and locally regenerating codes. Their existence was proven by Kamath et al. but no explicit construction was given. Our design is based on HashTag codes that can have different sub-packetization levels. In this work we emphasize the importance of having two ways to repair a node: repair only with local parity nodes or repair with both local and global parity nodes. We say that the repair strategy is network traffic driven since it is in connection with the concrete system and code parameters: the repair bandwidth of the code, the number of I/O operations, the access time for the contacted parts and the size of the stored file. We show the benefits of having repair duality in one practical example implemented in Hadoop. We also give algorithms for efficient repair of the global parity nodes.
Keywords: Vector codes; Repair bandwidth; Repair locality; Exact repair; Parity-splitting; Global parities; Hadoop.
MapReduce based fuzzy very fast decision tree for constructing prediction intervals
by Ojha Manish Kumar, Kumar Ravi, Vadlamani Ravi
Abstract: Prediction Interval is a methodology to measure the uncertainties in-point forecasts and predictions. In this paper, we proposed the fuzzy version of Very Fast Decision Tree (VFDT) to predict the Lower Upper Bound Estimation (LUBE), which is further compared with VFDT. VFDT is one pass incremental decision tree learner, which scans each instance only once. The proposed fuzzy VFDT is able to capture intrinsic features of VFDT as well as uncertainties available in data. It outputs fuzzy if-then rules. The VFDT and fuzzy VFDT were trained using the LUBE method. Due to increasing demand of Cloud Computing and Big Data challenges, the traditional decision tree is not an evident option, especially when the volume of data is large. Hence, we implemented VFDT; developed and implemented Fuzzy VFDT using Apache Hadoop MapReduce framework, where multiple slave nodes build a VFDT & fuzzy VFDT model. The developed models were tested on 6 datasets taken from the web. We conducted sensitivity analysis by varying the window size of the data stream, number of bins in discretization and observing their impact on the final results in all datasets. Experiments with real-world case studies demonstrated that the proposed MapReduce based Fuzzy VFDT and VFDT can construct high-quality prediction intervals precisely and quickly.
Keywords: VFDT; Fuzzy VFDT; MapReduce; Prediction Interval; Big Data.
Real-Time Event Search using Social Stream for Inbound Tourist Corresponding to Place and Time
by Ruriko Kudo, Miki Enoki, Akihiro Nakao, Shu Yamamoto, Saneyasu Yamaguchi, Masato Oguchi
Abstract: Since the decision was made to hold the Olympic Games in 2020 in Tokyo, the number of foreign tourists visiting the city has been increasing rapidly. Accordingly, tourists have been seeking more sightseeing information. While guidebooks are good for pointing out popular tourist attractions, it is more difficult for tourists to get information on local events and spots that are just becoming popular. We developed a tourist information distribution system that sends information corresponding to a place and time. The system extracts event information from social media streams in a per place and time manner and provides it to tourists. In order to extract useful information, we performed event classification using actual Twitter data. We examined how to distribute the events in order and make it more user - friendly system. Furthermore, we developed the information supplement function using external information.
Keywords: Twitter; Social media stream; Local event; Information extract.
Two-Channel Convolutional Neural Network for Facial Expression Recognition using Facial Parts
by Hui Wang, Jiang Lu, Lucy Nwosu, Ishaq Unwala
Abstract: This paper proposes the design of a Facial Expression Recognition system based on the deep convolutional neural network by using facial parts. In this work, a solution for facial expression recognition that uses a combination of algorithms for face detection, feature extraction and classification is discussed. The proposed method builds a two-channel convolutional neural network model in which Facial Parts are used as inputs: the extracted eyes are used as inputs to the first channel, while the mouth is the input into the second channel. Feature information from both channels converges in a fully connected layer which is used to learn global information from these local features and is then used for expression classification. Experiments are carried out on the Japanese Female Facial Expression dataset and the Extended Cohn-Kanada dataset to determine the expression recognition accuracy for the proposed facial expression recognition system based on facial parts. The results achieved shows that the system provides state of art classification accuracy with 97.6% and 95.7% respectively when compared to other methods.
Keywords: Facial expression recognition; Convolutional Neural Networks; Facial Parts.
A Computational Bayesian Approach for Estimating Density Functions Based on Noise-Multiplied Data
by Yan-Xia Lin
Abstract: In this big data era, an enormous amount of personal and company information can be easily collected by third parties. Sharing the data with the public and allowing data users to access the data for data mining often bring many benefits to the public, policymakers, national economy, and society. However, sharing the micro data with the public usually causes the issue of data privacy. Protecting data privacy and mining statistical information of the original data from protected data are the essential issues in big data.
Protecting data privacy through noise-multiplied data is one of the approaches studied in the literature. This paper introduces the B-M L2014 Approach for estimating the density function of the original data based on noise-multiplied microdata.
This paper shows applications of the B-M L2014 Approach and demonstrates that the statistical information of the original data can be retrieved from their noise-multiplied data reasonably while the disclosure risk is under control if the multiplicative noise used to mask the original data is appropriate. The B-M L2014 Approach provides a new data mining technique for big data when data privacy is concerned.
Keywords: Data mining; Data anonymization; Privacy-preserving;.
New algorithms for inferring gene regulatory networks from time-series expression data on Apache Spark
by Yasser Abduallah, Jason T.L. Wang
Abstract: Gene regulatory networks (GRNs) are crucial to understand the inner workings of the cell and the complexity of gene interactions. Numerous algorithms have been developed to infer GRNs from gene expression data. As the number of identified genes increases and the complexity of their interactions is uncovered, gene networks become cumbersome to test. Furthermore, prodding through experimental results requires an enormous amount of computation, resulting in slow data processing. Therefore, new approaches are needed to analyse copious amounts of experimental data from cellular GRNs. To meet this need, cloud computing is promising as reported in the literature. Here we present two new algorithms for reverse engineering GRNs in a cloud environment. The algorithms, implemented in Spark, employ an information-theoretic approach to infer GRNs from time-series gene expression data. Experimental results show that one of our new algorithms is faster than, yet as accurate as, two existing cloud-based GRN inference methods.
Keywords: network inference; systems biology; spark; big data; MapReduce; gene regulatory networks; GRN; time-series; gene expression; big data intelligence.
Scalable Mining, Analysis, and Visualization of Protein-Protein Interaction Networks
by Shaikh Arifuzzaman, Bikesh Pandey
Abstract: Proteins are linear chain biomolecules that are the basis of functional networks in all organisms. Protein-protein interaction (PPI) networks are networks of protein complexes formed by biochemical events and electrostatic forces. PPI networks can be used to study diseases and discover drugs. The causes of diseases are evident on a protein interaction level. For instance, elevation of interaction edge weights of oncogenes is manifested in cancers. The availability of large datasets and need for efficient analysis necessitate the design of scalable methods leveraging modern high-performance computing (HPC) platforms. In this paper, we design a lightweight framework on a distributed-memory parallel system to study PPI networks. Our framework supports automated analytics based on methods for extracting signed motifs,
computing centrality, and finding functional units. We design message passing interface (MPI)-based parallel methods and workflow, scalable to large networks. To the best of our knowledge, these capabilities collectively make our tool novel.
Keywords: protein interaction; biological networks; network visualisation; massive networks;
HPC systems; network mining.
Text Visualization for Feature Selection in Online Review Analysis
by Keerthika Koka, Shiaofen Fang
Abstract: Opinion spamming is a reality, and it can have unpleasant consequences in the retail industry. While there are, several promising research works done on identifying the fake online reviews from genuine online reviews, there have been few involving visualization and visual analytics. Thernpurpose of this work is to show that feature selection through visualization is at least as powerful as the best automatic feature selection algorithms. This is achieved by applying radial chart visualization technique to the online review classification into fake and genuine reviews. Radial chart and the color overlaps are used to explore the best feature selection through visualization for classification. Parallel coordinate visualization of the review data is also explored and compared with radial chart results. The system gives a structure to each text review based on certain attributes, compares how different or similar the structure of the different or same categories are, and highlights the key features that contribute to the classification the most. Our visualization technique helps the user get insights into the high dimensional data by providing means to eliminate the worst features right away, pick some best features without statistical aids, understand the behavior of the dimensions in different combinations.
Keywords: multi-dimensional visualization; feature detection; text mining; online review analysis.
DeepSim: Cluster Level Behavioral Simulation Model for Deep Learning
by Yuankun Shi, Kevin Long, Kaushik Balasubramanian, Zhaojuan Bian, Adam Procter, Ramesh Illikkal
Abstract: We are witnessing an explosion of AI use cases driving the computer industry, and especially datacenter and server architectures. As Intel faces fierce competition in this emerging technology space, it is critical that architecture definitions and directions are driven with data from proper tools and methodologies, and insights are drawn from end-to-end holistic analysis at the datacenter levels. In this paper, we introduce DeepSim, a cluster-level behavioral simulation model for deep learning. DeepSim, which is based on the Intel CoFluent simulation framework, uses timed behavioral models to simulate complex interworking between compute nodes, networking, and storage at the datacenter level, providing a realistic performance model of a real-world image recognition applications based on the popular Deep Learning Framework Caffe. The end-to-end simulation data from DeepSim provides insights which can be used for architecture analysis driving future datacenter architecture directions. DeepSim enables scalable system design, deployment, and capacity planning through accurate performance insights. Results from preliminary scaling studies (e.g. node scaling and network scaling) and what-if analyses (e.g., Xeon with HBM and Xeon Phi with dual OPA) are presented in this paper. The simulation results are correlated well with empirical measurements, achieving an accuracy of 95%.
Keywords: Deep Learning; Datacenter; Behavioral Simulation; AlexNet; Architecture Analysis; Performance Analysis; Server Architecture.
Predicting Hospital Length of Stay Using Neural Networks
by Thanos Gentimis, Ala' J. Alnaser, Alex Durante, Kyle Cook, Robert Steele
Abstract: Accurate prediction of hospital length of stay can provide benefits for hospital
resource planning and quality-of-care. We describe the utilisation of neural networks for predicting the length of hospital stay for patients with various diagnoses based on selected administrative and clinical attributes. An all-condition neural network, that can be applied to all patients and not limited to a specific diagnosis, is trained to predict whether patient stay will be long or short in terms of the median length of stay as the cut-off between long and short, and predicted at the time the patient leaves the intensive care unit. In addition, neural networks are trained to predict whether patients of 14 specific common primary diagnoses will have a long or short stay, as defined as greater than or less than or equal to the median length of stay for that particular condition. Our dataset is drawn from the MIMIC III database. Our prediction accuracy is approximately 80% for the all-condition neural network and the neural networks for specific conditions generally demonstrated higher accuracy and all clearly out-performed any linear model.
Keywords: length of stay; health analytics; neural networks; MIMIC III.
Efficient Clustering Techniques on Hadoop and Spark
by Sami Al Ghamdi, Giuseppe Di Fatta
Abstract: Clustering is an essential data mining technique that divides observations into groups
where each group contains similar observations. K-means is one of the most popular clustering algorithms that has been used for over 50 years. Due to the current exponential growth of the data, it became a necessity to improve the efficiency and scalability of K-means even further to cope with large-scale datasets known as big data. This paper presents K-means optimisations using triangle inequality on two well-known distributed computing platforms: Hadoop and Spark. K-means variants that use triangle inequality usually require caching extra information from the previous
iteration, which is a challenging task to achieve on Hadoop. Hence, this work introduces two methods to pass information from one iteration to the next on Hadoop to accelerate K-means. The experimental work shows that the efficiency of K-means on Hadoop and Spark can be significantly improved by using triangle inequality optimisations.
Keywords: K-means; Hadoop; Spark; MapReduce; efficient clustering; triangle inequality K-means.
A Hybrid Power Management Schema for Multi-Tier Data Centers
by Aryan Azimzadeh, Babak Maleki Shoja, Nasseh Tabrizi
Abstract: Data centers play an important role in the operation and management of IT infrastructures but because of their huge power consumption, it raises an issue of great concern as it relates to global warming. This paper explores the sleep state of data centers servers under specific conditions such as setup time and identifies an optimal number of servers potentially to increase energy efficiency. We use a Dynamic Power Management policy-based model with the optimal number of servers that is required in each tier while increasing servers setup time after sleep mode. The Reactive approach is used to validate the results and energy efficiency by calculating the average power consumption of each server under specific sleep mode and setup time. Our method uses average power consumption to calculate the Normalized-Performance-Per-Watt in order to evaluate the power efficiency. The results indicate that the schema reported in this paper can improve power efficiency in data centers with high setup time servers.
Keywords: Power; Green; Management; Schema; Multi-Tier.
Towards an automation of the fact-checking in the journalistic web context
by Edouard NGOR SARR, SALL Ousmane
Abstract: Is Fact checking automatisable? Apparently, yes, since numerous moved forward noted in the search and the analysis of digital data. However, this task which in priori seemed to be simple, turns out rather binding. Indeed, automate the check of facts combine and requires at the same time very advanced knowledge in analysis of data, in search of Web data, the Web technologies, in image processing and sometimes in automatic natural language processing (NLP). Nevertheless, the latter years, numerous researches are led to deepen such an analysis. In this article, having revisited the state of the art concerning the question, we identify and diagnose in detail the obstacles before concluding with an explanation of the methods.
Keywords: fact-checking; data journalism; semantic web.