International Journal of Big Data Intelligence (26 papers in press)
Interoperable Identity Management Protocol for Multi-Cloud Platform
by Tania Chaudhary, Sheetal Kalra
Abstract: Multi-cloud adaptive application provisioning promises to solve data storage problem and leads to interoperability of data within multi-cloud environment. This also raises concern about interoperability of user among these computing domains. Although various standards and techniques have been developed to secure the identity of cloud consumer, but neither of them provides facility to interoperate and to secure the identity of cloud consumer. Thus, there is a need to develop an efficient authentication protocol that maintains single unique identity of cloud consumer and make it interoperable among various cloud service providers. Elliptic curve cryptography (ECC) based algorithms are the best choice among Public Key Cryptography (PKC) algorithms due to their small key sizes and efficient computation. In this paper, a secure ECC based mutual authentication protocol for cloud service provider servers using smart device and one time token has been proposed. The proposed scheme achieves mutual authentication and provides interoperability among multiple cloud service providers. The security analysis of the proposed protocol proves that the protocol is robust against all the security attacks. The formal verification of the proposed protocol is performed using AVISPA tool, which proves its security in the presence of intruder.
Keywords: Authentication; Cloud Computing; Elliptic Curve Cryptography; Multi-Cloud; One Time Token; Smart Device.
A Novel Entropy Based Dynamic Data Placement Strategy for Data Intensive Applications in Hadoop Clusters
by K. Hemant Kumar Reddy, Diptendu Sinha Roy, Vishal Pandey
Abstract: In the last decade, efficient data analysis of data-intensive applications has become an increasingly important research issue. The popular map-reduce framework has offered an enthralling solution to this problem by means of distributing the work load across interconnected data centers. Hadoop is most widely used platform for data intensive application such as analysis of web logs, detection of global weather patterns, bioinformatics applications among others. However, most Hadoop implementations assume that every node attached to a cluster are homogeneous in nature having same computational capacities which may reduce map-reduce performance by increasing extra over-head for run-time data communications. However, majority of data placement strategies attempt placing related data close to each other for faster run-time access. However they disregard scenarios where such placement algorithms have to work with data sets which are new, either generated or for different MapReduce jobs. This paper deals with improving the map-reduce performance over multi-cluster data sets by means of a novel entropy based data placement strategy (EDPS), that works in three phases and account for new data sets. In the first phase, a k-means clustering strategy is employed to extract dependencies among different datasets and group them into data groups. In second phase, these data groups placed in different data centers while considering heterogeneity of virtual machines into account. Finally, the third phase uses an entropy based grouping of the newly generated datasets where these datasets are grouped with most similar existing cluster based on their relative entropy. The essence of the entropy based scheme lies in computing expected entropy, a measure for dissimilarity of MapReduce jobs and their data usage patterns in terms of data blocks stored in HDFS, and finally placing new data among clusters such that entropy is reduced. The experimental results shows efficacy of the proposed three fold dynamic grouping and data placement policy, which significantly reduces the time of execution and improve Hadoop performance in heterogeneous clusters with varying server and user applications and its parameters sizes.
Keywords: Dynamic Data Placement Strategy; Hadoop Clusters; MapReduce;k-means clustering; entropy;.
HYBRIDIZATION OF CLASSIFIERS FOR ANOMALY DETECTION IN BIG DATA
by Rasim Alguliyev, Ramiz Aliguliyev, Fargana Abdullayeva
Abstract: Recently the widespread use of cloud technologies has led to the rapid increase the scale and complexity of this infrastructure. The degradation and downtimes in the performance metrics of these large-scale systems are considered to be a major problem. The key issue in addressing these problems is to detect anomalies that can occur in hardware, software and state of the systems of cloud infrastructure. In this paper for the detection of anomalies in performance metrics of cloud infrastructure, based on ensemble classifiers a semi-supervized classification method is proposed. In the proposed method Naive Bayes, J48, SMO, multilayer Perseptron, IBK and PART algorithms is used. To detect anomalous behavior on the performance metrics the public data of the Google and Yahoo! companies, Python 2.7, Matlab, Weka and Google Cloud SDK Shell applications is used. In the result of the experimental study of the model 0.90 percent detection accuracy is obtained.
Keywords: Anomaly in performance metrics; CPU-usage; memory usage; Naive Bayes; J48 decision tree; semi-supervised algorithms; ensemble classifiers; Google cluster trace.
Fast Approaches for Semantic Service Composition in Big Data Environment
by Jun Huang, Yide Zhou, Qiang Duan, Cong-cong Xing
Abstract: The widespread deployment of Web services and the rapid development of big data applications bring in new challenges to Web service compositions in the context of big data. The large number of Web services processing a huge amount of diverse data together with the complex and dynamic relationships among the services require automatic composition of semantic Web services to perform quickly, thereby demanding fast and cost-effective service composition algorithms. In this paper, we investigate the Web service composition in big data environments by proposing novel composition algorithms with low time-complexity. In our proposed algorithm, we decompose the service composition into three stages: construction of parameter expansion graphs, transformation of service dependence graphs, and backtracking search for service compositions. Based on the parameter expansion strategies, we then propose two fast service composition algorithms, for which we also analyze their time complexities. We conduct comparison experimentally to evaluate the performance of the algorithms and validate their effectiveness using a big semantic service dataset. Our results reveal that the proposed approaches are more preferable than a well-known algorithm in terms of execution time and precision.
Keywords: Big data semantics; quality of services; service composition; virtual parameter.
An Insight into Mobile Advertising and its Impact on the Resources of Hand-held Devices: A survey
by Abdurhman Albasir, Maazen Alsabaan, Kshirasagar Naik
Abstract: With the rapid advancement of mobile devices, people become more attached to them than ever. This growth combined with millions of applications (apps) make smartphones a favorite means of communication among users. The available contents on smartphones, apps and web content, come into two versions: (i) free contents that are monetized via advertisements (ads); (ii) paid ones that are monetized by users' subscription fees. However, the resources on-board are limited, and the existence of ads can adversely impact them. These issues brought the need for good understanding of mobile advertising eco-system and how such limited resources should be efficiently used.
This survey paper gives an overview on the mobile advertising eco-system, and reviews the work done in the regard of the influence of such ads on smartphones' battery life and monthly data usage. It discusses and slightly addresses the open issues and research directions that need to be further investigated. This work is meant to motivate: (i) the researchers to investigate the energy and bandwidth issues further and hence, come up with more practical solutions; (ii) App and Web developers to consider the seriousness implications of embedding "expensive" ads in their apps and web-pages on the end users limited resources.
Keywords: Mobile Advertising; handheld devices; Energy Cost; Bandwidth Cost; Energy
A Five-layer Architecture for Big Data Processing and Analytics
by Yixuan Zhu, Bo Tang, Victor Li
Abstract: Big data technologies have attracted much attention in recent years. The academia and industry have reached a consensus, that is, the ultimate goal of big data is about transforming big data to real value. In this article, we discuss how to achieve this goal and propose a five-layer architecture for big data processing and analytics (BDPA), including a collection layer, a storage layer, a processing layer, an analytics layer, and an application layer. The five-layer architecture targets to set up a de facto standard for current BDPA solutions, to collect, manage, process, and analyse the vast volume of both static data and online data streams, and make valuable decisions for all types of industries. Functionalities and challenges of the five layers are illustrated, with the most recent technologies and solutions discussed accordingly. We conclude with the requirements for the future BDPA solutions, which may serve as a foundation for the future big data ecosystem.
Keywords: Big data processing and analytics (BDPA); online big data stream; five-layer architecture.
Extended results from the Measurement and Analysis of Safety in a Large City
by Rami Ibrahim, M. Omair Shafiq
Abstract: This paper presents an extended version of our measurement and analysis of data from the city of Los Angeles . More specifically, we analyzed datasets about crimes that took place in Los Angeles. This dataset was prepared by the Los Angeles Police Department (LAPD) and is also updated on a regular basis. This dataset contains approximately 1.5 million records, where each record represents a crime incident in the city. We analyzed multiple features of the dataset including different activities of crimes (i.e. number of crimes) in terms of year, month, weekdays, time of the day, area, victim sex, victim age, victim descent, suspect activities and crime seriousness index. In addition to it, we also analyzed the reporting period of a crime incident by calculating the average reporting days (i.e. number of days the victim took to report a crime incident) in terms of multiple factors. Our analysis uncovers the unique characteristics and insights of safety measures and crime prevention in the city. This extended version of paper contains some new results and discussions. This includes new graphs for number of crimes based on suspect activities and crime seriousness index, a new graph for crimes distribution based on crime seriousness index, and average reporting period based on crime seriousness index. We introduced a section that provides discussion on potential implications of our analytical results.
Keywords: Data Analysis; Smart City; Crime Statistics; Dataset; Extended version of the original paper.
Perceptions of independent financial advisors on the usefulness of Big Data in the context of decision making in the UK
by Ketty Grishikashvili, Clemens Bechter
Abstract: Big Data Analytics (BDA) are based on massive amounts of structured and unstructured data and should extract meaningful, accurate, and relevant data. However, what is considered as meaningful and relevant is a perception of the analyst. In our paper we focused on perceptions and not on the technical implementation of BDA. For this we looked into one of the most advanced services the financial services market in the UK. We chose eight independent financial advisors who have had first-hand experiences with large amounts of structured and unstructured data. Results show that there is moderate enthusiasm about the new possibilities of BDA such as better decision making and better market segmentation. This implies the necessity for easy to use tools and training on the side of the financial advisor as well as on the clients side. The paper fills a research gap by analyzing the actual impact of BDA on a real company and trying to measure the value of BDA for a financial organization in form of improved customer relationship through prescriptive analytics.
Keywords: Big Data; Financial Services; Independent Financial Advisors.
Availability Modelling and Assurance for A Bigdata Computing
by Nohpill Park
Abstract: This paper proposes a new analytical model to evaluate the availability of a bigdata computing, namely, map-reduce computing on a Hadoop platform. Map-reduce computing is represented by a queueing model in this work in order to trace flow of tasks (either map or reduce) of their arrivals and exits in the course of computation. The objective of the model is to evaluate the probability for a map-reduce computation to be available at an instance of time, referred to as availability. The set of variables taken into account in this model lists the number of map and reduce tasks, the number of servers (or referred to as nodes in this paper) engaged, along with a few constants such as task arrival/exit rates and node failure/repair rates. The proposed model provides a comprehensive yet fundamental basis to assure and ultimately optimize the design of map-reduce computing in terms of availability with reference to its performance in a simultaneous manner. Parametric simulations have been conducted and demonstrated the efficacy of the proposed model in assessing the availability and the cost.
Keywords: availability; map-reduce computing; queueing model.
Towards an automation of the fact-checking in the journalistic web context
by Edouard NGOR SARR, SALL Ousmane
Abstract: Is Fact checking automatisable? Apparently, yes, since numerous moved forward noted in the search and the analysis of digital data. However, this task which in priori seemed to be simple, turns out rather binding. Indeed, automate the check of facts combine and requires at the same time very advanced knowledge in analysis of data, in search of Web data, the Web technologies, in image processing and sometimes in automatic natural language processing (NLP). Nevertheless, the latter years, numerous researches are led to deepen such an analysis. In this article, having revisited the state of the art concerning the question, we identify and diagnose in detail the obstacles before concluding with an explanation of the methods.
Keywords: fact-checking; data journalism; semantic web.
A Survey of Computation Techniques on Time Evolving Graphs
by Shalini Sharma, Jerry Chou
Abstract: Time Evolving Graph (TEG) refers to graphs whose topology or attribute
values change over time due to update events, including edge addition/deletion, vertex
addition/deletion and attributes changes on vertex or edge. Driven by the Big Data
paradigm, the ability to process and analyze TEG in a timely fashion is critical in
many application domains, such as social network, web graph, road network trac, etc.
Recently, many research eorts have been made with the aim to address the challenges
of volume and velocity from dealing with such datasets. However it remains to be an
active and challenged research topic. Therefore, in this survey, we summarize the state-
of-art computation techniques for TEG. We collect these techniques from three dierent
research communities: i)The data mining community for graph analysis; ii)The theory
community for graph algorithm; iii)The computation community for graph computing
framework. Based on our study, we also propose our own computing framework DASH
for TEG. We have even performed some experiments by comparing DASH and Graph
Processing System (GPS).We are optimistic that this paper will help many researchers to
understand various dimensions of problems in TEG and continue developing the necessary
techniques to resolve these problems more eciently.
Keywords: Big Data; Time evolving graphs; Computing framework; Algorithm; Data Mining.
Special Issue on: DaSAA 2017 Recent advances in Data Sciences and Applications
Unified Framework for Data Management in Multi-Cloud Environment
by Kirthica S, Sabireen H, Rajeswari Sridhar
Abstract: Cloud Storage is growing rapidly, as it offers an on-demand and highly elastic storage provisioning. The recent emergence of managing data in a multi-cloud environment opens new challenges for selecting the suitable cloud for storage from many services based on the data access patterns. Indeed, it is crucial to avoid vendor lock-in problem and increase availability and durability by managing unlimited data in multiple clouds by providing access to various cloud services based on the customer requirements.
This work provides an approach to dynamically pick optimal clouds from a heterogeneous multi-cloud environment based on the data access patterns. The voluminous data is split into chunks using data split algorithm and each chunk is efficiently predicted using predictor engine. Depending on the predicted chunks, suitable clouds are picked using proposed cloud picker algorithm. The proposed algorithm picks clouds based on QoS attributes and weighted threshold coefficient. Furthermore, the data is placed into the selected clouds by checking the availability of the clouds. Additionally, the proposed data retention policies - fTUBA and TR^P is applied to the data along with homomorphic encryption to provide security and place the highest priority data into the appropriate cloud. The archived data is placed in a cloud from the set of selected clouds based on the cloud picker algorithm. The results indicate the cost effectiveness of storing huge data in a multi-cloud environment to provide the best of breed of services.
Keywords: Cloud Computing; Data Management; Multi-cloud; Cloud Inter-operation.
Special Issue on: DataCom 2017 Big Data Infrastructure and Deep Learning Applications
A Scalable System for Executing and Scoring K-Means Clustering Techniques and Its Impact on Applications in Agriculture
by Nevena Golubovic, Chandra Krintz, Rich Wolski, Balaji Sethuramasamyraja, Bo Liu
Abstract: We present Centaurus- a scalable, open source, clustering service for k-means
clustering of correlated, multidimensional data. Centaurus provides users with
automatic deployment via public or private cloud resources, model selection (using
Bayesian Information Criterion), and data visualization.
We apply Centaurus to a real-world, agricultural analytics application and compare
its results to the industry standard clustering approach. The application uses soil electrical
conductivity measurements, GPS coordinates, and elevation data from a field to produce
a map of differing soil zones (so that management can be specialized for each). We
use Centaurus and these datasets to empirically evaluate the impact of considering
multiple k-means variants and large numbers of experiments. We show that Centaurus
yields more consistent and useful clusterings than the competitive approach for use in
zone-based soil decision-support applications where a hard decision is required.
Keywords: K-means Clustering; Cloud Computing.
Scalable Mining, Analysis, and Visualization of Protein-Protein Interaction Networks
by Shaikh Arifuzzaman, Bikesh Pandey
Abstract: Proteins are linear chain biomolecules that are the basis of functional networks in all organisms. Protein-protein interaction (PPI) networks are the networks of protein complexes formed by biochemical events and electrostatic forces. PPI networks can be used to study diseases and discover drugs. The causes of diseases are evident on a protein interaction level. For instance, an elevation of interaction edge weights of oncogenes is manifested in cancers. Further, the majority of approved drugs target a particular PPI, and thus studying PPI networks is vital to drug discovery.
The availability of large datasets and need for efficient analysis necessitate the design of scalable methods leveraging modern high-performance computing (HPC) platforms. In this paper, we design a lightweight framework on a distributed-memory parallel system, which includes scalable algorithmic and analytic techniques to study PPI networks and visualize them. Our study of PPIs is based on network-centric mining and analysis approaches. Since PPI networks are signed (labeled) and weighted, many existing network mining methods working on simple unweighted networks will be unsuitable to study PPIs. Further, the large volume and variety of such data limits the use of sequential tool or methods. Many existing tools also do not support a convenient workflow starting from automated
data preprocessing to visualizing results and reports for efficient extraction of intelligence from large-scale PPI networks. Our framework support automated analytics based on a large range of extensible methods for extracting signed motifs, computing
centrality, and finding functional units. We design MPI (Message Passing Interface) based parallel methods and workflow, which scale to large networks. The framework is also extensible and sufficiently generic. To the best of our knowledge, all these capabilities collectively make our tool novel.
Keywords: protein-protein interaction; biological networks; network visualization; massive networks; HPC systems; network mining.
Optimizing NBA Player Signing Strategies Based on Practical Constraints and Statistics Analytics
by Lin Li
Abstract: In National Basketball Association (NBA), how to comprehensively measure a player's performance and how to sign talented players with reasonable contracts are always challenging. Due to various practical constraints such as the salary cap and the players' on-court minutes, no teams can sign all desired players. To ensure the team's competency on both offense and defense sides, player's efficiency must be comprehensively evaluated. This research studied the key indicators widely used to measure player efficiency and team performance. Through data analytics, the most frequently referred statistics including Player Efficiency Rating, Defense Rating, Real Plus Minus, Points, Rebounds, Assists, Blocks, Steals, etc., were chosen to formulate the prediction of the team winning rate in different schemes. Based on the models trained and tested, two player selection strategies were proposed according to different objectives and constraints. Experimental results show that the developed team winning rate prediction models have high accuracy and the player selection strategies are effective.
Keywords: Optimization; Prediction; Regression; Linear Programming; Sports Data Analytics;.
Network Traffic Driven Storage Repair
by Danilo Gligoroski, Katina Kralevska, Rune Jensen, Per Simonsen
Abstract: Recently we constructed an explicit family of locally repairable and locally regenerating codes. Their existence was proven by Kamath et al. but no explicit construction was given. Our design is based on HashTag codes that can have different sub-packetization levels. In this work we emphasize the importance of having two ways to repair a node: repair only with local parity nodes or repair with both local and global parity nodes. We say that the repair strategy is network traffic driven since it is in connection with the concrete system and code parameters: the repair bandwidth of the code, the number of I/O operations, the access time for the contacted parts and the size of the stored file. We show the benefits of having repair duality in one practical example implemented in Hadoop. We also give algorithms for efficient repair of the global parity nodes.
Keywords: Vector codes; Repair bandwidth; Repair locality; Exact repair; Parity-splitting; Global parities; Hadoop.
MapReduce based fuzzy very fast decision tree for constructing prediction intervals
by Ojha Manish Kumar, Kumar Ravi, Vadlamani Ravi
Abstract: Prediction Interval is a methodology to measure the uncertainties in-point forecasts and predictions. In this paper, we proposed the fuzzy version of Very Fast Decision Tree (VFDT) to predict the Lower Upper Bound Estimation (LUBE), which is further compared with VFDT. VFDT is one pass incremental decision tree learner, which scans each instance only once. The proposed fuzzy VFDT is able to capture intrinsic features of VFDT as well as uncertainties available in data. It outputs fuzzy if-then rules. The VFDT and fuzzy VFDT were trained using the LUBE method. Due to increasing demand of Cloud Computing and Big Data challenges, the traditional decision tree is not an evident option, especially when the volume of data is large. Hence, we implemented VFDT; developed and implemented Fuzzy VFDT using Apache Hadoop MapReduce framework, where multiple slave nodes build a VFDT & fuzzy VFDT model. The developed models were tested on 6 datasets taken from the web. We conducted sensitivity analysis by varying the window size of the data stream, number of bins in discretization and observing their impact on the final results in all datasets. Experiments with real-world case studies demonstrated that the proposed MapReduce based Fuzzy VFDT and VFDT can construct high-quality prediction intervals precisely and quickly.
Keywords: VFDT; Fuzzy VFDT; MapReduce; Prediction Interval; Big Data.
Real-Time Event Search using Social Stream for Inbound Tourist Corresponding to Place and Time
by Ruriko Kudo, Miki Enoki, Akihiro Nakao, Shu Yamamoto, Saneyasu Yamaguchi, Masato Oguchi
Abstract: Since the decision was made to hold the Olympic Games in 2020 in Tokyo, the number of foreign tourists visiting the city has been increasing rapidly. Accordingly, tourists have been seeking more sightseeing information. While guidebooks are good for pointing out popular tourist attractions, it is more difficult for tourists to get information on local events and spots that are just becoming popular. We developed a tourist information distribution system that sends information corresponding to a place and time. The system extracts event information from social media streams in a per place and time manner and provides it to tourists. In order to extract useful information, we performed event classification using actual Twitter data. We examined how to distribute the events in order and make it more user - friendly system. Furthermore, we developed the information supplement function using external information.
Keywords: Twitter; Social media stream; Local event; Information extract.
Two-Channel Convolutional Neural Network for Facial Expression Recognition using Facial Parts
by Hui Wang, Jiang Lu, Lucy Nwosu, Ishaq Unwala
Abstract: This paper proposes the design of a Facial Expression Recognition system based on the deep convolutional neural network by using facial parts. In this work, a solution for facial expression recognition that uses a combination of algorithms for face detection, feature extraction and classification is discussed. The proposed method builds a two-channel convolutional neural network model in which Facial Parts are used as inputs: the extracted eyes are used as inputs to the first channel, while the mouth is the input into the second channel. Feature information from both channels converges in a fully connected layer which is used to learn global information from these local features and is then used for expression classification. Experiments are carried out on the Japanese Female Facial Expression dataset and the Extended Cohn-Kanada dataset to determine the expression recognition accuracy for the proposed facial expression recognition system based on facial parts. The results achieved shows that the system provides state of art classification accuracy with 97.6% and 95.7% respectively when compared to other methods.
Keywords: Facial expression recognition; Convolutional Neural Networks; Facial Parts.
Efficient Clustering Techniques on Hadoop and Spark
by Sami Al Ghamdi, Giuseppe Di Fatta
Abstract: Clustering is an essential data mining technique that divides observations into groups where each group contains similar observations. K-Means is one of the most popular clustering algorithms that has been used for over fifty years. Due to the current exponential growth of the data, it became a necessity to improve the efficiency and scalability of K-Means even further to cope with large-scale datasets known as Big Data. This paper presents K-Means optimisations using triangle inequality on two well-known distributed computing platforms: Hadoop and Spark. K-Means variants that use triangle inequality usually require caching extra information from the previous iteration, which is a challenging task to achieve on Hadoop. Hence, this work introduces two methods to pass information from one iteration to the next on Hadoop to accelerate K-Means. The experimental work shows that the efficiency of K-Means on Hadoop and Spark can be significantly improved by using triangle inequality optimisations.
Keywords: K-Means; Hadoop; Spark; MapReduce; Efficient Clustering; Triangle Inequality K-Means.
A Computational Bayesian Approach for Estimating Density Functions Based on Noise-Multiplied Data
by Yan-Xia Lin
Abstract: In this big data era, an enormous amount of personal and company information can be easily collected by third parties. Sharing the data with the public and allowing data users to access the data for data mining often bring many benefits to the public, policymakers, national economy, and society. However, sharing the micro data with the public usually causes the issue of data privacy. Protecting data privacy and mining statistical information of the original data from protected data are the essential issues in big data.
Protecting data privacy through noise-multiplied data is one of the approaches studied in the literature. This paper introduces the B-M L2014 Approach for estimating the density function of the original data based on noise-multiplied microdata.
This paper shows applications of the B-M L2014 Approach and demonstrates that the statistical information of the original data can be retrieved from their noise-multiplied data reasonably while the disclosure risk is under control if the multiplicative noise used to mask the original data is appropriate. The B-M L2014 Approach provides a new data mining technique for big data when data privacy is concerned.
Keywords: Data mining; Data anonymization; Privacy-preserving;.
New algorithms for inferring gene regulatory networks from time-series expression data on Apache Spark
by Yasser Abduallah, Jason T.L. Wang
Abstract: Gene regulatory networks (GRNs) are crucial to understand the inner workings of the cell and the complexity of gene interactions. Numerous algorithms have been developed to infer GRNs from gene expression data. As the number of identified genes increases and the complexity of their interactions is uncovered, gene networks become cumbersome to test. Furthermore, prodding through experimental results requires an enormous amount of computation, resulting in slow data processing. Therefore, new approaches are needed to analyse copious amounts of experimental data from cellular GRNs. To meet this need, cloud computing is promising as reported in the literature. Here we present two new algorithms for reverse engineering GRNs in a cloud environment. The algorithms, implemented in Spark, employ an information-theoretic approach to infer GRNs from time-series gene expression data. Experimental results show that one of our new algorithms is faster than, yet as accurate as, two existing cloud-based GRN inference methods.
Keywords: network inference; systems biology; spark; big data; MapReduce; gene regulatory networks; GRN; time-series; gene expression; big data intelligence.
Text Visualization for Feature Selection in Online Review Analysis
by Keerthika Koka, Shiaofen Fang
Abstract: Opinion spamming is a reality, and it can have unpleasant consequences in the retail industry. While there are, several promising research works done on identifying the fake online reviews from genuine online reviews, there have been few involving visualization and visual analytics. Thernpurpose of this work is to show that feature selection through visualization is at least as powerful as the best automatic feature selection algorithms. This is achieved by applying radial chart visualization technique to the online review classification into fake and genuine reviews. Radial chart and the color overlaps are used to explore the best feature selection through visualization for classification. Parallel coordinate visualization of the review data is also explored and compared with radial chart results. The system gives a structure to each text review based on certain attributes, compares how different or similar the structure of the different or same categories are, and highlights the key features that contribute to the classification the most. Our visualization technique helps the user get insights into the high dimensional data by providing means to eliminate the worst features right away, pick some best features without statistical aids, understand the behavior of the dimensions in different combinations.
Keywords: multi-dimensional visualization; feature detection; text mining; online review analysis.
DeepSim: Cluster Level Behavioral Simulation Model for Deep Learning
by Yuankun Shi, Kevin Long, Kaushik Balasubramanian, Zhaojuan Bian, Adam Procter, Ramesh Illikkal
Abstract: We are witnessing an explosion of AI use cases driving the computer industry, and especially datacenter and server architectures. As Intel faces fierce competition in this emerging technology space, it is critical that architecture definitions and directions are driven with data from proper tools and methodologies, and insights are drawn from end-to-end holistic analysis at the datacenter levels. In this paper, we introduce DeepSim, a cluster-level behavioral simulation model for deep learning. DeepSim, which is based on the Intel CoFluent simulation framework, uses timed behavioral models to simulate complex interworking between compute nodes, networking, and storage at the datacenter level, providing a realistic performance model of a real-world image recognition applications based on the popular Deep Learning Framework Caffe. The end-to-end simulation data from DeepSim provides insights which can be used for architecture analysis driving future datacenter architecture directions. DeepSim enables scalable system design, deployment, and capacity planning through accurate performance insights. Results from preliminary scaling studies (e.g. node scaling and network scaling) and what-if analyses (e.g., Xeon with HBM and Xeon Phi with dual OPA) are presented in this paper. The simulation results are correlated well with empirical measurements, achieving an accuracy of 95%.
Keywords: Deep Learning; Datacenter; Behavioral Simulation; AlexNet; Architecture Analysis; Performance Analysis; Server Architecture.
Predicting Hospital Length of Stay Using Neural Networks
by Robert Steele
Abstract: Accurate prediction of hospital length of stay can provide benefits for hospital
resource planning and quality-of-care. We describe the utilization of neural networks
for predicting the length of hospital stay for patients with various diagnoses based on
selected administrative and clinical attributes. An all-condition neural network, that can
be applied to all patients and not limited to a specific diagnosis, is trained to predict
whether patient stay will be long or short in terms of the median length of stay as the
cut-off between long and short, and predicted at the time the patient leaves the intensive
care unit. In addition, neural networks are trained to predict whether patients of fourteen
specific common primary diagnoses will have a long or short stay, as defined as greater
than or less than or equal to the median length of stay for that particular condition. Our
dataset is drawn from the MIMIC III database. Our prediction accuracy is approximately
80% for the all-condition neural network and the neural networks for specific conditions
generally demonstrated higher accuracy and all clearly out-performed any linear model.
Keywords: length of stay; health analytics; neural networks; MIMIC III.
A Hybrid Power Management Schema for Multi-Tier Data Centers
by Aryan Azimzadeh, Babak Maleki Shoja, Nasseh Tabrizi
Abstract: Data centers play an important role in the operation and management of IT infrastructures but because of their huge power consumption, it raises an issue of great concern as it relates to global warming. This paper explores the sleep state of data centers servers under specific conditions such as setup time and identifies an optimal number of servers potentially to increase energy efficiency. We use a Dynamic Power Management policy-based model with the optimal number of servers that is required in each tier while increasing servers setup time after sleep mode. The Reactive approach is used to validate the results and energy efficiency by calculating the average power consumption of each server under specific sleep mode and setup time. Our method uses average power consumption to calculate the Normalized-Performance-Per-Watt in order to evaluate the power efficiency. The results indicate that the schema reported in this paper can improve power efficiency in data centers with high setup time servers.
Keywords: Power; Green; Management; Schema; Multi-Tier.