International Journal of Big Data Intelligence (7 papers in press)
Modeling the dynamics of acoustic gaps between speakers during Business-to-Business sales calls
by Anat Lerner, Vered Silber-Varod, Nehoray Carmi, Yonathan Guttel, Omri Allouche
Abstract: The value of Conversation Intelligence as a means of deepening the insights of authentic conversations is a common ground nowadays between researchers and the business community. The rapid development of big data processing algorithms and technology enables companies to process massive amounts of data and meta-data about the conversation flow, that combines content, vocal features and even body gestures.
This study is based on the analysis of 358 Business-to-Business (B2B) sales calls at the Discovery stage. We propose a model to capture the dynamics of acoustic gaps between the sales representatives and customers by relying solely on the acoustic signal. In order to model the conversations we extract a basic set of features from the acoustic signal: speech proportion, fundamental frequency (F0), intensity, harmonics-to-noise ratio (HNR), jitter and shimmer. We focus on the differences between four groups of conversations defined by the speakers' gender pairing (Female-Female , Male-Male, Male-Female, Female-Male). We found significant differences in the behavioral patterns of the dynamics between these four groups. The study demonstrates that using delta metrics to assess the interactions leads to new insights.
Keywords: Conversation Intelligence; conversation modelling; acoustic features; speech data; sales calls.
An efficient algorithm for reducing the flow of real time data stream with least sampling error
by Devesh Kumar Lal, Ugrasen Suman
Abstract: Nature of data stream is determined after complete scanning of whole data sets during real time data processing. However, it becomes inconvenient to process entire data stream at once in real time data stream processing. Thus, a sheer sized fixed window of data streams is processed at a particular time. The intensification of sheer sized fixed window at processing node is mitigated by reducing the flowing rate of data stream. Heuristic Clustering Windowing (HCW) approach and Partial Blind Window (PBW) approach are proposed for reducing the flow of data stream with least sampling error. These approaches consist of the combination of systematic sampling and clustering mechanism. A clustering approach is applied on n/4x portion of data streams whereas systematic sampling handles 3n/4x portion of streams. These approaches are helpful in reducing flow of data streams in minimum latency.
Keywords: Clustering approach; data Stream; data processing; data sampling; real time big data; systematic sampling.
Scalable Big Data Modeling
by JAYESH PATEL
Abstract: In the Information age, data integration has become easier than ever. Enterprises integrate a wide range of data sources to enrich big data lakes. Enterprise big data lake made data consumption simpler and faster for all stakeholders. Often, stakeholders face challenges to limit data that they need for analysis and making effective decisions. As more data from ever-growing data sources is coming in, users are flooded with a variety of data. Data models alleviated the pain to serve insights to enterprise users. Data models provided insights after data cleansing, aggregating, and applying business rules. As data models in big data grow, queries and analysis require processing the large volume of data and big joins. It leads to long response and processing times. Data modeling in big data platforms needs attention to effectively cleanse, organize, and store big data to ensure timely availability of enterprise insights. As the scale is a critical aspect of the big data platform, big data should be modeled in a way that accessibility and delivery of insights should not be affected when the scale goes up. This paper presents best practices to model structured and semi-structured data in the big data platform.
Keywords: Enterprise Big Data Models; Scalable Modeling; Big Data Lake; Dimensional Models; Big Joins; Hadoop; Spark.
Improved big data stock index prediction using deep learning with CNN and GRU
by Abhishek Verma
Abstract: Stock index prediction has been a challenging problem due to difficult to model complexities of the stock market. More recently Deep learning approaches have become an important method in modelling complex relationships in time-series data. In this paper: we propose novel deep learning models that combine multiple pipelines of convolutional neural network and uni-directional or bi-directional gated recurrent units. Proposed models improve prediction performance and execution time upon previously published models on large scale S&P 500 dataset. We present several variations of multiple and single pipeline deep learning models based on different CNN kernel sizes and number of GRU units.
Keywords: Stock prediction; S&P500; CNN; GRU; Deep learning.
Cache-Collision Side-Channel Analysis and Attacks Against AES-GCM
by Xiaoming Li, James Huang
Abstract: Data security is an important issue in Big Data applications. Even though the same problem has been extensively studied within other contexts, Big Data applications present several unique challenges. Among them is actually the volume of data that typical Big Data applications process. Just because there is so much data passing through, the sheer data volume provides way more opportunities for a potential attacker to observe and identify patterns in computation and data. In this paper, we reveal that the data/computation patterns derived from the observation of large volume of data can be associated with certain sensitive information, specifically the key used in the AES-GCM algorithm, one of the foundation algorithms in data security. The paper presents a software-based (i.e. no assumptions on the target hardware implementation platform) cache-collision timing attack against the well known authenticated encryption scheme AES-GCM. The attack can be successful if enough data (plaintext-ciphertext pairs) are processed and the hash key H used for generating look-up tables in software implementation. Such data-pattern driven attack can potentially compromise the security of Big Data application. We present an attack model and an implementation of the attack based on OpenSSL, a widely used library that provides security-related functions for many applications. In most cases, our attack methodology is able to converge and extract the hidden key. We also discuss potential counter measures against similar attacks.
Keywords: Data Pattern; Cache Collision; AES-GCM.
Special Issue on: SNTA2019 Systems and Network Telemetry and Analytics
Network Traffic Performance Analysis from Passive Measurements using Gradient Boosting Machine Learning
by Astha Syal, Alina Lazar, Jinoh Kim, Alex Sim, Kesheng Wu
Abstract: Effective monitoring and analysis of network traffic are vital for scientific computing, since scientific applications often require moving massive data from one site to another. A body of statistical and machine learning techniques have been introduced for network traffic monitoring and analysis, but it is a highly challenging task due to several reasons, such as unavailability of label information, complication of real-time analysis, generalization property of machine learning models, and so forth. In this paper, we present a novel method that identifies the continuous time windows of low throughput for the purpose of network performance analysis and anomaly detection, in order to facilitate data transfers for high-performance scientific computing. The presented method is based on supervised learning techniques with an adaptive labeling function that automatically determines if the time window is whether slow or normal. The presented method is validated on real large datasets collected from several data transfer nodes (DTNs) located in Science DMZ. Our experimental results show that the proposed method is able to quickly predict time windows of low performing network transfers, that require attention from network engineers.
Keywords: Network traffic; TCP performance; UMAP; classification; Tstat; supervised machine learning; accuracy; cross-validation.
Predicting WAN Traffic Volumes using Fourier and Multivariate SARIMA Approaches
by Bashir Mohammed, Mariam Kiran, Nandini Krishnaswamy
Abstract: Network traffic has been a vital research issue which has attracted huge attention both in the network operations research domain and the industry. It has become crucial to develop techniques to better understand and predict the behavior and performance of networks. Understanding how network links are used and data movements can help network operations improve link utilization and capacity. In this paper, we tackle the need to understand traffic patterns across a large network topology by developing statistical models that use multivariate data model, incorporating seasonality, peak frequencies, and link relationships to improve future predictions. Using Fourier Transforms to extract seasons and peak frequencies from individual traces, we perform seasonality tests and ARIMA measures to determine optimal parameters to use in our prediction model. Our study shows that network traffic is non-stationary and possess seasonality, making SARIMA prediction approach, the most suitable for network traffic prediction. We also develop a multivariate model on real network traces that are collected from multiple time-zones of a large R&E WAN. Our results indicate an improved prediction accuracy with better RMSE and smaller confidence intervals using multivariate approach rather than univariate approaches. Our work provides key insights into studying network traffic, creating a deeper understanding of prediction methods, necessary for future research in network capacity management and planning.
Keywords: Traffic forecasting; Multi-variate Time-series analysis; FourierTransforms; SARIMA.