International Journal of Data Mining, Modelling and Management (22 papers in press)
A comprehensive review of deep learning for natural language processing
by Amal Bouraoui, Salma Jamoussi, Abdelmajid Ben Hamadou
Abstract: Deep learning has attracted considerable attention across many Natural Language Processing (NLP) domains. Deep learning models aim to learn embeddings of data with multiple levels of abstraction through multiple layers for either labeled structured input data or unlabeled unstructured input data. Currently, two research trends have emerged in building higher level embeddings. On one hand, a strong trend in deep learning leads towards increasingly powerful and complex models. On the other hand, multi-purpose sentence representation based on simple sums or averages of word vectors was recently shown to be effective. Furthermore, improving the performance of deep learning methods by attention mechanism has become a research hotspot in the last four years. In this paper, we seek to provide a comprehensive review of recent studies in building Neural Network (NN) embeddings that have been applied to NLP tasks. We provide a walk-through of deep learning evolution and a description of a variety of its architectures. We present and compare the performance of several deep learning models on standard datasets about different NLP tasks. We also present some deep learning challenges for natural language processing.
Keywords: Deep Learning; Word Embedding; Sentence Embedding; Attention
Mechanism; Compositional Models; Convolutional NNs; Recurrent/Recursive
NNs; Multi-purpose Sentence Embedding; Natural Language Processing.
Time-Series Gradient Boosting Tree for Stock Price Prediction
by Kei Nakagawa, Kenichi Yoshida
Abstract: We propose a time-series gradient boosting tree for a data set with time-series and cross-sectional attributes.
Our time-series gradient boosting tree has weak learners with time-series and cross-sectional attributes in its internal node, and split examples based on similarity between a pair of time-series or impurity between cross-sectional attributes.
Dissimilarity between a pair of time-series is defined by the dynamic time warping method.
In other words, the decision tree is constructed based on the shape that the time-series is similar or not similar to its past shape.
We conducted an empirical analysis using major world indices and confirmed that our time-series gradient boosting tree is superior to prior research methods in terms of both profitability and accuracy.
Keywords: Dynamic Time Warping mtehod; Time-series Decision Tree; Time-series Gradient Boosting Tree; Stock Price Prediction.
SUGGESTION AND SOLUTION OF A MATHEMATICAL MODEL FOR DETERMINING EFFECTIVE
by Kenan Mengüç, Tarik Küçükdeniz
Abstract: As obtaining data gets easier and cheaper with the help of technological achievements, data-based analyticsrnand management have become an essential part of planning and decision making to achieve success in thernsports industry. The study finds offensive routes for a team game using high-security data produced withrntechnology. An analysis of a sports team match was performed using seasonal data. A mathematical modelrnhas been developed for this analysis, discussing the effectiveness of the routes the model offers. This articlernaims to find the safe, efficient route for organizing the football on the field. In addition, the study also offersrnan experimental proposal for this purpose.rn
Keywords: Match strategy; tactics; optimization; transshipment problem.
Emotions recognition in synchronic textual CSCL situations
by Germán Lescano, Rosanna Costaguta, Analia Amandi
Abstract: Computer-Supported Collaborative Learning (CSCL) is an useful practice to teach learners working in groups and to acquire collaborative skills. To evaluate the collaborative process can be heavy for teachers because implies to analyze a lot of interactions. One issue to consider is socio-affective interactions which are important to recognize them due to their influence in the learning process. In this work, we propose an approach to recognize affective states in synchronic textual CSCL situations of students that speaking Spanish. Through experimentation, we analyze emotions manifested by university students of computer sciences when they worked in groups in synchronic textual CSCL situations and we evaluated the proposed approach using tools and libraries available in the market to make sentiment analysis. Using the proposed approach we developed classifiers to recognize subjectivity, sentiments and emotions. The sentiment classification model developed was compared with pre-built models regarding the rate of correct classifications. Results show that resources available in the market help the process of developing classifiers of sentiments and emotions for CSCL environments using traditional machine learning techniques. Providing to CSCL environments with a tool to recognize socio-affective interactions can be useful in order to help teachers evaluate this dimension of the collaborative process.
Keywords: Computer-Supported Collaborative Learning; Socio-Affective Interactions; Affective Computing.
Developing a Machine Learning Framework to Determine the Spread of COVID-19 in the United States using Meteorological, Social, and Demographic Factors
by Akash Gupta, Amir Gharehgozli
Abstract: Coronavirus disease of 2019 (COVID-19) has become pandemic in the matter of a few months, since the outbreak in December 2019 in Wuhan, China. We study the impact of weather factors including temperature and pollution on the spread of COVID-19. We also include social and demographic variables such as per capita Gross Domestic Product (GDP) and population density. Adapting the theory from the field of epidemiology, we develop a framework to build analytical models to predict the spread of COVID-19. In the proposed framework, we employ machine learning methods including linear regression, linear kernel support vector machine (SVM), radial kernel SVM, polynomial kernel SVM, and decision tree. Given the non-linear nature of the problem, the radial kernel SVM performs the best and explains 95% more variation than the existing methods. In align with the literature, our study indicates the population density is the critical factor to determine the spread. The univariate analysis shows that a higher temperature, air pollution, and population density can increase the spread. On the other hand, a higher per capita GDP can decrease the spread.
Keywords: COVID-19; disease spread; social and demographic factors; machine learning; epidemiology; predictive modeling.
Data Analytics for Gross Domestic Product using Random Forest and Extreme Gradient Boosting Approaches: An Empirical Study
by Elsayed Habib Elamir
Abstract: Gross domestic product per capita may be considered one of the foremost substantial measures of social gladness where all nations attempt to boost their gross domestic product per capita to share in their population bliss and prosperity, in addition to fortify their nation standing in worldwide relations. This study aims to use the random forest and extreme gradient boosting approaches to forecast and analyze gross domestic product per capita using data from world bank development indicators on countries level over the period 2010 to 2017. The comprehensive comparisons are executed using years before 2017 as training data and year 2017 as testing data. The root mean squares error, and the coefficient of determination are used to judge among the different models. The random forest and extreme gradient boosting achieve accuracy 97.8% and 98.1%, respectively, using coefficient of determination. The results suggest that the investment in education, labor, health, and industry as well as decreasing in inflation, interest, unemployment is necessary to enhance gross domestic product per capita. Motivating results are given by two-way interaction measure that is useful in explaining co-dependencies in the model behavior. The strongest interactions are between trade-technology, technology-education followed by consumption-health in terms of extreme gradient boosting method.
Keywords: bagging; boosting; business analytics; forecast; GDP; machine learning.
Methodology for Comparing Text Corpora via Topic Model
by Fedor Krasnov, Mikhail Shvartsman, Alexander Dimentov
Abstract: The authors of this paper developed a methodology approach for comparative analysis of patents' content. The approach named T4C is based on the topic modeling methodology and the machine learning methodology. The authors were able to identify the ownership of a patent in a particular country with an accuracy of 97.5% using supervised machine learning methods. When studying the dependence of patents on time, the authors were able to identify the patent belonging to a specific period with an accuracy of 85% for a specific country. The authors have developed a visual presentation of a thematic correlation between groups of patents. It should also be noted that in terms of the patent description text composition, Chinese patents are fundamentally different from US patents.rnThe results presented in this study were used to manage the patenting process at GazpromNeft STC.
Keywords: Topic Modeling; Text Classification; ARTM; PLSA; Random Forest; Text Collections Comparison.
Non-linear Gradient-based Feature Selection for Precise Prediction of Diseases
by Sadaf Kabir, Leily Farrokhvar
Abstract: Developing accurate predictive models can profoundly help health care providers improve the quality of their services. However, medical data often contain several variables, and not all of the data equally contribute towards the prediction. The existence of irrelevant and redundant features in a dataset can unnecessarily increase computational cost and complexity while deteriorating the performance of the predictive model. In this study, we employ the gradient-based prediction attribution as a general tool to identify important features in differentiable predictive models, such as neural networks and linear regression. Built upon this approach, we analyze single-stage and multi-stage scenarios for feature selection using ten medical datasets. Through extensive experiments, we demonstrate that the combination of the gradient-based approach with neural networks provides a powerful non-linear technique to identify important features contributing to the prediction. In particular, non-linear gradient-based feature selection achieves competitive results or significantimprovements over previously reported results on all datasets.
Keywords: Machine learning; feature selection; neural networks; logistic regression; disease prediction models; health care data.
Synergistic Effects Between Data Corpora Properties and Machine Learning Performance in Data Pipelines
by Roberto Bertolini, Stephen Finch
Abstract: To analyze data, a computationally feasible pipeline must be developed for data modeling. Corpora properties affect performance variability of machine learning (ML) techniques in pipelines; however, this has not been thoroughly investigated using simulation methodologies. A Monte Carlo study is used to compare differences in the area under the curve (AUC) metric for large-n-small-p-corpora examining: (1) the choice of ML algorithm, (2) size of the training database, (3) measurement error, (4) class imbalance magnitude, and (5) missing data pattern. Our simulations are consistent with established results under which these algorithms and corpora properties perform best, while providing insights into their synergistic effects. Measurement error negatively impacted pipeline performance across all corpora factors and ML algorithms. A larger training corpus ameliorated the decrease in predictive efficacy resulting from measurement error, class imbalance magnitudes, and missing data patterns. We discuss the implications of these findings for designing pipelines to enhance prediction performance.
Keywords: Data Pipeline; Interaction/Synergistic Effects; Monte Carlo Simulation; Machine Learning; Binary Classification.
Prediction of air pollution and analyze its effects on pollution dispersion of PM10 in Egypt using machine learning algorithms.
by Wael K. Hanna, Rasha Elstohy, Nouran Radwan
Abstract: Air pollution has been considered as one of the serious threats in Egypt. According to a study in Environmental Science & Technology Letters journal, air pollution is one of the main responsible for shortening Egyptians lives by 1.85 years. The main cause of air pollution in Egypt is PM10 which comes from industrial processes. PM10 concentrations exceed daily average concentrations during 98% of the measurement period. In this paper, we will apply machine learning classification algorithms to build the most accurate model for air pollution prediction and analysing its effects on pollution dispersion of PM10. The proposed classification model begins with air quality data collection and pre-processing, and then classifying process to discover the main relevant features for prediction. Experimental results show a good performance of the proposed air quality model. Random Forest, Na
Keywords: Air pollution; PM10; Classification model and machine learning algorithms.
Detecting and Exploiting Symmetries in Sequential Pattern Mining
by Ikram Nekkache, Said Jabbour, Nadjet Kamel, Lakhdar SAIS
Abstract: In this paper, we introduce a new framework for discovering and using symmetries in sequential pattern mining tasks. Symmetries are permutations between items that leave invariant the sequential database. Symmetries present several potential benefits. They can be seen as a new kind of structural patterns expressing regularities and similarities between items. As symmetries induce a partition of the sequential patterns into equivalent classes, exploiting them would allow to improve the pattern enumeration process, while reducing the size of the output. To this end, we first address the problem of symmetry discovery from database of sequences. Then, we first show how Apriori-like algorithms can be enhanced by dynamic integration of the detected symmetries. Secondly, we provide a second symmetry breaking approach allowing to eliminate symmetries in a preprocessing step by reformulating the sequential database of transactions. Our experiments clearly show that several sequential pattern mining datasets contain such symmetry based regularities. We also experimentally demonstrate that using such symmetries would results in significant reduction of the search space on some datasets.
Keywords: Data Mining; sequential pattern mining; symmetries.
Comparison of Harmony Search Derivatives for Artificial Neural Network Parameter Optimization: Stock Price Forecasting
by Mehmet Ozcalici, Ayse Tugba Dosdogru, Asli Boru Ipek, Mustafa Gocken
Abstract: This study has been conducted on forecasting as accurately as possible the next days stock price using Harmony Search (HS) and its variants (Improved Harmony Search (IHS), Global-Best Harmony Search (GHS), Self-Adaptive Harmony Search (SAHS), and Intelligent Tuned Harmony Search (ITHS)) together with Artificial Neural Network (ANN). The advantages of the proposed models are that the useful information in the original stock data is found by input variable selection and simultaneously the most proper number of hidden neurons in hidden layer is discovered to mitigate overfitting/underfitting problem in ANN. The results have shown that forecasts made by HS-ANN, IHS-ANN, GHS-ANN, SAHS-ANN, and ITHS-ANN demonstrate a tendency to achieve hit rates above 89% which is considerably better than previously proposed forecasting models in literature. Hence, ANN models provide more valuable forecasting results for investors to hedge against potential risk in stock markets.
Keywords: stock price forecasting; artificial neural network; harmony search and its variants.
Recommendation System for Improving Churn Rate based on Action Rules and Sentiment Mining
by Yuehua Duan, Zbigniew Ras
Abstract: It is well recognized that customers are one of the most valuable assets to a company. Therefore, it is of significant value for companies to reduce the customer outflow. In this paper, we focus on identifying the customers with high chance of attrition and provide valid and trustworthy recommendations to improve their customer churn rate. To this end, we designed and implemented a recommender system that can provide actionable recommendations to improve customer churn rate. We used both transaction and survey data from heavy equipment repair and service sector from 2011 to 2017. This data was collected by a consulting company based in Charlotte, North Carolina. In the survey data, customers give their thoughts, feelings, expectations and complaints by free-form text. We applied aspect-based sentiment analysis on the review text data to gain insightful knowledge on customers' attitudes toward the service. Action rule mining and meta-action triggering mechanism are used to recognize the actionable strategies to help with reducing customer churn.
Keywords: Action Rule Mining; Meta-actions; Aspect-based Sentiment Analysis; Recommender System; Reduct.
ONTOLOGY AND WEB USAGE MINING FOR WEB SITE MAINTENANCE
by Khaled Benali
Abstract: The search for information in the classical web is based essentially on the structure of the documents, and this makes the exploitation of the content almost impossible by the machines. In contrast, in the Semantic Web, machines can access resources through the semantic representation of content. In this regard, two domains, namely the web mining and the semantic web are closely linked: on the one hand, web-mining techniques help in the construction of the semantic web; on the other hand, the semantic web helps extract new knowledge. The present article discusses the problem of implementing Web Usage Mining in the semantic web for information retrieval from the web using ontology. Therefore, we present an approach that uses ontology and Web Usage Mining techniques for website maintenance. This work can help novice researchers start working in the field of web mining in the Semantic Web easily. Our approach will be tested on the ontology of a university website, which will be built and then enriched based on the extracted patterns on the Site Logs using an algorithm for the extraction of frequent itemsets. This approach aims to produce all the pages that are often accessible at the same time and throughout the same session to maintain the websites.
Keywords: Apriori; knowledge; Log File; Ontology; Web Usage Mining; Semantic Web and Website Maintenance.
OPTIMIZING DATA QUALITY OF A DATA WAREHOUSE USING DATA PURGATION PROCESS
by Neha Gupta
Abstract: Data act as fuel for any science and technology operation and due to the rapid growth of data collection and storage services, maintaining the quality of the data collected and stored is a major challenge. There are various data formats available and they are specifically categorized into three groups, i.e., Structured, Semi-structured and Unstructured. Different data mining techniques are utilized to gather, refine and investigate the data which further prompts the issue of data quality administration. The process of improving the quality of data without much alteration is known as data purgation. Data purgation occurs when the data is subject to Extract, Transform and Load (ETL) methodology in order to maintain and improve the data quality. Metadata is the most important factor that affects the quality of the collected data. The data may contain unnecessary information & may have inappropriate symbols which can be defined as dummy values, cryptic values or missing values. The present work has improved the Expectation-Maximization algorithm with dot product to handle cryptic data, DBSCAN method with Gower metrics has been implemented to ensure dummy values, Wards algorithm with Minkowski distance has been applied to improve the results of contradicting data and K-means algorithm along with Euclidean distance metrics has been applied to handle missing values in a dataset. These distance metrics have improved the data quality & also helped in providing consistent data to be loaded into a data warehouse. The above mentioned algorithms have been modified with the feature of scanning the database once, calculating the minimum support thereby increasing the efficiency as well as accuracy. The implementation of algorithms has been tested on various datasets of different sizes with more than 1000 records. The proposed algorithms have helped in maintaining the accuracy, integrity, consistency, non redundancy of data in a timely manner.
Keywords: Data Warehouse (DW); Data Quality (DQ); Extract; Transform and Load (ETL); Data Purgation (DP).
A Deep-Learning Approach to Game Bot Identification via Behavioural Features Analysis in Complex Massively-Cooperative Environments
by Alfredo Cuzzocrea, Fabio Martinelli, Francesco Mercaldo
Abstract: The importance of the video game market has been continuously growing in recent years due to the continuous increase in the number of players. To maintain and increase enthusiasm in video game players, the games are continuously updated and other major innovations are expected in the coming years. Thus, a community of players interested in the so-called Massively Multiplayer Online Role-Playing Games (MMORPGs) has developed. Players soon introduced the possibility of obtaining some kind of gain from competitions. However, some player has tried to obtain advantages with easy winnings introducing game bots in the games. In order to maintain fairness among players it is important to detect the presence of game bots during video games so that they can be expelled from the games. This paper describes an approach to distinguish human players from game bots based on behavioral analysis. In other words, the approach detects when player behavior is abnormal compared to a normal human player behavior. Behavioral features extracted during running games are analyzed by supervised Machine Learning (ML) and Deep Learning (DL) algorithms. For detecting game bots the considered algorithms are first trained with labeled features and then used to classify unseen before features. In this paper the performances of our game bots detection approach are experimentally obtained. The dataset we use for training and classification is extracted from the logs generated during online video games matches.
Keywords: Game Bot Detection; Complex Massively-Cooperative Environments; Machine Learning; Deep Learning.
Application of rule-based data mining in extracting the rules from the number of patients and climatic factors in instantaneous to long-term spectrum
by Sima Hadadian, Zahra Naji-Azimi, Nasser Motahari Farimani, Behrouz Minaei-Bidgoli
Abstract: Predicting the number of patients helps managers to allocate resources in hospitals efficiently. In this research, the relationship between the number of patients with the temperature, relative humidity, wind speed, air pressure, and air pollution in instantaneous, short-, medium- and long-term indices was investigated. Genetic algorithm and ID3 decision tree have been used for feature selection, and classification based on multidimensional association rule mining algorithm has been applied for rule mining. The data have been collected for 19 months from a pediatric hospital whose wards are Nephrology, Hematology, Emergency, and PICU. The results show that in the long-term index, all climatic factors are correlated with the number of patients in all wards. Also, several if-then rules have been obtained, indicating the relationship between climate factors in four indices with the number of patients in each hospital ward. According to if-then rules, optimal planning can be done for resource allocation in the hospital.
Keywords: climatic factors; the number of patients; Classification Based on Multidimensional Association Rule Mining; Genetic Algorithm; ID3 Decision Tree.
Detecting cyberbullying in Spanish texts throughout deep learning techniques
by Paul Cumba, Diego Riofrio, Verónica Rodríguez, Joe Carrión
Abstract: Recent recollected data suggests that it is possible to automatically detect events that may negatively affect the most vulnerable parts of our society, by using any communication technology like social networks or messaging applications. This research consolidates and prepares a corpus with Spanish bullying expressions taken from Twitter in order to use them as an input to train a convolutional neuronal network through deep learning techniques. As a result of this training, a predictive model was created, which can identify Spanish cyberbullying expressions such as insults, racism, homophobic attacks, and so on.
Keywords: cyberbullying; deep learning; convolutional neuronal network; spanish; social networks.
Adaptable Address Parser with Active Learning
by You-Xuan Lin
Abstract: Address parsing, decomposing address strings to semantically meaningful components, is a measure to convert unstructured or semi-structured address data to structured one. Flexibility and variability in real-world address formats make parser development a nontrivial task. Even after all the time and effort dedicated to obtaining a capable parser, updating or even re-training is required for out-of-domain data and extra costs will be incurred. To minimize the cost of model building and updating, this study experiments with active learning for model training and adaptation. Models composed of character-level embedding and Recurrent Neural Networks are trained to parse address in Taiwan. Results show that by active learning, 420 additional instances to the training data are sufficient for a model to adapt itself to unfamiliar data while its competence in the original domain is retained. This suggests that active learning is helpful for model adaptation when data labelling is expensive and restricted.
Keywords: address parsing; record linkage; active learning; model adaptation; recurrent neural network; address in Taiwan.
Capturing Uncertainties through Log Analysis Using DevOps
by Rajeev Kumar Gupta, Arti Jain, Ruchika Kumar, R.K. Pateriya
Abstract: DevOps is an advancement of agile processes which is mainly used to improve the coordination between development and operation teams. Continuous practices survive within the core of the DevOps which ensures efficient pipelines and high-quality delivery of software. Using such practices in asynchronous, business dynamics compliance and ever-changing needs of clients can meet high performance and reliable final products. This research work is an attempt to propose a simplified solution, guideline and tools support for developing and maintaining the quality of continuous practices that are used in the DevOps project. The system automates the correlation among various Telemetry data to contribute towards enriching log analysis and reduces manual efforts. The proposed system undergoes in-depth analysis of logs, promotes quality assessments and feedback to developers, which in a result, and helps in deeper problem diagnosis of the telemetry data. In this work, an empirical study is carried out to gain conceptual clarity on integrated pipeline architecture and to address how automation in continuous monitoring accelerates and extends the feedback loop in the system.
Keywords: Agile; DevOps; Log analysis; Telemetry Data; SDLC.
An Optimization Approach for Determining the Efficiency of Vital Medical Devices in Intensive Care Units with COVID-19 Patients Using Apriori Algorithm
by Abasat Mirzaei, Fatemeh Hoseini, Mehrshad Lalinia
Abstract: Improving the process of strategic management in hospitals preparation and equipping the intensive care units (ICUs) and the availability of these medical devices plays an important role for knowing consumer behavior and need. This cross-sectional study was performed in the intensive care unit of Farhikhtegan Hospital, Tehran, Iran for a period of six months. During these six months, 10 vital medical devices have been used 5497 times. These devices include: Ventilator, Oxygen Cylinder, Infusion Pump, Electrocardiography Machine, Vital Signs Monitor, Oxygen Flowmeter, Wavy Mattress, Ultrasound Sonography Machine, Ultrasound Echocardiography Machine, Dialysis Machine. Using the apriori algorithm of medical devices, the ICU with COVID-19 patients showed that 4 devices: ventilator, oxygen cylinder, vital signs monitoring device, oxygen flowmeter are the most used and are the basic needs of patients. These devices are positively correlated with each other and their confidence is over 80% and their support is 73%. In order to validating the results, we have used ECLAT (Equivalence Class Clustering and bottom-up Lattice Traversal) algorithm in our dataset.
Keywords: Medical Equipment; COVID-19; Hospital; Apriori Algorithm; Technology Management; Health Care Equipment; Medical Devices; Data Mining; Medical Data,Association Rule; ECLAT algorithm.
Big Data Visual Exploration as a Recommendation Problem
by Moustafa Sadek Kahil, Abdelkrim Bouramoul, Makhlouf Derdour
Abstract: Big Data visual exploration is believed it can be considered as a recommendation problem. This proximity concerns essentially their purpose: It consists in selecting among huge amount of data those that are the most valuable according to specific criteria, to eventually present it to users. On the other hand, the recommendation systems are recently resolved mostly using Neural Networks (NN). The present paper proposes three alternative solutions to improve the Big Data visual exploration based on recommendation using Matrix Factorization (MF) namely: conventional, Alternating Least Squares (ALS)-based, and NN-based methods. It concerns generating the implicit data used to build recommendations, and providing the most valuable data patterns according to the user profiles. The first two solutions are developed using Apache Spark, while the third one was developed using TensorFlow2. A comparison based on results is done to show the most efficient one. The results show their applicability and effectiveness.
Keywords: Big Data Visualization; Recommendation Systems; Collaborative Filtering; Content-based Filtering; Matrix Factorization; Alternating Least Square; Machine Learning; Neural Networks.