Forthcoming and Online First Articles

International Journal of Data Mining, Modelling and Management

International Journal of Data Mining, Modelling and Management (IJDMMM)

Forthcoming articles have been peer-reviewed and accepted for publication but are pending final changes, are not yet published and may not appear here in their final order of publication until they are assigned to issues. Therefore, the content conforms to our standards but the presentation (e.g. typesetting and proof-reading) is not necessarily up to the Inderscience standard. Additionally, titles, authors, abstracts and keywords may change before publication. Articles will not be published until the final proofs are validated by their authors.

Forthcoming articles must be purchased for the purposes of research, teaching and private study only. These articles can be cited using the expression "in press". For example: Smith, J. (in press). Article Title. Journal Title.

Articles marked with this shopping trolley icon are available for purchase - click on the icon to send an email request to purchase.

Online First articles are published online here, before they appear in a journal issue. Online First articles are fully citeable, complete with a DOI. They can be cited, read, and downloaded. Online First articles are published as Open Access (OA) articles to make the latest research available as early as possible.

Open AccessArticles marked with this Open Access icon are Online First articles. They are freely available and openly accessible to all without any restriction except the ones stated in their respective CC licenses.

Register for our alerting service, which notifies you by email when new issues are published online.

We also offer which provide timely updates of tables of contents, newly published articles and calls for papers.

International Journal of Data Mining, Modelling and Management (15 papers in press)

Regular Issues

  • OPTIMIZING DATA QUALITY OF A DATA WAREHOUSE USING DATA PURGATION PROCESS   Order a copy of this article
    by Neha Gupta 
    Abstract: Data act as fuel for any science and technology operation and due to the rapid growth of data collection and storage services, maintaining the quality of the data collected and stored is a major challenge. There are various data formats available and they are specifically categorized into three groups, i.e., Structured, Semi-structured and Unstructured. Different data mining techniques are utilized to gather, refine and investigate the data which further prompts the issue of data quality administration. The process of improving the quality of data without much alteration is known as data purgation. Data purgation occurs when the data is subject to Extract, Transform and Load (ETL) methodology in order to maintain and improve the data quality. Metadata is the most important factor that affects the quality of the collected data. The data may contain unnecessary information & may have inappropriate symbols which can be defined as dummy values, cryptic values or missing values. The present work has improved the Expectation-Maximization algorithm with dot product to handle cryptic data, DBSCAN method with Gower metrics has been implemented to ensure dummy values, Wards algorithm with Minkowski distance has been applied to improve the results of contradicting data and K-means algorithm along with Euclidean distance metrics has been applied to handle missing values in a dataset. These distance metrics have improved the data quality & also helped in providing consistent data to be loaded into a data warehouse. The above mentioned algorithms have been modified with the feature of scanning the database once, calculating the minimum support thereby increasing the efficiency as well as accuracy. The implementation of algorithms has been tested on various datasets of different sizes with more than 1000 records. The proposed algorithms have helped in maintaining the accuracy, integrity, consistency, non redundancy of data in a timely manner.
    Keywords: Data Warehouse (DW); Data Quality (DQ); Extract; Transform and Load (ETL); Data Purgation (DP).

  • A Deep-Learning Approach to Game Bot Identification via Behavioural Features Analysis in Complex Massively-Cooperative Environments   Order a copy of this article
    by Alfredo Cuzzocrea, Fabio Martinelli, Francesco Mercaldo 
    Abstract: The importance of the video game market has been continuously growing in recent years due to the continuous increase in the number of players. To maintain and increase enthusiasm in video game players, the games are continuously updated and other major innovations are expected in the coming years. Thus, a community of players interested in the so-called Massively Multiplayer Online Role-Playing Games (MMORPGs) has developed. Players soon introduced the possibility of obtaining some kind of gain from competitions. However, some player has tried to obtain advantages with easy winnings introducing game bots in the games. In order to maintain fairness among players it is important to detect the presence of game bots during video games so that they can be expelled from the games. This paper describes an approach to distinguish human players from game bots based on behavioral analysis. In other words, the approach detects when player behavior is abnormal compared to a normal human player behavior. Behavioral features extracted during running games are analyzed by supervised Machine Learning (ML) and Deep Learning (DL) algorithms. For detecting game bots the considered algorithms are first trained with labeled features and then used to classify unseen before features. In this paper the performances of our game bots detection approach are experimentally obtained. The dataset we use for training and classification is extracted from the logs generated during online video games matches.
    Keywords: Game Bot Detection; Complex Massively-Cooperative Environments; Machine Learning; Deep Learning.

  • Application of rule-based data mining in extracting the rules from the number of patients and climatic factors in instantaneous to long-term spectrum   Order a copy of this article
    by Sima Hadadian, Zahra Naji-Azimi, Nasser Motahari Farimani, Behrouz Minaei-Bidgoli 
    Abstract: Predicting the number of patients helps managers to allocate resources in hospitals efficiently. In this research, the relationship between the number of patients with the temperature, relative humidity, wind speed, air pressure, and air pollution in instantaneous, short-, medium- and long-term indices was investigated. Genetic algorithm and ID3 decision tree have been used for feature selection, and classification based on multidimensional association rule mining algorithm has been applied for rule mining. The data have been collected for 19 months from a pediatric hospital whose wards are Nephrology, Hematology, Emergency, and PICU. The results show that in the long-term index, all climatic factors are correlated with the number of patients in all wards. Also, several if-then rules have been obtained, indicating the relationship between climate factors in four indices with the number of patients in each hospital ward. According to if-then rules, optimal planning can be done for resource allocation in the hospital.
    Keywords: climatic factors; the number of patients; Classification Based on Multidimensional Association Rule Mining; Genetic Algorithm; ID3 Decision Tree.

  • Adaptable Address Parser with Active Learning   Order a copy of this article
    by You-Xuan Lin 
    Abstract: Address parsing, decomposing address strings to semantically meaningful components, is a measure to convert unstructured or semi-structured address data to structured one. Flexibility and variability in real-world address formats make parser development a nontrivial task. Even after all the time and effort dedicated to obtaining a capable parser, updating or even re-training is required for out-of-domain data and extra costs will be incurred. To minimize the cost of model building and updating, this study experiments with active learning for model training and adaptation. Models composed of character-level embedding and Recurrent Neural Networks are trained to parse address in Taiwan. Results show that by active learning, 420 additional instances to the training data are sufficient for a model to adapt itself to unfamiliar data while its competence in the original domain is retained. This suggests that active learning is helpful for model adaptation when data labelling is expensive and restricted.
    Keywords: address parsing; record linkage; active learning; model adaptation; recurrent neural network; address in Taiwan.
    DOI: 10.1504/IJDMMM.2023.10051856
     
  • Capturing Uncertainties through Log Analysis Using DevOps   Order a copy of this article
    by Rajeev Kumar Gupta, Arti Jain, Ruchika Kumar, R.K. Pateriya 
    Abstract: DevOps is an advancement of agile processes which is mainly used to improve the coordination between development and operation teams. Continuous practices survive within the core of the DevOps which ensures efficient pipelines and high-quality delivery of software. Using such practices in asynchronous, business dynamics compliance and ever-changing needs of clients can meet high performance and reliable final products. This research work is an attempt to propose a simplified solution, guideline and tools support for developing and maintaining the quality of continuous practices that are used in the DevOps project. The system automates the correlation among various Telemetry data to contribute towards enriching log analysis and reduces manual efforts. The proposed system undergoes in-depth analysis of logs, promotes quality assessments and feedback to developers, which in a result, and helps in deeper problem diagnosis of the telemetry data. In this work, an empirical study is carried out to gain conceptual clarity on integrated pipeline architecture and to address how automation in continuous monitoring accelerates and extends the feedback loop in the system.
    Keywords: Agile; DevOps; Log analysis; Telemetry Data; SDLC.

  • An Optimization Approach for Determining the Efficiency of Vital Medical Devices in Intensive Care Units with COVID-19 Patients Using Apriori Algorithm   Order a copy of this article
    by Abasat Mirzaei, Fatemeh Hoseini, Mehrshad Lalinia 
    Abstract: Improving the process of strategic management in hospitals preparation and equipping the intensive care units (ICUs) and the availability of these medical devices plays an important role for knowing consumer behavior and need. This cross-sectional study was performed in the intensive care unit of Farhikhtegan Hospital, Tehran, Iran for a period of six months. During these six months, 10 vital medical devices have been used 5497 times. These devices include: Ventilator, Oxygen Cylinder, Infusion Pump, Electrocardiography Machine, Vital Signs Monitor, Oxygen Flowmeter, Wavy Mattress, Ultrasound Sonography Machine, Ultrasound Echocardiography Machine, Dialysis Machine. Using the apriori algorithm of medical devices, the ICU with COVID-19 patients showed that 4 devices: ventilator, oxygen cylinder, vital signs monitoring device, oxygen flowmeter are the most used and are the basic needs of patients. These devices are positively correlated with each other and their confidence is over 80% and their support is 73%. In order to validating the results, we have used ECLAT (Equivalence Class Clustering and bottom-up Lattice Traversal) algorithm in our dataset.
    Keywords: Medical Equipment; COVID-19; Hospital; Apriori Algorithm; Technology Management; Health Care Equipment; Medical Devices; Data Mining; Medical Data,Association Rule; ECLAT algorithm.

  • Big Data Visual Exploration as a Recommendation Problem   Order a copy of this article
    by Moustafa Sadek Kahil, Abdelkrim Bouramoul, Makhlouf Derdour 
    Abstract: Big Data visual exploration is believed it can be considered as a recommendation problem. This proximity concerns essentially their purpose: It consists in selecting among huge amount of data those that are the most valuable according to specific criteria, to eventually present it to users. On the other hand, the recommendation systems are recently resolved mostly using Neural Networks (NN). The present paper proposes three alternative solutions to improve the Big Data visual exploration based on recommendation using Matrix Factorization (MF) namely: conventional, Alternating Least Squares (ALS)-based, and NN-based methods. It concerns generating the implicit data used to build recommendations, and providing the most valuable data patterns according to the user profiles. The first two solutions are developed using Apache Spark, while the third one was developed using TensorFlow2. A comparison based on results is done to show the most efficient one. The results show their applicability and effectiveness.
    Keywords: Big Data Visualization; Recommendation Systems; Collaborative Filtering; Content-based Filtering; Matrix Factorization; Alternating Least Square; Machine Learning; Neural Networks.

  • Identification of relevant features influencing movie reviews using sentiment analysis   Order a copy of this article
    by Isha Gupta, Indranath Chatterjee, Neha Gupta 
    Abstract: Sentiment analysis is a systematic text mining research that examines individuals' behavior, approach, and viewpoint. This paper analyzes viewers' sentiments towards the movies released during the pandemic. This study employs the sentiment analysis techniques on movie reviews' accessed in real-time from Internet Movie Database (IMDb). The paper's main objective is to identify the potential words that contribute to the biases of the reviews and influence overall viewers. The proposed methodology has employed Valence Aware Dictionary for Sentiment Reasoning based on sentiment analysis of overall reviews, followed by application to various movie genres. Finally, we have applied Pearson's correlation analysis to find the association between the words among the genres. The paper also calculates the sentiment scores of reviews using different sentiment analysis models. Our results showed a minimum of 17% percent features common genre-wise. It reveals sets of most distinct influential words, which may be vital for understanding the nature of the language used for a particular kind of movie.
    Keywords: Sentiment Analysis; Feature Selection (FS) sentiment scores; IMDb reviews; adjectives & adverbs features.

  • Churn Prediction in Telecommunication Sector with Machine Learning Methods   Order a copy of this article
    by Ayse SENYÜREK, Selçuk ALP 
    Abstract: The aim of this study is to construct a model in which the subscribers are able to cancel their subscriptions in the telecommunication sector. In this context, it was aimed to select data, to prepare the preliminary preparation, to use machine learning method, performance criteria and measurement processes. According to logistic regression, artificial neural network, random forest and boosting method, potential churn subscribers were estimated. When the results of the study are examined, it is seen that the boosting method gives more accurate and successful results than the other methods. The most important factors causing customer churn was the period remaining until the end of the contract, tenure, which operator preferred the close relatives and the quality of the network.
    Keywords: Churn analysis; Telecommunication; CRM; Machine learning.

  • Mining Association Rules for Classification Using Frequent Generator Itemsets in arules Package   Order a copy of this article
    by Makhlouf Ledmi, Mohammed El Habib Souidi, Michael Hahsler, Abdeldjalil Ledmi, Chafia Kara-Mohamed 
    Abstract: Mining frequent itemsets is an attractive research activity in data mining whose main aim is to provide useful relationships among data. Consequently, several open-source development platforms are continuously developed to facilitate the users’ exploitation of new data mining tasks. Among these platforms, the R language is one of the most popular tools. In this paper, we propose an extension of arules package by adding the option of mining frequent generator itemsets. We discuss in detail how generators can be used for a classification task through an application example in relation with COVID-19.
    Keywords: frequent generator itemsets; FGIs; classification; association rules; data mining; R language.
    DOI: 10.1504/IJDMMM.2023.10050487
     
  • Detection of Terrorism’s Apologies on Twitter using a New Bi-lingual Dataset   Order a copy of this article
    by Khaled BEDJOU, Faical Azouaou 
    Abstract: A lot of terrorist apology content is being shared on social media without being detected. Therefore, the automatic and immediate detection of these contents is essential for people’s safety. In this paper, we propose a language independent process to detect and classify terrorism’ apologies on Twitter into three classes (apology, no apology, and neutral). We tested the process on a bi-lingual (Arabic and English) dataset of 12,155 manually annotated tweets. We conducted two sets of experiments, one with imbalanced data and the other with oversampled data. We compared the classification performances of four machine learning algorithms (RF, DT, KNN, and NB) and five deep learning algorithms (GRU, SimpleRNN, LSTM, BiLSTM, and BERT). Our comparative study concluded that BERT achieves better classification performance than the others, with an accuracy of 0.84 for Arabic and 0.81 for English on imbalanced data, and 0.88 for Arabic and 0.91 for English on oversampled data.
    Keywords: terrorism’s apology; social network analysis; Twitter; NLP; sentiment analysis; machine learning; deep learning; transfer learning.
    DOI: 10.1504/IJDMMM.2023.10051983
     
  • An ABC approach for depression signs on social networks posts   Order a copy of this article
    by Amina MADANI, Fatima Boumahdi, Anfel Boukenaoui, Mohamed Chaouki Kritli, Asma Ghribi, Fatma Limani, Hamza Hentabli 
    Abstract: Mental health is considered as one of today’s world’s most prominent plagues. In this paper, we aim to solve one of mental health’s biggest issues, which is depression. Using the potential of social media platforms, our ABC approach is based on a combination of different deep learning models that are autoencoder, BiLSTM and CNN. We test our approach and discuss our experiments on three datasets of Reddit posts provided by 2019, 2020 and 2021 Conference and Labs of the Evaluation Forum (CLEF).
    Keywords: depression signs; social networks; deep learning; convolutional neural network; CNN; BiLSTM; autoencoder.
    DOI: 10.1504/IJDMMM.2023.10051990
     
  • A constraint programming approach for quantitative frequent pattern mining   Order a copy of this article
    by Mohammed El Amine LAGHZAOUI, Yahia LEBBAH 
    Abstract: Itemset mining is the first pattern mining problem studied in the literature, due to its simplicity but also its practical applications. Recently, it has been efficiently tackled in the constraint programming community, succeeding to propose declarative environments enabling to consider various mining constraints. Most of the itemset mining studies have considered only boolean datasets, where each transaction can contain or not items. In practical applications, items appear in some transactions with some quantities. For instance, sales are made by purchasing quantities of products. The usual approach to handle a quantitative dataset is by transforming it into a boolean dataset where each quantitative item is transformed into as much boolean items as the number of its quantities. In this paper, we propose an extension of the current efficient constraint programming approach for itemset mining, to take into account quantitative items in order to find patterns with their quantities directly on the original quantitative dataset. The contribution is two folds. Firstly, we facilitate the modelling task of mining problems through a new constraint, which enables to express frequency and closeness constraints on quantitative items directly on the quantitative dataset, without any transformation. Secondly, we propose a new filtering algorithm to handle completely the behaviour of the frequency and closeness constraint on a quantitative dataset. Experiments performed on standard benchmark datasets with numerous mining constraints show that our approach enables to find more informative quantitative patterns, which are better in running time than quantitative approaches based on classical boolean patterns.
    Keywords: Itemset Mining; Quantitative Database; Closed itemset Mining; Constraint Programming.

  • A New Process for Healthcare Big Data Warehouse Integration   Order a copy of this article
    by Nouha Arfaoui 
    Abstract: Healthcare domain generates huge amount of data from different and heterogynous clinical data sources using different devices to ensure a good managing hospital performance. Because of the quantity and complexity structure of the data, we use big healthcare data warehouse for the storage first and the decision making later. To achieve our goal, we propose a new process that deals with this type of data. It starts by unifying the different data, then it extracts it, loads it into big healthcare data warehouse and finally it makes the necessary transformations. For the first step, the ontology is used. It is the best solution to solve the problem of data sources heterogeneity. We use, also, Hadoop and its ecosystem including Hive, MapReduce and HDFS to accelerate the treatment through the parallelism exploiting the performance of ELT to ensure the
    Keywords: big healthcare data warehouse; BHDW; Hive; Hadoop; MapReduce; ontology; big data; ELT; ETL.
    DOI: 10.1504/IJDMMM.2023.10052446
     
  • Hierarchical ++: Improving the Hierarchical Clustering Algorithm   Order a copy of this article
    by WALLACE A PINHEIRO, Ana Bárbara S. Pinheiro 
    Abstract: Hierarchical grouping is a widely used grouping strategy. However, this technique many times provides lower results when compared to other approaches, such as K-means clustering. In addition, many algorithms try to correct hierarchical fails refactoring intermediate clustering combination actions, which may worsen performance. In this work, we propose a new set of procedures that alter the hierarchical technique to improve its results. The idea is to do it right the first time, avoiding refactoring previous steps. These modifications involve the concept of golden boxes, based on initial points named seeds, which indicate groups that must keep disconnected. To assess our strategy, we compare the results of some approaches: Traditional Hierarchical Clustering (single-link, complete-link, average, weighted, centroid, and Median), K-means, K-means++, and the proposed method, named Hierarchical++. An experimental evaluation indicates that our proposal far surpasses the compared strategies.
    Keywords: clustering; grouping; similarity; golden boxes; complex distributions; dendrograms.