Forthcoming and Online First Articles

International Journal of Data Mining, Modelling and Management

International Journal of Data Mining, Modelling and Management (IJDMMM)

Forthcoming articles have been peer-reviewed and accepted for publication but are pending final changes, are not yet published and may not appear here in their final order of publication until they are assigned to issues. Therefore, the content conforms to our standards but the presentation (e.g. typesetting and proof-reading) is not necessarily up to the Inderscience standard. Additionally, titles, authors, abstracts and keywords may change before publication. Articles will not be published until the final proofs are validated by their authors.

Forthcoming articles must be purchased for the purposes of research, teaching and private study only. These articles can be cited using the expression "in press". For example: Smith, J. (in press). Article Title. Journal Title.

Articles marked with this shopping trolley icon are available for purchase - click on the icon to send an email request to purchase.

Online First articles are published online here, before they appear in a journal issue. Online First articles are fully citeable, complete with a DOI. They can be cited, read, and downloaded. Online First articles are published as Open Access (OA) articles to make the latest research available as early as possible.

Open AccessArticles marked with this Open Access icon are Online First articles. They are freely available and openly accessible to all without any restriction except the ones stated in their respective CC licenses.

Register for our alerting service, which notifies you by email when new issues are published online.

We also offer which provide timely updates of tables of contents, newly published articles and calls for papers.

International Journal of Data Mining, Modelling and Management (18 papers in press)

Regular Issues

  • An Optimization Approach for Determining the Efficiency of Vital Medical Devices in Intensive Care Units with COVID-19 Patients Using Apriori Algorithm   Order a copy of this article
    by Abasat Mirzaei, Fatemeh Hoseini, Mehrshad Lalinia 
    Abstract: Improving the process of strategic management in hospitals preparation and equipping the intensive care units (ICUs) and the availability of these medical devices plays an important role for knowing consumer behavior and need. This cross-sectional study was performed in the intensive care unit of Farhikhtegan Hospital, Tehran, Iran for a period of six months. During these six months, 10 vital medical devices have been used 5497 times. These devices include: Ventilator, Oxygen Cylinder, Infusion Pump, Electrocardiography Machine, Vital Signs Monitor, Oxygen Flowmeter, Wavy Mattress, Ultrasound Sonography Machine, Ultrasound Echocardiography Machine, Dialysis Machine. Using the apriori algorithm of medical devices, the ICU with COVID-19 patients showed that 4 devices: ventilator, oxygen cylinder, vital signs monitoring device, oxygen flowmeter are the most used and are the basic needs of patients. These devices are positively correlated with each other and their confidence is over 80% and their support is 73%. In order to validating the results, we have used ECLAT (Equivalence Class Clustering and bottom-up Lattice Traversal) algorithm in our dataset.
    Keywords: Medical Equipment; COVID-19; Hospital; Apriori Algorithm; Technology Management; Health Care Equipment; Medical Devices; Data Mining; Medical Data,Association Rule; ECLAT algorithm.

  • Big Data Visual Exploration as a Recommendation Problem   Order a copy of this article
    by Moustafa Sadek Kahil, Abdelkrim Bouramoul, Makhlouf Derdour 
    Abstract: Big Data visual exploration is believed it can be considered as a recommendation problem. This proximity concerns essentially their purpose: It consists in selecting among huge amount of data those that are the most valuable according to specific criteria, to eventually present it to users. On the other hand, the recommendation systems are recently resolved mostly using Neural Networks (NN). The present paper proposes three alternative solutions to improve the Big Data visual exploration based on recommendation using Matrix Factorization (MF) namely: conventional, Alternating Least Squares (ALS)-based, and NN-based methods. It concerns generating the implicit data used to build recommendations, and providing the most valuable data patterns according to the user profiles. The first two solutions are developed using Apache Spark, while the third one was developed using TensorFlow2. A comparison based on results is done to show the most efficient one. The results show their applicability and effectiveness.
    Keywords: Big Data Visualization; Recommendation Systems; Collaborative Filtering; Content-based Filtering; Matrix Factorization; Alternating Least Square; Machine Learning; Neural Networks.

  • Identification of relevant features influencing movie reviews using sentiment analysis   Order a copy of this article
    by Isha Gupta, Indranath Chatterjee, Neha Gupta 
    Abstract: Sentiment analysis is a systematic text mining research that examines individuals' behavior, approach, and viewpoint. This paper analyzes viewers' sentiments towards the movies released during the pandemic. This study employs the sentiment analysis techniques on movie reviews' accessed in real-time from Internet Movie Database (IMDb). The paper's main objective is to identify the potential words that contribute to the biases of the reviews and influence overall viewers. The proposed methodology has employed Valence Aware Dictionary for Sentiment Reasoning based on sentiment analysis of overall reviews, followed by application to various movie genres. Finally, we have applied Pearson's correlation analysis to find the association between the words among the genres. The paper also calculates the sentiment scores of reviews using different sentiment analysis models. Our results showed a minimum of 17% percent features common genre-wise. It reveals sets of most distinct influential words, which may be vital for understanding the nature of the language used for a particular kind of movie.
    Keywords: Sentiment Analysis; Feature Selection (FS) sentiment scores; IMDb reviews; adjectives & adverbs features.
    DOI: 10.1504/IJDMMM.2023.10052818
  • Churn Prediction in Telecommunication Sector with Machine Learning Methods   Order a copy of this article
    by Ayse SENYÜREK, Selçuk ALP 
    Abstract: The aim of this study is to construct a model in which the subscribers are able to cancel their subscriptions in the telecommunication sector. In this context, it was aimed to select data, to prepare the preliminary preparation, to use machine learning method, performance criteria and measurement processes. According to logistic regression, artificial neural network, random forest and boosting method, potential churn subscribers were estimated. When the results of the study are examined, it is seen that the boosting method gives more accurate and successful results than the other methods. The most important factors causing customer churn was the period remaining until the end of the contract, tenure, which operator preferred the close relatives and the quality of the network.
    Keywords: Churn analysis; Telecommunication; CRM; Machine learning.

  • Mining Association Rules for Classification Using Frequent Generator Itemsets in arules Package   Order a copy of this article
    by Makhlouf Ledmi, Mohammed El Habib Souidi, Michael Hahsler, Abdeldjalil Ledmi, Chafia Kara-Mohamed 
    Abstract: Mining frequent itemsets is an attractive research activity in data mining whose main aim is to provide useful relationships among data. Consequently, several open-source development platforms are continuously developed to facilitate the users’ exploitation of new data mining tasks. Among these platforms, the R language is one of the most popular tools. In this paper, we propose an extension of arules package by adding the option of mining frequent generator itemsets. We discuss in detail how generators can be used for a classification task through an application example in relation with COVID-19.
    Keywords: frequent generator itemsets; FGIs; classification; association rules; data mining; R language.
    DOI: 10.1504/IJDMMM.2023.10050487
  • Detection of Terrorism’s Apologies on Twitter using a New Bi-lingual Dataset   Order a copy of this article
    by Khaled BEDJOU, Faical Azouaou 
    Abstract: A lot of terrorist apology content is being shared on social media without being detected. Therefore, the automatic and immediate detection of these contents is essential for people’s safety. In this paper, we propose a language independent process to detect and classify terrorism’ apologies on Twitter into three classes (apology, no apology, and neutral). We tested the process on a bi-lingual (Arabic and English) dataset of 12,155 manually annotated tweets. We conducted two sets of experiments, one with imbalanced data and the other with oversampled data. We compared the classification performances of four machine learning algorithms (RF, DT, KNN, and NB) and five deep learning algorithms (GRU, SimpleRNN, LSTM, BiLSTM, and BERT). Our comparative study concluded that BERT achieves better classification performance than the others, with an accuracy of 0.84 for Arabic and 0.81 for English on imbalanced data, and 0.88 for Arabic and 0.91 for English on oversampled data.
    Keywords: terrorism’s apology; social network analysis; Twitter; NLP; sentiment analysis; machine learning; deep learning; transfer learning.
    DOI: 10.1504/IJDMMM.2023.10051983
  • An ABC approach for depression signs on social networks posts   Order a copy of this article
    by Amina MADANI, Fatima Boumahdi, Anfel Boukenaoui, Mohamed Chaouki Kritli, Asma Ghribi, Fatma Limani, Hamza Hentabli 
    Abstract: Mental health is considered as one of today’s world’s most prominent plagues. In this paper, we aim to solve one of mental health’s biggest issues, which is depression. Using the potential of social media platforms, our ABC approach is based on a combination of different deep learning models that are autoencoder, BiLSTM and CNN. We test our approach and discuss our experiments on three datasets of Reddit posts provided by 2019, 2020 and 2021 Conference and Labs of the Evaluation Forum (CLEF).
    Keywords: depression signs; social networks; deep learning; convolutional neural network; CNN; BiLSTM; autoencoder.
    DOI: 10.1504/IJDMMM.2023.10051990
  • A constraint programming approach for quantitative frequent pattern mining   Order a copy of this article
    by Mohammed El Amine LAGHZAOUI, Yahia LEBBAH 
    Abstract: Itemset mining is the first pattern mining problem studied in the literature, due to its simplicity but also its practical applications. Recently, it has been efficiently tackled in the constraint programming community, succeeding to propose declarative environments enabling to consider various mining constraints. Most of the itemset mining studies have considered only boolean datasets, where each transaction can contain or not items. In practical applications, items appear in some transactions with some quantities. For instance, sales are made by purchasing quantities of products. The usual approach to handle a quantitative dataset is by transforming it into a boolean dataset where each quantitative item is transformed into as much boolean items as the number of its quantities. In this paper, we propose an extension of the current efficient constraint programming approach for itemset mining, to take into account quantitative items in order to find patterns with their quantities directly on the original quantitative dataset. The contribution is two folds. Firstly, we facilitate the modelling task of mining problems through a new constraint, which enables to express frequency and closeness constraints on quantitative items directly on the quantitative dataset, without any transformation. Secondly, we propose a new filtering algorithm to handle completely the behaviour of the frequency and closeness constraint on a quantitative dataset. Experiments performed on standard benchmark datasets with numerous mining constraints show that our approach enables to find more informative quantitative patterns, which are better in running time than quantitative approaches based on classical boolean patterns.
    Keywords: Itemset Mining; Quantitative Database; Closed itemset Mining; Constraint Programming.

  • A New Process for Healthcare Big Data Warehouse Integration   Order a copy of this article
    by Nouha Arfaoui 
    Abstract: Healthcare domain generates huge amount of data from different and heterogynous clinical data sources using different devices to ensure a good managing hospital performance. Because of the quantity and complexity structure of the data, we use big healthcare data warehouse for the storage first and the decision making later. To achieve our goal, we propose a new process that deals with this type of data. It starts by unifying the different data, then it extracts it, loads it into big healthcare data warehouse and finally it makes the necessary transformations. For the first step, the ontology is used. It is the best solution to solve the problem of data sources heterogeneity. We use, also, Hadoop and its ecosystem including Hive, MapReduce and HDFS to accelerate the treatment through the parallelism exploiting the performance of ELT to ensure the
    Keywords: big healthcare data warehouse; BHDW; Hive; Hadoop; MapReduce; ontology; big data; ELT; ETL.
    DOI: 10.1504/IJDMMM.2023.10052446
  • Hierarchical ++: Improving the Hierarchical Clustering Algorithm   Order a copy of this article
    by WALLACE A PINHEIRO, Ana Bárbara S. Pinheiro 
    Abstract: Hierarchical grouping is a widely used grouping strategy. However, this technique many times provides lower results when compared to other approaches, such as K-means clustering. In addition, many algorithms try to correct hierarchical fails refactoring intermediate clustering combination actions, which may worsen performance. In this work, we propose a new set of procedures that alter the hierarchical technique to improve its results. The idea is to do it right the first time, avoiding refactoring previous steps. These modifications involve the concept of golden boxes, based on initial points named seeds, which indicate groups that must keep disconnected. To assess our strategy, we compare the results of some approaches: Traditional Hierarchical Clustering (single-link, complete-link, average, weighted, centroid, and Median), K-means, K-means++, and the proposed method, named Hierarchical++. An experimental evaluation indicates that our proposal far surpasses the compared strategies.
    Keywords: clustering; grouping; similarity; golden boxes; complex distributions; dendrograms.

  • A Comparative Study of Supervised/Unsupervised Machine Learning Algorithms with Feature Selection Approaches to Predict Student Performance   Order a copy of this article
    by Alaa Khalaf Hamoud, Ali Salah Alasady, Wid Akeel Awadh, Jassim Mohammed Dahr, Mohammed B. M. Kamel, Aqeel Majeed Humadi, Ihab Ahmed Najm 
    Abstract: The field of educational data mining (EDM) is one of the most growing fields that aims to improve the performance of students, academic staff, and overall institutional performance. The implementing process of data mining algorithms almost needs the feature selection process to find the most correlated features and improve the accuracy. In this paper, a comparative study is performed to study implementation of supervised/unsupervised algorithms in predicting the students’ performance. The student's grade is classified using different fields of supervised and unsupervised algorithms such as decision trees, clustering, and neural networks. These algorithms were examined over the questionnaire dataset before/after feature selection to measure the effect of feature selection on the result accuracy. The results showed that the random forest decision tree outperformed other supervised/unsupervised algorithms. The results also showed that the performance evaluation of algorithms with the dataset after removing the less correlated attributes is enhanced for most of the algorithms.
    Keywords: educational data mining; EDM; students’ performance; supervised algorithms; unsupervised algorithms; feature selection.
    DOI: 10.1504/IJDMMM.2023.10055032
  • A Novel Taxonomy of Natural Disasters based on Casualty and Consequence using Hierarchical Clustering   Order a copy of this article
    by Donald D. Atsa\'am, Frank Adusei-Mensah, Oluwafemi S. Balogun, Temidayo O. Omotehinwa, Oluwaseun S. Dada, Richard Osei Agjei, Samuel Nii Odoi Devine 
    Abstract: Post-disaster management requires a proportional deployment of human and material resources. The number of resources required to manage a disaster cannot be known without first evaluating the extent of casualty and consequence. This study proposed a taxonomy for classifying natural disasters based on casualty and consequence. Using a secondary data on global disasters from 1900 to 2021, the hierarchical cluster analysis technique was deployed for taxonomy formation. The learning algorithm evaluated the similarities in numbers of deaths, injuries, and the cost of damaged property caused by disasters. Three clusters were extracted which sub-grouped historical disasters based on similarities in casualty and consequence. Further, a taxonomy that defines the ranges of what constitute low, average, and high deaths/injuries/damage were established. Classifying a future disaster with this taxonomy prior to the deployment of resources for rescue, resettlement, compensation, and other disaster management operations will guide efficient resource allocation on a case-by-case basis.
    Keywords: Disaster taxonomy; natural disasters; casualty and consequence; post-disaster management; hierarchical cluster analysis.
    DOI: 10.1504/IJDMMM.2023.10055078
  • K-Means and DBSCAN for Look Alike-Sound Alike Medicines Issue   Order a copy of this article
    by Souad Moufok, Anas Mouattah, Khalid Hachemi 
    Abstract: The goal of this study is to analyse the application of data mining techniques in clustering drug names based on their spelling similarity in order to reduce the occurrence of dispensing errors caused by look-alike sound-alike medicine confusion, as they considered one of the most common causes of dispensing errors. Two unsupervised data mining methods, k-means and DBSCAN, were used in conjunction with two similarity measures, Bisim and Levenshtein. The results of the study showed that the approach is effective in identifying potential confusable medicines, with Bisim-based k-means clustering being favored with a silhouette score of 0.5.
    Keywords: Look Alike Sound Alike; Data Mining; Medication Errors; Dispensing Errors; Lasa; K-means; DBSCAN.

  • A deep-learning approach to game bot identification via behavioural features analysis in complex massively-cooperative environments   Order a copy of this article
    by Alfredo Cuzzocrea, Fabio Martinelli, Francesco Mercaldo 
    Abstract: In the so-called massively multiplayer online role-playing games (MMORPGs), malicious players have the possibility of obtaining some kind of gains from competitions, via easy victories achieved thanks to the introduction of game bots in the games. In order to maintain fairness among players, it is important to detect the presence of game bots during video games so that they can be expelled from the games. This paper describes an approach to distinguish human players from game bots based on behavioural analysis. This implemented via supervised machine learning (ML) and deep learning (DL) algorithms. In order to detect game bots, considered algorithms are first trained with labelled features and then used to classify unseen-before features. In this paper, the performance of our game bots detection approach is experimentally obtained. The dataset we use for training and classification is extracted from logs generated during online video games matches of a real-life MMORPG.
    Keywords: game bot detection; complex massively-cooperative environments; machine learning; deep learning; massively multiplayer online role-playing games; MMORPGs.

  • Application of rule-based data mining in extracting the rules from the number of patients and climatic factors in instantaneous to long-term spectrum   Order a copy of this article
    by Sima Hadadian, Zahra Naji-Azimi, Nasser Motahari Farimani, Behrouz Minaei-Bidgoli 
    Abstract: Predicting the number of patients helps managers to allocate resources in hospitals efficiently. In this research, the relationship between the number of patients with the temperature, relative humidity, wind speed, air pressure, and air pollution in instantaneous, short-, medium- and long-term indices was investigated. Genetic algorithm and ID3 decision tree have been used for feature selection, and classification based on multidimensional association rule mining algorithm has been applied for rule mining. The data have been collected for 19 months from a pediatric hospital whose wards are nephrology, hematology, emergency, and PICU. The results show that in the long-term index, all climatic factors are correlated with the number of patients in all wards. Also, several if-then rules have been obtained, indicating the relationship between climate factors in four indices with the number of patients in each hospital ward. According to if-then rules, optimal planning can be done for resource allocation in the hospital.
    Keywords: temperature; relative humidity; wind speed; air pressure; air pollution; patients; hospital; association rule mining; classification; genetic algorithm; ID3 decision tree.

  • Capturing uncertainties through log analysis using DevOps   Order a copy of this article
    by Rajeev Kumar Gupta, Arti Jain, Ruchika Kumar, R.K. Pateriya 
    Abstract: DevOps is an advancement of agile processes which is mainly used to improve the coordination between development and operation teams. Continuous practices survive within the core of the DevOps which ensures efficient pipelines and high-quality delivery of software. Using such practices in a synchronous, business dynamics compliance and ever-changing needs of clients can meet high performance and reliable final products. This research work is an attempt to propose a simplified solution, guideline and tools support for developing and maintaining quality of continuous practices that are used in the DevOps project. The system automates the correlation among various telemetry data to contribute towards enriching log analysis and reduces manual efforts. The proposed system undergoes in-depth analysis of logs, promotes quality assessments and feedback to developers, which in result, helps in deeper problem diagnosis of the telemetry data. In this work, an empirical study is carried out to gain conceptual clarity on integrated pipeline architecture and to address how automation in continuous monitoring accelerates and extends the feedback loop in the system.
    Keywords: agile; DevOps; log analysis; telemetry data; software development life cycle; SDLC.

  • Adaptable address parser with active learning   Order a copy of this article
    by You-Xuan Lin 
    Abstract: Address parsing, decomposing address strings to semantically meaningful components, is a measure to convert unstructured or semi-structured address data to structured one. Flexibility and variability in real-world address formats make parser development a non-trivial task. Even after all the time and effort dedicated to obtaining a capable parser, updating or even re-training is required for out-of-domain data and extra costs will be incurred. To minimise the cost of model building and updating, this study experiments with active learning for model training and adaptation. Models composed of character-level embedding and recurrent neural networks are trained to parse address in Taiwan. Results show that by active learning, 420 additional instances to the training data are sufficient for a model to adapt itself to unfamiliar data while its competence in the original domain is retained. This suggests that active learning is helpful for model adaptation when data labelling is expensive and restricted.
    Keywords: address parsing; record linkage; active learning; model adaptation; recurrent neural network; RNN; address in Taiwan.
    DOI: 10.1504/IJDMMM.2023.10051856
  • Optimising data quality of a data warehouse using data purgation process   Order a copy of this article
    by Neha Gupta 
    Abstract: The rapid growth of data collection and storage services has impacted the quality of the data. Data purgation process helps in maintaining and improving the data quality when the data is subject to extract, transform and load (ETL) methodology. Metadata may contain unnecessary information which can be defined as dummy values, cryptic values or missing values. The present work has improved the EM algorithm with dot product to handle cryptic data, DBSCAN method with Gower metrics has been implemented to ensure dummy values, Wards algorithm with Minkowski distance has been applied to improve the results of contradicting data and K-means algorithm along with Euclidean distance metrics has been applied to handle missing values in a dataset. These distance metrics have improved the data quality and also helped in providing consistent data to be loaded into a data warehouse. The proposed algorithms have helped in maintaining the accuracy, integrity, consistency, non-redundancy of data in a timely manner.
    Keywords: data warehouse; DW; data quality; DQ; extract; transform and load; ETL; data purgation; DP.