International Journal of Data Mining, Modelling and Management (IJDMMM) Inderscience Publishers - linking academia, business and industry through research

Forthcoming and Online First Articles

International Journal of Data Mining, Modelling and Management

Forthcoming articles have been peer-reviewed and accepted for publication but are pending final changes, are not yet published and may not appear here in their final order of publication until they are assigned to issues. Therefore, the content conforms to our standards but the presentation (e.g. typesetting and proof-reading) is not necessarily up to the Inderscience standard. Additionally, titles, authors, abstracts and keywords may change before publication. Articles will not be published until the final proofs are validated by their authors.

Forthcoming articles must be purchased for the purposes of research, teaching and private study only. These articles can be cited using the expression "in press". For example: Smith, J. (in press). Article Title. Journal Title.

Articles marked with this shopping trolley icon are available for purchase - click on the icon to send an email request to purchase.

Online First articles are published online here, before they appear in a journal issue. Online First articles are fully citeable, complete with a DOI. They can be cited, read, and downloaded. Online First articles are published as Open Access (OA) articles to make the latest research available as early as possible.

Articles marked with this Open Access icon are Online First articles. They are freely available and openly accessible to all without any restriction except the ones stated in their respective CC licenses.

International Journal of Data Mining, Modelling and Management (20 papers in press)

Regular Issues

Analysing and Forecasting COVID-19 Vaccination - Evidence from a Native American Community in North Carolina, USA
by Xin Zhang, Zhixin Kang, Guanlin Gao, Xinyan Shi
Abstract: This study examines the determining factors of vaccination decisions for adults and children in a historical tribal region and evaluates various machine learning models in their predicting powers. COVID-19 vaccination data were investigated; though, the proposed method may be used for evaluating other vaccination data. We administrated a survey and collected cross-sectional data (e.g., socio-demographics, COVID-19 testing behaviours, vaccination status, and people's knowledge about, attitude toward, and belief in the vaccines), developed new features and built predicting models (e.g., random forest, neural network, and decision tree), and evaluated their performance against the benchmark logistic regression models. The results show that people, who tested more frequently, believed vaccination is a social responsibility, and were provided with paid leaves from employers are more likely to be fully vaccinated and vaccinate their children. Our results also show that not all machine learning models outperform the logistic regression model.
Keywords: COVID-19 vaccination intention; feature design and evaluation; vaccination forecasting; machine learning; Bayesian-correlation; model evaluation.
DOI: 10.1504/IJDMMM.2025.10066364

Multi-Document Text Summarisation using DL-BiLSTM Model with Hybrid Algorithms
by Jyotirmayee Rautaray, Sangram Panigrahi, Ajit Kumar Nayak
Abstract: With the overwhelming amount of information available online, it becomes challenging for users to access relevant data. Automated techniques are essential to effectively filter and extract valuable information from vast datasets. Recently, text summarisation has emerged as a key method for distilling relevant content from lengthy documents. This work introduces a novel deep learning-based approach for multi-document text summarisation. The proposed system begins with pre-processing tasks such as stop word removal, sentence and paragraph chunking, stemming, and lemmatisation. Textual phrases are transformed into vector space models using TF-ISF and sentence scores are evaluated. A deep learning-based bidirectional long short-term memory model is employed for summarisation. Additionally, cat swarm optimisation and aquila optimisers refine DL model's parameters. The approach is validated using DUC 2002, DUC 2003, and DUC 2005 datasets, demonstrating superior performance across various metrics including Rouge scores, BLEU scores, cohesion, sensitivity, positive predictive value, and readability when compared to other summarisation methods.
Keywords: multi-document text summarisation; MDTS; BiLSTM; term frequency-inverse sentence frequency; deep learning; Aquila optimiser; cat swarm optimisation; CSO; natural language processing; NLP.
DOI: 10.1504/IJDMMM.2025.10066438

Identifying Immoral Posts on Social Media Platforms: a Review
by Bibi Saqia, Khairullah Khan, Atta Ur Rahman
Abstract: Social media has become an integral part of our lives, connecting people across different parts of the world. Recently, there has been an increasing concern over the proliferation of immoral content on social media platforms. The ease and speed of communication on social media have made it a popular platform for people to express their opinions. Still, it has also led to the spread of harmful and immoral content. Hate speech, cyberbullying, and other forms of immoral behaviour are common on social media platforms, which can have serious consequences for the individuals involved and the wider community. Current literature reviews have normally fixated on a specific class of immoral posts as hate speech. According to the study, no review has been dedicated to overall categories of immoral post-identification. This paper describes a systematic literature review of computational approaches, resources, challenges, and research gaps about overall categories of immoral post-identification.
Keywords: immoral posts; social media; cyberbullying; hate speech; challenges and issues.
DOI: 10.1504/IJDMMM.2025.10066845

Sentiment Analysis of Danish Health Care Industries' Financial Text
by Rudra Pratap Deb Nath, Emil Bækdahl, Magnus Brogaard Larsen, Jakob Skallebæk, Jesper Juul Severinsen
Abstract: Sentiment analysis enables organisations to gain insights into market trends and customer opinions expressed in textual format. It quantifies textual opinions by classifying them as positive, negative, or neutral. We present a system for performing sentiment analysis on Danish texts related to the Danish healthcare industry. The system is composed of two components: domain-specific sentiment lexicon (DSSL) generator and dependency tree-based sentence analyser (DTSA). To generate DSSL, we use company stock prices to automatically label the sentiments of financial news articles based on the point-wise mutual information method and achieve performance improvements compared to existing general sentiment lexicons. Our DTSA is based on a data structure called a dependency tree, which describes how words in a text are connected. Depending on the types of connections between the words, we apply different rules to compute a sentiment value. This approach, in conjunction with DSSL, performs best in three-class sentence classification compared to systems using different sentiment lexicons and/or sentiment analysis components. We achieve an accuracy of 53% and the best F1 scores.
Keywords: Sentiment Analysis; Danish Text Mining; Business Intelligence; Knowledge Discovery; Natural Language Processing; ETL.
DOI: 10.1504/IJDMMM.2025.10066891

Lung Disease Classification using Deep Learning 1-D Convolutional Neural Network
by J. Viji Gripsy, Divya T
Abstract: Healthcare plays a crucial role in human life, particularly in the early diagnosis of diseases such as lung cancer, which affects people worldwide. Early detection of lung cancer can significantly improve treatment outcomes. This paper proposes a 1-D CNN deep learning architecture to classify patients into low, medium, and high-risk categories for lung cancer. The model achieves 97% training accuracy and 96.33% test accuracy, outperforming existing classification algorithms in accuracy, precision, recall, F1-score, and AUC. These results highlight the effectiveness of the proposed architecture in the early diagnosis of lung cancer.
Keywords: lung disease; classification; 1-D convolutional neural network; 1-D CNN; prediction.
DOI: 10.1504/IJDMMM.2025.10066898

Sentiment Analysis on Customers' Review in Indonesian Marketplace using Natural Language Processing (a Case Study of Organic Face Mask)
by Nur Izzaty, Adelia Shinta, Riski Arifin, Sri Rahmawati
Abstract: The increasing development of technology nowadays has led to the transformation of customers behaviour in purchasing products, from offline to online through marketplace. One of the most popular marketplaces in Indonesia is Shopee with the best seller skincare product is organic face mask. This study aims to analyse the sentiment of customers review using natural language processing (NLP) and term frequency-inversed document frequency (TF-IDF). The result revealed that from 882 reviews extracted, 89.7% was classified as positive reviews (rating 4 and 5) and the rest as much as 10.3% was the negative ones (rating 1 and 2). The sentiments were visualised using word cloud. Among the positive reviews were 'very good', 'quickly absorbed', and 'convenient'. Meanwhile, among the negative reviews were 'disappointed', 'delivery', and 'acne'. In summary, the performance metrics used for the evaluation of the classification model showed that the model accuracy reached 95%.
Keywords: customers review; natural language processing; NLP; sentiment analysis; term frequency-inverse document frequency; TF-IDF; skincare; organic face mask.
DOI: 10.1504/IJDMMM.2025.10066900

Automated Big Data Quality Assessment using Knowledge Graph Embeddings
by Hadi Fadlallah, Chamoun Rima Kilany, Mitri Haber, Ali Jaber
Abstract: This paper introduces a knowledge-based approach to automate data quality assessment, addressing the limitations of traditional methods that overlook contextual data characteristics. By using knowledge graph embeddings, it predicts missing connections between a datasets context and relevant quality rules within a knowledge graph. This integration of diverse representations enables a context-specific data quality assessment plan tailored to each scenario. The approach enhances understanding of the datasets context, surpassing traditional strict matching methods. Numerical edge attributes are applied to assign weights to predicted quality measurements, providing a comprehensive assessment. The solution is evaluated using AmpliGraph on a radiation sensors dataset from the Lebanese Atomic Energy Commission (LAEC-CNRS), demonstrating its effectiveness in generating a robust data quality assessment plan. The results obtained from this evaluation demonstrate the capability of our solution to generate a comprehensive data quality assessment plan for the given input dataset.
Keywords: Data quality assessment; Data context; Big data; Machine learning; Knowledge graph embeddings; Automation.
DOI: 10.1504/IJDMMM.2025.10067404

Knowledge Discovery for Anthropometric Measures using Data Mining Techniques
by Ali Chegini, Alireza Dehghan, Roozbeh Ghousi
Abstract: The article presents a novel application for data mining techniques on anthropometric measures to uncover hidden relationships and associations between different body measurements. The anthropometric data consists of 111 samples with 15 features including basic demographics, body mass index (BMI), and ten anthropometric measures. The research utilises the CRISP-DM methodology to form an applicable data analytics framework for anthropometric measurements. Various data mining methods were applied including regression analysis to predict stature, clustering algorithms (K-means and hierarchical clustering) to segment the data, classification techniques (SVM and decision trees) to categorise BMI status, and association rules mining to uncover patterns between body dimensions and BMI category. The results demonstrated strong correlations between anthropometric dimensions with stature and weight; three distinct physical trait profiles clusters emerging from the K-means algorithm. The findings can facilitate ergonomic design and promote health assessments, and personalised interventions.
Keywords: ergonomics; human factors; anthropometry; CRISP-DM; machine learning.
DOI: 10.1504/IJDMMM.2025.10067602

Recognition of Critical Built-up Areas Located on High-hill Slope Regions using Decision Tree Technique
by B. G. Kodge
Abstract: In mountainous places, structures are being built for residential or commercial uses without the necessary safety precautions. Every year, landslides, torrential downpours, severe snowfall, earthquakes, volcanic eruptions, and floods cause buildings to collapse. The bulk of them are found in high-hill slope areas with loose soil types, close to river flows and other sorts of water sources. Therefore, these incidents have claimed thousands of lives. This paper deals with the process of automatic identification of critical buildings (residential/commercial) located in mountainous area which are on high-hill-slope, close to river flows, having loose soil type and high variations in land elevation contours. This study use the primary data like, built-up/residential area and water body areas which are extracted from sample land use and land cover (LULC) using image classification techniques, and another important data like slope map and land elevation contour maps which are generated from digital elevation model (DEM). In addition, the supplementary data like, river maps, soil maps and other base maps, are also collected. All the data are integrated and taken into consideration for the identification and extraction of critical residential/build-up areas using spatial data mining technique.
Keywords: critical residential area identification; LUCL; DEM; image segmentation; decision tree; spatial data mining; SDM.
DOI: 10.1504/IJDMMM.2026.10068077

Profiling Cryptocurrency Influencers on Social Media: a Comparative Study using SetFit and DistilBERT
by Rebeh Imane Ammar Aouchiche, Fatima Boumahdi, Mohamed Abdelkarim Remmide, Amina Guendouz
Abstract: Nowadays, in a world dominated by social media, the content people share can have significant effects, particularly in the domain of cryptocurrency, where investors often turn to online advice. The instability of the cryptocurrency market is well known, and some social media individuals wield considerable influence over this market through their posts. Our study focuses on categorizing these influential cryptocurrency influencers based on their English tweets, with the challenge of limited data availability. Two transformer-based models: Sentence Transformer Fine-tuning (SetFit) and Distilled Bert (DistilBERT), were used to classify cryptocurrency influencers into three subtasks: profile authors based on their degree of influence, main interests, and message intent. These models were evaluated on a Twitter-based dataset from PAN2023. The results show that SetFit achieved the best performance with a 0.82 F1-score, followed closely by DistilBERT with a 0.80 F1-score.
Keywords: Social media; Author Profiling; Cryptocurrency influencers; DistilBert; SetFit; Few-shot-learning.
DOI: 10.1504/IJDMMM.2026.10068138

A Web-Based Plagiarism Detection Method for Student Reports using Intrinsic Analysis
by Maryam Elamine, Lamia Hadrich Belguith
Abstract: With the advent of complex language models and the massive amount of data available on the Web, students have had an easier time committing plagiarism. This research describes a web-based system for identifying plagiarism in student reports using intrinsic analysis. To detect plagiarism, we use a combination of stylistic and semantic features as well as a similarity matching technique. We experimented with a dataset of scientific papers mostly published in French, the predominant language in our institutions. Our plagiarism detection method examines the writing style of suspect documents, locates relevant sources on the internet, and compares them to the suspicious documents using external text matching. The preliminary results are promising, with our intrinsic and extrinsic methods reaching an F-score of 40.3% and 89% accuracy, respectively.
Keywords: Online plagiarism detection; intrinsic analysis; writing style analysis; semantic analysis; text-matching; plagiarism in education.
DOI: 10.1504/IJDMMM.2025.10068457

Satellite Image Classification using Deep Learning Model-ResNet
by Pranali Kosamkar, Vrushali Kulkarni, Abdulrahim Shaikh, Geetika Agarwal, Inderjeet Balotia
Abstract: Data mining framework and artificial intelligence (AI) have played a key part in all decision making scenarios. Due to the significant expenses associated with creating training and testing datasets, we need to deal with a number of issues, object recognition, classification, and semantic segmentation in images of low spatial resolution. In this paper we first reviewed the machine learning and deep learning based model for satellite health monitoring systems. We built the deep learning model for satellite image classification. The dataset used is Satellite Image Classification Dataset-RSI-CB256. Two variants, ResNet-12 and ResNet-18 were tested on the dataset. The ResNet-18 showed over 0.94 accuracy for 5 number of epochs and the ResNet-12 showed 0.92 accuracy for training over 10 number of epochs. The result shows that the choice of employing the ResNet CNN architecture for Satellite Image Classification is certainly better than employing other available models such as FCNN, RCNN (with F-RCNN).
Keywords: Deep Learning; ResNet; Data Mining; Artificial Intelligence; Machine Learning; satellite Image; Remote Sensing.
DOI: 10.1504/IJDMMM.2026.10068596

Exploring Solar Activity Dynamics: Nonparametric Change Point Analysis of Sunspot and Umbra Areas
by Sushovon Jana, Chandranath Pal
Abstract: Solar observational studies are crucial for understanding the suns behaviour, its impact on space weather, and its influence on Earths climate. Central to this research is sunspot data analysis, a key indicator of solar activity and magnetic field variations. The study of solar differential rotation has been fundamental, with pioneering work revealing that faster equatorial rotation influences the suns magnetic field and activity cycle. Sunspot areas, meticulously documented by observatories like the Royal Greenwich Observatory and KoSO, have been critical for analysing long-term solar activity trends. The integration of machine learning has significantly advanced sunspot data analysis, enhancing space weather forecasting and the understanding of solar phenomena. This paper employs change point analysis on KoSO sunspot and umbra area data to detect significant shifts over time, utilising nonparametric methods for their computational efficiency. Results show deviations from normality, positive trends, and significant autocorrelation in the data. The PELT algorithm reveals several significant shifts, dividing the period into distinct segments with varying statistical characteristics. These findings align with known solar cycles and highlight the importance of advanced statistical techniques in understanding solar activity.
Keywords: Sunspot; Summary statistics ; Change point analysis; Nonparametric.
DOI: 10.1504/IJDMMM.2026.10068653

Machine Learning Pipeline with an Optimal Feature Set in the Stage-wise Diagnosis of Hepatitis C Virus
by Shirina Samreen
Abstract: Timely and accurate diagnosis of Hepatitis C Virus is aimed in the proposed research using a novel dataset For this purpose, numerous experiments are conducted using various machine learning models employing preprocessing techniques like feature engineering and data augmentation along with multiple heterogeneous classifiers In addition to detecting the onset of the disease, the proposed method also detects the stage of the disease to comprehend the severity for an appropriate follow-up treatment to prevent further damage to the health of the patient. Each experiment comprises various combinations of feature engineering approaches along with multiple heterogeneous classifiers It was found that the machine learning pipeline employing the feature engineering approach of recursive feature elimination with Support Vector Classifier as the estimator and a stacking ensemble classifier provides the best score for all performance metrics with a F1-score of 0.95, accuracy of 95.2 and mean square error of 0.06.
Keywords: Machine Learning; Multi-class Classification; Feature Engineering; Imbalanced Dataset; Synthetic Minority Oversampling Technique; Recursive Feature Elimination; F1-Score; Mean Square Error.
DOI: 10.1504/IJDMMM.2026.10068989

Entity Resolution: a Novel Graph Embedding Approach Using RandomDeep
by Nour Mekki, Djamel Berrabah, Abdelhamid Malki
Abstract: The exponential growth of digital information necessitates robust methods for entity resolution to ensure data quality and integration across datasets. This paper presents three novel node embedding algorithms for entity resolution in graph databases: textit{RandomDeep}, Refined embedding, and Combined embedding. textit{RandomDeep} integrates Iterative Deepening Depth First Search with deep learning to capture structural and semantic characteristics. Refined embedding enhances initial Graph Convolutional (GCN) embeddings through random walk-based refinement. Combined embedding merges outputs from complementary algorithms to produce versatile representations adaptable to diverse graph structures. A two-stage graph summarization technique supports this approach: initially as a blocking method to reduce computational complexity, and later during merging to consolidate redundant nodes. Evaluation datasets (DBLP-Scholar, Amazon-Google, Cora, and Yellow-Yelp) demonstrate the methods' effectiveness, with Area Under Cover Precision and Recall values ranging from 0.50 to 0.97 and F-measure values between 0.67 and 0.94. These results showcase accurate, efficient entity resolution in graph databases.
Keywords: Entity Resolution; graph databases; node embedding; graph summarization; data quality.
DOI: 10.1504/IJDMMM.2026.10069148

Context-Specific Multi-Class Data Analytics for Improving Online Conversation through Deep Learning
by Dhanasekaran K, Nadana Ravishankar, Goyal S. B, Sardar M. N. Islam
Abstract: Social networks have emerged as a platform for disseminating information rapidly to friends, relatives, and the public. An effective text classification strategy can improve the effectiveness of online discussion. This has been a great motivation behind text analytics research. Several text classification approaches have been developed to enhance information extraction performance and address its challenges. However, traditional text data analytics are based on limited contextual and static resources and require effective intelligent techniques for automatically extracting features from the container. To address these issues, we proposed and developed a unique context-specific Multi-Class Data Analytics architecture based on Deep Learning, this approach improved the performance of data analytics and mainly focused on extracting various types of information that describe several attributes to improve the online conversation. The experimental results showed that the proposed multi-class data analytics provide promising results over classification accuracy, validation accuracy, validation loss, precision, recall, and F1-measure in support of text classification for information extraction.
Keywords: Convolutional neural network; Data analytics; Information extraction; Clustering; Deep learning.
DOI: 10.1504/IJDMMM.2026.10069923

MoDA-TL - Monitoring Domestic Animals using Convolutional Neural Networks and Transfer Learning
by Alex A. Do Amaral, Raimundo V. Costa Filho, Mário W. De L. Moreira
Abstract: In recent years, computer vision has made significant advances, expanding its knowledge and applications in various fields. An important example is the use of this technology to improve the recognition of different types of animals. This paper proposes an intelligent surveillance system that can individually identify each animal in a specific location and clearly indicate dangerous or unsuitable areas during monitoring, ensuring the safety of both people and the animals being monitored. In this context, deep learning algorithms, such as convolutional neural networks (CNN), are used to produce machine learning models capable of detecting and identifying objects in digital images. The study utilises the You Only Look Once (YOLO) version 8 model and achieves 99.5% accuracy in animal recognition, demonstrating its effectiveness in monitoring. Additionally, a comparison between a model trained from random weight initialisation and another based on transfer learning reveals that the latter outperforms across various metrics, showing 99.5% accuracy, 99.3% recall, 99.5% mAP50, and 77.5% mAP50-95. These results highlight the advantage of transfer learning in optimising performance.
Keywords: Artificial Intelligence; Deep Learning; Neural Networks; Computer Vision; Image Recognition.
DOI: 10.1504/IJDMMM.2026.10070032

Hybrid Kernel Support Vector Penalised Regression Model for Forewarning Pest Incidence using Weather Variables
by Naranammal Narayanasamy, Krishna S. R. Priya
Abstract: Crop pest incidence and development are impacted by environmental factors. Therefore, weather-based machine learning model will be an effective scientific measure for forewarning pests. But in many cases, the raw data is complex and has the problems of nonlinearity and multicollinearity. So, development of robust model is much needed to forecast complex data. The present study is an attempt to develop hybrid models such as kernel support vector ridge and kernel support vector elastic net regression (KSVENR) to forewarn crop pests of Cotton. Weekly pest incidence data of sucking pests such as aphids, jassid, thrips and whitefly from year 2015-16 to 2022-23 has been used for the study. The results reveal that the KSVENR model outperformed other penalised models by 43%, 42%, 40% and 33% for forewarning pest incidence of aphids, jassid, thrips and whitefly respectively. The proposed model would be a good tool for forecasting nonlinear data with multicollinearity.
Keywords: Time series; Modelling; Forecasting; Nonlinear; Multicollinearity; Data Analysis; Machine Learning; Hybrid Model.
DOI: 10.1504/IJDMMM.2026.10070953

ATESA: Audio Text Emotion & Sentiment Analyser- a Sentiment & Emotion Analysis Tool based on Deep Learning Methods
by Pallavi Shukla, Rakesh Kumar, Vijay Dwivedi, Ashutosh Singh
Abstract: Sentiment analysis (SA) identifies sentiments in text, reviews, tweets, audio, images, and videos. Sentiment integrates emotion and thinking, with emotions being temporary while sentiments last longer. Emotion recognition and sentiment polarity analysis are gaining popularity in natural language processing due to their ability to mine social media data. This study applies machine learning (ML) classifiers such as random forest, logistic regression, support vector machine, and decision tree to classify text and speech as positive, negative, or neutral. Additionally, it explores available sentiment analysis tools and introduces the audio text emotion and sentiment analyser (ATESA). ATESA leverages ensemble-oriented classification techniques using deep learning, specifically bidirectional long-short-term memory recurrent neural networks (Bi-LSTM-RNN). It processes text, Twitter data, and speech converted into text. Experimental results show that ATESA achieves 92% accuracy, outperforming other algorithms.
Keywords: Sentiment Analysis Tool; Bi-LSTM; RNN; TFIDF; Deep Learning.
DOI: 10.1504/IJDMMM.2026.10071047

Advancements in Mental Health Diagnosis: Leveraging Delta Feature Extraction Framework and PWSA Ensemble for Motion Data Analysis
by S. Annapoorani, Lakshmi M.
Abstract: Depression affects over 350 million people globally and can become a serious health issue, especially when prolonged and ranging from mild to severe. Physical activity data offers a cost-effective and accessible approach to aid in diagnosing mental illnesses. This study introduces the Delta feature extraction framework (D-FEF), which extracts delta series and relevant features from original time series data, subsequently selecting a significant feature set. A probabilistic weighted selection algorithm (PSWA) with SMOTE generates multiple hypotheses using training data based on modified distributions, creating an ensemble of classifiers to predict healthy controls, depressive disorder, and schizophrenia. The PSWA classifier, utilising the D-FEF feature selection process, achieved 92.94% accuracy, outperforming all other tested methods. The techniques performance was evaluated on mental health datasets, including Depresjon and Psykose, and compared against state-of-the-art approaches. The proposed D-FEF and PSWA methodology demonstrates promising results for the classification of mental health conditions using physical activity data.
Keywords: Actigraphy data; mental health; feature engineering; feature selection; ensemble machine learning algorithm.
DOI: 10.1504/IJDMMM.2026.10072023

Forthcoming and Online First Articles

International Journal of Data Mining, Modelling and Management

Keep up-to-date