Template-Type: ReDIF-Article 1.0 Author-Name: Elsayed A.H. Elamir Author-X-Name-First: Elsayed A.H. Author-X-Name-Last: Elamir Title: Data analytics for gross domestic product using random forest and extreme gradient boosting approaches: an empirical study Abstract: This study aims to use the random forest and extreme gradient boosting approaches to forecast and analyse gross domestic product per capita using data from World Bank development indicators on countries level over the period 2010 to 2017. The comprehensive comparisons are executed using years before 2017 as training data and year 2017 as testing data. The root mean squares error, and the coefficient of determination are used to judge among the different models. The random forest and extreme gradient boosting achieve accuracy 97.8% and 98.1%, respectively, using coefficient of determination. The results suggest that the investment in education, labour, health, and industry as well as decreasing in inflation, interest, unemployment is necessary to enhance gross domestic product per capita. Motivating results are given by two-way interaction measure that is useful in explaining co-dependencies in the model behaviour. The strongest interactions are between trade-technology, technology-education followed by consumption-health. Journal: Int. J. of Data Mining, Modelling and Management Pages: 269-286 Issue: 3 Volume: 14 Year: 2022 Keywords: bagging; boosting; business analytics; forecast; gross domestic product; GDP; machine learning. File-URL: http://www.inderscience.com/link.php?id=125258 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:3:p:269-286 Template-Type: ReDIF-Article 1.0 Author-Name: Fedor Krasnov Author-X-Name-First: Fedor Author-X-Name-Last: Krasnov Author-Name: Mikhail Shvartsman Author-X-Name-First: Mikhail Author-X-Name-Last: Shvartsman Author-Name: Alexander Dimentov Author-X-Name-First: Alexander Author-X-Name-Last: Dimentov Title: Comparing text corpora via topic modelling Abstract: A method is developed for conducting comparative analysis on the content of full text patents collections. Named T4C, the approach is based on topic modelling and machine learning and extends comparative text mining. The idea of T4C was inspired by the possibility of precise topics extracting from a joint collection of texts and following analysing the parts of collection on the topics. The different aspects of meta information of the patents full texts collection are considered. The ownership of a patent in a particular country can be identified with an accuracy of 97.5% by using supervised machine learning. By studying how patents vary with time, those belonging to a specific period can be identified with an accuracy of 85% for a given country. Also developed is a visual representation of the thematic correlation between groups of patents. In terms of the text composition of patent descriptions, Chinese patents differ fundamentally from US patents. T4C method is valid for structured medium-sized collections of texts in English. The experimental results are used to manage the patenting process at GazpromNeft STC. Journal: Int. J. of Data Mining, Modelling and Management Pages: 203-216 Issue: 3 Volume: 14 Year: 2022 Keywords: topic modelling; text classification; ARTM; additive regularisation of topic models; PLSA; random forest; comparing text collections. File-URL: http://www.inderscience.com/link.php?id=125259 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:3:p:203-216 Template-Type: ReDIF-Article 1.0 Author-Name: Sadaf Kabir Author-X-Name-First: Sadaf Author-X-Name-Last: Kabir Author-Name: Leily Farrokhvar Author-X-Name-First: Leily Author-X-Name-Last: Farrokhvar Title: Nonlinear gradient-based feature selection for precise prediction of diseases Abstract: Developing accurate predictive models can profoundly help healthcare providers improve the quality of their services. However, medical data often contain several variables, and not all the data equally contribute towards the prediction. The existence of irrelevant and redundant features in a dataset can unnecessarily increase computational cost and complexity while deteriorating the performance of the predictive model. In this study, we employ the gradient-based prediction attribution as a general tool to identify important features in differentiable predictive models, such as neural networks (NN) and linear regression. Built upon this approach, we analyse single-stage and multi-stage scenarios for feature selection using ten medical datasets. Through extensive experiments, we demonstrate that the combination of the gradient-based approach with NN provides a powerful nonlinear technique to identify important features contributing to the prediction. In particular, nonlinear gradient-based feature selection achieves competitive results or significant improvements over previously reported results on all datasets. Journal: Int. J. of Data Mining, Modelling and Management Pages: 248-268 Issue: 3 Volume: 14 Year: 2022 Keywords: machine learning; feature selection; neural networks; logistic regression; disease prediction models; healthcare data. File-URL: http://www.inderscience.com/link.php?id=125260 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:3:p:248-268 Template-Type: ReDIF-Article 1.0 Author-Name: Roberto Bertolini Author-X-Name-First: Roberto Author-X-Name-Last: Bertolini Author-Name: Stephen J. Finch Author-X-Name-First: Stephen J. Author-X-Name-Last: Finch Title: Synergistic effects between data corpora properties and machine learning performance in data pipelines Abstract: To analyse data, a computationally feasible pipeline must be developed for data modelling. Corpora properties affect performance variability of machine learning (ML) techniques in pipelines; however, this has not been thoroughly investigated using simulation methodologies. A Monte Carlo study is used to compare differences in the area under the curve (AUC) metric for large-<i>n</i>-small-<i>p</i>-corpora examining: 1) the choice of ML algorithm; 2) size of the training database; 3) measurement error; 4) class imbalance magnitude; 5) missing data pattern. Our simulations are consistent with established results under which these algorithms and corpora properties perform best, while providing insights into their synergistic effects. Measurement error negatively impacted pipeline performance across all corpora factors and ML algorithms. A larger training corpus ameliorated the decrease in predictive efficacy resulting from measurement error, class imbalance magnitudes, and missing data patterns. We discuss the implications of these findings for designing pipelines to enhance prediction performance. Journal: Int. J. of Data Mining, Modelling and Management Pages: 217-233 Issue: 3 Volume: 14 Year: 2022 Keywords: data pipeline; interaction/synergistic effects; Monte Carlo simulation; machine learning; binary classification; area under the curve; AUC. File-URL: http://www.inderscience.com/link.php?id=125261 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:3:p:217-233 Template-Type: ReDIF-Article 1.0 Author-Name: Paúl Cumba-Armijos Author-X-Name-First: Paúl Author-X-Name-Last: Cumba-Armijos Author-Name: Diego Riofrío-Luzcando Author-X-Name-First: Diego Author-X-Name-Last: Riofrío-Luzcando Author-Name: Verónica Rodríguez-Arboleda Author-X-Name-First: Verónica Author-X-Name-Last: Rodríguez-Arboleda Author-Name: Joe Carrión-Jumbo Author-X-Name-First: Joe Author-X-Name-Last: Carrión-Jumbo Title: Detecting cyberbullying in Spanish texts through deep learning techniques Abstract: Recent recollected data suggests that it is possible to automatically detect events that may negatively affect the most vulnerable parts of our society, by using any communication technology like social networks or messaging applications. This research consolidates and prepares a corpus with Spanish bullying expressions taken from Twitter in order to use them as an input to train a convolutional neuronal network through deep learning techniques. As a result of this training, a predictive model was created, which can identify Spanish cyberbullying expressions such as insults, racism, homophobic attacks, and so on. Journal: Int. J. of Data Mining, Modelling and Management Pages: 234-247 Issue: 3 Volume: 14 Year: 2022 Keywords: cyberbullying; deep learning; convolutional neuronal network; Spanish; social networks. File-URL: http://www.inderscience.com/link.php?id=125265 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:3:p:234-247 Template-Type: ReDIF-Article 1.0 Author-Name: Essam Alhroob Author-X-Name-First: Essam Author-X-Name-Last: Alhroob Author-Name: Mohammed Falah Mohammed Author-X-Name-First: Mohammed Falah Author-X-Name-Last: Mohammed Author-Name: Fadhl Hujainah Author-X-Name-First: Fadhl Author-X-Name-Last: Hujainah Author-Name: Osama Nayel Al Sayaydeh Author-X-Name-First: Osama Nayel Al Author-X-Name-Last: Sayaydeh Author-Name: Ngahzaifa Ab Ghani Author-X-Name-First: Ngahzaifa Ab Author-X-Name-Last: Ghani Title: Investigation of contraction process issue in fuzzy min-max models Abstract: The fuzzy min-max (FMM) network is one of the most powerful neural networks. It combines a neural network and fuzzy sets into a unified framework to address pattern classification problems. The FMM consists of three main learning processes, namely, hyperbox contraction, hyperbox expansion and hyperbox overlap tests. Despite its various learning processes, the contraction process is considered as one of the major challenges in the FMM that affects the classification process. Thus, this study aims to investigate the FMM contraction process precisely to highlight its usage consequences during the learning process. Such investigation can assist practitioners and researchers in obtaining a better understanding about the consequences of using the contraction process on the network performance. Findings of this study indicate that the contraction process used in FMM can affect network performance in terms of misclassification and incapability in handling the membership ambiguity of the overlapping regions. Journal: Int. J. of Data Mining, Modelling and Management Pages: 1-14 Issue: 1 Volume: 14 Year: 2022 Keywords: pattern classification; fuzzy min-max; FMM models; contraction process. File-URL: http://www.inderscience.com/link.php?id=122034 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:1:p:1-14 Template-Type: ReDIF-Article 1.0 Author-Name: Taiseer Abdalla Elfadil Eisa Author-X-Name-First: Taiseer Abdalla Elfadil Author-X-Name-Last: Eisa Title: Plagiarism detection of figure images in scientific publications Abstract: Plagiarism is stealing others' work using their words directly or indirectly without a credit citation. Copying others' ideas is another type of plagiarism that may occur in many areas but the most serious one is the academic plagiarism. Therefore, technical solutions are urgently required for automatic detection of idea plagiarism. Detection of figure plagiarism is a particularly challenging field of research, because not only the text analytics but also graphic features need to be analysed. This paper investigates the issues of idea and figure plagiarism and proposes a detection method which copes with both text and structure change. The procedure depends on finding similar semantic meanings between figures by applying image processing and semantic mapping techniques. The figures were compared using the representation of shape features based on detailed comparisons between the components of figures. This is an improvement over existing methods, which only compare the numbers and types of shapes inside figures. Journal: Int. J. of Data Mining, Modelling and Management Pages: 15-29 Issue: 1 Volume: 14 Year: 2022 Keywords: plagiarism detection; figure plagiarism detection; idea plagiarism detection; academic plagiarism; image processing; semantic mapping techniques; content-based algorithms. File-URL: http://www.inderscience.com/link.php?id=122036 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:1:p:15-29 Template-Type: ReDIF-Article 1.0 Author-Name: Nuhu Yusuf Author-X-Name-First: Nuhu Author-X-Name-Last: Yusuf Author-Name: Mohd Amin Mohd Yunus Author-X-Name-First: Mohd Amin Mohd Author-X-Name-Last: Yunus Author-Name: Norfaradilla Wahid Author-X-Name-First: Norfaradilla Author-X-Name-Last: Wahid Author-Name: Aida Mustapha Author-X-Name-First: Aida Author-X-Name-Last: Mustapha Author-Name: Nazri Mohd Nawi Author-X-Name-First: Nazri Mohd Author-X-Name-Last: Nawi Author-Name: Noor Azah Samsudin Author-X-Name-First: Noor Azah Author-X-Name-Last: Samsudin Title: Arabic text semantic-based query expansion Abstract: Query expansions are being used in many search applications for retrieving relevant documents. Although retrieving the relevant documents are important for search users, the complexity of Arabic morphology remains a challenge. As such, many irrelevant documents were still retrieved from the ranked results. To address this challenge, this paper proposes a new searching method for Arabic text semantic-based query expansion. The proposed method combines Arabic word synonyms and ontology to expand the query with additional terms. Specifically, the proposed method combined lexical words within the ranking algorithm and then improved with ontology links to expand query. The performance of Arabic text semantic-based query expansion was evaluated in terms of average precision, means average precision and means reciprocal rank. Experiments on Quran datasets show that the proposed method using Arabic text semantic-based query expansion approach outperforms the previous methods using other dataset which is called Tafsir dataset. The proposed method achieved 15.44% mean average precision. Journal: Int. J. of Data Mining, Modelling and Management Pages: 30-40 Issue: 1 Volume: 14 Year: 2022 Keywords: Arabic text; semantic search; query expansion; lexical words; ontology; ranking algorithms. File-URL: http://www.inderscience.com/link.php?id=122037 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:1:p:30-40 Template-Type: ReDIF-Article 1.0 Author-Name: Talal Almutiri Author-X-Name-First: Talal Author-X-Name-Last: Almutiri Author-Name: Faisal Saeed Author-X-Name-First: Faisal Author-X-Name-Last: Saeed Title: A hybrid feature selection method combining Gini index and support vector machine with recursive feature elimination for gene expression classification Abstract: Microarray datasets are suffering from a curse of dimensionality, because of a large number of genes and low numbers of samples, wherefore, the high dimensionality leads to computational cost and complexity. Consequently, feature selection (FS) is the process of choosing informative genes that could help in improving the effectiveness of classification. In this study, a hybrid feature selection was proposed, which combines the Gini index and support vector machine with recursive feature elimination (GI-SVM-RFE), calculates a weight for each gene and recursively selects only ten genes to be the informative genes. To measure the impact of the proposed method, the experiments include four scenarios: baseline without feature selection, GI feature selection, SVM-RFE feature selection, and combining GI with SVM-RFE. In this paper, 11 microarray datasets were used. The proposed method showed an improvement in terms of classification accuracy when compared with other previous studies. Journal: Int. J. of Data Mining, Modelling and Management Pages: 41-62 Issue: 1 Volume: 14 Year: 2022 Keywords: classification; feature selection; gene expression; Gini index; microarray; recursive feature elimination. File-URL: http://www.inderscience.com/link.php?id=122038 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:1:p:41-62 Template-Type: ReDIF-Article 1.0 Author-Name: Hema Dubey Author-X-Name-First: Hema Author-X-Name-Last: Dubey Author-Name: Nilay Khare Author-X-Name-First: Nilay Author-X-Name-Last: Khare Title: Fast parallel computation of PageRank scores with improved convergence time Abstract: PageRank is a conspicuous link-based approach used by many search engines in order to rank its search results. PageRank algorithm is based on performing iterations for calculating PageRank of web pages until the convergent point is met. The computational cost of this algorithm is very high for very large web graphs. So to overcome this drawback, in this paper we have proposed a fast parallel computation of PageRank which uses standard deviation technique to normalise the PageRank score of each web page. The proposed work is experimented on standard datasets taken from Stanford large network dataset collection, on a machine having multicore architecture using CUDA programming paradigm. We observed from the experiments that the proposed fast parallel PageRank algorithm needs lesser number of iterations to converge as compared to existing parallel PageRank method. We also determined that there is a speed up of about 2 to 10 for nine different standard datasets for the proposed algorithm over the existing algorithm. Journal: Int. J. of Data Mining, Modelling and Management Pages: 63-88 Issue: 1 Volume: 14 Year: 2022 Keywords: PageRank; normalisation; standard deviation; parallel computation; graphics processing unit; GPU; compute unified device architecture; CUDA. File-URL: http://www.inderscience.com/link.php?id=122039 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:1:p:63-88 Template-Type: ReDIF-Article 1.0 Author-Name: Wael K. Hanna Author-X-Name-First: Wael K. Author-X-Name-Last: Hanna Author-Name: Rasha Elstohy Author-X-Name-First: Rasha Author-X-Name-Last: Elstohy Author-Name: Nouran M. Radwan Author-X-Name-First: Nouran M. Author-X-Name-Last: Radwan Title: Prediction of air pollution and analysis of its effects on the pollution dispersion of PM10 in Egypt using machine learning algorithms Abstract: Air pollution has been considered as one of the serious threats in Egypt. According to a study in <i>Environmental Science & Technology Letters</i> journal, air pollution is one of the main responsible for shortening Egyptians lives by 1.85 years. The main cause of air pollution in Egypt is PM<SUB align="right"><SMALL>10</SMALL></SUB> which comes from industrial processes. PM<SUB align="right"><SMALL>10</SMALL></SUB> concentrations exceed daily average concentrations during 98% of the measurement period. In this paper, we will apply machine learning classification algorithms to build the most accurate model for air pollution prediction and analysing its effects on pollution dispersion of PM<SUB align="right"><SMALL>10</SMALL></SUB>. The proposed classification model begins with air quality data collection and pre-processing, and then classifying process to discover the main relevant features for prediction. Experimental results show a good performance of the proposed air quality model. Random forest and naïve Bayes algorithms achieved accuracy almost 82%, and JRip and fuzzy classifier achieved less classification results accuracy 65%, 76% respectively. Journal: Int. J. of Data Mining, Modelling and Management Pages: 358-371 Issue: 4 Volume: 14 Year: 2022 Keywords: air pollution; PM10; classification model; machine learning algorithms; Egypt. File-URL: http://www.inderscience.com/link.php?id=126662 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:4:p:358-371 Template-Type: ReDIF-Article 1.0 Author-Name: Ikram Nekkache Author-X-Name-First: Ikram Author-X-Name-Last: Nekkache Author-Name: Said Jabbour Author-X-Name-First: Said Author-X-Name-Last: Jabbour Author-Name: Nadjet Kamel Author-X-Name-First: Nadjet Author-X-Name-Last: Kamel Author-Name: Lakhdar Sais Author-X-Name-First: Lakhdar Author-X-Name-Last: Sais Title: Detecting and exploiting symmetries in sequential pattern mining Abstract: In this paper, we introduce a new framework for discovering and using symmetries in sequential pattern mining tasks. Symmetries are permutations between items that leave invariant the sequential database. Symmetries present several potential benefits. They can be seen as a new kind of structural patterns expressing regularities and similarities between items. As symmetries induce a partition of the sequential patterns into equivalent classes, exploiting them would allow to improve the pattern enumeration process, while reducing the size of the output. To this end, we first address the problem of symmetry discovery from database of sequences. Then, we first show how Apriori-like algorithms can be enhanced by dynamic integration of the detected symmetries. Secondly, we provide a second symmetry breaking approach allowing to eliminate symmetries in a pre-processing step by reformulating the sequential database of transactions. Our experiments clearly show that several sequential pattern mining datasets contain such symmetry-based regularities. We also experimentally demonstrate that using such symmetries would results in significant reduction of the search space on some datasets. Journal: Int. J. of Data Mining, Modelling and Management Pages: 309-334 Issue: 4 Volume: 14 Year: 2022 Keywords: data mining; sequential pattern mining; symmetries. File-URL: http://www.inderscience.com/link.php?id=126663 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:4:p:309-334 Template-Type: ReDIF-Article 1.0 Author-Name: Mehmet Özçalıcı Author-X-Name-First: Mehmet Author-X-Name-Last: Özçalıcı Author-Name: Ayşe Tuğba Dosdoğru Author-X-Name-First: Ayşe Tuğba Author-X-Name-Last: Dosdoğru Author-Name: Aslı Boru İpek Author-X-Name-First: Aslı Boru Author-X-Name-Last: İpek Author-Name: Mustafa Göçken Author-X-Name-First: Mustafa Author-X-Name-Last: Göçken Title: Comparison of harmony search derivatives for artificial neural network parameter optimisation: stock price forecasting Abstract: This study has been conducted on forecasting, as accurately as possible, the next day's stock price using harmony search (HS) and its variants [improved harmony search (IHS), global-best harmony search (GHS), self-adaptive harmony search (SAHS), and intelligent tuned harmony Search (ITHS) together with artificial neural network (ANN)]. The advantage of the proposed models are that the useful information in the original stock data is found by input variable selection and simultaneously the most proper number of hidden neurons in hidden layer is discovered to mitigate overfitting/underfitting problem in ANN. The results have shown that forecasts made by HS-ANN, IHS-ANN, GHS-ANN, SAHS-ANN, and ITHS-ANN demonstrate a tendency to achieve hit rates above 89%, which is considerably better than previously proposed forecasting models in literature. Hence, ANN models provide more valuable forecasting results for investors to hedge against potential risk in stock markets. Journal: Int. J. of Data Mining, Modelling and Management Pages: 335-357 Issue: 4 Volume: 14 Year: 2022 Keywords: stock price forecasting; artificial neural network; harmony search and its variants. File-URL: http://www.inderscience.com/link.php?id=126664 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:4:p:335-357 Template-Type: ReDIF-Article 1.0 Author-Name: Yuehua Duan Author-X-Name-First: Yuehua Author-X-Name-Last: Duan Author-Name: Zbigniew W. Ras Author-X-Name-First: Zbigniew W. Author-X-Name-Last: Ras Title: Recommendation system for improving churn rate based on action rules and sentiment mining Abstract: It is well recognised that customers are one of the most valuable assets to a company. Therefore, it is of significant value for companies to reduce the customer outflow. In this paper, we focus on identifying the customers with high chance of attrition and provide valid and trustworthy recommendations to improve their customer churn rate. To this end, we designed and implemented a recommender system that can provide actionable recommendations to improve customer churn rate. We used both transaction and survey data from heavy equipment repair and service sector from 2011 to 2017. This data was collected by a consulting company based in Charlotte, North Carolina. In the survey data, customers give their thoughts, feelings, expectations and complaints by freeform text. We applied aspect-based sentiment analysis on the review text data to gain insightful knowledge on customers' attitudes toward the service. Action rule mining and meta-action triggering mechanism are used to recognise the actionable strategies to help with reducing customer churn. Journal: Int. J. of Data Mining, Modelling and Management Pages: 287-308 Issue: 4 Volume: 14 Year: 2022 Keywords: action rule mining; meta-actions; aspect-based sentiment analysis; recommender system; reduct. File-URL: http://www.inderscience.com/link.php?id=126665 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:4:p:287-308 Template-Type: ReDIF-Article 1.0 Author-Name: Khaled Benali Author-X-Name-First: Khaled Author-X-Name-Last: Benali Title: Ontology and web usage mining for website maintenance Abstract: The web mining and the semantic web are closely linked: on the one hand, web-mining techniques help in the construction of the semantic web on the other hand; the semantic web helps extract new knowledge. The present article presents an approach that uses ontology and web usage mining techniques for website maintenance. This work can help novice researchers start working enriched based on the extracted patterns on the site logs using an algorithm for maintain the website. Journal: Int. J. of Data Mining, Modelling and Management Pages: 372-400 Issue: 4 Volume: 14 Year: 2022 Keywords: apriori; knowledge; log file; semantic web; ontology; web usage mining; WUM; website maintenance. File-URL: http://www.inderscience.com/link.php?id=126666 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:4:p:372-400 Template-Type: ReDIF-Article 1.0 Author-Name: Amal Bouraoui Author-X-Name-First: Amal Author-X-Name-Last: Bouraoui Author-Name: Salma Jamoussi Author-X-Name-First: Salma Author-X-Name-Last: Jamoussi Author-Name: Abdelmajid Ben Hamadou Author-X-Name-First: Abdelmajid Ben Author-X-Name-Last: Hamadou Title: A comprehensive review of deep learning for natural language processing Abstract: Deep learning has attracted considerable attention across many natural language processing (NLP) domains. Deep learning models aim to learn embeddings of data with multiple levels of abstraction through multiple layers for either labelled structured input data or unlabelled unstructured input data. Currently, two research trends have emerged in building higher level embeddings. On one hand, a strong trend in deep learning leads towards increasingly powerful and complex models. On the other hand, multi-purpose sentence representation based on simple sums or averages of word vectors was recently shown to be effective. Furthermore, improving the performance of deep learning methods by attention mechanism has become a research hotspot in the last four years. In this paper, we seek to provide a comprehensive review of recent studies in building neural network (NN) embeddings that have been applied to NLP tasks. We provide a walk-through of deep learning evolution and a description of a variety of its architectures. We present and compare the performance of several deep learning models on standard datasets about different NLP tasks. We also present some deep learning challenges for natural language processing. Journal: Int. J. of Data Mining, Modelling and Management Pages: 149-182 Issue: 2 Volume: 14 Year: 2022 Keywords: deep learning; word embedding; sentence embedding; attention mechanism; compositional models; convolutional neural networks; CNNs; recurrent/recursive NNs; multi-purpose sentence embedding; natural language processing; NLP. File-URL: http://www.inderscience.com/link.php?id=123356 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:2:p:149-182 Template-Type: ReDIF-Article 1.0 Author-Name: Kei Nakagawa Author-X-Name-First: Kei Author-X-Name-Last: Nakagawa Author-Name: Kenichi Yoshida Author-X-Name-First: Kenichi Author-X-Name-Last: Yoshida Title: Time-series gradient boosting tree for stock price prediction Abstract: We propose a time-series gradient boosting tree for a dataset with time-series and cross-sectional attributes. Our time-series gradient boosting tree has weak learners with time-series and cross-sectional attributes in its internal node, and split examples based on similarity between a pair of time-series or impurity between cross-sectional attributes. Dissimilarity between a pair of time-series is defined by the dynamic time warping method. In other words, the decision tree is constructed based on the shape that the time-series is similar or not similar to its past shape. We conducted an empirical analysis using major world indices and confirmed that our time-series gradient boosting tree is superior to prior research methods in terms of both profitability and accuracy. Journal: Int. J. of Data Mining, Modelling and Management Pages: 110-125 Issue: 2 Volume: 14 Year: 2022 Keywords: dynamic time warping method; time-series decision tree; time-series gradient boosting tree; stock price prediction. File-URL: http://www.inderscience.com/link.php?id=123357 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:2:p:110-125 Template-Type: ReDIF-Article 1.0 Author-Name: Kenan Mengüç Author-X-Name-First: Kenan Author-X-Name-Last: Mengüç Author-Name: Tarık Küçükdeniz Author-X-Name-First: Tarık Author-X-Name-Last: Küçükdeniz Title: Suggestion and solution of a mathematical model for determining effective routes in football Abstract: As obtaining data gets easier and cheaper with the help of technological achievements, data-based analytics and management have become an essential part of planning and decision making to achieve success in the sports industry. The study finds offensive routes for a team game using high-security data produced with technology. An analysis of a sport's team match was performed using seasonal data. A mathematical model has been developed for this analysis, discussing the effectiveness of the routes the model offers. This article aims to find the safe, efficient route for organising the football on the field. In addition, the study also offers an experimental proposal for this purpose. Journal: Int. J. of Data Mining, Modelling and Management Pages: 126-148 Issue: 2 Volume: 14 Year: 2022 Keywords: match strategy; tactics; optimisation; transshipment problem. File-URL: http://www.inderscience.com/link.php?id=123358 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:2:p:126-148 Template-Type: ReDIF-Article 1.0 Author-Name: Germán Lescano Author-X-Name-First: Germán Author-X-Name-Last: Lescano Author-Name: Rosanna Costaguta Author-X-Name-First: Rosanna Author-X-Name-Last: Costaguta Author-Name: Analía Amandi Author-X-Name-First: Analía Author-X-Name-Last: Amandi Title: Emotions recognition in synchronic textual CSCL situations Abstract: Computer-supported collaborative learning (CSCL) is a useful practice to teach learners working in groups and to acquire collaborative skills. To evaluate the collaborative process can be heavy for teachers because it implies to analyse a lot of interactions. One issue to consider is socio-affective interactions due to their influence in the learning process. In this work, we propose an approach to recognise affective states in synchronic textual CSCL situations of students that speak Spanish. Through experimentation, we analyse emotions manifested by university students of computer sciences when they worked in groups in these situations and we evaluated the proposed approach using tools and libraries available in the market to make a sentiment analysis. Results obtained are promising. Providing CSCL environments with a tool to recognise socio-affective interactions can be useful in order to help teachers evaluate this dimension of the collaborative process. Journal: Int. J. of Data Mining, Modelling and Management Pages: 183-202 Issue: 2 Volume: 14 Year: 2022 Keywords: computer-supported collaborative learning; CSCL; socio-affective interactions; affective computing. File-URL: http://www.inderscience.com/link.php?id=123359 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:2:p:183-202 Template-Type: ReDIF-Article 1.0 Author-Name: Akash Gupta Author-X-Name-First: Akash Author-X-Name-Last: Gupta Author-Name: Amir Gharehgozli Author-X-Name-First: Amir Author-X-Name-Last: Gharehgozli Title: Developing a machine learning framework to determine the spread of COVID-19 in the USA using meteorological, social, and demographic factors Abstract: Coronavirus disease of 2019 (COVID-19) has become a pandemic in the matter of a few months, since the outbreak in December 2019 in Wuhan, China. We study the impact of weather factors including temperature and pollution on the spread of COVID-19. We also include social and demographic variables such as per capita gross domestic product (GDP) and population density. Adapting the theory from the field of epidemiology, we develop a framework to build analytical models to predict the spread of COVID-19. In the proposed framework, we employ machine learning methods including linear regression, linear kernel support vector machine (SVM), radial kernel SVM, polynomial kernel SVM, and decision tree. Given the nonlinear nature of the problem, the radial kernel SVM performs the best and explains 95% more variation than the existing methods. In line with the literature, our study indicates the population density is the critical factor to determine the spread. The univariate analysis shows that a higher temperature, air pollution, and population density can increase the spread. On the other hand, a higher per capita GDP can decrease the spread. Journal: Int. J. of Data Mining, Modelling and Management Pages: 89-109 Issue: 2 Volume: 14 Year: 2022 Keywords: COVID-19; disease spread; social and demographic factors; machine learning; epidemiology; predictive modelling. File-URL: http://www.inderscience.com/link.php?id=123360 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:14:y:2022:i:2:p:89-109