Template-Type: ReDIF-Article 1.0 Author-Name: Sachin Deshmukh Author-X-Name-First: Sachin Author-X-Name-Last: Deshmukh Author-Name: Seema Sant Author-X-Name-First: Seema Author-X-Name-Last: Sant Author-Name: Neerja Kashive Author-X-Name-First: Neerja Author-X-Name-Last: Kashive Title: Modelling attrition to know why your employees leave or stay Abstract: Today's environmental factors influence every aspect of business, be it marketing, finance, operations or human resource. Talent shortage has become a global issue for organisations. One of the major challenges faced by any organisation is the increase in the level of employee attrition. The current study has tried to build a predictive model by using logistic regression and understand the specific factors that lead to attrition. This paper also attempts to compare factors responsible for attrition in two time periods, first period from 1996 to 2008 (Holtom's model) and second period from 2009 to 2016 to find whether any changes have taken place in employees' expectations, which, if not fulfilled, may lead to attrition. An analysis of an IT organisation's data reveal that factors responsible for attrition in the second period have changed, compared to the first period. Journal: Int. J. of Data Mining, Modelling and Management Pages: 231-253 Issue: 3 Volume: 13 Year: 2021 Keywords: attrition; predictive model; logistic regression. File-URL: http://www.inderscience.com/link.php?id=118018 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:3:p:231-253 Template-Type: ReDIF-Article 1.0 Author-Name: Chia-Hao Chiu Author-X-Name-First: Chia-Hao Author-X-Name-Last: Chiu Author-Name: Yun-Cheng Tsai Author-X-Name-First: Yun-Cheng Author-X-Name-Last: Tsai Author-Name: Ho-Lin Chen Author-X-Name-First: Ho-Lin Author-X-Name-Last: Chen Title: Long text to image converter for financial reports Abstract: In this study, we proposed a novel article analysis method. This method converts the article classification problem into an image classification problem by projecting texts into images and then applying CNN models for classification. We called the method the long text to image converter (LTIC). The features are extracted automatically from the generated images, hence there is no need of any explicit step of embedding the words or characters into numeric vector representations. This method saves the time to experiment pre-process. This study uses the financial domain as an example. In companies' financial reports, there will be a chapter that describes the company's financial trends. The content has many financial terms used to infer the company's current and future's financial position. The LTIC achieved excellent convolution matrix and test data accuracy. The results indicated an 80% accuracy rate. The proposed LTIC produced excellent results during practical application. The LTIC achieved excellent performance in classifying corporate financial reports under review. The return on simulated investment is 46%. In addition to tangible returns, the LTIC method reduced the time required for article analysis and is able to provide article classification references in a short period to facilitate the decisions of the researcher. Journal: Int. J. of Data Mining, Modelling and Management Pages: 211-230 Issue: 3 Volume: 13 Year: 2021 Keywords: article analysis; convolutional neural network; CNN; financial analysis; long text to image converter; LTIC. File-URL: http://www.inderscience.com/link.php?id=118019 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:3:p:211-230 Template-Type: ReDIF-Article 1.0 Author-Name: Maira Alejandra Pulgarín Rodríguez Author-X-Name-First: Maira Alejandra Pulgarín Author-X-Name-Last: Rodríguez Author-Name: Bárbara Maricely Fierro Chong Author-X-Name-First: Bárbara Maricely Fierro Author-X-Name-Last: Chong Author-Name: Erica María Ossa Taborda Author-X-Name-First: Erica María Ossa Author-X-Name-Last: Taborda Title: E-learning process through text mining for academic literacy Abstract: The aim of this paper is to present the results of a research carried out in a virtual faculty of education in a private university in Colombia. It consists of the characterisation of students' abilities for reading and writing comprehension for academic literacy. This study verifies the effectiveness of an e-learning platform implementation for all the programs incorporated in such faculty. The university established a methodological procedure for text mining in order to identify specific keywords in different text typologies for specialised areas. This platform allows professors and students to develop an expertise in disciplines using text mining as an interdisciplinary strategy to build the knowledge and improve the quality in their professional context. Journal: Int. J. of Data Mining, Modelling and Management Pages: 283-298 Issue: 3 Volume: 13 Year: 2021 Keywords: text mining; terminological work; cognitive processes; e- learning; academic literacy; reading comprehension; academic writing. File-URL: http://www.inderscience.com/link.php?id=118020 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:3:p:283-298 Template-Type: ReDIF-Article 1.0 Author-Name: Muning Chang Author-X-Name-First: Muning Author-X-Name-Last: Chang Title: Association rules in mobile game operation Abstract: Mobile games are now playing a significant role in the gaming industry as the internet continues to develop. Due to the economic and cultural value of mobile games, it is very importance for the gaming companies to maintain and further improve the product quality to remain competitive in the industry. The operation team plays the key points to maintain product profitability after issuing the games. This paper will analyse the gaming data collected during operation and propose operation strategies accordingly. A correlation coefficient algorithm suitable for time sequences is proposed, the association is defined by the similarity between data. The level of association between two-time sequences is reflected in the probability of the occurrence of such association. Based on the discovery, we can analyse the next popular mobile game in depth to explore the correlation between the number of users online, the number of new players, and the retention rate. The study found that there are two fatigue periods, at approximately day 30 and 120 when there is a high likelihood for user loss, which is important to consider in the strategic planning for the game operation. Journal: Int. J. of Data Mining, Modelling and Management Pages: 254-267 Issue: 3 Volume: 13 Year: 2021 Keywords: mobile games; association rules; sequence correlation; operation optimisation. File-URL: http://www.inderscience.com/link.php?id=118023 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:3:p:254-267 Template-Type: ReDIF-Article 1.0 Author-Name: Paravee Maneejuk Author-X-Name-First: Paravee Author-X-Name-Last: Maneejuk Author-Name: Chalerm Jaitang Author-X-Name-First: Chalerm Author-X-Name-Last: Jaitang Author-Name: Woraphon Yamaka Author-X-Name-First: Woraphon Author-X-Name-Last: Yamaka Title: A multivariate copula-based SUR probit model: application to insolvency probability of enterprises Abstract: The purpose of this study is to introduce a more flexible joint distribution for a probit model with more than two equations, or a so-called SUR probit model. The main idea of the suggested method is to use a multivariate copula to link the errors of equations in the SUR probit model. We conduct a simulation study to assess the performance of the model and then apply the model to a real economic problem that is the insolvency probability of small and medium enterprises in Thailand. This study considers three economic sectors and speculates some dependencies among them. The results obtained from the copula-based SUR probit model can show a better performance in both simulation and application studies. In addition, it is found to be suitable for explaining the causal effect of the companies' financial statements on their insolvency probability and challenged results for the Thai enterprises are brought out. Journal: Int. J. of Data Mining, Modelling and Management Pages: 268-282 Issue: 3 Volume: 13 Year: 2021 Keywords: multivariate copula; multivariate probit model; small and medium enterprises; financial statements; insolvency probability. File-URL: http://www.inderscience.com/link.php?id=118025 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:3:p:268-282 Template-Type: ReDIF-Article 1.0 Author-Name: Woraphon Yamaka Author-X-Name-First: Woraphon Author-X-Name-Last: Yamaka Author-Name: Pichayakone Rakpho Author-X-Name-First: Pichayakone Author-X-Name-Last: Rakpho Author-Name: Paravee Maneejuk Author-X-Name-First: Paravee Author-X-Name-Last: Maneejuk Title: Hedging agriculture commodities futures with histogram data: a Markov switching volatility and correlation model Abstract: In this study, the bivariate flexible Markov switching dynamic copula GARCH model is developed to histogram-value data for calculating optimal portfolio weight and optimal hedge. This model is an extension of the Markov switching dynamic copula GARCH in which all estimated parameters are allowed to be a regime dependent. The histogram data is constructed from the five-minute wheat spot and futures returns. We compare our proposed model with other bivariate GARCH models through AIC, BIC, and hedge effectiveness. The empirical results show that our model is slightly better than the conventional methods in terms of the lowest AIC and BIC, and the highest hedge effectiveness. This indicates that our proposed model is quite effective in reducing risks in portfolio returns. Journal: Int. J. of Data Mining, Modelling and Management Pages: 299-315 Issue: 3 Volume: 13 Year: 2021 Keywords: hedging strategy; Markov switching; time-varying dependence; histogram data; wheat. File-URL: http://www.inderscience.com/link.php?id=118026 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:3:p:299-315 Template-Type: ReDIF-Article 1.0 Author-Name: Lamiche Chaabane Author-X-Name-First: Lamiche Author-X-Name-Last: Chaabane Title: An enhanced cooperative method to solve multiple-sequence alignment problem Abstract: In this research study, we aim to propose a novel cooperative approach called dynamic simulated particle swarm optimisation (DSPSO) which is based on metaheuristics and the pairwise dynamic programming (DP) procedure to find an approximate solution for the multiple-sequence alignment (MSA) problem. The developed approach applies the particle swarm optimisation (PSO) algorithm to discover the search space globally and the simulated annealing (SA) technique to improve the population leader quality in order to overcome local optimum problem. After that the dynamic programming technique is integrated as an improver mechanism in order to improve the worst solution quality and to increase the convergence speed of the proposed approach. Simulation results on BAliBASE benchmarks have shown the potent of the proposed method to produce good quality alignments comparing to those given by other literature existing methods. Journal: Int. J. of Data Mining, Modelling and Management Pages: 1-16 Issue: 1/2 Volume: 13 Year: 2021 Keywords: cooperative approach; multiple-sequence alignment; MSA; DSPSO; particle swarm optimisation; PSO; SA; DP; BAliBASE benchmarks. File-URL: http://www.inderscience.com/link.php?id=112907 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:1/2:p:1-16 Template-Type: ReDIF-Article 1.0 Author-Name: Ismaïl Biskri Author-X-Name-First: Ismaïl Author-X-Name-Last: Biskri Author-Name: Mohamed Hassani Author-X-Name-First: Mohamed Author-X-Name-Last: Hassani Title: A formal theoretical framework for a flexible classification process Abstract: The classification process is a complex technique that connects language, text, information and knowledge theories with computational formalisation, statistical and symbolic approaches, standard and non-standard logics, etc. This process should always be under the control of the user according to his subjectivity, his knowledge and the purpose of his analysis. It becomes important to create platforms to support the design of classification tools, their management, and their adaptation to new needs and experiments. In the last years, several platforms for data digging including textual data where classification is the main functionality have emerged. However, they lack flexibility and formal foundations. We propose in this paper a formal model with strong logical foundations based on applicative type systems. Journal: Int. J. of Data Mining, Modelling and Management Pages: 17-36 Issue: 1/2 Volume: 13 Year: 2021 Keywords: classification; flexibility; applicative systems; operators/operands; combinatory logics; inferential calculus; compositionality; processing chains; modules; discovery process; collaborative intelligent science. File-URL: http://www.inderscience.com/link.php?id=112908 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:1/2:p:17-36 Template-Type: ReDIF-Article 1.0 Author-Name: Abdelkrime Aries Author-X-Name-First: Abdelkrime Author-X-Name-Last: Aries Author-Name: Djamel Eddine Zegour Author-X-Name-First: Djamel Eddine Author-X-Name-Last: Zegour Author-Name: Walid Khaled Hidouci Author-X-Name-First: Walid Khaled Author-X-Name-Last: Hidouci Title: Graph-based cumulative score using statistical features for multilingual automatic text summarisation Abstract: Multilingual summarisation began to receive more attention these late years. Many approaches can be used to achieving this, among them: statistical and graph-based approaches. Our idea is to combine these two approaches into a new extractive text summarisation method. Surface statistical features are used to calculate a primary score for each sentence. The graph is used to selecting some candidate sentences and calculating a final score for each sentence based on its primary score and those of its neighbours in the graph. We propose four variants to calculating the cumulative score of a sentence. Also, the order of sentences is an important aspect of summary readability. We propose some other algorithms to generating the summary not just based on final scores but on sentences connections in the graph. The method is tested using MultiLing'15 workshop's MSS corpus and ROUGE metric. It is evaluated against some well known methods and it gives promising results. Journal: Int. J. of Data Mining, Modelling and Management Pages: 37-64 Issue: 1/2 Volume: 13 Year: 2021 Keywords: automatic text summarisation; ATS; graph-based summarisation; statistical features; multilingual summarisation; extractive summarisation. File-URL: http://www.inderscience.com/link.php?id=112909 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:1/2:p:37-64 Template-Type: ReDIF-Article 1.0 Author-Name: Tayeb Kenaza Author-X-Name-First: Tayeb Author-X-Name-Last: Kenaza Title: An ontology-based modelling and reasoning for alerts correlation Abstract: SIEM is a modern and powerful security tool thanks to several functions that it provides to take benefit of collected data, such as normalisation and aggregation. The main important function is events correlation, when security operators can get a precise and quick picture about threats and attacks in real-time. The quality of that picture depends on the efficiency of the adopted reasoning approach to putting together pieces of information provided by several analysers. In this paper, we propose a semantic approach based on description logics (DLs) which is a powerful tool for knowledge representation and reasoning. Indeed, ontology provides a comprehensive environment to represent information for intrusion detection and allows easy maintaining of information or adding new ones. We implemented a rule-based engine for alert correlation based on the proposed ontology and two attack scenarios are carried out to show the usefulness of our approach. Journal: Int. J. of Data Mining, Modelling and Management Pages: 65-80 Issue: 1/2 Volume: 13 Year: 2021 Keywords: information security; intrusion detection; security information and event management system; SIEM; alert correlation; rules-based reasoning; ontology; ontology web language; OWL; Semantic Web Rule Language; SWRL. File-URL: http://www.inderscience.com/link.php?id=112913 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:1/2:p:65-80 Template-Type: ReDIF-Article 1.0 Author-Name: Meriem Bahi Author-X-Name-First: Meriem Author-X-Name-Last: Bahi Author-Name: Mohamed Batouche Author-X-Name-First: Mohamed Author-X-Name-Last: Batouche Title: Convolutional neural network with stacked autoencoders for predicting drug-target interaction and binding affinity Abstract: The prediction of novel drug-target interactions (DTIs) is critically important for drug repositioning, as it can lead the researchers to find new indications for existing drugs and to reduce the cost and time of the de novo drug development process. In order to explore new ways for this innovation, we have proposed two novel methods named SCA-DTIs and SCA-DTA, respectively to predict both drug-target interactions and drug-target binding affinities (DTAs) based on convolutional neural network (CNN) with stacked autoencoders (SAE). Initialising a CNN's weights with filters of trained stacked autoencoders yields to superior performance. Moreover, for boosting the performance of the DTIs prediction, we propose a new method called RNDTIs to generate reliable negative samples. Tests on different benchmark datasets show that the proposed method can achieve an excellent prediction performance with an accuracy of more than 99%. These results demonstrate the strength of the proposed model potential for DTIs and DTA prediction, thereby improving the drug repurposing process. Journal: Int. J. of Data Mining, Modelling and Management Pages: 81-113 Issue: 1/2 Volume: 13 Year: 2021 Keywords: stacked autoencoders; SAE; convolutional neural network; CNN; semi-supervised learning; deep learning; drug repositioning; drug-target interaction; DTI; binding affinity. File-URL: http://www.inderscience.com/link.php?id=112914 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:1/2:p:81-113 Template-Type: ReDIF-Article 1.0 Author-Name: Mostefa Zafer Author-X-Name-First: Mostefa Author-X-Name-Last: Zafer Author-Name: Mustapha Reda Senouci Author-X-Name-First: Mustapha Reda Author-X-Name-Last: Senouci Author-Name: Mohamed Aissani Author-X-Name-First: Mohamed Author-X-Name-Last: Aissani Title: Efficient deployment approach of wireless sensor networks on 3D terrains Abstract: Ensuring the coverage of a region of interest (RoI) when deploying a wireless sensor network (WSN) is an objective that depends on several factors, such as the detection capability of the used sensor nodes and the topography of the RoI. To address the topography challenges, in this paper, we propose a new WSN deployment approach based on the idea of partitioning the RoI into sub-regions with relatively simple topography. Then allocating, to each constructed sub-region, the necessary number of sensor nodes and finding their appropriates positions to maximise the coverage quality. The performance evaluation of this approach coupled with three different deployment methods named deployment method based on simulated annealing (DMSA), greedy deployment method (GDM), and random deployment method (RDM), has revealed its relevance since it helped to significantly improve the coverage quality of the RoI. Journal: Int. J. of Data Mining, Modelling and Management Pages: 114-136 Issue: 1/2 Volume: 13 Year: 2021 Keywords: wireless sensor networks; WSNs; 3D terrains; deployment; coverage. File-URL: http://www.inderscience.com/link.php?id=112915 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:1/2:p:114-136 Template-Type: ReDIF-Article 1.0 Author-Name: Lamia Berkani Author-X-Name-First: Lamia Author-X-Name-Last: Berkani Title: Recommendation of items using a social-based collaborative filtering approach and classification techniques Abstract: With the large amount of data generated every day in social networks, the use of classification techniques becomes a necessity. The clustering-based approaches reduce the search space by clustering similar users or items together. We focus in this paper on the personalised item recommendation in social context. Our approach combines in different ways the social filtering algorithm and the traditional user-based collaborative filtering algorithm. The social information is formalised by some social-behaviour metrics such as friendship, commitment and trust degrees of users. Moreover, two classification techniques are used: an unsupervised technique applied initially to all users and a supervised technique applied to newly added users. Finally, the proposed approach has been experimented using different existing datasets. The obtained results show the contribution of integrating social information on the collaborative filtering and the added value of using the classification techniques on the different algorithms in terms of the recommendation accuracy. Journal: Int. J. of Data Mining, Modelling and Management Pages: 137-159 Issue: 1/2 Volume: 13 Year: 2021 Keywords: item recommendation; collaborative filtering; social filtering; supervised classification; unsupervised classification. File-URL: http://www.inderscience.com/link.php?id=112919 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:1/2:p:137-159 Template-Type: ReDIF-Article 1.0 Author-Name: Karima Sid Author-X-Name-First: Karima Author-X-Name-Last: Sid Author-Name: Mohamed Batouche Author-X-Name-First: Mohamed Author-X-Name-Last: Batouche Title: Distributed heterogeneous ensemble learning on Apache Spark for ligand-based virtual screening Abstract: Virtual screening is one of the most common computer-aided drug design techniques that apply computational tools and methods on large libraries of molecules to extract the drugs. Ensemble learning is a recent paradigm launched to improve machine learning results in terms of predictive performance and robustness. It has been successfully applied in ligand-based virtual screening (LBVS) approaches. Applying ensemble learning on huge molecular libraries is computationally expensive. Hence, the distribution and parallelisation of the task have become a significant step by using sophisticated frameworks such as Apache Spark. In this paper, we propose a new approach HEnsL_DLBVS, for heterogeneous ensemble learning, distributed on Spark to improve the large-scale LBVS results. To handle the problem of imbalanced big training datasets, we propose a novel hybrid technique. We generate new training datasets to evaluate the approach. Experimental results confirm the effectiveness of our approach with satisfactory accuracy and its superiority over homogeneous models. Journal: Int. J. of Data Mining, Modelling and Management Pages: 160-191 Issue: 1/2 Volume: 13 Year: 2021 Keywords: virtual screening; big data; computer-aided drug design; CADD; Apache Spark; machine learning; drug discovery; ensemble learning; imbalanced datasets; Spark MLlib; ligand-based virtual screening; LBVS. File-URL: http://www.inderscience.com/link.php?id=112920 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:1/2:p:160-191 Template-Type: ReDIF-Article 1.0 Author-Name: Noussaiba Benadjimi Author-X-Name-First: Noussaiba Author-X-Name-Last: Benadjimi Author-Name: Khaled-Walid Hidouci Author-X-Name-First: Khaled-Walid Author-X-Name-Last: Hidouci Title: Hash-processing of universal quantification-like queries dealing with requirements and prohibitions Abstract: This paper is focused on flexible universal quantification-like queries handling simultaneously positive and negative preferences (requirements or prohibitions). We emphasise the performance improvement of the considered operator by proposing new variants of the classical hash-division algorithm. The issue of answers ranking is also dealt with. We target in our work the in memory databases systems (also called main-memory database systems) with a very large volume of data. In these systems, all the data is primarily stored in the RAM of a computer. We have introduced a parallel implementation of the operator that takes into account the data skew issue. Our empirical analysis for both sequential and parallel versions shows the relevance of our approach. They demonstrate that the new processing of the mixed operator in a main-memory database achieves better performance compared to the conventional ones, and becomes faster through parallelism. Journal: Int. J. of Data Mining, Modelling and Management Pages: 192-210 Issue: 1/2 Volume: 13 Year: 2021 Keywords: universal quantification queries; relational division; relational anti-division; main-memory databases; flexible division; hash-division. File-URL: http://www.inderscience.com/link.php?id=112921 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:1/2:p:192-210 Template-Type: ReDIF-Article 1.0 Author-Name: Seyyed Mohammad Mirtaghian Rudsari Author-X-Name-First: Seyyed Mohammad Mirtaghian Author-X-Name-Last: Rudsari Author-Name: Naji Gharibi Author-X-Name-First: Naji Author-X-Name-Last: Gharibi Title: Application of structural equation modelling in Iranian tourism researches: challenges and guidelines Abstract: The main purpose of this study is to identify and analyse the challenges in using structural equation modelling (SEM) in tourism research in Iran. The paper examines how Iranian scholars have used the technique, using a sample of 172 papers published in the top five tourism journals published in Farsi (i.e., Persian). The results indicate that often there is a lack of discussion as to sample size, issues of normality of distribution, effect analysis, the role of coefficients of determination, and additionally selective and arbitrary reporting of fit indices are not uncommon. The paper also emphasises the role of theory in constructing such models. Journal: Int. J. of Data Mining, Modelling and Management Pages: 364-387 Issue: 4 Volume: 13 Year: 2021 Keywords: structural equation modelling; SEM; covariance-based SEM; partial least squares; challenges and misuse; Iranian tourism research. File-URL: http://www.inderscience.com/link.php?id=119627 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:4:p:364-387 Template-Type: ReDIF-Article 1.0 Author-Name: Konstantin Savenkov Author-X-Name-First: Konstantin Author-X-Name-Last: Savenkov Author-Name: Vladimir Gorbachenko Author-X-Name-First: Vladimir Author-X-Name-Last: Gorbachenko Author-Name: Anatoly Solomakha Author-X-Name-First: Anatoly Author-X-Name-Last: Solomakha Title: New perspectives on deep neural networks in decision support in surgery Abstract: The paper considers the development of a neural network system for predicting complications after acute appendicitis operations. A neural network of deep architecture has been developed. As a learning set, a set developed by the authors based on real clinic data was used. To select significant features, a method for selecting features based on the interquartile range of the F1-score is proposed. For preliminary processing of training data, it is proposed to use an overcomplete autoencoder. Overcomplete autoencoder converts the selected features into a space of higher dimension, which, according to Cover's theorem facilitates the classification of features according to complication and not corresponding to complication. To overcome the overfitting of the network, the dropout method of neurons was used. The neural network is implemented using the Keras and TensorFlow libraries. Trained neural network showed high diagnostic metrics on test data set. Journal: Int. J. of Data Mining, Modelling and Management Pages: 317-336 Issue: 4 Volume: 13 Year: 2021 Keywords: neural networks; features selection; learning neural networks; overfitting; overcomplete autoencoder; medical diagnostics. File-URL: http://www.inderscience.com/link.php?id=119628 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:4:p:317-336 Template-Type: ReDIF-Article 1.0 Author-Name: Satish M. Srinivasan Author-X-Name-First: Satish M. Author-X-Name-Last: Srinivasan Author-Name: Ruchika Chari Author-X-Name-First: Ruchika Author-X-Name-Last: Chari Author-Name: Abhishek Tripathi Author-X-Name-First: Abhishek Author-X-Name-Last: Tripathi Title: Modelling and visualising emotions in Twitter feeds Abstract: Predictive analytics on Twitter feeds is becoming a popular field for research. A tweet holds wealth of information on how an individual express and communicates their feelings and emotions within their social network. Large-scale mining of tweets will not only help in capturing an individual's emotion but also the emotions of a larger group. In this study, an emotion-based classification scheme has been proposed. By training the naïve Bayes multinomial and the random forest classifiers on different training datasets, emotion classification was performed on the test dataset containing tweets related to the 2016 US presidential election. Upon classifying the tweets in the test dataset to one of the four basic emotion types: anger, happy, sadness and surprise, and by determining the sentiments of the people, we have tried to portray the flux in the emotional landscape of the people towards the presidential candidates in the 2016 US election. Journal: Int. J. of Data Mining, Modelling and Management Pages: 337-350 Issue: 4 Volume: 13 Year: 2021 Keywords: emotion classification; Twitter data analysis; US presidential election; supervised classifier; random forest; naïve Bayes multinomial; NBM. File-URL: http://www.inderscience.com/link.php?id=119629 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:4:p:337-350 Template-Type: ReDIF-Article 1.0 Author-Name: Po-Jen Chuang Author-X-Name-First: Po-Jen Author-X-Name-Last: Chuang Author-Name: Yun-Sheng Tu Author-X-Name-First: Yun-Sheng Author-X-Name-Last: Tu Title: Pursuing efficient data stream mining by removing long patterns from summaries Abstract: Frequent pattern mining is a useful data mining technique. It can help in digging out frequently used patterns from the massive internet data streams for significant applications and analyses. To uplift the mining accuracy and reduce the needed processing time, this paper proposes a new approach that is able to remove less used long patterns from the pattern summary to preserve space for more frequently used short patterns, in order to enhance the performance of existing frequent pattern mining algorithms. Extensive simulation runs are carried out to check the performance of the proposed approach. The results show that our approach can strengthen the mining performance by effectively bringing down the required run time and substantially increasing the mining accuracy. Journal: Int. J. of Data Mining, Modelling and Management Pages: 388-409 Issue: 4 Volume: 13 Year: 2021 Keywords: data streams; frequent pattern mining; pattern summary; length skip; performance evaluation. File-URL: http://www.inderscience.com/link.php?id=119630 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:4:p:388-409 Template-Type: ReDIF-Article 1.0 Author-Name: Nourelhouda Yahi Author-X-Name-First: Nourelhouda Author-X-Name-Last: Yahi Author-Name: Hacene Belhadef Author-X-Name-First: Hacene Author-X-Name-Last: Belhadef Author-Name: Mathieu Roche Author-X-Name-First: Mathieu Author-X-Name-Last: Roche Title: Investigating the impact of preprocessing on document embedding: an empirical comparison Abstract: Digital representation of text documents is a crucial task in machine learning and natural language processing (NLP). It aims to transform unstructured text documents into mathematically-computable elements. In recent years, several methods have been proposed and implemented to encode text documents into fixed-length feature vectors. This operation is known as document embedding and it has become an interesting and open area of research. Paragraph vector (Doc2vec) is one of the most used document embedding methods. It has gained a good reputation thanks to its good results. To overcome its limits, Doc2vec, was extended by proposing the document through corruption (Doc2vecC) technique. To get a deep view of these two methods, this work presents a study on the impact of morphosyntactic text preprocessing on these two document embedding methods. We have done this analysis by applying the most-used text preprocessing techniques, such as cleaning, stemming and lemmatisation, and their different combinations. The experimental analysis on the Microsoft Research Paraphrase dataset (MSRP), reveals that the preprocessing techniques serve to improve the classifier accuracy; and that the stemming method outperforms the other techniques. Journal: Int. J. of Data Mining, Modelling and Management Pages: 351-363 Issue: 4 Volume: 13 Year: 2021 Keywords: natural language preprocessing; document embedding; paragraph vector; document through corruption; text preprocessing; semantic similarity. File-URL: http://www.inderscience.com/link.php?id=119631 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:13:y:2021:i:4:p:351-363