International Journal of Data Mining, Modelling and Management (31 papers in press)
Application of Structural Equation Modeling in Iranian Tourism Researches: Challenges and Guidelines
by Seyyed Mohammad Mirtaghian Rudsari, Najmeh Gharibi
Abstract: The main purpose of this study is to identify and analyze the challenges in using Structural Equation Modeling (SEM) in tourism research in Iran. The paper examines how Iranian scholars have used the technique, using a sample of 172 papers published in the top five tourism journals published in Farsi (i.e. Persian). The results indicate that often there is a lack of discussion as to sample size, issues of normality of distribution, effect analysis, the role of coefficients of determination and additionally selective and arbitrary reporting of fit indices is not uncommon. The paper also emphasizes the role of theory in constructing such models.
Keywords: Structural Equation Modeling (SEM); Covariance Based SEM; Partial Least Squares SEM; Challenges and Misuse; Iranian Tourism Research.
New perspectives on deep neural networks in decision support in surgery
by Konstantin Savenkov, Vladimir Gorbachenko, Anatoly Solomakha
Abstract: The paper considers the development of a neural network system for predicting complications after acute appendicitis operations. A neural network of deep architecture has been developed. As a learning set, a set developed by the authors based on real clinic data was used. To select significant features, a method for selecting features based on the interquartile range of the F1-score is proposed. For preliminary processing of training data, it is proposed to use an overcomplete autoencoder. Overcomplete autoencoder converts the selected features into a space of higher dimension, which, according to Cover's theorem facilitates the classification of features according to complication and not corresponding to complication. To overcome the overfitting of the network, the dropout method of neurons was used. The neural network is implemented using the Keras and TensorFlow libraries. Trained neural network showed high diagnostic metrics on test data set.
Keywords: neural networks; features selection; learning neural networks; overfitting; overcomplet autoencoder; medical diagnostics.
Modelling and Visualizing Emotions in Twitter Feeds
by Satish M. Srinivasan, Abhishek Tripathi
Abstract: Predictive analytics on twitter feeds is becoming a popular field for research. A tweet holds wealth of information on how an individual expresses and communicates their feelings and emotions within their social network. Large scale collection, cleaning, and mining of tweets will not only help in capturing an individuals emotion but also the emotions of a larger group. However, capturing a large volume of tweets and identifying the emotions expressed in it is a very challenging task. In this study an emotion-based classification scheme has been proposed. Initially a synthetic dataset is built by randomly picking instances from different training datasets. Using this newly constructed dataset, the classifiers are trained (model building). Finally, emotions are predicted on the test datasets using the generated models. By training the Na
Keywords: emotion classification; twitter data analysis; US presidential election; supervised classifier; Random Forest; Naïve Bayes Multinomial.
Pursuing Efficient Data Stream Mining by Removing Long Patterns from Summaries
by Po-Jen Chuang, Yun-Sheng Tu
Abstract: Frequent pattern mining is a useful data mining technique. It can help in digging out frequently used patterns from the massive Internet data streams for significant applications and analyses. To uplift the mining accuracy and reduce the needed processing time, this paper proposes a new approach that is able to remove less used long patterns from the pattern summary to preserve space for more frequently used short patterns, in order to enhance the performance of existing frequent pattern mining algorithms. Extensive simulation runs are carried out to check the performance of the proposed approach. The results show that our approach can strengthen the mining performance by effectively bringing down the required run time and substantially increasing the mining accuracy.
Keywords: Data Streams; Frequent Pattern Mining; Pattern Summary; Length Skip; Performance Evaluation.
Investigating the Impact of Preprocessing on Document Embedding: An Empirical Comparison
by Nourelhouda Yahi, Hacene Belhadef, Mathieu Roche
Abstract: Digital representation of text documents is a crucial task in machine learning and Natural Language Processing (NLP). It aims to transform unstructured text documents into mathematically-computable elements. In recent years, several methods have been proposed and implemented to encode text documents into fixed-length feature vectors. This operation is known as: Document Embedding and it has become an interesting and open area of research. Paragraph Vector (Doc2vec) is one of the most used document embedding methods. It has gained a good reputation thanks to its good results. To overcome its limits, Doc2vec, was extended by proposing the Document through Corruption (Doc2vecC) technique. To get a deep view of these two methods, this work presents a study on the impact of morphosyntactic text preprocessing on these two document embedding methods. We have done this analysis by applying the most-used text preprocessing techniques, such as Cleaning, Stemming and Lemmatisation, and their different combinations. Experimental analysis on the Microsoft Research Paraphrase dataset, MSRP, reveals that the preprocessing techniques serve to improve the classifier accuracy, and that Stemming methods outperform the other techniques.
Keywords: Natural Language Preprocessing; Document Embedding; Paragraph Vector; Document through Corruption; Text Preprocessing; Semantic Similarity.
A comprehensive review of deep learning for natural language processing
by Amal Bouraoui, Salma Jamoussi, Abdelmajid Ben Hamadou
Abstract: Deep learning has attracted considerable attention across many Natural Language Processing (NLP) domains. Deep learning models aim to learn embeddings of data with multiple levels of abstraction through multiple layers for either labeled structured input data or unlabeled unstructured input data. Currently, two research trends have emerged in building higher level embeddings. On one hand, a strong trend in deep learning leads towards increasingly powerful and complex models. On the other hand, multi-purpose sentence representation based on simple sums or averages of word vectors was recently shown to be effective. Furthermore, improving the performance of deep learning methods by attention mechanism has become a research hotspot in the last four years. In this paper, we seek to provide a comprehensive review of recent studies in building Neural Network (NN) embeddings that have been applied to NLP tasks. We provide a walk-through of deep learning evolution and a description of a variety of its architectures. We present and compare the performance of several deep learning models on standard datasets about different NLP tasks. We also present some deep learning challenges for natural language processing.
Keywords: Deep Learning; Word Embedding; Sentence Embedding; Attention
Mechanism; Compositional Models; Convolutional NNs; Recurrent/Recursive
NNs; Multi-purpose Sentence Embedding; Natural Language Processing.
Time-Series Gradient Boosting Tree for Stock Price Prediction
by Kei Nakagawa, Kenichi Yoshida
Abstract: We propose a time-series gradient boosting tree for a data set with time-series and cross-sectional attributes.
Our time-series gradient boosting tree has weak learners with time-series and cross-sectional attributes in its internal node, and split examples based on similarity between a pair of time-series or impurity between cross-sectional attributes.
Dissimilarity between a pair of time-series is defined by the dynamic time warping method.
In other words, the decision tree is constructed based on the shape that the time-series is similar or not similar to its past shape.
We conducted an empirical analysis using major world indices and confirmed that our time-series gradient boosting tree is superior to prior research methods in terms of both profitability and accuracy.
Keywords: Dynamic Time Warping mtehod; Time-series Decision Tree; Time-series Gradient Boosting Tree; Stock Price Prediction.
SUGGESTION AND SOLUTION OF A MATHEMATICAL MODEL FOR DETERMINING EFFECTIVE
by Kenan Mengüç, Tarik Küçükdeniz
Abstract: As obtaining data gets easier and cheaper with the help of technological achievements, data-based analyticsrnand management have become an essential part of planning and decision making to achieve success in thernsports industry. The study finds offensive routes for a team game using high-security data produced withrntechnology. An analysis of a sports team match was performed using seasonal data. A mathematical modelrnhas been developed for this analysis, discussing the effectiveness of the routes the model offers. This articlernaims to find the safe, efficient route for organizing the football on the field. In addition, the study also offersrnan experimental proposal for this purpose.rn
Keywords: Match strategy; tactics; optimization; transshipment problem.
Emotions recognition in synchronic textual CSCL situations
by Germán Lescano, Rosanna Costaguta, Analia Amandi
Abstract: Computer-Supported Collaborative Learning (CSCL) is an useful practice to teach learners working in groups and to acquire collaborative skills. To evaluate the collaborative process can be heavy for teachers because implies to analyze a lot of interactions. One issue to consider is socio-affective interactions which are important to recognize them due to their influence in the learning process. In this work, we propose an approach to recognize affective states in synchronic textual CSCL situations of students that speaking Spanish. Through experimentation, we analyze emotions manifested by university students of computer sciences when they worked in groups in synchronic textual CSCL situations and we evaluated the proposed approach using tools and libraries available in the market to make sentiment analysis. Using the proposed approach we developed classifiers to recognize subjectivity, sentiments and emotions. The sentiment classification model developed was compared with pre-built models regarding the rate of correct classifications. Results show that resources available in the market help the process of developing classifiers of sentiments and emotions for CSCL environments using traditional machine learning techniques. Providing to CSCL environments with a tool to recognize socio-affective interactions can be useful in order to help teachers evaluate this dimension of the collaborative process.
Keywords: Computer-Supported Collaborative Learning; Socio-Affective Interactions; Affective Computing.
Developing a Machine Learning Framework to Determine the Spread of COVID-19 in the United States using Meteorological, Social, and Demographic Factors
by Akash Gupta, Amir Gharehgozli
Abstract: Coronavirus disease of 2019 (COVID-19) has become pandemic in the matter of a few months, since the outbreak in December 2019 in Wuhan, China. We study the impact of weather factors including temperature and pollution on the spread of COVID-19. We also include social and demographic variables such as per capita Gross Domestic Product (GDP) and population density. Adapting the theory from the field of epidemiology, we develop a framework to build analytical models to predict the spread of COVID-19. In the proposed framework, we employ machine learning methods including linear regression, linear kernel support vector machine (SVM), radial kernel SVM, polynomial kernel SVM, and decision tree. Given the non-linear nature of the problem, the radial kernel SVM performs the best and explains 95% more variation than the existing methods. In align with the literature, our study indicates the population density is the critical factor to determine the spread. The univariate analysis shows that a higher temperature, air pollution, and population density can increase the spread. On the other hand, a higher per capita GDP can decrease the spread.
Keywords: COVID-19; disease spread; social and demographic factors; machine learning; epidemiology; predictive modeling.
Data Analytics for Gross Domestic Product using Random Forest and Extreme Gradient Boosting Approaches: An Empirical Study
by Elsayed Habib Elamir
Abstract: Gross domestic product per capita may be considered one of the foremost substantial measures of social gladness where all nations attempt to boost their gross domestic product per capita to share in their population bliss and prosperity, in addition to fortify their nation standing in worldwide relations. This study aims to use the random forest and extreme gradient boosting approaches to forecast and analyze gross domestic product per capita using data from world bank development indicators on countries level over the period 2010 to 2017. The comprehensive comparisons are executed using years before 2017 as training data and year 2017 as testing data. The root mean squares error, and the coefficient of determination are used to judge among the different models. The random forest and extreme gradient boosting achieve accuracy 97.8% and 98.1%, respectively, using coefficient of determination. The results suggest that the investment in education, labor, health, and industry as well as decreasing in inflation, interest, unemployment is necessary to enhance gross domestic product per capita. Motivating results are given by two-way interaction measure that is useful in explaining co-dependencies in the model behavior. The strongest interactions are between trade-technology, technology-education followed by consumption-health in terms of extreme gradient boosting method.
Keywords: bagging; boosting; business analytics; forecast; GDP; machine learning.
Methodology for Comparing Text Corpora via Topic Model
by Fedor Krasnov, Mikhail Shvartsman, Alexander Dimentov
Abstract: The authors of this paper developed a methodology approach for comparative analysis of patents' content. The approach named T4C is based on the topic modeling methodology and the machine learning methodology. The authors were able to identify the ownership of a patent in a particular country with an accuracy of 97.5% using supervised machine learning methods. When studying the dependence of patents on time, the authors were able to identify the patent belonging to a specific period with an accuracy of 85% for a specific country. The authors have developed a visual presentation of a thematic correlation between groups of patents. It should also be noted that in terms of the patent description text composition, Chinese patents are fundamentally different from US patents.rnThe results presented in this study were used to manage the patenting process at GazpromNeft STC.
Keywords: Topic Modeling; Text Classification; ARTM; PLSA; Random Forest; Text Collections Comparison.
Non-linear Gradient-based Feature Selection for Precise Prediction of Diseases
by Sadaf Kabir, Leily Farrokhvar
Abstract: Developing accurate predictive models can profoundly help health care providers improve the quality of their services. However, medical data often contain several variables, and not all of the data equally contribute towards the prediction. The existence of irrelevant and redundant features in a dataset can unnecessarily increase computational cost and complexity while deteriorating the performance of the predictive model. In this study, we employ the gradient-based prediction attribution as a general tool to identify important features in differentiable predictive models, such as neural networks and linear regression. Built upon this approach, we analyze single-stage and multi-stage scenarios for feature selection using ten medical datasets. Through extensive experiments, we demonstrate that the combination of the gradient-based approach with neural networks provides a powerful non-linear technique to identify important features contributing to the prediction. In particular, non-linear gradient-based feature selection achieves competitive results or significantimprovements over previously reported results on all datasets.
Keywords: Machine learning; feature selection; neural networks; logistic regression; disease prediction models; health care data.
Synergistic Effects Between Data Corpora Properties and Machine Learning Performance in Data Pipelines
by Roberto Bertolini, Stephen Finch
Abstract: To analyze data, a computationally feasible pipeline must be developed for data modeling. Corpora properties affect performance variability of machine learning (ML) techniques in pipelines; however, this has not been thoroughly investigated using simulation methodologies. A Monte Carlo study is used to compare differences in the area under the curve (AUC) metric for large-n-small-p-corpora examining: (1) the choice of ML algorithm, (2) size of the training database, (3) measurement error, (4) class imbalance magnitude, and (5) missing data pattern. Our simulations are consistent with established results under which these algorithms and corpora properties perform best, while providing insights into their synergistic effects. Measurement error negatively impacted pipeline performance across all corpora factors and ML algorithms. A larger training corpus ameliorated the decrease in predictive efficacy resulting from measurement error, class imbalance magnitudes, and missing data patterns. We discuss the implications of these findings for designing pipelines to enhance prediction performance.
Keywords: Data Pipeline; Interaction/Synergistic Effects; Monte Carlo Simulation; Machine Learning; Binary Classification.
Prediction of air pollution and analyze its effects on pollution dispersion of PM10 in Egypt using machine learning algorithms.
by Wael K. Hanna, Rasha Elstohy, Nouran Radwan
Abstract: Air pollution has been considered as one of the serious threats in Egypt. According to a study in Environmental Science & Technology Letters journal, air pollution is one of the main responsible for shortening Egyptians lives by 1.85 years. The main cause of air pollution in Egypt is PM10 which comes from industrial processes. PM10 concentrations exceed daily average concentrations during 98% of the measurement period. In this paper, we will apply machine learning classification algorithms to build the most accurate model for air pollution prediction and analysing its effects on pollution dispersion of PM10. The proposed classification model begins with air quality data collection and pre-processing, and then classifying process to discover the main relevant features for prediction. Experimental results show a good performance of the proposed air quality model. Random Forest, Na
Keywords: Air pollution; PM10; Classification model and machine learning algorithms.
Detecting and Exploiting Symmetries in Sequential Pattern Mining
by Ikram Nekkache, Said Jabbour, Nadjet Kamel, Lakhdar SAIS
Abstract: In this paper, we introduce a new framework for discovering and using symmetries in sequential pattern mining tasks. Symmetries are permutations between items that leave invariant the sequential database. Symmetries present several potential benefits. They can be seen as a new kind of structural patterns expressing regularities and similarities between items. As symmetries induce a partition of the sequential patterns into equivalent classes, exploiting them would allow to improve the pattern enumeration process, while reducing the size of the output. To this end, we first address the problem of symmetry discovery from database of sequences. Then, we first show how Apriori-like algorithms can be enhanced by dynamic integration of the detected symmetries. Secondly, we provide a second symmetry breaking approach allowing to eliminate symmetries in a preprocessing step by reformulating the sequential database of transactions. Our experiments clearly show that several sequential pattern mining datasets contain such symmetry based regularities. We also experimentally demonstrate that using such symmetries would results in significant reduction of the search space on some datasets.
Keywords: Data Mining; sequential pattern mining; symmetries.
Comparison of Harmony Search Derivatives for Artificial Neural Network Parameter Optimization: Stock Price Forecasting
by Mehmet Ozcalici, Ayse Tugba Dosdogru, Asli Boru Ipek, Mustafa Gocken
Abstract: This study has been conducted on forecasting as accurately as possible the next days stock price using Harmony Search (HS) and its variants (Improved Harmony Search (IHS), Global-Best Harmony Search (GHS), Self-Adaptive Harmony Search (SAHS), and Intelligent Tuned Harmony Search (ITHS)) together with Artificial Neural Network (ANN). The advantages of the proposed models are that the useful information in the original stock data is found by input variable selection and simultaneously the most proper number of hidden neurons in hidden layer is discovered to mitigate overfitting/underfitting problem in ANN. The results have shown that forecasts made by HS-ANN, IHS-ANN, GHS-ANN, SAHS-ANN, and ITHS-ANN demonstrate a tendency to achieve hit rates above 89% which is considerably better than previously proposed forecasting models in literature. Hence, ANN models provide more valuable forecasting results for investors to hedge against potential risk in stock markets.
Keywords: stock price forecasting; artificial neural network; harmony search and its variants.
Recommendation System for Improving Churn Rate based on Action Rules and Sentiment Mining
by Yuehua Duan, Zbigniew Ras
Abstract: It is well recognized that customers are one of the most valuable assets to a company. Therefore, it is of significant value for companies to reduce the customer outflow. In this paper, we focus on identifying the customers with high chance of attrition and provide valid and trustworthy recommendations to improve their customer churn rate. To this end, we designed and implemented a recommender system that can provide actionable recommendations to improve customer churn rate. We used both transaction and survey data from heavy equipment repair and service sector from 2011 to 2017. This data was collected by a consulting company based in Charlotte, North Carolina. In the survey data, customers give their thoughts, feelings, expectations and complaints by free-form text. We applied aspect-based sentiment analysis on the review text data to gain insightful knowledge on customers' attitudes toward the service. Action rule mining and meta-action triggering mechanism are used to recognize the actionable strategies to help with reducing customer churn.
Keywords: Action Rule Mining; Meta-actions; Aspect-based Sentiment Analysis; Recommender System; Reduct.
ONTOLOGY AND WEB USAGE MINING FOR WEB SITE MAINTENANCE
by Khaled Benali
Abstract: The search for information in the classical web is based essentially on the structure of the documents, and this makes the exploitation of the content almost impossible by the machines. In contrast, in the Semantic Web, machines can access resources through the semantic representation of content. In this regard, two domains, namely the web mining and the semantic web are closely linked: on the one hand, web-mining techniques help in the construction of the semantic web; on the other hand, the semantic web helps extract new knowledge. The present article discusses the problem of implementing Web Usage Mining in the semantic web for information retrieval from the web using ontology. Therefore, we present an approach that uses ontology and Web Usage Mining techniques for website maintenance. This work can help novice researchers start working in the field of web mining in the Semantic Web easily. Our approach will be tested on the ontology of a university website, which will be built and then enriched based on the extracted patterns on the Site Logs using an algorithm for the extraction of frequent itemsets. This approach aims to produce all the pages that are often accessible at the same time and throughout the same session to maintain the websites.
Keywords: Apriori; knowledge; Log File; Ontology; Web Usage Mining; Semantic Web and Website Maintenance.
OPTIMIZING DATA QUALITY OF A DATA WAREHOUSE USING DATA PURGATION PROCESS
by Neha Gupta
Abstract: Data act as fuel for any science and technology operation and due to the rapid growth of data collection and storage services, maintaining the quality of the data collected and stored is a major challenge. There are various data formats available and they are specifically categorized into three groups, i.e., Structured, Semi-structured and Unstructured. Different data mining techniques are utilized to gather, refine and investigate the data which further prompts the issue of data quality administration. The process of improving the quality of data without much alteration is known as data purgation. Data purgation occurs when the data is subject to Extract, Transform and Load (ETL) methodology in order to maintain and improve the data quality. Metadata is the most important factor that affects the quality of the collected data. The data may contain unnecessary information & may have inappropriate symbols which can be defined as dummy values, cryptic values or missing values. The present work has improved the Expectation-Maximization algorithm with dot product to handle cryptic data, DBSCAN method with Gower metrics has been implemented to ensure dummy values, Wards algorithm with Minkowski distance has been applied to improve the results of contradicting data and K-means algorithm along with Euclidean distance metrics has been applied to handle missing values in a dataset. These distance metrics have improved the data quality & also helped in providing consistent data to be loaded into a data warehouse. The above mentioned algorithms have been modified with the feature of scanning the database once, calculating the minimum support thereby increasing the efficiency as well as accuracy. The implementation of algorithms has been tested on various datasets of different sizes with more than 1000 records. The proposed algorithms have helped in maintaining the accuracy, integrity, consistency, non redundancy of data in a timely manner.
Keywords: Data Warehouse (DW); Data Quality (DQ); Extract; Transform and Load (ETL); Data Purgation (DP).
Special Issue on: ICBBD 2019 Business, Big Data and Decision Sciences
Modelling Attrition to Know Why Your Employees Leave or Stay
by Sachin Deshmukh, Seema Sant, Neerja Kashive
Abstract: Todays environmental factors influence every aspect of business, be it its Marketing, Finance, Operations or Human Resources policies. Increased globalization and technological developments have resulted into fierce competition among companies. Talent shortage has become a global issue for organizations. One of the major challenges faced by any organization is the increase in the level of employee attrition. Attrition up to a certain limit is good for any organization as it enables to inject new blood and ideas which can help in developing competitive advantage. But attrition beyond a certain limit can prove unhealthy as talented employees may go elsewhere in search of a greener pasture. Data Analytics is used as an effective tool to delve into the problem of attrition. Predictive models are been used to understand factors responsible for attrition and also predict probabilities of employees who may leave the organization for some reason. The current study has tried to build a predictive model by using logistic regression and understand the specific factors that lead to attrition. This paper also attempts to compare factors responsible for attrition in two time periods, first period from 1996 to 2008 (Holtoms Model) and second period from 2009 to 2016 to find whether any changes have taken place in employees expectations, which, if not fulfilled, may lead to attrition. An analysis of an IT organizations data reveal that factors responsible for attrition in the second period have changed, compared to the first period.
Keywords: Attrition;Predictive Model;Logistic Regression.
Long Text to Image Converter for Financial Reports
by Chia-Hao Chiu, Yun-Cheng Tsai, Ho-Lin Chen
Abstract: In this study, we proposed a novel article analysis method. This method
converts the article classification problem into image classification problem by
projecting texts into images and then applying CNN models for classification.
We called the method the Long Text to Image Converter (LTIC). The features
are extracted automatically from the generated images, hence there is no need
of any explicit step of embedding the words or characters into numeric vector
representations.This method saves the time to experiment pre-process.
This study using the financial domain as an example. In companies financial
reports, there will be a chapter describes the companys financial trends. The
content has many financial terms used to infer the companys current and
futures financial position. The LTIC achieved excellent convolution matrix and
test data accuracy. The results indicated an 80% accuracy rate. The proposed
LTIC produced excellent results during practical application. The LTIC achieved
excellent performance in classifying corporate financial reports under review.
The return on simulated investment is 46%. In addition to tangible returns, the
LTIC method reduced the time required for article analysis and is able to provide
article classification references in a short period to facilitate the decisions of the
Keywords: Article Analysis,Convolutional Neural Network,Financial Analysis; Long Text to Image Converter.
E-Learning process through text mining for academic literacy
by Maira Alejandra Pulgarin Roriguez, Bárbara Maricely Fierro Chong, Erica María Ossa Taborda
Abstract: This paper aims to present the results of research carried out in a Virtual Faculty of Education in a Private university in Colombia. It consists of the characterization of student's abilities for reading and writing comprehension for academic literacy. This study is to verify the effectiveness of an E-learning platform implementation for all the programs incorporated in the Faculty. According to the policies, at the University exists a structure of a methodological procedure for the text mining through specific keywords applicable to different text typologies in specialized areas. This platform allows professors and students to develop expertise in disciplines using text mining as an interdisciplinary strategy to build knowledge and improve the quality in their professional context.
Keywords: Text mining; terminological work; cognitive processes; E-learning; academic literacy; reading comprehension; academic writing.
Association Rules in Mobile Game Operation
by Muning Chang
Abstract: Mobile games are now playing a significant role in the gaming industry as the Internet continues to develop. Due to the economic and cultural value of mobile games, it is very importance for the gaming companies to maintain and further improve the product quality to remain competition in the industry. The operation team plays the key points to maintain product profitability after issuing the games.
This paper will analyze the gaming data collected during operation and propose operation strategies accordingly. A correlation coefficient algorithm suitable for time sequences is proposed, the association is defined by the similarity between data. The level of association between two-time sequences is reflected in the probability of the occurrence of such association. Based on the discovery, we can analyze the next popular mobile game in depth to explore the correlation between the number of users online, the number of new players, and the retention rate. The study found that there are two fatigue periods, at approximately day 30 and 120 when there is a high likelihood for user loss, which is important to consider in the strategic planning for the game operation.
Keywords: Mobile Games; Association Rules; Sequence Correlation; Operation Optimization.
A Multivariate Copula-based SUR Probit Model: Application to Insolvency Probability of Enterprises
by Woraphon Yamaka, Paravee Maneejuk
Abstract: The purpose of this study is to introduce a more flexible joint distribution for a Probit model with more than two equations, or a so-called SUR Probit model. The main idea of the suggested method is to use a multivariate copula to link the errors of equations in the SUR Probit model. We conduct a simulation study to assess the performance of the model and then apply the model to a real economic problem that is the insolvency probability of small and medium enterprises in Thailand. This study considers three economic sectors and speculates some dependencies among them. The results obtained from the copula-based SUR Probit model can show a better performance in both simulation and application study. In addition, it is found to be suitable for explaining the causal effect of the companies financial statements on their insolvency probability and challenged results for the Thai enterprises are brought out.
Keywords: Multivariate Copula; Multivariate Probit Model; Small and Medium Enterprises; Financial Statements; Insolvency Probability.
Hedging Agriculture Commodities Futures with Histogram data: A Markov Switching Volatility and Correlation model
by Woraphon Yamaka, Pichayakone Rakpho, Paravee Maneejuk
Abstract: In this study, the bivariate flexible Markov Switching Dynamic Copula GARCH model is developed to histogram-value data for calculating optimal portfolio weight and optimal hedge. This model is an extension of the Markov Switching Dynamic Copula GARCH in which all estimated parameters are allowed to be a regime dependent. The histogram data is constructed from the 5-minute wheat spot and futures returns. We compare our purposed model with other bivariate GARCH models through AIC, BIC, and hedge effectiveness. The empirical results show that our model is slightly better than the conventional methods in term of the lowest AIC and BIC; and the highest hedge effectiveness. This indicates that our purposed model is quite effective in reducing risks in portfolio returns.
Keywords: Hedging strategy; Markov Switching; Time-varying dependence; Histogram data; Wheat.
Special Issue on: IRICT 2019 Advances in Computational Intelligence and Data Science
Investigation of Contraction Process Issue in Fuzzy Min-Max Models
by Essam Alhroob, Mohammed Falah Mohammed, Fadhl Hujainah, Osama Nayel Al Sayaydeh, Ngahzaifa Ab Ghani
Abstract: The fuzzy min-max (FMM) network is one of the most powerful neural net-works. It combines a neural network and fuzzy sets into a unified framework to address pattern classification problems. The FMM consists of three main learning processes, namely, hyperbox contraction, hyperbox expansion and hyperbox overlap tests. Despite its various learning processes, the contraction process is considered as one of the major challenges in the FMM that affects the classifica-tion process. Thus, this study aims to investigate the FMM contraction process precisely to highlight its usage consequences during the learning process. Such investigation can assist practitioners and researchers in obtaining a better under-standing about the consequences of using the contraction process on the network performance. Findings of this study indicate that the contraction process used in FMM can affect network performance in terms of misclassification and incapabil-ity in handling the membership ambiguity of the overlapping regions.
Keywords: Pattern classification; Fuzzy min-max; FMM models; Contraction process.
Plagiarism Detection of Figure Images in Scientific Publications
by Taiseer Eisa
Abstract: Plagiarism is stealing others work using their words directly or indirectly without a credit citation. Copying others ideas is another type of plagiarism that may occur in many areas but the most serious one is the academic plagiarism. Therefore, technical solutions are urgently required for automatic detection of idea plagiarism. Detection of figure plagiarism is a particularly challenging field of research, because not only the text analytics but also graphic features need to be analyzed. This paper investigates the issues of idea and figure plagiarism and proposes a detection method which copes with both text and structure change. The procedure depends on finding similar semantic meanings between figures by applying image processing and semantic mapping techniques. The figures were compared using the representation of shape features based on detailed comparisons between the components of figures. This is an improvement over existing methods, which only compare the numbers and types of shapes inside figures.
Keywords: Plagiarism detection; figure plagiarism detection; idea plagiarism detection; academic plagiarism; structure change; text change; semantic meanings; image processing; semantic mapping techniques; scientific publications; content based algorithms.
Arabic Text Semantic-Based Query Expansion
by Nuhu Yusuf, Mohd Amin Mohd Yunus, Norfaradilla Wahid, Aida Mustapha, Nazri Mohd Nawi, Noor Azah Samsudin
Abstract: Abstract: Query expansions are being used in many search applications for retrieving relevant documents. Although retrieving the relevant documents are important for search users, the complexity of Arabic morphology remains a challenge. As such many irrelevant documents were still retrieved from the ranked results. To address this challenge, This paper proposes a new searching method for Arabic text semantic-based query expansion. The proposed method combines Arabic word synonyms and ontology to expand the query with additional terms. Specifically, the proposed method combined lexical words within the ranking algorithm and then improved with ontology links to expand query. The performance of Arabic text semantic-based query expansion was evaluated in terms of average precision, means average precision and means reciprocal rank. Experiments on Quran datasets show that the proposed method using Arabic Text Semantic-Based Query Expansion approach outperforms the previous methods using other dataset which is called Tafsir dataset. The proposed method achieved 15.44% mean average precision.
Keywords: Arabic Text; Semantic Search; Query Expansion; Lexical Words; Ontology; Ranking Algorithms.
A Hybrid Feature Selection Method Combining Gini Index and Support Vector Machine with Recursive Feature Elimination for Gene Expression Classification
by Talal Almutiri, Faisal Saeed
Abstract: Microarray datasets are suffering from a curse of dimensionality, because of a large number of genes and low numbers of samples, wherefore, the high dimensionality leads to computational cost and complexity. Consequently, feature selection (FS) is the process of choosing informative genes that could help in improving the effectiveness of classification. In this study, a hybrid feature selection was proposed, which combines the Gini Index and Support vector machine with Recursive Feature Elimination (GI-SVM-RFE), calculates a weight for each gene and recursively selects only ten genes to be the informative genes. To measure the impact of the proposed method, the experiments include four scenarios: baseline without feature selection, GI feature selection, SVM-RFE feature selection, and combining GI with SVM-RFE. In this paper, eleven microarray datasets were used. The proposed method showed an improvement in terms of classification accuracy when compared with other previous studies.
Keywords: Classification; Feature Selection; Gene Expression; Gini Index; Microarray; Recursive Feature Elimination.
Fast Parallel Computation of PageRank Scores with Improved Convergence Time
by Hema Dubey, Nilay Khare
Abstract: PageRank is a conspicuous link based approach used by many search engines in order to rank its search results. PageRank algorithm is based on performing iterations for calculating PageRank of web pages until the convergent point is met. The computational cost of this algorithm is very high for very large web graphs. So to overcome this drawback, in this paper we have proposed a fast parallel computation of PageRank which uses standard deviation technique to normalize the PageRank score of each web page. The proposed work is experimented on standard datasets taken from Stanford Large Network Dataset Collection, on a machine having multicore architecture using CUDA programming paradigm. We observed from the experiments that the proposed fast Parallel PageRank algorithm needs lesser number of iterations to converge as compared to existing Parallel PageRank method. We also determined that there is a speed up of about 2 to 10 for nine different standard datasets for proposed algorithm over existing algorithm.
Keywords: PageRank; Normalization; Standard Deviation; Parallel Computation; GPU; CUDA.