International Journal of Data Mining, Modelling and Management (28 papers in press)
Application of Structural Equation Modeling in Iranian Tourism Researches: Challenges and Guidelines
by Seyyed Mohammad Mirtaghian Rudsari, Najmeh Gharibi
Abstract: The main purpose of this study is to identify and analyze the challenges in using Structural Equation Modeling (SEM) in tourism research in Iran. The paper examines how Iranian scholars have used the technique, using a sample of 172 papers published in the top five tourism journals published in Farsi (i.e. Persian). The results indicate that often there is a lack of discussion as to sample size, issues of normality of distribution, effect analysis, the role of coefficients of determination and additionally selective and arbitrary reporting of fit indices is not uncommon. The paper also emphasizes the role of theory in constructing such models.
Keywords: Structural Equation Modeling (SEM); Covariance Based SEM; Partial Least Squares SEM; Challenges and Misuse; Iranian Tourism Research.
New perspectives on deep neural networks in decision support in surgery
by Konstantin Savenkov, Vladimir Gorbachenko, Anatoly Solomakha
Abstract: The paper considers the development of a neural network system for predicting complications after acute appendicitis operations. A neural network of deep architecture has been developed. As a learning set, a set developed by the authors based on real clinic data was used. To select significant features, a method for selecting features based on the interquartile range of the F1-score is proposed. For preliminary processing of training data, it is proposed to use an overcomplete autoencoder. Overcomplete autoencoder converts the selected features into a space of higher dimension, which, according to Cover's theorem facilitates the classification of features according to complication and not corresponding to complication. To overcome the overfitting of the network, the dropout method of neurons was used. The neural network is implemented using the Keras and TensorFlow libraries. Trained neural network showed high diagnostic metrics on test data set.
Keywords: neural networks; features selection; learning neural networks; overfitting; overcomplet autoencoder; medical diagnostics.
Modelling and Visualizing Emotions in Twitter Feeds
by Satish M. Srinivasan, Abhishek Tripathi
Abstract: Predictive analytics on twitter feeds is becoming a popular field for research. A tweet holds wealth of information on how an individual expresses and communicates their feelings and emotions within their social network. Large scale collection, cleaning, and mining of tweets will not only help in capturing an individuals emotion but also the emotions of a larger group. However, capturing a large volume of tweets and identifying the emotions expressed in it is a very challenging task. In this study an emotion-based classification scheme has been proposed. Initially a synthetic dataset is built by randomly picking instances from different training datasets. Using this newly constructed dataset, the classifiers are trained (model building). Finally, emotions are predicted on the test datasets using the generated models. By training the Na
Keywords: emotion classification; twitter data analysis; US presidential election; supervised classifier; Random Forest; Naïve Bayes Multinomial.
Pursuing Efficient Data Stream Mining by Removing Long Patterns from Summaries
by Po-Jen Chuang, Yun-Sheng Tu
Abstract: Frequent pattern mining is a useful data mining technique. It can help in digging out frequently used patterns from the massive Internet data streams for significant applications and analyses. To uplift the mining accuracy and reduce the needed processing time, this paper proposes a new approach that is able to remove less used long patterns from the pattern summary to preserve space for more frequently used short patterns, in order to enhance the performance of existing frequent pattern mining algorithms. Extensive simulation runs are carried out to check the performance of the proposed approach. The results show that our approach can strengthen the mining performance by effectively bringing down the required run time and substantially increasing the mining accuracy.
Keywords: Data Streams; Frequent Pattern Mining; Pattern Summary; Length Skip; Performance Evaluation.
Investigating the Impact of Preprocessing on Document Embedding: An Empirical Comparison
by Nourelhouda Yahi, Hacene Belhadef, Mathieu Roche
Abstract: Digital representation of text documents is a crucial task in machine learning and Natural Language Processing (NLP). It aims to transform unstructured text documents into mathematically-computable elements. In recent years, several methods have been proposed and implemented to encode text documents into fixed-length feature vectors. This operation is known as: Document Embedding and it has become an interesting and open area of research. Paragraph Vector (Doc2vec) is one of the most used document embedding methods. It has gained a good reputation thanks to its good results. To overcome its limits, Doc2vec, was extended by proposing the Document through Corruption (Doc2vecC) technique. To get a deep view of these two methods, this work presents a study on the impact of morphosyntactic text preprocessing on these two document embedding methods. We have done this analysis by applying the most-used text preprocessing techniques, such as Cleaning, Stemming and Lemmatisation, and their different combinations. Experimental analysis on the Microsoft Research Paraphrase dataset, MSRP, reveals that the preprocessing techniques serve to improve the classifier accuracy, and that Stemming methods outperform the other techniques.
Keywords: Natural Language Preprocessing; Document Embedding; Paragraph Vector; Document through Corruption; Text Preprocessing; Semantic Similarity.
A comprehensive review of deep learning for natural language processing
by Amal Bouraoui, Salma Jamoussi, Abdelmajid Ben Hamadou
Abstract: Deep learning has attracted considerable attention across many Natural Language Processing (NLP) domains. Deep learning models aim to learn embeddings of data with multiple levels of abstraction through multiple layers for either labeled structured input data or unlabeled unstructured input data. Currently, two research trends have emerged in building higher level embeddings. On one hand, a strong trend in deep learning leads towards increasingly powerful and complex models. On the other hand, multi-purpose sentence representation based on simple sums or averages of word vectors was recently shown to be effective. Furthermore, improving the performance of deep learning methods by attention mechanism has become a research hotspot in the last four years. In this paper, we seek to provide a comprehensive review of recent studies in building Neural Network (NN) embeddings that have been applied to NLP tasks. We provide a walk-through of deep learning evolution and a description of a variety of its architectures. We present and compare the performance of several deep learning models on standard datasets about different NLP tasks. We also present some deep learning challenges for natural language processing.
Keywords: Deep Learning; Word Embedding; Sentence Embedding; Attention
Mechanism; Compositional Models; Convolutional NNs; Recurrent/Recursive
NNs; Multi-purpose Sentence Embedding; Natural Language Processing.
Time-Series Gradient Boosting Tree for Stock Price Prediction
by Kei Nakagawa, Kenichi Yoshida
Abstract: We propose a time-series gradient boosting tree for a data set with time-series and cross-sectional attributes.
Our time-series gradient boosting tree has weak learners with time-series and cross-sectional attributes in its internal node, and split examples based on similarity between a pair of time-series or impurity between cross-sectional attributes.
Dissimilarity between a pair of time-series is defined by the dynamic time warping method.
In other words, the decision tree is constructed based on the shape that the time-series is similar or not similar to its past shape.
We conducted an empirical analysis using major world indices and confirmed that our time-series gradient boosting tree is superior to prior research methods in terms of both profitability and accuracy.
Keywords: Dynamic Time Warping mtehod; Time-series Decision Tree; Time-series Gradient Boosting Tree; Stock Price Prediction.
SUGGESTION AND SOLUTION OF A MATHEMATICAL MODEL FOR DETERMINING EFFECTIVE
by Kenan Mengüç, Tarik Küçükdeniz
Abstract: As obtaining data gets easier and cheaper with the help of technological achievements, data-based analyticsrnand management have become an essential part of planning and decision making to achieve success in thernsports industry. The study finds offensive routes for a team game using high-security data produced withrntechnology. An analysis of a sports team match was performed using seasonal data. A mathematical modelrnhas been developed for this analysis, discussing the effectiveness of the routes the model offers. This articlernaims to find the safe, efficient route for organizing the football on the field. In addition, the study also offersrnan experimental proposal for this purpose.rn
Keywords: Match strategy; tactics; optimization; transshipment problem.
Emotions recognition in synchronic textual CSCL situations
by Germán Lescano, Rosanna Costaguta, Analia Amandi
Abstract: Computer-Supported Collaborative Learning (CSCL) is an useful practice to teach learners working in groups and to acquire collaborative skills. To evaluate the collaborative process can be heavy for teachers because implies to analyze a lot of interactions. One issue to consider is socio-affective interactions which are important to recognize them due to their influence in the learning process. In this work, we propose an approach to recognize affective states in synchronic textual CSCL situations of students that speaking Spanish. Through experimentation, we analyze emotions manifested by university students of computer sciences when they worked in groups in synchronic textual CSCL situations and we evaluated the proposed approach using tools and libraries available in the market to make sentiment analysis. Using the proposed approach we developed classifiers to recognize subjectivity, sentiments and emotions. The sentiment classification model developed was compared with pre-built models regarding the rate of correct classifications. Results show that resources available in the market help the process of developing classifiers of sentiments and emotions for CSCL environments using traditional machine learning techniques. Providing to CSCL environments with a tool to recognize socio-affective interactions can be useful in order to help teachers evaluate this dimension of the collaborative process.
Keywords: Computer-Supported Collaborative Learning; Socio-Affective Interactions; Affective Computing.
Developing a Machine Learning Framework to Determine the Spread of COVID-19 in the United States using Meteorological, Social, and Demographic Factors
by Akash Gupta, Amir Gharehgozli
Abstract: Coronavirus disease of 2019 (COVID-19) has become pandemic in the matter of a few months, since the outbreak in December 2019 in Wuhan, China. We study the impact of weather factors including temperature and pollution on the spread of COVID-19. We also include social and demographic variables such as per capita Gross Domestic Product (GDP) and population density. Adapting the theory from the field of epidemiology, we develop a framework to build analytical models to predict the spread of COVID-19. In the proposed framework, we employ machine learning methods including linear regression, linear kernel support vector machine (SVM), radial kernel SVM, polynomial kernel SVM, and decision tree. Given the non-linear nature of the problem, the radial kernel SVM performs the best and explains 95% more variation than the existing methods. In align with the literature, our study indicates the population density is the critical factor to determine the spread. The univariate analysis shows that a higher temperature, air pollution, and population density can increase the spread. On the other hand, a higher per capita GDP can decrease the spread.
Keywords: COVID-19; disease spread; social and demographic factors; machine learning; epidemiology; predictive modeling.
Data Analytics for Gross Domestic Product using Random Forest and Extreme Gradient Boosting Approaches: An Empirical Study
by Elsayed Habib Elamir
Abstract: Gross domestic product per capita may be considered one of the foremost substantial measures of social gladness where all nations attempt to boost their gross domestic product per capita to share in their population bliss and prosperity, in addition to fortify their nation standing in worldwide relations. This study aims to use the random forest and extreme gradient boosting approaches to forecast and analyze gross domestic product per capita using data from world bank development indicators on countries level over the period 2010 to 2017. The comprehensive comparisons are executed using years before 2017 as training data and year 2017 as testing data. The root mean squares error, and the coefficient of determination are used to judge among the different models. The random forest and extreme gradient boosting achieve accuracy 97.8% and 98.1%, respectively, using coefficient of determination. The results suggest that the investment in education, labor, health, and industry as well as decreasing in inflation, interest, unemployment is necessary to enhance gross domestic product per capita. Motivating results are given by two-way interaction measure that is useful in explaining co-dependencies in the model behavior. The strongest interactions are between trade-technology, technology-education followed by consumption-health in terms of extreme gradient boosting method.
Keywords: bagging; boosting; business analytics; forecast; GDP; machine learning.
Methodology for Comparing Text Corpora via Topic Model
by Fedor Krasnov, Mikhail Shvartsman, Alexander Dimentov
Abstract: The authors of this paper developed a methodology approach for comparative analysis of patents' content. The approach named T4C is based on the topic modeling methodology and the machine learning methodology. The authors were able to identify the ownership of a patent in a particular country with an accuracy of 97.5% using supervised machine learning methods. When studying the dependence of patents on time, the authors were able to identify the patent belonging to a specific period with an accuracy of 85% for a specific country. The authors have developed a visual presentation of a thematic correlation between groups of patents. It should also be noted that in terms of the patent description text composition, Chinese patents are fundamentally different from US patents.rnThe results presented in this study were used to manage the patenting process at GazpromNeft STC.
Keywords: Topic Modeling; Text Classification; ARTM; PLSA; Random Forest; Text Collections Comparison.
Non-linear Gradient-based Feature Selection for Precise Prediction of Diseases
by Sadaf Kabir, Leily Farrokhvar
Abstract: Developing accurate predictive models can profoundly help health care providers improve the quality of their services. However, medical data often contain several variables, and not all of the data equally contribute towards the prediction. The existence of irrelevant and redundant features in a dataset can unnecessarily increase computational cost and complexity while deteriorating the performance of the predictive model. In this study, we employ the gradient-based prediction attribution as a general tool to identify important features in differentiable predictive models, such as neural networks and linear regression. Built upon this approach, we analyze single-stage and multi-stage scenarios for feature selection using ten medical datasets. Through extensive experiments, we demonstrate that the combination of the gradient-based approach with neural networks provides a powerful non-linear technique to identify important features contributing to the prediction. In particular, non-linear gradient-based feature selection achieves competitive results or significantimprovements over previously reported results on all datasets.
Keywords: Machine learning; feature selection; neural networks; logistic regression; disease prediction models; health care data.
Synergistic Effects Between Data Corpora Properties and Machine Learning Performance in Data Pipelines
by Roberto Bertolini, Stephen Finch
Abstract: To analyze data, a computationally feasible pipeline must be developed for data modeling. Corpora properties affect performance variability of machine learning (ML) techniques in pipelines; however, this has not been thoroughly investigated using simulation methodologies. A Monte Carlo study is used to compare differences in the area under the curve (AUC) metric for large-n-small-p-corpora examining: (1) the choice of ML algorithm, (2) size of the training database, (3) measurement error, (4) class imbalance magnitude, and (5) missing data pattern. Our simulations are consistent with established results under which these algorithms and corpora properties perform best, while providing insights into their synergistic effects. Measurement error negatively impacted pipeline performance across all corpora factors and ML algorithms. A larger training corpus ameliorated the decrease in predictive efficacy resulting from measurement error, class imbalance magnitudes, and missing data patterns. We discuss the implications of these findings for designing pipelines to enhance prediction performance.
Keywords: Data Pipeline; Interaction/Synergistic Effects; Monte Carlo Simulation; Machine Learning; Binary Classification.
Prediction of air pollution and analyze its effects on pollution dispersion of PM10 in Egypt using machine learning algorithms.
by Wael K. Hanna, Rasha Elstohy, Nouran Radwan
Abstract: Air pollution has been considered as one of the serious threats in Egypt. According to a study in Environmental Science & Technology Letters journal, air pollution is one of the main responsible for shortening Egyptians lives by 1.85 years. The main cause of air pollution in Egypt is PM10 which comes from industrial processes. PM10 concentrations exceed daily average concentrations during 98% of the measurement period. In this paper, we will apply machine learning classification algorithms to build the most accurate model for air pollution prediction and analysing its effects on pollution dispersion of PM10. The proposed classification model begins with air quality data collection and pre-processing, and then classifying process to discover the main relevant features for prediction. Experimental results show a good performance of the proposed air quality model. Random Forest, Na
Keywords: Air pollution; PM10; Classification model and machine learning algorithms.
Detecting and Exploiting Symmetries in Sequential Pattern Mining
by Ikram Nekkache, Said Jabbour, Nadjet Kamel, Lakhdar SAIS
Abstract: In this paper, we introduce a new framework for discovering and using symmetries in sequential pattern mining tasks. Symmetries are permutations between items that leave invariant the sequential database. Symmetries present several potential benefits. They can be seen as a new kind of structural patterns expressing regularities and similarities between items. As symmetries induce a partition of the sequential patterns into equivalent classes, exploiting them would allow to improve the pattern enumeration process, while reducing the size of the output. To this end, we first address the problem of symmetry discovery from database of sequences. Then, we first show how Apriori-like algorithms can be enhanced by dynamic integration of the detected symmetries. Secondly, we provide a second symmetry breaking approach allowing to eliminate symmetries in a preprocessing step by reformulating the sequential database of transactions. Our experiments clearly show that several sequential pattern mining datasets contain such symmetry based regularities. We also experimentally demonstrate that using such symmetries would results in significant reduction of the search space on some datasets.
Keywords: Data Mining; sequential pattern mining; symmetries.
Comparison of Harmony Search Derivatives for Artificial Neural Network Parameter Optimization: Stock Price Forecasting
by Mehmet Ozcalici, Ayse Tugba Dosdogru, Asli Boru Ipek, Mustafa Gocken
Abstract: This study has been conducted on forecasting as accurately as possible the next days stock price using Harmony Search (HS) and its variants (Improved Harmony Search (IHS), Global-Best Harmony Search (GHS), Self-Adaptive Harmony Search (SAHS), and Intelligent Tuned Harmony Search (ITHS)) together with Artificial Neural Network (ANN). The advantages of the proposed models are that the useful information in the original stock data is found by input variable selection and simultaneously the most proper number of hidden neurons in hidden layer is discovered to mitigate overfitting/underfitting problem in ANN. The results have shown that forecasts made by HS-ANN, IHS-ANN, GHS-ANN, SAHS-ANN, and ITHS-ANN demonstrate a tendency to achieve hit rates above 89% which is considerably better than previously proposed forecasting models in literature. Hence, ANN models provide more valuable forecasting results for investors to hedge against potential risk in stock markets.
Keywords: stock price forecasting; artificial neural network; harmony search and its variants.
Recommendation System for Improving Churn Rate based on Action Rules and Sentiment Mining
by Yuehua Duan, Zbigniew Ras
Abstract: It is well recognized that customers are one of the most valuable assets to a company. Therefore, it is of significant value for companies to reduce the customer outflow. In this paper, we focus on identifying the customers with high chance of attrition and provide valid and trustworthy recommendations to improve their customer churn rate. To this end, we designed and implemented a recommender system that can provide actionable recommendations to improve customer churn rate. We used both transaction and survey data from heavy equipment repair and service sector from 2011 to 2017. This data was collected by a consulting company based in Charlotte, North Carolina. In the survey data, customers give their thoughts, feelings, expectations and complaints by free-form text. We applied aspect-based sentiment analysis on the review text data to gain insightful knowledge on customers' attitudes toward the service. Action rule mining and meta-action triggering mechanism are used to recognize the actionable strategies to help with reducing customer churn.
Keywords: Action Rule Mining; Meta-actions; Aspect-based Sentiment Analysis; Recommender System; Reduct.
ONTOLOGY AND WEB USAGE MINING FOR WEB SITE MAINTENANCE
by Khaled Benali
Abstract: The search for information in the classical web is based essentially on the structure of the documents, and this makes the exploitation of the content almost impossible by the machines. In contrast, in the Semantic Web, machines can access resources through the semantic representation of content. In this regard, two domains, namely the web mining and the semantic web are closely linked: on the one hand, web-mining techniques help in the construction of the semantic web; on the other hand, the semantic web helps extract new knowledge. The present article discusses the problem of implementing Web Usage Mining in the semantic web for information retrieval from the web using ontology. Therefore, we present an approach that uses ontology and Web Usage Mining techniques for website maintenance. This work can help novice researchers start working in the field of web mining in the Semantic Web easily. Our approach will be tested on the ontology of a university website, which will be built and then enriched based on the extracted patterns on the Site Logs using an algorithm for the extraction of frequent itemsets. This approach aims to produce all the pages that are often accessible at the same time and throughout the same session to maintain the websites.
Keywords: Apriori; knowledge; Log File; Ontology; Web Usage Mining; Semantic Web and Website Maintenance.
OPTIMIZING DATA QUALITY OF A DATA WAREHOUSE USING DATA PURGATION PROCESS
by Neha Gupta
Abstract: Data act as fuel for any science and technology operation and due to the rapid growth of data collection and storage services, maintaining the quality of the data collected and stored is a major challenge. There are various data formats available and they are specifically categorized into three groups, i.e., Structured, Semi-structured and Unstructured. Different data mining techniques are utilized to gather, refine and investigate the data which further prompts the issue of data quality administration. The process of improving the quality of data without much alteration is known as data purgation. Data purgation occurs when the data is subject to Extract, Transform and Load (ETL) methodology in order to maintain and improve the data quality. Metadata is the most important factor that affects the quality of the collected data. The data may contain unnecessary information & may have inappropriate symbols which can be defined as dummy values, cryptic values or missing values. The present work has improved the Expectation-Maximization algorithm with dot product to handle cryptic data, DBSCAN method with Gower metrics has been implemented to ensure dummy values, Wards algorithm with Minkowski distance has been applied to improve the results of contradicting data and K-means algorithm along with Euclidean distance metrics has been applied to handle missing values in a dataset. These distance metrics have improved the data quality & also helped in providing consistent data to be loaded into a data warehouse. The above mentioned algorithms have been modified with the feature of scanning the database once, calculating the minimum support thereby increasing the efficiency as well as accuracy. The implementation of algorithms has been tested on various datasets of different sizes with more than 1000 records. The proposed algorithms have helped in maintaining the accuracy, integrity, consistency, non redundancy of data in a timely manner.
Keywords: Data Warehouse (DW); Data Quality (DQ); Extract; Transform and Load (ETL); Data Purgation (DP).
A Deep-Learning Approach to Game Bot Identification via Behavioural Features Analysis in Complex Massively-Cooperative Environments
by Alfredo Cuzzocrea, Fabio Martinelli, Francesco Mercaldo
Abstract: The importance of the video game market has been continuously growing in recent years due to the continuous increase in the number of players. To maintain and increase enthusiasm in video game players, the games are continuously updated and other major innovations are expected in the coming years. Thus, a community of players interested in the so-called Massively Multiplayer Online Role-Playing Games (MMORPGs) has developed. Players soon introduced the possibility of obtaining some kind of gain from competitions. However, some player has tried to obtain advantages with easy winnings introducing game bots in the games. In order to maintain fairness among players it is important to detect the presence of game bots during video games so that they can be expelled from the games. This paper describes an approach to distinguish human players from game bots based on behavioral analysis. In other words, the approach detects when player behavior is abnormal compared to a normal human player behavior. Behavioral features extracted during running games are analyzed by supervised Machine Learning (ML) and Deep Learning (DL) algorithms. For detecting game bots the considered algorithms are first trained with labeled features and then used to classify unseen before features. In this paper the performances of our game bots detection approach are experimentally obtained. The dataset we use for training and classification is extracted from the logs generated during online video games matches.
Keywords: Game Bot Detection; Complex Massively-Cooperative Environments; Machine Learning; Deep Learning.
Application of rule-based data mining in extracting the rules from the number of patients and climatic factors in instantaneous to long-term spectrum
by Sima Hadadian, Zahra Naji-Azimi, Naser Moatahari-Farimani, Behrouz Minaei-Bidgoli
Abstract: Predicting the number of patients helps managers to allocate resources in hospitals efficiently. In this research, the relationship between the number of patients with the temperature, relative humidity, wind speed, air pressure, and air pollution in instantaneous, short-, medium- and long-term indices was investigated. Genetic algorithm and ID3 decision tree have been used for feature selection, and classification based on multidimensional association rule mining algorithm has been applied for rule mining. The data have been collected for 19 months from a pediatric hospital whose wards are Nephrology, Hematology, Emergency, and PICU. The results show that in the long-term index, all climatic factors are correlated with the number of patients in all wards. Also, several if-then rules have been obtained, indicating the relationship between climate factors in four indices with the number of patients in each hospital ward. According to if-then rules, optimal planning can be done for resource allocation in the hospital.
Keywords: climatic factors; the number of patients; Classification Based on Multidimensional Association Rule Mining; Genetic Algorithm; ID3 Decision Tree.
Detecting cyberbullying in Spanish texts throughout deep learning techniques
by Paul Cumba, Diego Riofrio, Verónica Rodríguez, Joe Carrión
Abstract: Recent recollected data suggests that it is possible to automatically detect events that may negatively affect the most vulnerable parts of our society, by using any communication technology like social networks or messaging applications. This research consolidates and prepares a corpus with Spanish bullying expressions taken from Twitter in order to use them as an input to train a convolutional neuronal network through deep learning techniques. As a result of this training, a predictive model was created, which can identify Spanish cyberbullying expressions such as insults, racism, homophobic attacks, and so on.
Keywords: cyberbullying; deep learning; convolutional neuronal network; spanish; social networks.
Special Issue on: IRICT 2019 Advances in Computational Intelligence and Data Science
Investigation of Contraction Process Issue in Fuzzy Min-Max Models
by Essam Alhroob, Mohammed Falah Mohammed, Fadhl Hujainah, Osama Nayel Al Sayaydeh, Ngahzaifa Ab Ghani
Abstract: The fuzzy min-max (FMM) network is one of the most powerful neural net-works. It combines a neural network and fuzzy sets into a unified framework to address pattern classification problems. The FMM consists of three main learning processes, namely, hyperbox contraction, hyperbox expansion and hyperbox overlap tests. Despite its various learning processes, the contraction process is considered as one of the major challenges in the FMM that affects the classifica-tion process. Thus, this study aims to investigate the FMM contraction process precisely to highlight its usage consequences during the learning process. Such investigation can assist practitioners and researchers in obtaining a better under-standing about the consequences of using the contraction process on the network performance. Findings of this study indicate that the contraction process used in FMM can affect network performance in terms of misclassification and incapabil-ity in handling the membership ambiguity of the overlapping regions.
Keywords: Pattern classification; Fuzzy min-max; FMM models; Contraction process.
Plagiarism Detection of Figure Images in Scientific Publications
by Taiseer Eisa
Abstract: Plagiarism is stealing others work using their words directly or indirectly without a credit citation. Copying others ideas is another type of plagiarism that may occur in many areas but the most serious one is the academic plagiarism. Therefore, technical solutions are urgently required for automatic detection of idea plagiarism. Detection of figure plagiarism is a particularly challenging field of research, because not only the text analytics but also graphic features need to be analyzed. This paper investigates the issues of idea and figure plagiarism and proposes a detection method which copes with both text and structure change. The procedure depends on finding similar semantic meanings between figures by applying image processing and semantic mapping techniques. The figures were compared using the representation of shape features based on detailed comparisons between the components of figures. This is an improvement over existing methods, which only compare the numbers and types of shapes inside figures.
Keywords: Plagiarism detection; figure plagiarism detection; idea plagiarism detection; academic plagiarism; structure change; text change; semantic meanings; image processing; semantic mapping techniques; scientific publications; content based algorithms.
Arabic Text Semantic-Based Query Expansion
by Nuhu Yusuf, Mohd Amin Mohd Yunus, Norfaradilla Wahid, Aida Mustapha, Nazri Mohd Nawi, Noor Azah Samsudin
Abstract: Abstract: Query expansions are being used in many search applications for retrieving relevant documents. Although retrieving the relevant documents are important for search users, the complexity of Arabic morphology remains a challenge. As such many irrelevant documents were still retrieved from the ranked results. To address this challenge, This paper proposes a new searching method for Arabic text semantic-based query expansion. The proposed method combines Arabic word synonyms and ontology to expand the query with additional terms. Specifically, the proposed method combined lexical words within the ranking algorithm and then improved with ontology links to expand query. The performance of Arabic text semantic-based query expansion was evaluated in terms of average precision, means average precision and means reciprocal rank. Experiments on Quran datasets show that the proposed method using Arabic Text Semantic-Based Query Expansion approach outperforms the previous methods using other dataset which is called Tafsir dataset. The proposed method achieved 15.44% mean average precision.
Keywords: Arabic Text; Semantic Search; Query Expansion; Lexical Words; Ontology; Ranking Algorithms.
A Hybrid Feature Selection Method Combining Gini Index and Support Vector Machine with Recursive Feature Elimination for Gene Expression Classification
by Talal Almutiri, Faisal Saeed
Abstract: Microarray datasets are suffering from a curse of dimensionality, because of a large number of genes and low numbers of samples, wherefore, the high dimensionality leads to computational cost and complexity. Consequently, feature selection (FS) is the process of choosing informative genes that could help in improving the effectiveness of classification. In this study, a hybrid feature selection was proposed, which combines the Gini Index and Support vector machine with Recursive Feature Elimination (GI-SVM-RFE), calculates a weight for each gene and recursively selects only ten genes to be the informative genes. To measure the impact of the proposed method, the experiments include four scenarios: baseline without feature selection, GI feature selection, SVM-RFE feature selection, and combining GI with SVM-RFE. In this paper, eleven microarray datasets were used. The proposed method showed an improvement in terms of classification accuracy when compared with other previous studies.
Keywords: Classification; Feature Selection; Gene Expression; Gini Index; Microarray; Recursive Feature Elimination.
Fast Parallel Computation of PageRank Scores with Improved Convergence Time
by Hema Dubey, Nilay Khare
Abstract: PageRank is a conspicuous link based approach used by many search engines in order to rank its search results. PageRank algorithm is based on performing iterations for calculating PageRank of web pages until the convergent point is met. The computational cost of this algorithm is very high for very large web graphs. So to overcome this drawback, in this paper we have proposed a fast parallel computation of PageRank which uses standard deviation technique to normalize the PageRank score of each web page. The proposed work is experimented on standard datasets taken from Stanford Large Network Dataset Collection, on a machine having multicore architecture using CUDA programming paradigm. We observed from the experiments that the proposed fast Parallel PageRank algorithm needs lesser number of iterations to converge as compared to existing Parallel PageRank method. We also determined that there is a speed up of about 2 to 10 for nine different standard datasets for proposed algorithm over existing algorithm.
Keywords: PageRank; Normalization; Standard Deviation; Parallel Computation; GPU; CUDA.