International Journal of Data Mining, Modelling and Management (35 papers in press)
Performance of Authorship Attribution Classifiers with Short Texts: Application of Religious Arabic Fatwas
by Mohammed Al-Sarem, Abdel-Hamid Emara, Ahmed Abdel Wahab
Abstract: Although authorship attribution is a well-known problem in authorship analysis domain, researches on Arabic contexts are still limited. In addition, examining the performance of the attribution methods on training set with short textual documents is also not considered well in other languages, such as English, Chinese, Spanish and Dutch. Therefore, this current work aims at examining the performance of attribution classifiers in the context of short Arabic textual documents. The experimental part of this work is conducted with well-known classifiers namely: Decision Tree C4.5 method, Naive Bayes model, K-NN method, Markov Model, SMO and Burrows Delta method. We experiment with various features combination. The results show that combining the word-based lexical features with the structural features yields the best accuracy. At this end, we use this combination as a baseline for further investigation. We also examine the effect of combining the n-gram features. The results indicate that some classifiers show an improvement while the others do not. In addition, the results show that the naive Bayes method gives the highest accuracy among all the attribution classifiers.
Keywords: Authorship Attribution; Stylometric Features; Attribution Classifiers; JGAAP tool; Arabic Language.
Extracting useful reply-posts for text forum threads summarisation using quality features and classification methods
by Akram Osman, Naomie Salim
Abstract: Text forums threads have a large amount of information furnished by users who discuss on a specific topic. At times, certain thread reply-posts are entirely off-topic, thereby deviating from the main discussion. It negatively affects the users preference to continue replying to the discussion. Thus, there is a possibility that the user prefers to read certain selected reply-posts that provide a short summary of the topic of the discussion. The objective of the paper is to choose quality re-ply-posts regarding a topic considered in the initial-post, which also serve a brief summary. We offer an exhaustive examination of the conversational patterns of the threads on the basis of 12 quality features for analysis. These features can ensure selection of relevant reply-posts for the thread summary. Experimental outcomes obtained using two datasets show that the presented techniques considerably enhanced the performance in selecting initial-post replies pairs for text forum threads summarisation.
Keywords: information retrieval; initial-post replies pairs; text data; text forum threads; text forum threads summarisation; text summarisation; thread retrieval.
Phish Webpage Classification Using Hybrid Algorithm of Machine Learning and Statistical Induction Ratios
by Hiba Zuhair, Ali Selamat
Abstract: Although the conventional machine learning-based anti-phishing techniques outperform their competitors in phishing detection, they are still targeted by zero-hour phish webpages due to their constraints of phishing induction. Therefore, phishing induction must be boosted up with the extraction of new features, the selection of robust subsets of decisive features, the active learning of classifiers on a big webpage stream. In this paper, we propose a hybrid feature-based classification algorithm (HFBC) for decisive phish webpage classification. HFBC hybridizes two statistical criteria Optimized Feature Occurrence (OFC) and Phishing Induction Ratio (PIR) with the induction settings of the most salient machine learning algorithms, Na
Keywords: phish webpage; machine learning; optimized feature occurrence; phishing induction ratio; hybrid feature-based classifier.
Application of hyperconvergent platform for Big Data in exploring regional innovation systems
by Alexey Finogeev, Leyla Gamidullaeva, Sergey Vasin
Abstract: The authors developed a decentralized hyperconvergent analytical platform for the collection and processing of Big Data in order to explore the monitoring processes of distributed objects in the regions on the basis of multi-agent approach. The platform is intended for modular integration of tools for searching, collecting, processing and big data mining from cyber-physical and cyber-social objects. The results of the intellectual analysis are used to assess the integrated criteria for the effectiveness of innovation systems of distributed monitoring and forecasting the dynamics of the influence of various factors on technological and socio-economic processes. The work analyzes convergent and hyperconvergent systems, substantiates the necessity of creating a multi-agent decentralized platform for Big Data collection and analytical processing. The article proposes the principles of streaming architecture for the data integration analytical processing to resolve the problems of searching, parallel processing, data mining and uploading of information into a cloud storage. The paper also considers the main components of the hyperconvergent analytical platform. A new concept of distributed ETLM (Extraction, Transformation, Loading, Mining) system is considered.
Keywords: innovation system; convergence; convergent platform; hyperconvergent system; intellectual analysis; Big Data; multi-agent approach; ETLM (Extraction; Transformation; Loading; Mining) system.
A quest for better anomaly detectors
by Mehdi Soleymani
Abstract: Anomaly detection is a very popular method for detecting exceptional
observations which are very rare. It has been frequently used in medical
diagnosis, fraud detection, etc. In this article, we revisit some
popular algorithms for anomaly detection and investigate why we are
on a quest for a better algorithm for identifying anomalies. We propose
a new algorithm, which unlike other popular algorithms, is not looking
for outliers directly, but it searches for them by removing the inliers
(opposite to outliers) in an iterative way. We present an extensive
simulation study to show the performance of the proposed algorithm
compared to its competitors.
Keywords: Anomaly detection; Algorithm; k-Nearest Negibhour.
The bootstrap procedure in classification problems
by Borislava Vrigazova, Ivan Ivanov
Abstract: In classification problems cross validation chooses random samples from the dataset in order to improve the ability of the model to classify properly new observations in the respective class. Research articles from various fields show that when applied to regression problems, the bootstrap can improve either the prediction ability of the model or the ability for feature selection. The purpose of our research is to show that the bootstrap as a model selection procedure in classification problems can outperform cross validation. We compare the performance measures of cross validation and the bootstrap on a set of classification problems and analyze their practical advantages and disadvantages. We show that the bootstrap procedure can accelerate execution time compared to the cross-validation procedure while preserving the accuracy of the classification model. This advantage of the bootstrap is particularly important in big datasets as the time needed for fitting the model can be reduced without decreasing the models performance.
Keywords: logistic regression; decision tree; k-nearest neighbor; the bootstrap; cross validation.
Weighted LSTM for Intrusion Detection and Data Mining to Prevent Attacks
by Meryem Amar, Bouabid El Ouahidi
Abstract: The usage of cloud opportunities brings not only resources and storage availability, but puts also customers privacy at stake. Cloud services are carried out using web applications that generate log files at every level of the computing infrastructure, and they contain valuable information to track malicious behaviors and to identify the attackers. Unfortunately, they scale up to high Velocity, big Volume, and Variant formats.
Deep Learning is robust in analyzing high dimension of databases, selecting dynamically relevant features and detecting intrusions with greater accuracy and reduced loss error rate. This paper proposes first a Data Preparation Treatment (DPT) method that structures diverse input log files, anticipates missing features, and performs a weighted conversion on categorical features to ease the discrimination of normal behaviors from malicious ones. It also avails the strength of a Weighted Long Short-Term Memory (WLSTM) Deep Learning algorithm to mine network traffic predictors, regarding past events and minimizes the vanishing gradient values. This solution proposes also a Data mining approach to prevent the occurrence of a set of consecutive attacks based on Decision Trees model.
The results prove the robustness of the proposed architecture. It achieved 98% of accuracy in detecting attacks and minimized the False Alarm Rates to 1, 47% only. For contextual malicious behaviors, the accuracy attained 97% and the loss was 22%.
Keywords: Cloud security breaches; Intrusion-detection; Weight of Evidence; Deep Learning; Long Short-Term Memory (LSTM);.
A New Quantitative Method for Simplifying Complex Fuzzy Cognitive Maps
by Mamoon Obiedat, Ali Al-yousef, Mustafa Banikhalaf, Khairallah Al Talafha
Abstract: Fuzzy cognitive map (FCM) is a qualitative soft computing approach addresses uncertain human perceptions of diverse real-world problems. The map depicts the problem in the form of problem nodes and cause-effect relationships among them. Complex problems often produce complex maps that may be difficult to understand or predict, and therefore, maps need to be simplified. Previous studies used subjectively simplification/condensation processes by grouping similar variables into one variable in a qualitative manner. This paper proposes a quantitative method for simplifying FCM. It uses the spectral clustering quantitative technique to classify/group related variables into new clusters without human intervention. Initially, improvements were added to this clustering technique to properly handle FCM matrix data. Then, the proposed method was examined by an application dataset to validate its appropriateness in FCM simplification. The results showed that the method successfully classified the dataset into meaningful clusters.
Keywords: Soft computing; fuzzy cognitive map model; complex problems; FCM simplification; spectral clustering; topological overlap matrix; decision support systems.
Proposal and Study of Statistical Features for String Similarity Computation and Classification
by Erick Rodrigues, Aura Conci, Esteban Clua, Panos Liatsis
Abstract: Strings are usually compared among themselves using language related information such as taxonomy and dictionaries, which can be challenging. In this work, adaptations of features commonly applied in the field of visual computing, Co-Occurrence Matrix (COM) and Run-Length Matrix (RLM), are used in the similarity computation of strings in general (words, phrases, codes and texts). The proposed features do not consider language related information, they are purely statistical and can be used in any context with any language or textual structure. Other statistical measures that are commonly employed in the field such as Longest Common Subsequence, Maximal Consecutive Longest Common Subsequence, Mutual Information and Edit Distances are evaluated and compared to the proposed features. Devised experiments consist of training and testing classifiers on features extracted from two strings. Various classification algorithms are evaluated, which include neural networks, function or rule based classification and decision tree classifiers. The proposed features provide interesting results. In the first synthetic set of experiments, the Co-Occurrence and Run-Length features outperform the remaining state-of-the-art statistical groups of features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group of features based on distances ($P value < 0.001$). The second set of experiments uses a real text plagiarism dataset. In this case, the RLM features obtained the best results.
Keywords: word comparison; string similarity; classification; statistical features; text mining; ocr; text plagiarism; text entailment; supervised learning.
Application of Structural Equation Modeling in Iranian Tourism Researches: Challenges and Guidelines
by Seyyed Mohammad Mirtaghian Rudsari, Najmeh Gharibi
Abstract: The main purpose of this study is to identify and analyze the challenges in using Structural Equation Modeling (SEM) in tourism research in Iran. The paper examines how Iranian scholars have used the technique, using a sample of 172 papers published in the top five tourism journals published in Farsi (i.e. Persian). The results indicate that often there is a lack of discussion as to sample size, issues of normality of distribution, effect analysis, the role of coefficients of determination and additionally selective and arbitrary reporting of fit indices is not uncommon. The paper also emphasizes the role of theory in constructing such models.
Keywords: Structural Equation Modeling (SEM); Covariance Based SEM; Partial Least Squares SEM; Challenges and Misuse; Iranian Tourism Research.
New perspectives on deep neural networks in decision support in surgery
by Konstantin Savenkov, Vladimir Gorbachenko, Anatoly Solomakha
Abstract: The paper considers the development of a neural network system for predicting complications after acute appendicitis operations. A neural network of deep architecture has been developed. As a learning set, a set developed by the authors based on real clinic data was used. To select significant features, a method for selecting features based on the interquartile range of the F1-score is proposed. For preliminary processing of training data, it is proposed to use an overcomplete autoencoder. Overcomplete autoencoder converts the selected features into a space of higher dimension, which, according to Cover's theorem facilitates the classification of features according to complication and not corresponding to complication. To overcome the overfitting of the network, the dropout method of neurons was used. The neural network is implemented using the Keras and TensorFlow libraries. Trained neural network showed high diagnostic metrics on test data set.
Keywords: neural networks; features selection; learning neural networks; overfitting; overcomplet autoencoder; medical diagnostics.
Modelling and Visualizing Emotions in Twitter Feeds
by Satish M. Srinivasan, Abhishek Tripathi
Abstract: Predictive analytics on twitter feeds is becoming a popular field for research. A tweet holds wealth of information on how an individual expresses and communicates their feelings and emotions within their social network. Large scale collection, cleaning, and mining of tweets will not only help in capturing an individuals emotion but also the emotions of a larger group. However, capturing a large volume of tweets and identifying the emotions expressed in it is a very challenging task. In this study an emotion-based classification scheme has been proposed. Initially a synthetic dataset is built by randomly picking instances from different training datasets. Using this newly constructed dataset, the classifiers are trained (model building). Finally, emotions are predicted on the test datasets using the generated models. By training the Na
Keywords: emotion classification; twitter data analysis; US presidential election; supervised classifier; Random Forest; Naïve Bayes Multinomial.
Pursuing Efficient Data Stream Mining by Removing Long Patterns from Summaries
by Po-Jen Chuang, Yun-Sheng Tu
Abstract: Frequent pattern mining is a useful data mining technique. It can help in digging out frequently used patterns from the massive Internet data streams for significant applications and analyses. To uplift the mining accuracy and reduce the needed processing time, this paper proposes a new approach that is able to remove less used long patterns from the pattern summary to preserve space for more frequently used short patterns, in order to enhance the performance of existing frequent pattern mining algorithms. Extensive simulation runs are carried out to check the performance of the proposed approach. The results show that our approach can strengthen the mining performance by effectively bringing down the required run time and substantially increasing the mining accuracy.
Keywords: Data Streams; Frequent Pattern Mining; Pattern Summary; Length Skip; Performance Evaluation.
Data Modeling for Large-Scale Social Media Analytics: Design Challenges and Lessons Learned
by Ahmet Arif Aydin, Kenneth M. Anderson
Abstract: We live in a world of big data; organizations collect, store, and analyze large volumes of data for various purposes. The five Vs of big datavolume, velocity, variety, veracity and valueintroduce new challenges for software developers to handle when performing data processing and analysis. Indeed, data modeling is one of the most challenging and critical aspects of big data because it determines how large volumes of data will be structured and stored; these decisions, in turn, impact how that data can be processed and analyzed. In this paper, we report on our work designing a data model for storing and analyzing large volumes of Twitter data in support of a research domain known as crisis informatics. In this work, we build on the data model provided by columnar NoSQL data stores and identify a set of column families that can efficiently index, sort, and store large Twitter data sets and support flexible data analysis. In particular, our column families are designed to achieve fast and efficient batch data processing of the stored Twitter data. We provide an evaluation of these claims and discuss additional issues that will be the focus of our future work.
Keywords: Data modeling; social media analytics; big data analytics; NoSQL.
Investigating the Impact of Preprocessing on Document Embedding: An Empirical Comparison
by Nourelhouda Yahi, Hacene Belhadef
Abstract: Digital representation of text documents is a crucial task in machine learning and Natural Language Processing (NLP). It aims to transform unstructured text documents into mathematically-computable elements. In recent years, several methods have been proposed and implemented to encode text documents into fixed-length feature vectors. This operation is known as: Document Embedding and it has become an interesting and open area of research. Paragraph Vector (Doc2vec) is one of the most used document embedding methods. It has gained a good reputation thanks to its good results. To overcome its limits, Doc2vec, was extended by proposing the Document through Corruption (Doc2vecC) technique. To get a deep view of these two methods, this work presents a study on the impact of morphosyntactic text preprocessing on these two document embedding methods. We have done this analysis by applying the most-used text preprocessing techniques, such as Cleaning, Stemming and Lemmatisation, and their different combinations. Experimental analysis on the Microsoft Research Paraphrase dataset, MSRP, reveals that the preprocessing techniques serve to improve the classifier accuracy, and that Stemming methods outperform the other techniques.
Keywords: Natural Language Preprocessing; Document Embedding; Paragraph Vector; Document through Corruption; Text Preprocessing; Semantic Similarity.
A comprehensive review of deep learning for natural language processing
by Amal Bouraoui, Salma Jamoussi, Abdelmajid Ben Hamadou
Abstract: Deep learning has attracted considerable attention across many Natural Language Processing (NLP) domains. Deep learning models aim to learn embeddings of data with multiple levels of abstraction through multiple layers for either labeled structured input data or unlabeled unstructured input data. Currently, two research trends have emerged in building higher level embeddings. On one hand, a strong trend in deep learning leads towards increasingly powerful and complex models. On the other hand, multi-purpose sentence representation based on simple sums or averages of word vectors was recently shown to be effective. Furthermore, improving the performance of deep learning methods by attention mechanism has become a research hotspot in the last four years. In this paper, we seek to provide a comprehensive review of recent studies in building Neural Network (NN) embeddings that have been applied to NLP tasks. We provide a walk-through of deep learning evolution and a description of a variety of its architectures. We present and compare the performance of several deep learning models on standard datasets about different NLP tasks. We also present some deep learning challenges for natural language processing.
Keywords: Deep Learning; Word Embedding; Sentence Embedding; Attention
Mechanism; Compositional Models; Convolutional NNs; Recurrent/Recursive
NNs; Multi-purpose Sentence Embedding; Natural Language Processing.
Special Issue on: IFIP CIIA 2018 Advanced Research in Computational Intelligence
Recommendation of Items Using a Social-based Collaborative Filtering Approach and Classification Techniques
by Lamia Berkani
Abstract: With the development of Web 2.0 and social media, the study of social-based recommender systems has emerged. The social relationships among users can improve the accuracy of recommendation. However, with the large amount of data generated every day in social networks, the use of classification techniques becomes a necessity. The clustering-based approaches reduce the search space by clustering similar users or items together. We focus in this paper on the personalized item recommendation in social context. Our approach combines in different ways the social filtering algorithm and the traditional user-based collaborative filtering algorithm. The social information is formalized by some social-behavior metrics such as friendship, commitment and trust degrees of users. Moreover, two classification techniques are used: an unsupervised technique applied initially to all users and a supervised technique applied to newly added users. Finally, the proposed approach has been experimented using different existing datasets. The obtained results show the contribution of integrating social information on the collaborative filtering and the added value of using the classification techniques on the different algorithms in terms of the recommendation accuracy.
Keywords: Item Recommendation; Collaborative filtering; Social filtering; Supervised Classification; Unsupervised classification.
Distributed Heterogeneous Ensemble Learning on Apache Spark for Ligand Based Virtual Screening
by Karima Sid, Mohamed Batouche
Abstract: Virtual screening is one of the most common Computer-Aided Drug Design techniques that apply computational tools and methods on large libraries of molecules to extract the drugs. Ensemble learning is a recent paradigm launched to improve machine-learning results in terms of predictive performance and robustness. It has been successfully applied in ligand-based virtual screening (LBVS) approaches. Applying ensemble learning on huge molecular libraries is computationally expensive. Hence, the distribution and parallelization of the task have become a significant step by using sophisticated frameworks such as Apache Spark. In this paper, we propose a new approach HEnsL_DLBVS, for heterogeneous ensemble learning, distributed on Spark to improve the large-scale LBVS results. To handle the problem of imbalanced big training datasets, we propose a novel hybrid technique. We generate new training datasets to evaluate the approach. Experimental results confirm the effectiveness of our approach with satisfactory accuracy and its superiority over homogeneous models.
Keywords: virtual screening; big data; computer-aided drug design; Apache Spark; machine learning; drug discovery; ensemble learning; imbalanced datasets; Spark MLlib; ligand-based virtual screening.
Hash-Processing of Universal Quantification-like Queries dealing with Requirements and Prohibitions.
by Noussaiba Benadjimi, Khaled Walid HIDOUCI
Abstract: This paper is focused on flexible universal quantification-like queries handling simultaneously positive and negative preferences (requirements or prohibitions). We emphasize the performance improvement of the considered operator by proposing new variants of the classical Hash-Division algorithm. The issue of answers ranking is also dealt with. We target in our work the in-memory databases systems (also called main-memory database systems) with a very large volume of data. In these systems, all the data is primarily stored in the RAM of a computer. We have introduced a parallel implementation of the operator that takes into account the data skew issue. Our empirical analysis for both sequential and parallel versions shows the relevance of our approach. They demonstrate that the new processing of the mixed operator in a main- memory database achieves better performance compared to the conventional ones, and becomes faster through parallelism.
Keywords: Universal quantification queries; Relational division; Relational anti-division; Main-memory databases; Flexible division; Hash-division.
An Enhanced Cooperative Method to Solve Multiple Sequence Alignment Problem
by Lamiche Chaabane
Abstract: In this research study, we aim to propose a novel cooperative approach called dynamic simulated particle swarm optimization (DSPSO) which is based on metaheuristics and the pairwise dynamic programming procedure (DP) to find an approximate solution for the multiple sequence alignment (MSA) problem. The developed approach applies the particle swam optimization (PSO) algorithm to discover the search space globally and the simulated annealing (SA) technique to improve the population leader quality in order to overcome local optimum problem. After that the dynamic programming technique is integrated as an improver mechanism in order to improve the worst solution quality and to increase the convergence speed of the proposed approach. Simulation results on BaliBASE benchmarks have shown the potent of the proposed method to produce good quality alignments comparing to those given by other literature existing methods.
Keywords: Cooperative approach; multiple sequence alignment; DSPSO; PSO; SA; DP; BaliBASE benchmarks.
A Formal Theoretical Framework for a Flexible Classification Process
by Ismail Biskri
Abstract: The classification process is a complex technique that connects language, text, information and knowledge theories with computational formalization, statistical and symbolic approaches, standard and non-standard logics, etc. This process should always be under the control of the user according to his subjectivity, his knowledge and the purpose of his analysis. It becomes important to create platforms to support the design of classification tools, their management, their adaptation to new needs and experiments. In the last years, several platforms for data digging including textual data where classification is the main functionality have emerged. However, they lack flexibility and formal foundations. We propose in this paper a formal model with strong logical foundations based on applicative type systems.
Keywords: Classification; Flexibility; Applicative Systems; Operators/Operands; Combinatory Logics; Inferential Calculus; Compositionality; Processing Chains; Modules; Discovery Process; Collaborative Intelligent Science.
Graph-based Cumulative Score Using statistical features for multilingual automatic text summarization
by Abdelkrime Aries, Djamel Eddine Zegour, Walid Khaled Hidouci
Abstract: Multilingual summarization began to receive more attention these late years. Many approaches can be used to achieving this, among them: statistical and graph-based approaches. Our idea is to combine these two approaches into a new extractive text summarization method. Surface statistical features are used to calculate a primary score for each sentence. The graph is used to selecting some candidate sentences and calculating a final score for each sentence based on its primary score and those of its neighbors in the graph. We propose four variants to calculating the cumulative score of a sentence. Also, the order of sentences is an important aspect of summary readability. We propose some other algorithms to generating the summary not just based on final scores but on sentences connections in the graph. The method is tested using MultiLing'15 workshop's MSS corpus and ROUGE metric. It is evaluated against some well known methods and it gives promising results.
Keywords: Autmatic text summarization; Graph-based summarization; Statistical features; Multilingual summarization; Extractive summarization.
An ontology-based modelling and reasoning for alert correlation
by Tayeb Kenaza
Abstract: A good defense strategy utilizes multiple solutions such as
Firewalls, IDS, Antivirus, AAA server, VPN, etc. However, these tools
can easily generate hundreds of thousands of events per day. Security
information and event management system (SIEM for short) is a centralized solution that collects information from these tools and use some
correlation techniques to build a reliable picture of the underlying monitored system. SIEM is a modern and powerful security tool thanks to
several functions that provide to take benefit of collected data, such as
normalization and aggregation of data. It provides security operators a
dashboard and helps them in the forensic analysis when an incident is reported. Indeed, the main important function is events correlation, when
security operators can get a precise and quick picture about threats and
attacks in real time. The quality of that picture depends on the efficiency
of the adopted reasoning approach to putting together pieces of information provided by several analyzers. However, most of proprietary SIEM
use its own data representation and its own correlation techniques which
are not always favorable to share knowledge and to do incremental or
collaborative reasoning. In this paper, we propose a semantic approach
based on Description Logics (DLs) which is a powerful tool for knowledge
representation and reasoning. Indeed, Ontology provides a comprehensive environment to represent information for intrusion detection and allows easy maintain of information or adding new ones. We implemented
a rule-based engine for alert correlation based on the proposed ontology
and two attack scenarios are carried out to show the usefulness of the
Keywords: ntrusion detection; Alert correlation; Rules based reason-
ing; Ontology; OWL.
Convolutional Neural Network with Stacked Autoencoders for Predicting Drug-Target Interaction and Binding Affinity
by Meriem Bahi, Mohamed Batouche
Abstract: The prediction of novel drug-target interactions (DTIs) is critically important for drug repositioning, as it can lead the researchers to find new indications for existing drugs and to reduce the cost and time of the de novo drug development process. In order to explore new ways for this innovation, we have proposed two novel methods named SCA-DTIs and SCA-DTA, respectively to predict both drug-target interactions and drug-target binding affinities (DTA) based on Convolutional Neural Network (CNN) with Stacked Autoencoders (SAE). Initializing a CNN's weights with filters of trained stacked autoencoders yields to superior performance. Moreover, for boosting the performance of the DTIs prediction, we propose a new method called RNDTIs to generate reliable negative samples. Tests on different benchmark datasets show that the proposed method can achieve an excellent prediction performance with an accuracy of more than 99%. These results demonstrate the strength of the proposed model potential for DTIs and DTA prediction, thereby improving the drug repurposing process.
Keywords: Stacked Autoencoders; Convolutional Neural Network; Semi-Supervised Learning; Deep Learning; Drug Repositioning; Drug-Target Interaction; Binding Affinity.
Efficient Deployment Approach of Wireless Sensor Networks on 3D Terrains
by Mostefa Zafer, Mustapha Reda Senouci, Mohamed Aissani
Abstract: Ensuring the coverage of a Region of Interest (RoI) when deploying a Wireless Sensor Network (WSN) is an objective that depends on several factors, such as the detection capability of the used sensor nodes and the topography of the RoI. To address the topography challenges, in this paper, we propose a new WSN deployment approach based on the idea of partitioning the RoI into sub-regions with relatively simple topography. Then allocating, to each constructed sub-region, the necessary number of sensor nodes and finding their appropriates positions to maximize the coverage quality. The performance evaluation of this approach coupled with three different deployment methods named DMSA (Deployment Method based on Simulated Annealing), GDM (Greedy Deployment Method), and RDM (Random Deployment Method), has revealed its relevance since it helped to significantly improve the coverage quality of the RoI.
Keywords: Wireless Sensor Networks; 3D terrains; Deployment; Coverage.
Special Issue on: ICBBD 2019 Business, Big Data and Decision Sciences
Modelling Attrition to Know Why Your Employees Leave or Stay
by Sachin Deshmukh, Seema Sant, Neerja Kashive
Abstract: Todays environmental factors influence every aspect of business, be it its Marketing, Finance, Operations or Human Resources policies. Increased globalization and technological developments have resulted into fierce competition among companies. Talent shortage has become a global issue for organizations. One of the major challenges faced by any organization is the increase in the level of employee attrition. Attrition up to a certain limit is good for any organization as it enables to inject new blood and ideas which can help in developing competitive advantage. But attrition beyond a certain limit can prove unhealthy as talented employees may go elsewhere in search of a greener pasture. Data Analytics is used as an effective tool to delve into the problem of attrition. Predictive models are been used to understand factors responsible for attrition and also predict probabilities of employees who may leave the organization for some reason. The current study has tried to build a predictive model by using logistic regression and understand the specific factors that lead to attrition. This paper also attempts to compare factors responsible for attrition in two time periods, first period from 1996 to 2008 (Holtoms Model) and second period from 2009 to 2016 to find whether any changes have taken place in employees expectations, which, if not fulfilled, may lead to attrition. An analysis of an IT organizations data reveal that factors responsible for attrition in the second period have changed, compared to the first period.
Keywords: Attrition;Predictive Model;Logistic Regression.
Long Text to Image Converter for Financial Reports
by Chia-Hao Chiu, Yun-Cheng Tsai, Ho-Lin FLong Text To Image Converter For Financial Reports
Abstract: In this study, we proposed a novel article analysis method. This method
converts the article classification problem into image classification problem by
projecting texts into images and then applying CNN models for classification.
We called the method the Long Text to Image Converter (LTIC). The features
are extracted automatically from the generated images, hence there is no need
of any explicit step of embedding the words or characters into numeric vector
representations.This method saves the time to experiment pre-process.
This study using the financial domain as an example. In companies financial
reports, there will be a chapter describes the companys financial trends. The
content has many financial terms used to infer the companys current and
futures financial position. The LTIC achieved excellent convolution matrix and
test data accuracy. The results indicated an 80% accuracy rate. The proposed
LTIC produced excellent results during practical application. The LTIC achieved
excellent performance in classifying corporate financial reports under review.
The return on simulated investment is 46%. In addition to tangible returns, the
LTIC method reduced the time required for article analysis and is able to provide
article classification references in a short period to facilitate the decisions of the
Keywords: Article Analysis,Convolutional Neural Network,Financial Analysis; Long Text to Image Converter.
E-Learning process through text mining for academic literacy
by Maira Alejandra Pulgarin Roriguez, Bárbara Maricely Fierro Chong, Erica María Ossa Taborda
Abstract: This paper aims to present the results of research carried out in a Virtual Faculty of Education in a Private university in Colombia. It consists of the characterization of student's abilities for reading and writing comprehension for academic literacy. This study is to verify the effectiveness of an E-learning platform implementation for all the programs incorporated in the Faculty. According to the policies, at the University exists a structure of a methodological procedure for the text mining through specific keywords applicable to different text typologies in specialized areas. This platform allows professors and students to develop expertise in disciplines using text mining as an interdisciplinary strategy to build knowledge and improve the quality in their professional context.
Keywords: Text mining; terminological work; cognitive processes; E-learning; academic literacy; reading comprehension; academic writing.
Association Rules in Mobile Game Operation
by Muning Chang
Abstract: Mobile games are now playing a significant role in the gaming industry as the Internet continues to develop. Due to the economic and cultural value of mobile games, it is very importance for the gaming companies to maintain and further improve the product quality to remain competition in the industry. The operation team plays the key points to maintain product profitability after issuing the games.
This paper will analyze the gaming data collected during operation and propose operation strategies accordingly. A correlation coefficient algorithm suitable for time sequences is proposed, the association is defined by the similarity between data. The level of association between two-time sequences is reflected in the probability of the occurrence of such association. Based on the discovery, we can analyze the next popular mobile game in depth to explore the correlation between the number of users online, the number of new players, and the retention rate. The study found that there are two fatigue periods, at approximately day 30 and 120 when there is a high likelihood for user loss, which is important to consider in the strategic planning for the game operation.
Keywords: Mobile Games; Association Rules; Sequence Correlation; Operation Optimization.
A Multivariate Copula-based SUR Probit Model: Application to Insolvency Probability of Enterprises
by Woraphon Yamaka, Paravee Maneejuk
Abstract: The purpose of this study is to introduce a more flexible joint distribution for a Probit model with more than two equations, or a so-called SUR Probit model. The main idea of the suggested method is to use a multivariate copula to link the errors of equations in the SUR Probit model. We conduct a simulation study to assess the performance of the model and then apply the model to a real economic problem that is the insolvency probability of small and medium enterprises in Thailand. This study considers three economic sectors and speculates some dependencies among them. The results obtained from the copula-based SUR Probit model can show a better performance in both simulation and application study. In addition, it is found to be suitable for explaining the causal effect of the companies financial statements on their insolvency probability and challenged results for the Thai enterprises are brought out.
Keywords: Multivariate Copula; Multivariate Probit Model; Small and Medium Enterprises; Financial Statements; Insolvency Probability.
Hedging Agriculture Commodities Futures with Histogram data: A Markov Switching Volatility and Correlation model
by Woraphon Yamaka, Pichayakone Rakpho, Paravee Maneejuk
Abstract: In this study, the bivariate flexible Markov Switching Dynamic Copula GARCH model is developed to histogram-value data for calculating optimal portfolio weight and optimal hedge. This model is an extension of the Markov Switching Dynamic Copula GARCH in which all estimated parameters are allowed to be a regime dependent. The histogram data is constructed from the 5-minute wheat spot and futures returns. We compare our purposed model with other bivariate GARCH models through AIC, BIC, and hedge effectiveness. The empirical results show that our model is slightly better than the conventional methods in term of the lowest AIC and BIC; and the highest hedge effectiveness. This indicates that our purposed model is quite effective in reducing risks in portfolio returns.
Keywords: Hedging strategy; Markov Switching; Time-varying dependence; Histogram data; Wheat.
Special Issue on: IRICT 2019 Advances in Computational Intelligence and Data Science
Investigation of Contraction Process Issue in Fuzzy Min-Max Models
by Essam Alhroob, Mohammed Falah Mohammed, Fadhl Hujainah, Osama Nayel Al Sayaydeh, Ngahzaifa Ab Ghani
Abstract: The fuzzy min-max (FMM) network is one of the most powerful neural net-works. It combines a neural network and fuzzy sets into a unified framework to address pattern classification problems. The FMM consists of three main learning processes, namely, hyperbox contraction, hyperbox expansion and hyperbox overlap tests. Despite its various learning processes, the contraction process is considered as one of the major challenges in the FMM that affects the classifica-tion process. Thus, this study aims to investigate the FMM contraction process precisely to highlight its usage consequences during the learning process. Such investigation can assist practitioners and researchers in obtaining a better under-standing about the consequences of using the contraction process on the network performance. Findings of this study indicate that the contraction process used in FMM can affect network performance in terms of misclassification and incapabil-ity in handling the membership ambiguity of the overlapping regions.
Keywords: Pattern classification; Fuzzy min-max; FMM models; Contraction process.
Plagiarism Detection of Figure Images in Scientific Publications
by Taiseer Eisa
Abstract: Plagiarism is stealing others work using their words directly or indirectly without a credit citation. Copying others ideas is another type of plagiarism that may occur in many areas but the most serious one is the academic plagiarism. Therefore, technical solutions are urgently required for automatic detection of idea plagiarism. Detection of figure plagiarism is a particularly challenging field of research, because not only the text analytics but also graphic features need to be analyzed. This paper investigates the issues of idea and figure plagiarism and proposes a detection method which copes with both text and structure change. The procedure depends on finding similar semantic meanings between figures by applying image processing and semantic mapping techniques. The figures were compared using the representation of shape features based on detailed comparisons between the components of figures. This is an improvement over existing methods, which only compare the numbers and types of shapes inside figures.
Keywords: Plagiarism detection; figure plagiarism detection; idea plagiarism detection; academic plagiarism; structure change; text change; semantic meanings; image processing; semantic mapping techniques; scientific publications; content based algorithms.
Arabic Text Semantic-Based Query Expansion
by Nuhu Yusuf, Mohd Amin Mohd Yunus, Norfaradilla Wahid, Aida Mustapha, Nazri Mohd Nawi, Noor Azah Samsudin
Abstract: Abstract: Query expansions are being used in many search applications for retrieving relevant documents. Although retrieving the relevant documents are important for search users, the complexity of Arabic morphology remains a challenge. As such many irrelevant documents were still retrieved from the ranked results. To address this challenge, This paper proposes a new searching method for Arabic text semantic-based query expansion. The proposed method combines Arabic word synonyms and ontology to expand the query with additional terms. Specifically, the proposed method combined lexical words within the ranking algorithm and then improved with ontology links to expand query. The performance of Arabic text semantic-based query expansion was evaluated in terms of average precision, means average precision and means reciprocal rank. Experiments on Quran datasets show that the proposed method using Arabic Text Semantic-Based Query Expansion approach outperforms the previous methods using other dataset which is called Tafsir dataset. The proposed method achieved 15.44% mean average precision.
Keywords: Arabic Text; Semantic Search; Query Expansion; Lexical Words; Ontology; Ranking Algorithms.
A Hybrid Feature Selection Method Combining Gini Index and Support Vector Machine with Recursive Feature Elimination for Gene Expression Classification
by Talal Almutiri, Faisal Saeed
Abstract: Microarray datasets are suffering from a curse of dimensionality, because of a large number of genes and low numbers of samples, wherefore, the high dimensionality leads to computational cost and complexity. Consequently, feature selection (FS) is the process of choosing informative genes that could help in improving the effectiveness of classification. In this study, a hybrid feature selection was proposed, which combines the Gini Index and Support vector machine with Recursive Feature Elimination (GI-SVM-RFE), calculates a weight for each gene and recursively selects only ten genes to be the informative genes. To measure the impact of the proposed method, the experiments include four scenarios: baseline without feature selection, GI feature selection, SVM-RFE feature selection, and combining GI with SVM-RFE. In this paper, eleven microarray datasets were used. The proposed method showed an improvement in terms of classification accuracy when compared with other previous studies.
Keywords: Classification; Feature Selection; Gene Expression; Gini Index; Microarray; Recursive Feature Elimination.