International Journal of Data Mining, Modelling and Management
These articles have been peer-reviewed and accepted for publication but are pending final changes, are not yet published and may not appear here in their final order of publication until they are assigned to issues. Therefore, the content conforms to our standards but the presentation (e.g. typesetting and proof-reading) is not necessarily up to the Inderscience standard. Additionally, titles, authors, abstracts and keywords may change before publication. Articles will not be published until the final proofs are validated by their authors.
Forthcoming articles must be purchased for the purposes of research, teaching and private study only. These articles can be cited using the expression "in press". For example: Smith, J. (in press). Article Title. Journal Title.
Articles marked with this shopping trolley icon are available for purchase - click on the icon to send an email request to purchase.
Register for our alerting service, which notifies you by email when new issues are published online.
International Journal of Data Mining, Modelling and Management (23 papers in press)
Abstract: : The Web of Things (WoT) uses Web technologies to connect embedded objects to each other and to deliver services to stakeholders. The context of these interactions (situation) is a key source of information which can be sometimes uncertain. In this paper, we focus on the development of intelligent web services. The main requirements for intelligent service are to deal with context diversity, semantic context representation and the capacity to reason with uncertain information. From this perspective, we propose a framework for intelligent services to deal with various contexts, to reactively respond to real-time situations and proactively predict future situations. For the semantic representation of context, we use PR-OWL, a probabilistic ontology based on Multi-Entity Bayesian Networks. PR-OWL is flexible enough to represent complex and uncertain contexts. We validate our framework with an intelligent plant watering use case to show its reasoning capabilities.
Keywords: Smart Web Service; the Web of Things; Context Reasoning; Proactive; Reactive; Multi-Entity Bayesian Networks; PR-OWL.
Intrusion detection using classification techniques: a comparative study
by Imad Bouteraa, Makhlouf Derdour, Ahmed Ahmim
Abstract: Todays highly connected world suffers from the increase and variety of cyber-attacks. To mitigate those threats, researchers have been continuously exploring different methods for intrusion detection through the last years. In this paper, we study the use of data mining techniques for intrusion detection. The research intends to compare the performances of classification techniques for intrusion detection. To reach the goal, we involve 74 classification techniques in this comparative study.The study shows that no technique outperforms the others in all situations. However, some classification methods lead to promising results and give clues for further combinations.
Keywords: Data mining; Classification; Network Security; Intrusion detection; KDD99.
An Insight into Application of Big Data Analytics in Health Care
by Sravani Nalluri, Sasikala R
Abstract: The main aim of this paper is to comprehend different aspects of big data, to gain insight of the current research trends of application of big data in health care and to identify the different aspects of health care where it can be applied. In this paper a brief analysis was done on Applications of Big data in health care. The main focus is on the aspects of health where big data is being used, collection of data and tools employed for big data analytics. In addition to it the paper also addresses the type of machine learning algorithms that were used in health care and which statistics commissioned to compare the performance of these algorithms. Most of the health care data was collected from University of California machine learning repository, from the hospitals and government agencies. Most of the researchers focused only on prediction of the diseases or emergency department visits, or a disease outbreak with the help of HADOOP and WEKA tool. Support vector machine, Artificial neural networks, Naive bayes & Decision tree were commonly used algorithms for prediction of diseases. The performance of the algorithms was compared statistically using Accuracy. In my perspective more research needs to be done in application of Big data Analytics in other domains of health rather than just prediction of disease.
Keywords: Big data; Hadoop; Machine learning algorithms; Healthcare; Map-reduce; Chronic diseases; Accuracy rate; Prevention; Analytics.
Grey Relational Classification Algorithm for Software Fault Proneness with SOM Clustering
by Aarti Aarti, Geeta Sikka, Renu Dhir
Abstract: The estimation by the human judgment to deal with the inherent uncertainty of software gives a vague and imprecise solution. To cope with this challenge, we propose a new hybrid analogy model based on the integration of GRA (grey relational analysis) classification with self-organizing map (SOM) clustering. In this paper, a new classification approach is proposed to distribute the data to similar groups. The attributes are selected based on GRC values. In the proposed, the similarity measure between reference project and cluster head is computed to determine the cluster to which target project belongs. The fault-proneness of reference project is estimated based on the regression equation of the selected cluster. The proposed algorithm gives resilience to users to select n features for both continuous and categorical attributes. In this study, two scenarios based on the integration of proposed classification with regression have been proposed. Experimental results show significant results indicating that proposed methodology can be used for the prediction of faults and produce conceivable results when compared with the results of multilayer-perceptron, logistic regression, bagging, na
Keywords: Self organizing map (SOM); grey relational analysis (GRA); unsupervised classification; fault-proneness; object-oriented (OO).
Overlapping Community Detection With A Novel Hybrid Metaheuristic Optimization Algorithm
by Imane Messaoudi, Nadjet Kamel
Abstract: Social networks are ubiquitous in our daily life. Due to the rapid development of information and electronic technology, social networks are becoming more and more complex in terms of sizes and contents. It is of paramount significance to analyze the structures of social networks in order to unveil the myth beneath complex social networks. Network community detection is recognized as a fundamental tool towards social networks analytics. As a consequence, numerical community detection methods are proposed in the literature. For a real-world social network, an individual may possess multiple memberships, while the existing community detection methods are mainly designed for non-overlapping situations. With regard to this, this paper proposes a hybrid metaheuristic method to detect overlapping communities in social networks. In the proposed method, the overlapping community detection problem is formulated as an optimization problem and a novel bat optimization algorithm is designed to solve the established optimization model. To enhance the search ability of the proposed algorithm, a local search operator based on tabu search is introduced. To validate the effectiveness of the proposed algorithm, experiments on benchmark and real-world social networks are carried out. The experiments indicate that the proposed algorithm is promising for overlapping community detection
Keywords: Overlapping Community; Modified Density; Tabu Search; Bat Algorithm; Link Clustering.
Bees Colonies For Detecting Communities Evolution Using Data WareHouse
by Yasmine Chaabani, Jalel Akaichi
Abstract: The analysis of social networks and their evolution has gained much interest in recent years. In fact, few methods revealed and tracked meaningful communities over time. These methods also dealt efficiently with structure and topic evolution of networks. In this paper, we propose a novel technique to track dynamic communities and their evolution behaviour. The main objective of our approach and using the Artificial Bee Colony(ABC)is to trace the evolution of community and to optimize our objective function to keep proper partitioning. Moreover, we use a Data warehouse as a mind of bees to store the information of different communities structure in every timestamp. The experimental results showed that the proposed method is efficient in discovering dynamics communities and tracking their evolution.
Keywords: Social Network; Community Detection; Bees Colonies.
A support Architecture to MDA Contribution for Data Mining
by Fatima MESKINE, Safia Nait-Bahloul
Abstract: The data mining process is the sequence of tasks applied to data, in order to discover relations between them to have knowledge. However, the data mining process lacks a formal specification that allows it to be modeled independently of platforms. MDA (Model Driven Architecture) is an approach for the development of software systems, based on the use of models to improve their productivity. Several research works have been elaborated to align the MDA approach with data mining on data warehouses, to specify the data mining process in a very high level of abstraction. In our work, we propose a support architecture that allows positioning these researches in different abstraction levels, on the basis of several criteria; with the aim to identify strengths for each level, in term of modelling; and to have a clear visibility on the MDA contribution for data mining.
Keywords: Data mining; Model Driven Architecture; Data warehouses; UML Profiles; Data Multidimensional Model; Transformation.
Emotion Mining From Text for Actionable Recommendations Detailed Survey
by Jaishree Ranganathan, Angelina Tzacheva
Abstract: In the era of Web 2.0, people express their opinion, feelings and thoughts about topics including political and cultural events, natural disasters, products and services, through mediums such as blogs, forums, and micro-blogs, like Twitter. Also, large amount of text is generated through e-mail which contains the writer's feeling or opinion; for instance, customer care service e-mail. The texts generated through such platforms are a rich source of data which can be mined in order to gain useful information about user opinion or feeling which in turn can be utilized in specific applications such as: marketing, sale predictions, political surveys, health care, student-faculty culture, e-learning platforms, and social networks. This process of identifying and extracting information about the attitude of a speaker or writer about a topic, polarity, or emotion in a document is called Sentiment Analysis. There are variety of sources for extracting sentiment such as speech, music, facial expression. Due to the rich source of information available in the form of text data, this paper focuses on sentiment analysis and emotion mining from text, as well as discovering actionable patterns. The actionable patterns may suggest ways to alter the user's sentiment or emotion to a more positive or desirable state.
Keywords: Actionable Pattern Mining; Data Mining; Text Mining; Sentiment Analysis.
Performance of Authorship Attribution Classifiers with Short Texts: Application of Religious Arabic Fatwas
by Mohammed Al-Sarem, Abdel-Hamid Emara, Ahmed Abdel Wahab
Abstract: Although authorship attribution is a well-known problem in authorship analysis domain, researches on Arabic contexts are still limited. In addition, examining the performance of the attribution methods on training set with short textual documents is also not considered well in other languages, such as English, Chinese, Spanish and Dutch. Therefore, this current work aims at examining the performance of attribution classifiers in the context of short Arabic textual documents. The experimental part of this work is conducted with well-known classifiers namely: Decision Tree C4.5 method, Naive Bayes model, K-NN method, Markov Model, SMO and Burrows Delta method. We experiment with various features combination. The results show that combining the word-based lexical features with the structural features yields the best accuracy. At this end, we use this combination as a baseline for further investigation. We also examine the effect of combining the n-gram features. The results indicate that some classifiers show an improvement while the others do not. In addition, the results show that the naive Bayes method gives the highest accuracy among all the attribution classifiers.
Keywords: Authorship Attribution; Stylometric Features; Attribution Classifiers; JGAAP tool; Arabic Language.
A survey of Term Weighting Schemes for Text Classification
by Abdullah Alsaeedi
Abstract: Text document classification approaches are designed to categorise documents into predefined classes. These approaches have two main components: document representation models and term-weighting methods. The high dimensionality of feature space has always been a major problem in text classification methods. To resolve high dimensionality issues and to improve the accuracy of text classification, various feature selection approaches were presented in the literature. Besides which, several term-weighting schemes were introduced that can be utilised for feature selection methods. This work surveys and investigates various term (feature) weighting approaches that have been presented in the text classification context.
Keywords: Document frequency; Supervised term weighting; Text classification; Unsupervised term weighting.
Extracting useful reply-posts for text forum threads summarisation using quality features and classification methods
by Akram Osman, Naomie Salim
Abstract: Text forums threads have a large amount of information furnished by users who discuss on a specific topic. At times, certain thread reply-posts are entirely off-topic, thereby deviating from the main discussion. It negatively affects the users preference to continue replying to the discussion. Thus, there is a possibility that the user prefers to read certain selected reply-posts that provide a short summary of the topic of the discussion. The objective of the paper is to choose quality re-ply-posts regarding a topic considered in the initial-post, which also serve a brief summary. We offer an exhaustive examination of the conversational patterns of the threads on the basis of 12 quality features for analysis. These features can ensure selection of relevant reply-posts for the thread summary. Experimental outcomes obtained using two datasets show that the presented techniques considerably enhanced the performance in selecting initial-post replies pairs for text forum threads summarisation.
Keywords: information retrieval; initial-post replies pairs; text data; text forum threads; text forum threads summarisation; text summarisation; thread retrieval.
Phish Webpage Classification Using Hybrid Algorithm of Machine Learning and Statistical Induction Ratios
by Hiba Zuhair, Ali Selamat
Abstract: Although the conventional machine learning-based anti-phishing techniques outperform their competitors in phishing detection, they are still targeted by zero-hour phish webpages due to their constraints of phishing induction. Therefore, phishing induction must be boosted up with the extraction of new features, the selection of robust subsets of decisive features, the active learning of classifiers on a big webpage stream. In this paper, we propose a hybrid feature-based classification algorithm (HFBC) for decisive phish webpage classification. HFBC hybridizes two statistical criteria Optimized Feature Occurrence (OFC) and Phishing Induction Ratio (PIR) with the induction settings of the most salient machine learning algorithms, Na
Keywords: phish webpage; machine learning; optimized feature occurrence; phishing induction ratio; hybrid feature-based classifier.
Online Detection of Blood-Spot Eggs Based on Levenberg-Marquardt Spectral Amplitude Space Conversion and BPNN
by Zhihui Zhu, Linfeng Wu, Huaixin Yu
Abstract: This paper aims to improve the generalization ability and discrimination accuracy during online blood-spot egg detection. To this end, this paper establishes a linear converter based on Levenberg-Marquardt (LM) algorithm to transform spectral signals. First, the transmittance in the 500~599nm band was chosen for our research. After the moving average filtering, the LM algorithm was employed to transform the space of spectral amplitude. Then, the converted amplitude signal was applied to establish the back propagation neural network (BPNN) model. The results show that the model established from the LM converted signal withstood impact of different batches and speeds on the discrimination accuracy, and improved the model accuracy to more than 96%. This study provides a feasible method for online nondestructive detection of blood-spot eggs.
Keywords: Blood-spot eggs;Online detection;Levenberg-marquardt;Space conversion;Back propagation neural network (BPNN).
Application of hyperconvergent platform for Big Data in exploring regional innovation systems
by Alexey Finogeev, Leyla Gamidullaeva, Sergey Vasin
Abstract: The authors developed a decentralized hyperconvergent analytical platform for the collection and processing of Big Data in order to explore the monitoring processes of distributed objects in the regions on the basis of multi-agent approach. The platform is intended for modular integration of tools for searching, collecting, processing and big data mining from cyber-physical and cyber-social objects. The results of the intellectual analysis are used to assess the integrated criteria for the effectiveness of innovation systems of distributed monitoring and forecasting the dynamics of the influence of various factors on technological and socio-economic processes. The work analyzes convergent and hyperconvergent systems, substantiates the necessity of creating a multi-agent decentralized platform for Big Data collection and analytical processing. The article proposes the principles of streaming architecture for the data integration analytical processing to resolve the problems of searching, parallel processing, data mining and uploading of information into a cloud storage. The paper also considers the main components of the hyperconvergent analytical platform. A new concept of distributed ETLM (Extraction, Transformation, Loading, Mining) system is considered.
Keywords: innovation system; convergence; convergent platform; hyperconvergent system; intellectual analysis; Big Data; multi-agent approach; ETLM (Extraction; Transformation; Loading; Mining) system.
Special Issue on: IFIP CIIA 2018 Advanced Research in Computational Intelligence
by Lamia Berkani
Abstract: With the development of Web 2.0 and social media, the study of social-based recommender systems has emerged. The social relationships among users can improve the accuracy of recommendation. However, with the large amount of data generated every day in social networks, the use of classification techniques becomes a necessity. The clustering-based approaches reduce the search space by clustering similar users or items together. We focus in this paper on the personalized item recommendation in social context. Our approach combines in different ways the social filtering algorithm and the traditional user-based collaborative filtering algorithm. The social information is formalized by some social-behavior metrics such as friendship, commitment and trust degrees of users. Moreover, two classification techniques are used: an unsupervised technique applied initially to all users and a supervised technique applied to newly added users. Finally, the proposed approach has been experimented using different existing datasets. The obtained results show the contribution of integrating social information on the collaborative filtering and the added value of using the classification techniques on the different algorithms in terms of the recommendation accuracy.
Keywords: Item Recommendation; Collaborative filtering; Social filtering; Supervised Classification; Unsupervised classification.
Distributed Heterogeneous Ensemble Learning on Apache Spark for Ligand Based Virtual Screening
by Karima Sid, Mohamed Batouche
Abstract: Virtual screening is one of the most common Computer-Aided Drug Design techniques that apply computational tools and methods on large libraries of molecules to extract the drugs. Ensemble learning is a recent paradigm launched to improve machine-learning results in terms of predictive performance and robustness. It has been successfully applied in ligand-based virtual screening (LBVS) approaches. Applying ensemble learning on huge molecular libraries is computationally expensive. Hence, the distribution and parallelization of the task have become a significant step by using sophisticated frameworks such as Apache Spark. In this paper, we propose a new approach HEnsL_DLBVS, for heterogeneous ensemble learning, distributed on Spark to improve the large-scale LBVS results. To handle the problem of imbalanced big training datasets, we propose a novel hybrid technique. We generate new training datasets to evaluate the approach. Experimental results confirm the effectiveness of our approach with satisfactory accuracy and its superiority over homogeneous models.
Keywords: virtual screening; big data; computer-aided drug design; Apache Spark; machine learning; drug discovery; ensemble learning; imbalanced datasets; Spark MLlib; ligand-based virtual screening.
Hash-Processing of Universal Quantification-like Queries dealing with Requirements and Prohibitions.
by Noussaiba Benadjimi, Khaled Walid HIDOUCI
Abstract: This paper is focused on flexible universal quantification-like queries handling simultaneously positive and negative preferences (requirements or prohibitions). We emphasize the performance improvement of the considered operator by proposing new variants of the classical Hash-Division algorithm. The issue of answers ranking is also dealt with. We target in our work the in-memory databases systems (also called main-memory database systems) with a very large volume of data. In these systems, all the data is primarily stored in the RAM of a computer. We have introduced a parallel implementation of the operator that takes into account the data skew issue. Our empirical analysis for both sequential and parallel versions shows the relevance of our approach. They demonstrate that the new processing of the mixed operator in a main- memory database achieves better performance compared to the conventional ones, and becomes faster through parallelism.
Keywords: Universal quantification queries; Relational division; Relational anti-division; Main-memory databases; Flexible division; Hash-division.
An Enhanced Cooperative Method to Solve Multiple Sequence Alignment Problem
by Lamiche Chaabane
Abstract: In this research study, we aim to propose a novel cooperative approach called dynamic simulated particle swarm optimization (DSPSO) which is based on metaheuristics and the pairwise dynamic programming procedure (DP) to find an approximate solution for the multiple sequence alignment (MSA) problem. The developed approach applies the particle swam optimization (PSO) algorithm to discover the search space globally and the simulated annealing (SA) technique to improve the population leader quality in order to overcome local optimum problem. After that the dynamic programming technique is integrated as an improver mechanism in order to improve the worst solution quality and to increase the convergence speed of the proposed approach. Simulation results on BaliBASE benchmarks have shown the potent of the proposed method to produce good quality alignments comparing to those given by other literature existing methods.
Keywords: Cooperative approach; multiple sequence alignment; DSPSO; PSO; SA; DP; BaliBASE benchmarks.
A Formal Theoretical Framework for a Flexible Classification Process
by Ismail Biskri
Abstract: The classification process is a complex technique that connects language, text, information and knowledge theories with computational formalization, statistical and symbolic approaches, standard and non-standard logics, etc. This process should always be under the control of the user according to his subjectivity, his knowledge and the purpose of his analysis. It becomes important to create platforms to support the design of classification tools, their management, their adaptation to new needs and experiments. In the last years, several platforms for data digging including textual data where classification is the main functionality have emerged. However, they lack flexibility and formal foundations. We propose in this paper a formal model with strong logical foundations based on applicative type systems.
Keywords: Classification; Flexibility; Applicative Systems; Operators/Operands; Combinatory Logics; Inferential Calculus; Compositionality; Processing Chains; Modules; Discovery Process; Collaborative Intelligent Science.
Graph-based Cumulative Score Using statistical features for multilingual automatic text summarization
by Abdelkrime Aries, Djamel Eddine Zegour, Walid Khaled Hidouci
Abstract: Multilingual summarization began to receive more attention these late years. Many approaches can be used to achieving this, among them: statistical and graph-based approaches. Our idea is to combine these two approaches into a new extractive text summarization method. Surface statistical features are used to calculate a primary score for each sentence. The graph is used to selecting some candidate sentences and calculating a final score for each sentence based on its primary score and those of its neighbors in the graph. We propose four variants to calculating the cumulative score of a sentence. Also, the order of sentences is an important aspect of summary readability. We propose some other algorithms to generating the summary not just based on final scores but on sentences connections in the graph. The method is tested using MultiLing'15 workshop's MSS corpus and ROUGE metric. It is evaluated against some well known methods and it gives promising results.
Keywords: Autmatic text summarization; Graph-based summarization; Statistical features; Multilingual summarization; Extractive summarization.
An ontology-based modelling and reasoning for alert correlation
by Tayeb Kenaza
Abstract: A good defense strategy utilizes multiple solutions such as Firewalls, IDS, Antivirus, AAA server, VPN, etc. However, these tools can easily generate hundreds of thousands of events per day. Security information and event management system (SIEM for short) is a centralized solution that collects information from these tools and use some correlation techniques to build a reliable picture of the underlying monitored system. SIEM is a modern and powerful security tool thanks to several functions that provide to take benefit of collected data, such as normalization and aggregation of data. It provides security operators a dashboard and helps them in the forensic analysis when an incident is reported. Indeed, the main important function is events correlation, when security operators can get a precise and quick picture about threats and attacks in real time. The quality of that picture depends on the efficiency of the adopted reasoning approach to putting together pieces of information provided by several analyzers. However, most of proprietary SIEM use its own data representation and its own correlation techniques which are not always favorable to share knowledge and to do incremental or collaborative reasoning. In this paper, we propose a semantic approach based on Description Logics (DLs) which is a powerful tool for knowledge representation and reasoning. Indeed, Ontology provides a comprehensive environment to represent information for intrusion detection and allows easy maintain of information or adding new ones. We implemented a rule-based engine for alert correlation based on the proposed ontology and two attack scenarios are carried out to show the usefulness of the semantic approach.
Keywords: ntrusion detection; Alert correlation; Rules based reason- ing; Ontology; OWL.
Convolutional Neural Network with Stacked Autoencoders for Predicting Drug-Target Interaction and Binding Affinity
by Meriem Bahi, Mohamed Batouche
Abstract: The prediction of novel drug-target interactions (DTIs) is critically important for drug repositioning, as it can lead the researchers to find new indications for existing drugs and to reduce the cost and time of the de novo drug development process. In order to explore new ways for this innovation, we have proposed two novel methods named SCA-DTIs and SCA-DTA, respectively to predict both drug-target interactions and drug-target binding affinities (DTA) based on Convolutional Neural Network (CNN) with Stacked Autoencoders (SAE). Initializing a CNN's weights with filters of trained stacked autoencoders yields to superior performance. Moreover, for boosting the performance of the DTIs prediction, we propose a new method called RNDTIs to generate reliable negative samples. Tests on different benchmark datasets show that the proposed method can achieve an excellent prediction performance with an accuracy of more than 99%. These results demonstrate the strength of the proposed model potential for DTIs and DTA prediction, thereby improving the drug repurposing process.
Keywords: Stacked Autoencoders; Convolutional Neural Network; Semi-Supervised Learning; Deep Learning; Drug Repositioning; Drug-Target Interaction; Binding Affinity.
Efficient Deployment Approach of Wireless Sensor Networks on 3D Terrains
by Mostefa Zafer, Mustapha Reda Senouci, Mohamed Aissani
Abstract: Ensuring the coverage of a Region of Interest (RoI) when deploying a Wireless Sensor Network (WSN) is an objective that depends on several factors, such as the detection capability of the used sensor nodes and the topography of the RoI. To address the topography challenges, in this paper, we propose a new WSN deployment approach based on the idea of partitioning the RoI into sub-regions with relatively simple topography. Then allocating, to each constructed sub-region, the necessary number of sensor nodes and finding their appropriates positions to maximize the coverage quality. The performance evaluation of this approach coupled with three different deployment methods named DMSA (Deployment Method based on Simulated Annealing), GDM (Greedy Deployment Method), and RDM (Random Deployment Method), has revealed its relevance since it helped to significantly improve the coverage quality of the RoI.
Keywords: Wireless Sensor Networks; 3D terrains; Deployment; Coverage.