International Journal of Data Mining, Modelling and Management
These articles have been peer-reviewed and accepted for publication but are pending final changes, are not yet published and may not appear here in their final order of publication until they are assigned to issues. Therefore, the content conforms to our standards but the presentation (e.g. typesetting and proof-reading) is not necessarily up to the Inderscience standard. Additionally, titles, authors, abstracts and keywords may change before publication. Articles will not be published until the final proofs are validated by their authors.
Forthcoming articles must be purchased for the purposes of research, teaching and private study only. These articles can be cited using the expression "in press". For example: Smith, J. (in press). Article Title. Journal Title.
Articles marked with this shopping trolley icon are available for purchase - click on the icon to send an email request to purchase.
Register for our alerting service, which notifies you by email when new issues are published online.
International Journal of Data Mining, Modelling and Management (19 papers in press)
Special Issue on: IFIP CIIA 2018 Advanced Research in Computational Intelligence
Abstract: : The Web of Things (WoT) uses Web technologies to connect embedded objects to each other and to deliver services to stakeholders. The context of these interactions (situation) is a key source of information which can be sometimes uncertain. In this paper, we focus on the development of intelligent web services. The main requirements for intelligent service are to deal with context diversity, semantic context representation and the capacity to reason with uncertain information. From this perspective, we propose a framework for intelligent services to deal with various contexts, to reactively respond to real-time situations and proactively predict future situations. For the semantic representation of context, we use PR-OWL, a probabilistic ontology based on Multi-Entity Bayesian Networks. PR-OWL is flexible enough to represent complex and uncertain contexts. We validate our framework with an intelligent plant watering use case to show its reasoning capabilities.
Keywords: Smart Web Service; the Web of Things; Context Reasoning; Proactive; Reactive; Multi-Entity Bayesian Networks; PR-OWL.
An Enhanced Cooperative Method to Solve Multiple Sequence Alignment Problem
by Lamiche Chaabane
Abstract: In this research study, we aim to propose a novel cooperative approach called dynamic simulated particle swarm optimization (DSPSO) which is based on metaheuristics and the pairwise dynamic programming procedure (DP) to find an approximate solution for the multiple sequence alignment (MSA) problem. The developed approach applies the particle swam optimization (PSO) algorithm to discover the search space globally and the simulated annealing (SA) technique to improve the population leader quality in order to overcome local optimum problem. After that the dynamic programming technique is integrated as an improver mechanism in order to improve the worst solution quality and to increase the convergence speed of the proposed approach. Simulation results on BaliBASE benchmarks have shown the potent of the proposed method to produce good quality alignments comparing to those given by other literature existing methods.
Keywords: Cooperative approach; multiple sequence alignment; DSPSO; PSO; SA; DP; BaliBASE benchmarks.
A Formal Theoretical Framework for a Flexible Classification Process
by Ismail Biskri
Abstract: The classification process is a complex technique that connects language, text, information and knowledge theories with computational formalization, statistical and symbolic approaches, standard and non-standard logics, etc. This process should always be under the control of the user according to his subjectivity, his knowledge and the purpose of his analysis. It becomes important to create platforms to support the design of classification tools, their management, their adaptation to new needs and experiments. In the last years, several platforms for data digging including textual data where classification is the main functionality have emerged. However, they lack flexibility and formal foundations. We propose in this paper a formal model with strong logical foundations based on applicative type systems.
Keywords: Classification; Flexibility; Applicative Systems; Operators/Operands; Combinatory Logics; Inferential Calculus; Compositionality; Processing Chains; Modules; Discovery Process; Collaborative Intelligent Science.
Graph-based Cumulative Score Using statistical features for multilingual automatic text summarization
by Abdelkrime Aries, Djamel Eddine Zegour, Walid Khaled Hidouci
Abstract: Multilingual summarization began to receive more attention these late years. Many approaches can be used to achieving this, among them: statistical and graph-based approaches. Our idea is to combine these two approaches into a new extractive text summarization method. Surface statistical features are used to calculate a primary score for each sentence. The graph is used to selecting some candidate sentences and calculating a final score for each sentence based on its primary score and those of its neighbors in the graph. We propose four variants to calculating the cumulative score of a sentence. Also, the order of sentences is an important aspect of summary readability. We propose some other algorithms to generating the summary not just based on final scores but on sentences connections in the graph. The method is tested using MultiLing'15 workshop's MSS corpus and ROUGE metric. It is evaluated against some well known methods and it gives promising results.
Keywords: Autmatic text summarization; Graph-based summarization; Statistical features; Multilingual summarization; Extractive summarization.
by Nilay Khare, Hema Dubey
Abstract: Brin and Larry proposed PageRank in 1998, which appears as a prevailing link analysis technique used by web search engines to rank its search results list. Computation of PageRank values in an efficient and faster manner for very immense web graph is truly an essential concern for search engines today. To identify the spam web pages and also deal with them is yet another important concern in web browsing. In this research article, an efficient and faster parallel PageRank algorithm is proposed, which harnesses the power of graphics processing units (GPUs). In proposed algorithm, the PageRank scores are nonuniformly distributes among the web pages, so it is also competent of coping with spam web pages. The experiments are performed on standard datasets available in Stanford Large Network Dataset Collection. There is a speed up of about 1.1 to 1.7 for proposed parallel PageRank algorithm over existing parallel PageRank algorithm.
Keywords: GPU; CUDA; Parallel PageRank Technique; Spam Web Pages.
Tuning Parameters via a new Rapid, Accurate and Parameter-less Method Using Meta-Learning
by Alireza Hekmatinia, Ali Mohammadi Shanghooshabad, Mohammad Mahdi Motevali, Mehrdad Almasi
Abstract: Abstract: Dealing with a large parameter space in optimization and data mining tasks is extremely time consuming because by increasing the number of parameters, the parameter space increases exponentially. Regardless of the considerable amount of time it takes, the tuning method itself needs to be tuned since methods themselves have at least one parameter. Here a new rapid and parameter-less method is presented to tune algorithms on diverse datasets to achieve high quality results in a short consumed time. Also, for a quick overview of the methods available in this area, taxonomy of the parameter selection approaches is presented here. The method presented here uses a pre-knowledge by using meta-features to guess closer point to optimal point in parameter space of target algorithms (here, Support Vector Machine algorithm is used). For preparing the pre-knowledge, 282 meta-features are introduced and then Genetic Algorithm (GA) is applied to determine best meta-features for the target algorithm. The best meta-feature set is a combination of meta-features that creates the most differentiates between various datasets. Then the best meta-features are used to tune target algorithm on unseen datasets. In experiments, 15 best meta-features are selected from 282 by using the GA over 30 datasets. Finally, by using extracted meta-features, SVMs parameters are tuned over 5 unseen datasets. The results show that in less than 0.19 minute in average, the method obtains approximately the same classification rates in comparison with others, but the consumed time is dramatically declined.
Keywords: Parameter Tuning ; Meta-Learning; Parameter-less Methods; Data Mining; Support Vector Machines.
Analyzing Sentiments based on Multi Feature Combination with Supervised Learning
by Monalisha Ghosh, Goutam Sanyal
Abstract: Sentiment analysis or opinion mining has become an open research domain after the proliferation of Internet and Web 2.0 social media. Feature generation and selection are consequent for text mining as the high dimensional feature set can affect the performance of sentiment analysis. This paper investigates the inability of the widely used feature selection method (IG, Chi-Square, Gini Index) individually as well as their combined approach on four machine learning classification algorithm. Initially, we transform the review datasets into the feature vector of unigram features along with bi-tagged features based on POS pattern. Next, Information gain (IG), Chi squared (χ2) and minimum redundancy maximum relevancy (mRMR) feature selection methods are applied to obtain an optimal feature subset for further functionality. These features are then given input to multiple machine learning classifiers, namely, Support vector machine (SVM), Multinomial Na
Keywords: Sentiment analysis; Opinion mining; text classification; Feature selection method; Machine learning algorithms optimal feature vector,.
A new network-based approach to investigating neurological disorders
by Francesco Cauteruccio, Paolo Lo Giudice, Giorgio Terracina, Domenico Ursino, Nadia Mammone, Francesco Carlo Morabito
Abstract: In this paper, we present a new network-based approach to helping experts to investigate neurological disorders in which the connections among brain areas play a key role. Our approach receives the EEG of a patient and associates a network with it, with nodes that represent electrodes and with edges that denote the disconnection degree of the corresponding brain areas, measured by means of a new string-based metric. Then, it performs some suitable projections on this network, depending on the neurological disorder to investigate. After this, it computes the values of a new coefficient, called connection coefficient, on them. These values can be employed to help neurologists in their analyses. We show how our approach can be employed for three different disorders, namely Creutzfeldt-Jacob Disease, Childhood Absence Epilepsy and Alzheimer's Disease.
Keywords: Network Analysis; Connection Coefficient; Clique; Consensus Multi-Parameterized Edit Distance; Electroencephalogram; Neurological Disorders.
Intrusion detection using classification techniques: a comparative study
by Imad Bouteraa, Makhlouf Derdour, Ahmed Ahmim
Abstract: Todays highly connected world suffers from the increase and variety of cyber-attacks. To mitigate those threats, researchers have been continuously exploring different methods for intrusion detection through the last years. In this paper, we study the use of data mining techniques for intrusion detection. The research intends to compare the performances of classification techniques for intrusion detection. To reach the goal, we involve 74 classification techniques in this comparative study.The study shows that no technique outperforms the others in all situations. However, some classification methods lead to promising results and give clues for further combinations.
Keywords: Data mining; Classification; Network Security; Intrusion detection; KDD99.
An Insight into Application of Big Data Analytics in Health Care
by Sravani Nalluri, Sasikala R
Abstract: The main aim of this paper is to comprehend different aspects of big data, to gain insight of the current research trends of application of big data in health care and to identify the different aspects of health care where it can be applied. In this paper a brief analysis was done on Applications of Big data in health care. The main focus is on the aspects of health where big data is being used, collection of data and tools employed for big data analytics. In addition to it the paper also addresses the type of machine learning algorithms that were used in health care and which statistics commissioned to compare the performance of these algorithms. Most of the health care data was collected from University of California machine learning repository, from the hospitals and government agencies. Most of the researchers focused only on prediction of the diseases or emergency department visits, or a disease outbreak with the help of HADOOP and WEKA tool. Support vector machine, Artificial neural networks, Naive bayes & Decision tree were commonly used algorithms for prediction of diseases. The performance of the algorithms was compared statistically using Accuracy. In my perspective more research needs to be done in application of Big data Analytics in other domains of health rather than just prediction of disease.
Keywords: Big data; Hadoop; Machine learning algorithms; Healthcare; Map-reduce; Chronic diseases; Accuracy rate; Prevention; Analytics.
Grey Relational Classification Algorithm for Software Fault Proneness with SOM Clustering
by Aarti Aarti, Geeta Sikka, Renu Dhir
Abstract: The estimation by the human judgment to deal with the inherent uncertainty of software gives a vague and imprecise solution. To cope with this challenge, we propose a new hybrid analogy model based on the integration of GRA (grey relational analysis) classification with self-organizing map (SOM) clustering. In this paper, a new classification approach is proposed to distribute the data to similar groups. The attributes are selected based on GRC values. In the proposed, the similarity measure between reference project and cluster head is computed to determine the cluster to which target project belongs. The fault-proneness of reference project is estimated based on the regression equation of the selected cluster. The proposed algorithm gives resilience to users to select n features for both continuous and categorical attributes. In this study, two scenarios based on the integration of proposed classification with regression have been proposed. Experimental results show significant results indicating that proposed methodology can be used for the prediction of faults and produce conceivable results when compared with the results of multilayer-perceptron, logistic regression, bagging, na
Keywords: Self organizing map (SOM); grey relational analysis (GRA); unsupervised classification; fault-proneness; object-oriented (OO).
Overlapping Community Detection With A Novel Hybrid Metaheuristic Optimization Algorithm
by Imane Messaoudi, Nadjet Kamel
Abstract: Social networks are ubiquitous in our daily life. Due to the rapid development of information and electronic technology, social networks are becoming more and more complex in terms of sizes and contents. It is of paramount significance to analyze the structures of social networks in order to unveil the myth beneath complex social networks. Network community detection is recognized as a fundamental tool towards social networks analytics. As a consequence, numerical community detection methods are proposed in the literature. For a real-world social network, an individual may possess multiple memberships, while the existing community detection methods are mainly designed for non-overlapping situations. With regard to this, this paper proposes a hybrid metaheuristic method to detect overlapping communities in social networks. In the proposed method, the overlapping community detection problem is formulated as an optimization problem and a novel bat optimization algorithm is designed to solve the established optimization model. To enhance the search ability of the proposed algorithm, a local search operator based on tabu search is introduced. To validate the effectiveness of the proposed algorithm, experiments on benchmark and real-world social networks are carried out. The experiments indicate that the proposed algorithm is promising for overlapping community detection
Keywords: Overlapping Community; Modified Density; Tabu Search; Bat Algorithm; Link Clustering.
Bees Colonies For Detecting Communities Evolution Using Data WareHouse
by Yasmine Chaabani, Jalel Akaichi
Abstract: The analysis of social networks and their evolution has gained much interest in recent years. In fact, few methods revealed and tracked meaningful communities over time. These methods also dealt efficiently with structure and topic evolution of networks. In this paper, we propose a novel technique to track dynamic communities and their evolution behaviour. The main objective of our approach and using the Artificial Bee Colony(ABC)is to trace the evolution of community and to optimize our objective function to keep proper partitioning. Moreover, we use a Data warehouse as a mind of bees to store the information of different communities structure in every timestamp. The experimental results showed that the proposed method is efficient in discovering dynamics communities and tracking their evolution.
Keywords: Social Network; Community Detection; Bees Colonies.
A support Architecture to MDA Contribution for Data Mining
by Fatima MESKINE, Safia Nait-Bahloul
Abstract: The data mining process is the sequence of tasks applied to data, in order to discover relations between them to have knowledge. However, the data mining process lacks a formal specification that allows it to be modeled independently of platforms. MDA (Model Driven Architecture) is an approach for the development of software systems, based on the use of models to improve their productivity. Several research works have been elaborated to align the MDA approach with data mining on data warehouses, to specify the data mining process in a very high level of abstraction. In our work, we propose a support architecture that allows positioning these researches in different abstraction levels, on the basis of several criteria; with the aim to identify strengths for each level, in term of modelling; and to have a clear visibility on the MDA contribution for data mining.
Keywords: Data mining; Model Driven Architecture; Data warehouses; UML Profiles; Data Multidimensional Model; Transformation.
Emotion Mining From Text for Actionable Recommendations Detailed Survey
by Jaishree Ranganathan, Angelina Tzacheva
Abstract: In the era of Web 2.0, people express their opinion, feelings and thoughts about topics including political and cultural events, natural disasters, products and services, through mediums such as blogs, forums, and micro-blogs, like Twitter. Also, large amount of text is generated through e-mail which contains the writer's feeling or opinion; for instance, customer care service e-mail. The texts generated through such platforms are a rich source of data which can be mined in order to gain useful information about user opinion or feeling which in turn can be utilized in specific applications such as: marketing, sale predictions, political surveys, health care, student-faculty culture, e-learning platforms, and social networks. This process of identifying and extracting information about the attitude of a speaker or writer about a topic, polarity, or emotion in a document is called Sentiment Analysis. There are variety of sources for extracting sentiment such as speech, music, facial expression. Due to the rich source of information available in the form of text data, this paper focuses on sentiment analysis and emotion mining from text, as well as discovering actionable patterns. The actionable patterns may suggest ways to alter the user's sentiment or emotion to a more positive or desirable state.
Keywords: Actionable Pattern Mining; Data Mining; Text Mining; Sentiment Analysis.
A survey of Term Weighting Schemes for Text Classification
by Abdullah Alsaeedi
Abstract: Text document classification approaches are designed to categorise documents into predefined classes. These approaches have two main components: document representation models and term-weighting methods. The high dimensionality of feature space has always been a major problem in text classification methods. To resolve high dimensionality issues and to improve the accuracy of text classification, various feature selection approaches were presented in the literature. Besides which, several term-weighting schemes were introduced that can be utilised for feature selection methods. This work surveys and investigates various term (feature) weighting approaches that have been presented in the text classification context.
Keywords: Document frequency; Supervised term weighting; Text classification; Unsupervised term weighting.
Special Issue on: IRICT 2018 Advances in Data Analytics and Business Intelligence
by Mohammed Al-Sarem
Abstract: Although authorship attribution is a well-known problem in authorship analysis domain, researches on Arabic contexts are still limited. In addition, examining the performance of the attribution methods on training set with short textual documents is also not considered well in other languages, such as English, Chinese, Spanish and Dutch. Therefore, this current work aims at examining the performance of attribution classifiers in the context of short Arabic textual documents. The experimental part of this work is conducted with well-known classifiers namely: Decision Tree C4.5 method, Naive Bayes model, K-NN method, Markov Model, SMO and Burrows Delta method. We experiment with various features combination. The results show that combining the word-based lexical features with the structural features yields the best accuracy. At this end, we use this combination as a baseline for further investigation. We also examine the effect of combining the n-gram features. The results indicate that some classifiers show an improvement while the others do not. In addition, the results show that the naive Bayes method gives the highest accuracy among all the attribution classifiers.
Keywords: Authorship Attribution; Stylometric Features; Attribution Classifiers; JGAAP tool; Arabic Language.
Extracting useful reply-posts for text forum threads summarisation using quality features and classification methods
by Akram Osman, Naomie Salim
Abstract: Text forums threads have a large amount of information furnished by users who discuss on a specific topic. At times, certain thread reply-posts are entirely off-topic, thereby deviating from the main discussion. It negatively affects the users preference to continue replying to the discussion. Thus, there is a possibility that the user prefers to read certain selected reply-posts that provide a short summary of the topic of the discussion. The objective of the paper is to choose quality re-ply-posts regarding a topic considered in the initial-post, which also serve a brief summary. We offer an exhaustive examination of the conversational patterns of the threads on the basis of 12 quality features for analysis. These features can ensure selection of relevant reply-posts for the thread summary. Experimental outcomes obtained using two datasets show that the presented techniques considerably enhanced the performance in selecting initial-post replies pairs for text forum threads summarisation.
Keywords: information retrieval; initial-post replies pairs; text data; text forum threads; text forum threads summarisation; text summarisation; thread retrieval.
Phish Webpage Classification Using Hybrid Algorithm of Machine Learning and Statistical Induction Ratios
by Hiba Zuhair, Ali Selamat
Abstract: Although the conventional machine learning-based anti-phishing techniques outperform their competitors in phishing detection, they are still targeted by zero-hour phish webpages due to their constraints of phishing induction. Therefore, phishing induction must be boosted up with the extraction of new features, the selection of robust subsets of decisive features, the active learning of classifiers on a big webpage stream. In this paper, we propose a hybrid feature-based classification algorithm (HFBC) for decisive phish webpage classification. HFBC hybridizes two statistical criteria Optimized Feature Occurrence (OFC) and Phishing Induction Ratio (PIR) with the induction settings of the most salient machine learning algorithms, Na
Keywords: phish webpage; machine learning; optimized feature occurrence; phishing induction ratio; hybrid feature-based classifier.