International Journal of Data Mining, Modelling and Management (16 papers in press)
Hybrid Feature Selection Methods for High-Dimensional Multi-class Datasets
by Amit Saxena, Vimal Dubey, John Wang
Abstract: Hybrid methods are very important for feature selection in case of the classification of high-dimensional datasets. In this paper, we proposed two hybrid methods which are the combination of filter-based feature selection, genetic algorithm, and sequential random search methods. The first proposed method is hybridization of information gain and genetic algorithm. In this, firstly the features are ranked based on the information gain and then a user defined features are selected from the ranked features. Genetic algorithm with these selected features is applied for the selection of optimal feature subset. It is applied for feature selection with two types of fitness functions which are single objective and multi-objective in nature. The second feature selection model is the hybridization of information gain and sequential random K-nearest Neighbor (SRKNN). In this method, again information gain is used to rank the features and a user defined top ranked number of features are selected. A set of binary population (having all feature selected by users) are generated and on each population sequential search method is applied for maximizing the classification accuracy. These methods are applied to 21 high-dimensional multi-class datasets. Obtained results show that on some datasets first methods performance is good and on some datasets second methods performance is good. The results obtained by proposed methods are compared with results registered for other methods.
Keywords: Intelligent Mining; High-Dimensional dataset; Genetic Algorithm; Filter Approach; Information Gain; Classification.
A Lexical- Semantics based Method for Multi label Text Categorization Using Word Net
by Shweta Taneja, Rajni Jindal
Abstract: Text categorization is an upcoming area in the field of text mining. The text documents possess huge number of features due to their unstructured nature. In this paper, an algorithm for Multi label Categorization of text documents based on the concepts of Lexical and Semantics using Word Net (MC-LSW) is proposed. The proposed algorithm is based on the concepts of lexical (tokens) and semantics of a language. It aims at minimizing the number of tokens used for categorizing text documents. MC-LSW uses Word Net to extract the semantic information of tokens. The proposed algorithm is implemented and tested on five datasets of text domain and is compared with the existing multi label categorization algorithms. The proposed algorithm (MC-LSW) shows more efficient and promising results in terms of space and time complexity than the existing methods. Accuracy and precision measures have been improved by the proposed algorithm as well as hamming loss has been reduced.
Keywords: Multi label Text Categorization; Lexical Analysis; Semantic Analysis; Word Net.
GACC: Genetic Algorithm Based Categorical Data Clustering For Large Datasets
by Abha Sharma, R.S. Thakur
Abstract: Many operators of Genetic Algorithm (GA) is discussed in the literature such as crossover operators, fitness functions, mutation etc. A range of GA based clustering methods have been proposed to obtain optimal solutions. In this paper most recent GA based hard and fuzzy clustering which is specifically designed for categorical data are discussed. In general all GA based clustering algorithms generate the initial population randomly, which may produce biased results. This paper proposed GACC algorithm with new population initialization criteria. In this population creation mechanism, the usual random selection of chromosomes is replaced with more refined and distinct clusters as chromosomes. This mechanism prohibit the user to initialize the population size as well. Experimental results shows the better clustering for the pure categorical dataset. The work finishes off with some open challenges and ways to improve clustering of categorical data.
Keywords: Categorical data; Genetic Algorithm; Genetic operators; Population; Population size.
The Dynamics of Wikipedia Article Revisions: An Analysis of Revision Activities and Patterns
by Zhongming Ma, Jie Tao, Jing Hu
Abstract: To study the dynamics of revision activities of Wikipedia articles, we define 14 revision actions, annotate 6,950 revisions from 20 articles in four quality ranks (C, B, GA, and FA), and analyze revisions and revision actions in 10 consecutive time periods. We identify four revision patterns (1) revision actions at the sentence and link levels appear in similar paces; (2) the numbers of revision actions at sentence and link levels comparatively evenly grow with the articles age prior to the last time period; (3) the paces of media and reference-level actions tend to be lagged behind sentence and link-level actions; (4) before being promoted to the GA or FA rank, articles nominated to the GA or FA rank exhibit a significant rising pattern in amounts of revisions and revision actions. This pattern is validated with a larger set of 533 articles.
Keywords: Crowdsourcing; Collaborative Processes; Wikipedia; Revision Behaviors.
A Social Network Analysis based approach to extracting knowledge patterns about innovation geography from patent databases
by Domenico Ursino, Massimiliano Ferrara, Diego Fosso, Roberto Mavilia, Davide Lanata
Abstract: Patents have been one of the main topics investigated in several fields of scientific literature. Currently, data about patents is rapidly increasing, and the adoption of Data Mining and Big-Data-centered approaches to investigating them appears compulsory. Among these last approaches, Social Network Analysis (SNA) is extremely promising. In this paper, we propose a SNA-based approach to extracting knowledge patterns about patent inventors and their collaborations. Our approach is extremely general and can be exploited to investigate patents of any country. It allows the analysis of some issues that have not been considered in the past, such as the presence of ``power inventors'' in a country, the existence of a backbone and of possible cliques among them, the influence and the benefits of power inventors on their co-inventors and, more in general, in the R\&D activities of their country. All these issues represent innovation geography knowledge patterns that can be extracted thanks to our approach.
Keywords: Patents; Knowledge Pattern Extraction; Social Network Analysis; Power Inventors; Innovation Geography.
An adaptive and interactive recommendation model for consumers behaviors prediction
by Mohamed Ramzi Haddad, Hajer Baazaoui
Abstract: Recommendation algorithms aim at predicting customers interests and purchases using different ideas and hypotheses. Consequently, system designers need to choose the recommendation approach that is the most suitable with regard to their products nature and consumers behaviors within the application field. In this paper, we propose an adaptive recommendation model based on statistical modeling in order to assist consumers facing choice overload by predicting their interests and consumption behaviors. We also propose a dynamic variant of the model that takes into account the recommendations time-value during interactive online recommendation scenario. Our proposal has endured a twofold evaluation. On the one hand, we conducted an offline comparative study on the MovieLens recommendation dataset in order to assess our models performance with regard to several widely adopted recommendation techniques. On the other hand, the model was evaluated within a real time online news recommendation platform to highlight its adaptability and efficiency in a highly interactive application domain. The obtained experimental results show that our proposal is able to outperform existing approaches by unifying their main ideas and by using all the available data sources when inferring and predicting users interests. The online experimentation of the proposed model proves its scalability and low complexity which makes it a potential candidate for real time large scale recommendations.
Keywords: adaptive recommendation model; interactive recommendation;
Continuous recommendation; Consumer Behavior modeling and prediction.
Interval graph mining
by Amina Kemmar, Yahia Lebbah, Samir Loudni
Abstract: Frequent subgraph mining is a difficult data mining problem aiming to find the exact set of frequent subgraphs into a database of graphs. Current subgraph mining approaches make use of the canonical encoding which is one of the key operations.
It is well known that canonical encodings have an exponential time complexity.
Consequently, mining all frequent patterns for large and dense graphs is computationally expensive. In this paper, we propose an interval approach to handle canonicity, leading to two encodings, lower and upper encodings, with a polynomial time complexity, allowing to tightly enclose the exact set of frequent subgraphs.
These two encodings lead to an interval graph mining algorithm where two minings are launched in parallel, a lower mining (resp. upper mining) using the lower (resp. upper) encoding. The interval graph mining approach has been implemented within the state of the art Gaston miner. Experiments performed on synthetic and real graph databases coming from stock market and biological datasets show that our interval graph mining is effective on dense graphs.
Keywords: Graph mining; interval approach; frequent subgraph discovery; graph encoding; subgraph isomorphism; graph isomorphism.
Association rules mining using cuckoo search algorithm
by Rasha Mohammed, Mehdi Duaimi
Abstract: Association rules mining (ARM) is a fundamental and widely used data mining technique to achieve useful information about data. The traditional ARM algorithms are degrading computation efficiency by mining too many association rules which are not appropriate for a given user. Recent research in ARM is investigating the use of metaheuristic algorithms which are looking for only a subset of high-quality rules. In this paper, a modified discrete cuckoo search algorithm for association rules mining (DCS-ARM) is proposed for this purpose. The effectiveness of our algorithm is tested against a set of well-known transactional databases. Results indicate that the proposed algorithm outperforms the existing metaheuristic methods.
Keywords: data mining; ARM; association rules mining; DCS; discrete cuckoo search; metaheuristic algorithm.
An Efficient Context-aware Agglomerative Fuzzy Clustering framework for Plagiarism Detection
by Anirban Chakrabarty, Sudipta Roy
Abstract: Plagiarism refers to the act of copying content without acknowledging the original source. Though there are several existing commercial tools for plagiarism detection, still plagiarism is tricky and challenging due to the rise in volume of online publications. Existing plagiarism detection methods use paraphrasing, sentence and key-word matching, but such techniques has not been very effective. In this work, a framework for fuzzy based plagiarism detection is proposed using a context-aware agglomerative clustering approach with an improved time complexity. The work aims in retrieving key concepts at word, sentence and paragraph level by integrating semantic features in a novel optimization function to detect plagiarism effectively. The notion of Fuzzy clustering has been applied to improve the robustness and consistency of results for clustering multi-disciplinary papers. The experimental analysis is supported by comparison with other contemporary techniques which indicate the superiority of proposed approach for plagiarism detection.
Keywords: Fuzzy clustering; Context similarity; Plagiarism detection; Spanning tree; Agglomerative clustering; validity index; constrained objective function.
Mining Hidden Opinions from Objective Sentences
by Farek Lazhar
Abstract: Sentiment Analysis and Opinion Mining is a very popular and active research area in natural language processing, it deals with structured and unstructured data to identify and extract peoples opinions, sentiments and emotions in many resources of subjectivity such as product reviews, blogs, social networks, etc. All existing feature-level Opinion Mining approaches deal with the detection of subjective sentences and eliminating objective ones before extracting explicit features and their related positive or negative polarities. However, objective sentences can carry implicit opinions and a lack attention given to such sentences can adversely affect the obtained results. In this paper, we propose a classification-based approach to extract implicit opinions from objective sentences. Firstly, we apply a rule-based approach to extract explicit feature-opinion pairs from subjective sentences. Secondly, in order de build a classification model, training corpus is constructed based on extracted explicit feature-opinion pairs and subjective sentences. Lastly, mining implicit feature-opinion pairs from objective sentences is formulated into a text classification problem using the model previously built. Tested on customer reviewsrnin three different domains, experimental results show the effectiveness of mining opinions from objective sentences.
Keywords: Opinion Mining; Hidden Opinion; Objectivity; Subjectivity; Supervised Learning;.
EFP-Tree: An Efficient FP-Tree for Incremental Mining of Frequent Patterns
by Razieh Davashi, Mohammad-Hossein Nadimi-Shahraki
Abstract: Frequent pattern mining from dynamic databases where there are many incremental updates is a significant research issue in data mining. After incremental updates, the validity of the frequent patterns is changed. A simple way to handle this state is rerunning mining algorithms from scratch which is very costly. To solve this problem, researchers have introduced incremental mining approach. In this article, an efficient FP-tree named EFP-tree is proposed for incremental mining of frequent patterns. For original database, it is constructed like FP-tree by using an auxiliary list without any reconstruction. Consistently, for incremental updates, EFP-tree is reconstructed once and therefore reduces the number of tree reconstructions, reconstructed branches and the search space. The experimental results show that using EFP-tree can reduce reconstructed branches and the runtime in both static and incremental mining and enhance the scalability compared to well-known tree structures CanTree, CP-tree, SPO-tree, and GM-tree in both dense and sparse datasets.
Keywords: Data mining; Dynamic databases; Frequent pattern; Incremental mining; FP-tree.
Topical Document Clustering: Two-Stage Post Processing Technique
by Poonam Goyal, N. Mehala, Divyansh Bhatia, Navneet Goyal
Abstract: Clustering documents is an essential step in improving efficiency and effectiveness of information retrieval systems. Topical clustering and soft clustering are the techniques to shape document clusters into coherent and natural clusters. In this paper, we have proposed a two-phase Split-Merge (SM) algorithm, which can be applied to topical clusters obtained from existing query-context aware document clustering algorithms, to produce soft topical document clusters. The proposed technique is a post processing technique which combines the advantages of document-pivot and feature-pivot based topical document clustering approaches. In SM algorithm, the split phase splits the topical clusters by relating them to the topics which are obtained by disambiguating the web search results, and converts clusters into homogeneous soft clusters. In the merge phase, similar topical document clusters are merged using similarity among topical clusters by feature-pivot approach. The proposed algorithm is tested on the outcome of two hierarchical query-context aware document clustering algorithms on different datasets including TREC session track 2011 dataset. The proposed SM algorithm improves the quality of clustering appreciably in all the experiments conducted over various data sets. We have also applied an incremental model to update obtained topical document clusters with the progress in the data stream. The model periodically updates new information efficiently and maintains quality of clusters comparable to that of the static model.
Keywords: Topical clustering; Query clustering; Query context; Document clustering; Incremental clustering; soft clustering.
Support vector machines for credit risk assessment with imbalanced datasets
by Sihem Khemakhem, Younes Boujelbene
Abstract: Support Vector Machines (SVM) have a limited performance in credit scoring issues due to the imbalanced data sets in which the number of unpaid is lower than paid loans. In this work, we developed an SVM model with more kernels on a set of imbalanced data and suggested two data resampling alternatives: Random Over Sampling (ROS) and Synthetic Minority Oversampling Technique (SMOTE). The aim of this work is to explore the relevance of re-sampling data with the SVM technique for an accurate credit risk prediction ratefate to the class imbalance constraint. The performance criteria chosen to evaluate the suggested technique to measure such an effect were accuracy, sensitivity specificity, error type I, error type II, G-mean and the area under the receiver operating characteristic curve (AUC). Significant empirical results obtained from an experimental study of a real imbalanced database of loans granted by a Tunisian bank demonstrated the performance improvement thanks to sampling strategies in SVM, thus leading to a better prediction accuracy of the creditworthiness of borrowers.
Keywords: credit scoring; Support Vector Machines; imbalanced data; Synthetic Minority Oversampling Technique; Random Over Sampling.
ABCD: Agent Based Model for Document Classification
by Abdurrahman Nasr
Abstract: Document classification is the task of analyzing, identifying and categorizing a collection of documents into their annotated classes based on their contents. Classification of news articles represents a popular application for document categorization, in addition to several other applications in industry, business, media, and government. This paper presents ABCD as an Agent Based Classifier for Documents. ABCD is autonomous by depending on software agents in collecting and distributing documents, and smart by exploiting machine learning techniques to train the underlying classifier. As such, the system consists of two essential components, namely, the agent component and the classification component, that integrate together to form the proposed model. The agent component consists of six software agents, of which, one for document collection, while the rest (one for each news topic) are responsible for distributing the classified documents to the subscribers.
The classification component recognizes the incoming document and assigns them to predefined categories. To be comprehensive and to facilitate comparative results, five statistical classifiers are exploited. These classifiers are based on Na
Keywords: Software Agent; Supervised Learning; Random Forest; Document Classification; Unimodal document; Multi-agent system.
To Identify the Usage of Clustering Techniques for Improving Search Result of a Website
by Shashi Mehrotra, Shruti Kohli
Abstract: Clustering has drawn much attention to research community due to its advantages and wide applications. However, clustering is a challenging problem, as many factors play a significant role. The same algorithm may generate different output, if there is change in parameters, presentation order or similarity measure. The search option is used excessively on almost every website. Grouping the search results in various folders will improve web browsing, and that can be achieved through clustering.
Clustering web elements facilitate data analysis in various ways. In this paper, we present well-known clustering algorithms and identify their different usages for the web elements. The paper discusses some significant work done in this field.
Keywords: Clustering algorithm; distance measure; web analytics; complexity.
Machine Learning for Water bodies Identification from Satellite Images
by Konstantinos Kontos, Manolis Maragoudakis
Abstract: Examining satellite images on residential areas and more particularly bodies of water such as swimming pools are of great interest in the field of image mining. Initially, the unobstructed water consumption for pool operation can lead to the reduction of water supplies especially during summer months, a fact that can influence water sources for firefighting. Moreover, they may serve as potential mosquito habitat) especially if they are surrounded by dense vegetation. Towards this direction, this paper presents an efficient classification system for identifying swimming pools from satellite images. A new method of trainable segmentation is presented for feature extraction and for the creation of the example set. In this study, a Support Vector Machine algorithm is used for reducing the feature set to the more appropriates. The proposed method was tested on different areas of Greece with an overall accuracy of 99.82% that was achieved by using an ensemble algorithm.
Keywords: Satellite Images; Feature Extraction; Image Processing; Pool Detection; Trainable Segmentation; Data Mining; SVM Algorithms; Decision Trees; Image Classification; Image Mining; Adaboost.