International Journal of Data Mining, Modelling and Management (12 papers in press)
Hybrid Feature Selection Methods for High-Dimensional Multi-class Datasets
by Amit Saxena, Vimal Dubey, John Wang
Abstract: Hybrid methods are very important for feature selection in case of the classification of high-dimensional datasets. In this paper, we proposed two hybrid methods which are the combination of filter-based feature selection, genetic algorithm, and sequential random search methods. The first proposed method is hybridization of information gain and genetic algorithm. In this, firstly the features are ranked based on the information gain and then a user defined features are selected from the ranked features. Genetic algorithm with these selected features is applied for the selection of optimal feature subset. It is applied for feature selection with two types of fitness functions which are single objective and multi-objective in nature. The second feature selection model is the hybridization of information gain and sequential random K-nearest Neighbor (SRKNN). In this method, again information gain is used to rank the features and a user defined top ranked number of features are selected. A set of binary population (having all feature selected by users) are generated and on each population sequential search method is applied for maximizing the classification accuracy. These methods are applied to 21 high-dimensional multi-class datasets. Obtained results show that on some datasets first methods performance is good and on some datasets second methods performance is good. The results obtained by proposed methods are compared with results registered for other methods.
Keywords: Intelligent Mining; High-Dimensional dataset; Genetic Algorithm; Filter Approach; Information Gain; Classification.
A Lexical- Semantics based Method for Multi label Text Categorization Using Word Net
by Shweta Taneja, Rajni Jindal
Abstract: Text categorization is an upcoming area in the field of text mining. The text documents possess huge number of features due to their unstructured nature. In this paper, an algorithm for Multi label Categorization of text documents based on the concepts of Lexical and Semantics using Word Net (MC-LSW) is proposed. The proposed algorithm is based on the concepts of lexical (tokens) and semantics of a language. It aims at minimizing the number of tokens used for categorizing text documents. MC-LSW uses Word Net to extract the semantic information of tokens. The proposed algorithm is implemented and tested on five datasets of text domain and is compared with the existing multi label categorization algorithms. The proposed algorithm (MC-LSW) shows more efficient and promising results in terms of space and time complexity than the existing methods. Accuracy and precision measures have been improved by the proposed algorithm as well as hamming loss has been reduced.
Keywords: Multi label Text Categorization; Lexical Analysis; Semantic Analysis; Word Net.
GACC: Genetic Algorithm Based Categorical Data Clustering For Large Datasets
by Abha Sharma, R.S. Thakur
Abstract: Many operators of Genetic Algorithm (GA) is discussed in the literature such as crossover operators, fitness functions, mutation etc. A range of GA based clustering methods have been proposed to obtain optimal solutions. In this paper most recent GA based hard and fuzzy clustering which is specifically designed for categorical data are discussed. In general all GA based clustering algorithms generate the initial population randomly, which may produce biased results. This paper proposed GACC algorithm with new population initialization criteria. In this population creation mechanism, the usual random selection of chromosomes is replaced with more refined and distinct clusters as chromosomes. This mechanism prohibit the user to initialize the population size as well. Experimental results shows the better clustering for the pure categorical dataset. The work finishes off with some open challenges and ways to improve clustering of categorical data.
Keywords: Categorical data; Genetic Algorithm; Genetic operators; Population; Population size.
The Dynamics of Wikipedia Article Revisions: An Analysis of Revision Activities and Patterns
by Zhongming Ma, Jie Tao, Jing Hu
Abstract: To study the dynamics of revision activities of Wikipedia articles, we define 14 revision actions, annotate 6,950 revisions from 20 articles in four quality ranks (C, B, GA, and FA), and analyze revisions and revision actions in 10 consecutive time periods. We identify four revision patterns (1) revision actions at the sentence and link levels appear in similar paces; (2) the numbers of revision actions at sentence and link levels comparatively evenly grow with the articles age prior to the last time period; (3) the paces of media and reference-level actions tend to be lagged behind sentence and link-level actions; (4) before being promoted to the GA or FA rank, articles nominated to the GA or FA rank exhibit a significant rising pattern in amounts of revisions and revision actions. This pattern is validated with a larger set of 533 articles.
Keywords: Crowdsourcing; Collaborative Processes; Wikipedia; Revision Behaviors.
A Social Network Analysis based approach to extracting knowledge patterns about innovation geography from patent databases
by Domenico Ursino, Massimiliano Ferrara, Diego Fosso, Roberto Mavilia, Davide Lanata
Abstract: Patents have been one of the main topics investigated in several fields of scientific literature. Currently, data about patents is rapidly increasing, and the adoption of Data Mining and Big-Data-centered approaches to investigating them appears compulsory. Among these last approaches, Social Network Analysis (SNA) is extremely promising. In this paper, we propose a SNA-based approach to extracting knowledge patterns about patent inventors and their collaborations. Our approach is extremely general and can be exploited to investigate patents of any country. It allows the analysis of some issues that have not been considered in the past, such as the presence of ``power inventors'' in a country, the existence of a backbone and of possible cliques among them, the influence and the benefits of power inventors on their co-inventors and, more in general, in the R\&D activities of their country. All these issues represent innovation geography knowledge patterns that can be extracted thanks to our approach.
Keywords: Patents; Knowledge Pattern Extraction; Social Network Analysis; Power Inventors; Innovation Geography.
An adaptive and interactive recommendation model for consumers behaviors prediction
by Mohamed Ramzi Haddad, Hajer Baazaoui
Abstract: Recommendation algorithms aim at predicting customers interests and purchases using different ideas and hypotheses. Consequently, system designers need to choose the recommendation approach that is the most suitable with regard to their products nature and consumers behaviors within the application field. In this paper, we propose an adaptive recommendation model based on statistical modeling in order to assist consumers facing choice overload by predicting their interests and consumption behaviors. We also propose a dynamic variant of the model that takes into account the recommendations time-value during interactive online recommendation scenario. Our proposal has endured a twofold evaluation. On the one hand, we conducted an offline comparative study on the MovieLens recommendation dataset in order to assess our models performance with regard to several widely adopted recommendation techniques. On the other hand, the model was evaluated within a real time online news recommendation platform to highlight its adaptability and efficiency in a highly interactive application domain. The obtained experimental results show that our proposal is able to outperform existing approaches by unifying their main ideas and by using all the available data sources when inferring and predicting users interests. The online experimentation of the proposed model proves its scalability and low complexity which makes it a potential candidate for real time large scale recommendations.
Keywords: adaptive recommendation model; interactive recommendation;
Continuous recommendation; Consumer Behavior modeling and prediction.
Interval graph mining
by Amina Kemmar, Yahia Lebbah, Samir Loudni
Abstract: Frequent subgraph mining is a difficult data mining problem aiming to find the exact set of frequent subgraphs into a database of graphs. Current subgraph mining approaches make use of the canonical encoding which is one of the key operations.
It is well known that canonical encodings have an exponential time complexity.
Consequently, mining all frequent patterns for large and dense graphs is computationally expensive. In this paper, we propose an interval approach to handle canonicity, leading to two encodings, lower and upper encodings, with a polynomial time complexity, allowing to tightly enclose the exact set of frequent subgraphs.
These two encodings lead to an interval graph mining algorithm where two minings are launched in parallel, a lower mining (resp. upper mining) using the lower (resp. upper) encoding. The interval graph mining approach has been implemented within the state of the art Gaston miner. Experiments performed on synthetic and real graph databases coming from stock market and biological datasets show that our interval graph mining is effective on dense graphs.
Keywords: Graph mining; interval approach; frequent subgraph discovery; graph encoding; subgraph isomorphism; graph isomorphism.
Association rules mining using cuckoo search algorithm
by Rasha Mohammed, Mehdi Duaimi
Abstract: Association rules mining (ARM) is a fundamental and widely used data mining technique to achieve useful information about data. The traditional ARM algorithms are degrading computation efficiency by mining too many association rules which are not appropriate for a given user. Recent research in ARM is investigating the use of metaheuristic algorithms which are looking for only a subset of high-quality rules. In this paper, a modified discrete cuckoo search algorithm for association rules mining (DCS-ARM) is proposed for this purpose. The effectiveness of our algorithm is tested against a set of well-known transactional databases. Results indicate that the proposed algorithm outperforms the existing metaheuristic methods.
Keywords: data mining; ARM; association rules mining; DCS; discrete cuckoo search; metaheuristic algorithm.
An Efficient Context-aware Agglomerative Fuzzy Clustering framework for Plagiarism Detection
by Anirban Chakrabarty, Sudipta Roy
Abstract: Plagiarism refers to the act of copying content without acknowledging the original source. Though there are several existing commercial tools for plagiarism detection, still plagiarism is tricky and challenging due to the rise in volume of online publications. Existing plagiarism detection methods use paraphrasing, sentence and key-word matching, but such techniques has not been very effective. In this work, a framework for fuzzy based plagiarism detection is proposed using a context-aware agglomerative clustering approach with an improved time complexity. The work aims in retrieving key concepts at word, sentence and paragraph level by integrating semantic features in a novel optimization function to detect plagiarism effectively. The notion of Fuzzy clustering has been applied to improve the robustness and consistency of results for clustering multi-disciplinary papers. The experimental analysis is supported by comparison with other contemporary techniques which indicate the superiority of proposed approach for plagiarism detection.
Keywords: Fuzzy clustering; Context similarity; Plagiarism detection; Spanning tree; Agglomerative clustering; validity index; constrained objective function.
Mining Hidden Opinions from Objective Sentences
by Farek Lazhar
Abstract: Sentiment Analysis and Opinion Mining is a very popular and active research area in natural language processing, it deals with structured and unstructured data to identify and extract peoples opinions, sentiments and emotions in many resources of subjectivity such as product reviews, blogs, social networks, etc. All existing feature-level Opinion Mining approaches deal with the detection of subjective sentences and eliminating objective ones before extracting explicit features and their related positive or negative polarities. However, objective sentences can carry implicit opinions and a lack attention given to such sentences can adversely affect the obtained results. In this paper, we propose a classification-based approach to extract implicit opinions from objective sentences. Firstly, we apply a rule-based approach to extract explicit feature-opinion pairs from subjective sentences. Secondly, in order de build a classification model, training corpus is constructed based on extracted explicit feature-opinion pairs and subjective sentences. Lastly, mining implicit feature-opinion pairs from objective sentences is formulated into a text classification problem using the model previously built. Tested on customer reviewsrnin three different domains, experimental results show the effectiveness of mining opinions from objective sentences.
Keywords: Opinion Mining; Hidden Opinion; Objectivity; Subjectivity; Supervised Learning;.
EFP-Tree: An Efficient FP-Tree for Incremental Mining of Frequent Patterns
by Razieh Davashi, Mohammad-Hossein Nadimi-Shahraki
Abstract: Frequent pattern mining from dynamic databases where there are many incremental updates is a significant research issue in data mining. After incremental updates, the validity of the frequent patterns is changed. A simple way to handle this state is rerunning mining algorithms from scratch which is very costly. To solve this problem, researchers have introduced incremental mining approach. In this article, an efficient FP-tree named EFP-tree is proposed for incremental mining of frequent patterns. For original database, it is constructed like FP-tree by using an auxiliary list without any reconstruction. Consistently, for incremental updates, EFP-tree is reconstructed once and therefore reduces the number of tree reconstructions, reconstructed branches and the search space. The experimental results show that using EFP-tree can reduce reconstructed branches and the runtime in both static and incremental mining and enhance the scalability compared to well-known tree structures CanTree, CP-tree, SPO-tree, and GM-tree in both dense and sparse datasets.
Keywords: Data mining; Dynamic databases; Frequent pattern; Incremental mining; FP-tree.
Topical Document Clustering: Two-Stage Post Processing Technique
by Poonam Goyal, N Mehala, Divyansh Bhatia, Navneet Goyal
Abstract: Clustering documents is an essential step in improving efficiency and effectiveness of information retrieval systems. Topical clustering and soft clustering are the techniques to shape document clusters into coherent and natural clusters. In this paper, we have proposed a two-phase Split-Merge (SM) algorithm, which can be applied to topical clusters obtained from existing query-context aware document clustering algorithms, to produce soft topical document clusters. The proposed technique is a post processing technique which combines the advantages of document-pivot and feature-pivot based topical document clustering approaches. In SM algorithm, the split phase splits the topical clusters by relating them to the topics which are obtained by disambiguating the web search results, and converts clusters into homogeneous soft clusters. In the merge phase, similar topical document clusters are merged using similarity among topical clusters by feature-pivot approach. The proposed algorithm is tested on the outcome of two hierarchical query-context aware document clustering algorithms on different datasets including TREC session track 2011 dataset. The proposed SM algorithm improves the quality of clustering appreciably in all the experiments conducted over various data sets. We have also applied an incremental model to update obtained topical document clusters with the progress in the data stream. The model periodically updates new information efficiently and maintains quality of clusters comparable to that of the static model.
Keywords: Topical clustering; Query clustering; Query context; Document clustering; Incremental clustering; soft clustering