International Journal of Data Mining, Modelling and Management (12 papers in press)
An Efficient Context-aware Agglomerative Fuzzy Clustering framework for Plagiarism Detection
by Anirban Chakrabarty, Sudipta Roy
Abstract: Plagiarism refers to the act of copying content without acknowledging the original source. Though there are several existing commercial tools for plagiarism detection, still plagiarism is tricky and challenging due to the rise in volume of online publications. Existing plagiarism detection methods use paraphrasing, sentence and key-word matching, but such techniques has not been very effective. In this work, a framework for fuzzy based plagiarism detection is proposed using a context-aware agglomerative clustering approach with an improved time complexity. The work aims in retrieving key concepts at word, sentence and paragraph level by integrating semantic features in a novel optimization function to detect plagiarism effectively. The notion of Fuzzy clustering has been applied to improve the robustness and consistency of results for clustering multi-disciplinary papers. The experimental analysis is supported by comparison with other contemporary techniques which indicate the superiority of proposed approach for plagiarism detection.
Keywords: Fuzzy clustering; Context similarity; Plagiarism detection; Spanning tree; Agglomerative clustering; validity index; constrained objective function.
Mining Hidden Opinions from Objective Sentences
by Farek Lazhar
Abstract: Sentiment Analysis and Opinion Mining is a very popular and active research area in natural language processing, it deals with structured and unstructured data to identify and extract peoples opinions, sentiments and emotions in many resources of subjectivity such as product reviews, blogs, social networks, etc. All existing feature-level Opinion Mining approaches deal with the detection of subjective sentences and eliminating objective ones before extracting explicit features and their related positive or negative polarities. However, objective sentences can carry implicit opinions and a lack attention given to such sentences can adversely affect the obtained results. In this paper, we propose a classification-based approach to extract implicit opinions from objective sentences. Firstly, we apply a rule-based approach to extract explicit feature-opinion pairs from subjective sentences. Secondly, in order de build a classification model, training corpus is constructed based on extracted explicit feature-opinion pairs and subjective sentences. Lastly, mining implicit feature-opinion pairs from objective sentences is formulated into a text classification problem using the model previously built. Tested on customer reviewsrnin three different domains, experimental results show the effectiveness of mining opinions from objective sentences.
Keywords: Opinion Mining; Hidden Opinion; Objectivity; Subjectivity; Supervised Learning;.
Topical Document Clustering: Two-Stage Post Processing Technique
by Poonam Goyal, N. Mehala, Divyansh Bhatia, Navneet Goyal
Abstract: Clustering documents is an essential step in improving efficiency and effectiveness of information retrieval systems. Topical clustering and soft clustering are the techniques to shape document clusters into coherent and natural clusters. In this paper, we have proposed a two-phase Split-Merge (SM) algorithm, which can be applied to topical clusters obtained from existing query-context aware document clustering algorithms, to produce soft topical document clusters. The proposed technique is a post processing technique which combines the advantages of document-pivot and feature-pivot based topical document clustering approaches. In SM algorithm, the split phase splits the topical clusters by relating them to the topics which are obtained by disambiguating the web search results, and converts clusters into homogeneous soft clusters. In the merge phase, similar topical document clusters are merged using similarity among topical clusters by feature-pivot approach. The proposed algorithm is tested on the outcome of two hierarchical query-context aware document clustering algorithms on different datasets including TREC session track 2011 dataset. The proposed SM algorithm improves the quality of clustering appreciably in all the experiments conducted over various data sets. We have also applied an incremental model to update obtained topical document clusters with the progress in the data stream. The model periodically updates new information efficiently and maintains quality of clusters comparable to that of the static model.
Keywords: Topical clustering; Query clustering; Query context; Document clustering; Incremental clustering; soft clustering.
Support vector machines for credit risk assessment with imbalanced datasets
by Sihem Khemakhem, Younes Boujelbene
Abstract: Support Vector Machines (SVM) have a limited performance in credit scoring issues due to the imbalanced data sets in which the number of unpaid is lower than paid loans. In this work, we developed an SVM model with more kernels on a set of imbalanced data and suggested two data resampling alternatives: Random Over Sampling (ROS) and Synthetic Minority Oversampling Technique (SMOTE). The aim of this work is to explore the relevance of re-sampling data with the SVM technique for an accurate credit risk prediction ratefate to the class imbalance constraint. The performance criteria chosen to evaluate the suggested technique to measure such an effect were accuracy, sensitivity specificity, error type I, error type II, G-mean and the area under the receiver operating characteristic curve (AUC). Significant empirical results obtained from an experimental study of a real imbalanced database of loans granted by a Tunisian bank demonstrated the performance improvement thanks to sampling strategies in SVM, thus leading to a better prediction accuracy of the creditworthiness of borrowers.
Keywords: credit scoring; Support Vector Machines; imbalanced data; Synthetic Minority Oversampling Technique; Random Over Sampling.
ABCD: Agent Based Model for Document Classification
by Abdurrahman Nasr
Abstract: Document classification is the task of analyzing, identifying and categorizing a collection of documents into their annotated classes based on their contents. Classification of news articles represents a popular application for document categorization, in addition to several other applications in industry, business, media, and government. This paper presents ABCD as an Agent Based Classifier for Documents. ABCD is autonomous by depending on software agents in collecting and distributing documents, and smart by exploiting machine learning techniques to train the underlying classifier. As such, the system consists of two essential components, namely, the agent component and the classification component, that integrate together to form the proposed model. The agent component consists of six software agents, of which, one for document collection, while the rest (one for each news topic) are responsible for distributing the classified documents to the subscribers.
The classification component recognizes the incoming document and assigns them to predefined categories. To be comprehensive and to facilitate comparative results, five statistical classifiers are exploited. These classifiers are based on Na
Keywords: Software Agent; Supervised Learning; Random Forest; Document Classification; Unimodal document; Multi-agent system.
To Identify the Usage of Clustering Techniques for Improving Search Result of a Website
by Shashi Mehrotra, Shruti Kohli, Aditi Sharan
Abstract: Clustering has drawn much attention to research community due to its advantages and wide applications. However, clustering is a challenging problem, as many factors play a significant role. The same algorithm may generate different output, if there is change in parameters, presentation order or similarity measure. The search option is used excessively on almost every website. Grouping the search results in various folders will improve web browsing, and that can be achieved through clustering.
Clustering web elements facilitate data analysis in various ways. In this paper, we present well-known clustering algorithms and identify their different usages for the web elements. The paper discusses some significant work done in this field.
Keywords: Clustering algorithm; distance measure; web analytics; complexity.
Machine Learning for Water bodies Identification from Satellite Images
by Konstantinos Kontos, Manolis Maragoudakis
Abstract: Examining satellite images on residential areas and more particularly bodies of water such as swimming pools are of great interest in the field of image mining. Initially, the unobstructed water consumption for pool operation can lead to the reduction of water supplies especially during summer months, a fact that can influence water sources for firefighting. Moreover, they may serve as potential mosquito habitat) especially if they are surrounded by dense vegetation. Towards this direction, this paper presents an efficient classification system for identifying swimming pools from satellite images. A new method of trainable segmentation is presented for feature extraction and for the creation of the example set. In this study, a Support Vector Machine algorithm is used for reducing the feature set to the more appropriates. The proposed method was tested on different areas of Greece with an overall accuracy of 99.82% that was achieved by using an ensemble algorithm.
Keywords: Satellite Images; Feature Extraction; Image Processing; Pool Detection; Trainable Segmentation; Data Mining; SVM Algorithms; Decision Trees; Image Classification; Image Mining; Adaboost.
Improving the Efficacy of Clustering By Using Far Enhanced Clustering Algorithm
by Bikram Keshari Mishra, Amiya Kumar Rath
Abstract: There are several aspects on which research works are carried out when the subject of discussion is clustering of objects, an imperative tool used in data mining. Basically, the focus is on finding the near about optimal cluster centers and determining the best possible groups into which the objects categorically fall into so that expectations are met. Keeping this in mind, we have emphasized on finding a technique which not only determines the near about initial optimal centroids but also contemplates on grouping the data points in their respective clusters in a way which is far more efficient than several novel approaches. In this paper, we have examined four varieties of clustering algorithms namely, K-Means, FEKM, ECM and proposed FECA which were implemented on varying data sets. Subsequently, we used few internal cluster validity indices like Dunns index, Davies-Bouldins index, Silhouette Coefficient, C index and Calinski index for quantitative evaluation of the clustering results obtained. The results obtained from simulation were compared between them, and as per our expectation it was found that, the quality of clustering produced by FECA is far more satisfactory than the others. Almost every value of validity indices used give encouraging results for FECA, implying good cluster formation. Further experiments support that the proposed algorithm also produces minimum quantization error almost for all the data sets used.
Keywords: Cluster Analysis; Cluster Validation; Optimal Centroid; K-Means; FEKM; FECA and ECM.
Prototype based Classification and Error Analysis under Bootstrapping Strategy
by Doosung Hwang, Youngju Son
Abstract: A prototype based classification is proposed to select handfuls of class
data for learning rules and prediction. A class point is considered as a prototype
if it forms a hypersphere that represents a part of class area measured by any
distance metric and class labels. The prototype selection algorithm, formulated
by a set covering optimisation, selects the number of within-class points that is as
small as possible while preserving class covering regions for the unknown data
distribution. The upper bound of error is analysed to compare the effectiveness
of the prototype based classification with Bayes classifier. Under a bootstrapping
strategy and the 0/1 loss, the bias and variance components are driven from
a generalization error without assuming the unknown distribution of a given
problem. This analysis provides a way to evaluate the prototype based models
and select the optimal model estimate for any standard classifier. The experiments
showthat the proposed approach is very competitive when compared to the nearest
neighbour and Bayes classifier, and efficient in choosing prototypes in terms of
class covering regions, data size and computation time.
Keywords: class prototype; set covering optimization; greedy method; nearestrnneighbor; error analysis.
Tree-based Text Stream Clustering with Application to Spam Mail Classification
by Phimphaka Taninpong, Sudsanguan Ngamsuriyaroj
Abstract: This paper proposes a new text clustering algorithm based on a tree structure. The main idea of the clustering algorithm is a sub-tree at a specific node represents a document cluster. Our clustering algorithm is a single pass scanning algorithm which traverses down the tree to search for all clusters without having to predefine the number of clusters. Thus, it fits our objectives to produce document clusters having high cohesion, and to keep the minimum number of clusters. Moreover, an incremental learning process will perform after a new document is inserted into the tree, and the clusters will be rebuilt to accommodate the new information. In addition, we applied the proposed clustering algorithm to spam mail classification and the experimental results show that Tree-based text clustering spam filter gives higher accuracy and specificity than the Cobweb clustering, Naive Bayes and KNN.
Keywords: clustering; text clustering; tree-based clustering; spam; spam classification; text classification.
A New Method for Behavioral-Based Malware Detection Using Reinforcement Learning
by Mansour Esmaeilpour
Abstract: Malware is - the abbreviation for malicious software - a comprehensive term that is called software that deliberately created to perform an unauthorized and often is harmful. Viruses, Backdoors, Keyloggers, Trojans, password thieves' software, Spyware, Adwares are number of malware samples. When calling something a virus or Ttrojan was enough, however, methods of contamination are developed, then the term virus and other malware definition was not satisfactory for all types of malicious programs. This research focus on clustering the malware according to the malware features. To avoid the dangers of malware, some applications have been created to track them down. In this paper, to present a new method for detection of malware using reinforcement learning. The result demonstrates that the proposed method can detection the malware more accurate.
Keywords: Antivirus (AVS); Malware; Reinforcement Learning.
Efficient Spatial Query Processing for KNN Queries using well organized Net-Grid Partition Indexing Approach
by Geetha Kannan, Kannan Arputharaj
Abstract: In the recent years, most of the applications use mobile devices with Geographical Positioning Systems support for providing Location Based Services. However, the queries sent through the mobile devices to obtain such services consume more time for processing due to the size of the spatial data. In order to solve this problem, an efficient indexing method for providing effective query processing services in mobile computing environments is proposed in this paper by introducing a new Net-Grid based Partition Indexing approach. This indexing method increases the efficiency of the query retrieval in mobile network environments. Since, all the existing mobile-based network applications utilize the node to node access of spatial objects for processing the query, the mobile query retrieval part in spatial databases is becoming the greatest disadvantage by consuming more time to process the query. This problem is solved in the proposed work by partitioning the road network into a large number of rectangular regions called grids and by creating indexes based on these grids. The experimental results carried out using the proposed Net-Grid based partition index approach show that the proposed model provides fast retrieval with high accuracy in processing of spatial queries when it is compared with the existing approaches used in database systems.
Keywords: Cache mechanism; KNN Queries; Location Based Services; Mobile Environments; Partition Index; Query Processing; Spatial data Management; Spatial Networks; Spatial Query; Wireless data broadcast.