International Journal of Data Mining, Modelling and Management (19 papers in press)
ABCD: Agent Based Model for Document Classification
by Abdurrahman Nasr
Abstract: Document classification is the task of analyzing, identifying and categorizing a collection of documents into their annotated classes based on their contents. Classification of news articles represents a popular application for document categorization, in addition to several other applications in industry, business, media, and government. This paper presents ABCD as an Agent Based Classifier for Documents. ABCD is autonomous by depending on software agents in collecting and distributing documents, and smart by exploiting machine learning techniques to train the underlying classifier. As such, the system consists of two essential components, namely, the agent component and the classification component, that integrate together to form the proposed model. The agent component consists of six software agents, of which, one for document collection, while the rest (one for each news topic) are responsible for distributing the classified documents to the subscribers.
The classification component recognizes the incoming document and assigns them to predefined categories. To be comprehensive and to facilitate comparative results, five statistical classifiers are exploited. These classifiers are based on Na
Keywords: Software Agent; Supervised Learning; Random Forest; Document Classification; Unimodal document; Multi-agent system.
To Identify the Usage of Clustering Techniques for Improving Search Result of a Website
by Shashi Mehrotra, Shruti Kohli, Aditi Sharan
Abstract: Clustering has drawn much attention to research community due to its advantages and wide applications. However, clustering is a challenging problem, as many factors play a significant role. The same algorithm may generate different output, if there is change in parameters, presentation order or similarity measure. The search option is used excessively on almost every website. Grouping the search results in various folders will improve web browsing, and that can be achieved through clustering.
Clustering web elements facilitate data analysis in various ways. In this paper, we present well-known clustering algorithms and identify their different usages for the web elements. The paper discusses some significant work done in this field.
Keywords: Clustering algorithm; distance measure; web analytics; complexity.
Machine Learning for Water bodies Identification from Satellite Images
by Konstantinos Kontos, Manolis Maragoudakis
Abstract: Examining satellite images on residential areas and more particularly bodies of water such as swimming pools are of great interest in the field of image mining. Initially, the unobstructed water consumption for pool operation can lead to the reduction of water supplies especially during summer months, a fact that can influence water sources for firefighting. Moreover, they may serve as potential mosquito habitat) especially if they are surrounded by dense vegetation. Towards this direction, this paper presents an efficient classification system for identifying swimming pools from satellite images. A new method of trainable segmentation is presented for feature extraction and for the creation of the example set. In this study, a Support Vector Machine algorithm is used for reducing the feature set to the more appropriates. The proposed method was tested on different areas of Greece with an overall accuracy of 99.82% that was achieved by using an ensemble algorithm.
Keywords: Satellite Images; Feature Extraction; Image Processing; Pool Detection; Trainable Segmentation; Data Mining; SVM Algorithms; Decision Trees; Image Classification; Image Mining; Adaboost.
Improving the Efficacy of Clustering By Using Far Enhanced Clustering Algorithm
by Bikram Keshari Mishra, Amiya Kumar Rath
Abstract: There are several aspects on which research works are carried out when the subject of discussion is clustering of objects, an imperative tool used in data mining. Basically, the focus is on finding the near about optimal cluster centers and determining the best possible groups into which the objects categorically fall into so that expectations are met. Keeping this in mind, we have emphasized on finding a technique which not only determines the near about initial optimal centroids but also contemplates on grouping the data points in their respective clusters in a way which is far more efficient than several novel approaches. In this paper, we have examined four varieties of clustering algorithms namely, K-Means, FEKM, ECM and proposed FECA which were implemented on varying data sets. Subsequently, we used few internal cluster validity indices like Dunns index, Davies-Bouldins index, Silhouette Coefficient, C index and Calinski index for quantitative evaluation of the clustering results obtained. The results obtained from simulation were compared between them, and as per our expectation it was found that, the quality of clustering produced by FECA is far more satisfactory than the others. Almost every value of validity indices used give encouraging results for FECA, implying good cluster formation. Further experiments support that the proposed algorithm also produces minimum quantization error almost for all the data sets used.
Keywords: Cluster Analysis; Cluster Validation; Optimal Centroid; K-Means; FEKM; FECA and ECM.
Prototype based Classification and Error Analysis under Bootstrapping Strategy
by Doosung Hwang, Youngju Son
Abstract: A prototype based classification is proposed to select handfuls of class
data for learning rules and prediction. A class point is considered as a prototype
if it forms a hypersphere that represents a part of class area measured by any
distance metric and class labels. The prototype selection algorithm, formulated
by a set covering optimisation, selects the number of within-class points that is as
small as possible while preserving class covering regions for the unknown data
distribution. The upper bound of error is analysed to compare the effectiveness
of the prototype based classification with Bayes classifier. Under a bootstrapping
strategy and the 0/1 loss, the bias and variance components are driven from
a generalization error without assuming the unknown distribution of a given
problem. This analysis provides a way to evaluate the prototype based models
and select the optimal model estimate for any standard classifier. The experiments
showthat the proposed approach is very competitive when compared to the nearest
neighbour and Bayes classifier, and efficient in choosing prototypes in terms of
class covering regions, data size and computation time.
Keywords: class prototype; set covering optimization; greedy method; nearestrnneighbor; error analysis.
Tree-based Text Stream Clustering with Application to Spam Mail Classification
by Phimphaka Taninpong, Sudsanguan Ngamsuriyaroj
Abstract: This paper proposes a new text clustering algorithm based on a tree structure. The main idea of the clustering algorithm is a sub-tree at a specific node represents a document cluster. Our clustering algorithm is a single pass scanning algorithm which traverses down the tree to search for all clusters without having to predefine the number of clusters. Thus, it fits our objectives to produce document clusters having high cohesion, and to keep the minimum number of clusters. Moreover, an incremental learning process will perform after a new document is inserted into the tree, and the clusters will be rebuilt to accommodate the new information. In addition, we applied the proposed clustering algorithm to spam mail classification and the experimental results show that Tree-based text clustering spam filter gives higher accuracy and specificity than the Cobweb clustering, Naive Bayes and KNN.
Keywords: clustering; text clustering; tree-based clustering; spam; spam classification; text classification.
A New Method for Behavioral-Based Malware Detection Using Reinforcement Learning
by Mansour Esmaeilpour
Abstract: Malware is - the abbreviation for malicious software - a comprehensive term that is called software that deliberately created to perform an unauthorized and often is harmful. Viruses, Backdoors, Keyloggers, Trojans, password thieves' software, Spyware, Adwares are number of malware samples. When calling something a virus or Ttrojan was enough, however, methods of contamination are developed, then the term virus and other malware definition was not satisfactory for all types of malicious programs. This research focus on clustering the malware according to the malware features. To avoid the dangers of malware, some applications have been created to track them down. In this paper, to present a new method for detection of malware using reinforcement learning. The result demonstrates that the proposed method can detection the malware more accurate.
Keywords: Antivirus (AVS); Malware; Reinforcement Learning.
Efficient Spatial Query Processing for KNN Queries using well organized Net-Grid Partition Indexing Approach
by Geetha Kannan, Kannan Arputharaj
Abstract: In the recent years, most of the applications use mobile devices with Geographical Positioning Systems support for providing Location Based Services. However, the queries sent through the mobile devices to obtain such services consume more time for processing due to the size of the spatial data. In order to solve this problem, an efficient indexing method for providing effective query processing services in mobile computing environments is proposed in this paper by introducing a new Net-Grid based Partition Indexing approach. This indexing method increases the efficiency of the query retrieval in mobile network environments. Since, all the existing mobile-based network applications utilize the node to node access of spatial objects for processing the query, the mobile query retrieval part in spatial databases is becoming the greatest disadvantage by consuming more time to process the query. This problem is solved in the proposed work by partitioning the road network into a large number of rectangular regions called grids and by creating indexes based on these grids. The experimental results carried out using the proposed Net-Grid based partition index approach show that the proposed model provides fast retrieval with high accuracy in processing of spatial queries when it is compared with the existing approaches used in database systems.
Keywords: Cache mechanism; KNN Queries; Location Based Services; Mobile Environments; Partition Index; Query Processing; Spatial data Management; Spatial Networks; Spatial Query; Wireless data broadcast.
Analysis of a Performability Model for the BRT System
by Renata Dantas, Jamilson Dantas, Gabriel Alves, Paulo Maciel
Abstract: Large cities have increasing mobility problems due to the large number of vehicles on the streets, which results in traffic jams and the eventual a waste of time and resources. An alternative to improve traffic is to prioritize the public transportation system. Several metropolises around the world are adopting Bus Rapid Transit (BRT) systems since they present compelling results considering the cost-benefit perspective. Evaluating metrics such as performance, reliability, and performability aids in the planning, monitoring, and optimizing of the BRT systems. This paper presents hierarchical models, using CTMC modeling techniques, to assess metrics such as performance and performability. Results show that these models pointed to the peak intervals that are more likely to arrive at the destination in a shorter time, in addition to showing the probability of the vehicle being affected by the failure at each interval. It was also possible to establish bases for the replication of the model in different scenarios to enable new comparative studies.
Keywords: Bus Rapid Transit (BRT); CTMC; Performability analysis.
Can Market Indicators Forecast The Port Throughput?
by AYLIN ÇALIŞKAN, BURCU KARAÖZ
Abstract: The main aim of this study is to forecast the likelihood of increasing or decreasing port throughput from month to month with determined market indicators as input variables. Additionally, the other aim is to determine whether Artificial Neural Network (ANN) and Support Vector Machines (SVM) algorithms are capable of accurately predicting the movement of port throughput. To the aim, Turkish ports were chosen as research environment. The monthly average exchange rates of U.S. dollar, Euro, and gold (compared to Turkish Lira), and crude oil prices were used as market indicators in the prediction models. The experimental results reveal that, the model with specific market indicators, successfully forecasts the direction of movement on port throughput with accuracy rate of 90.9 % in ANN and accuracy rate of 84.6 % in SVM. The model developed in the research may help managers to develop short-term logistics plans in operational processes and may help researchers in terms of adapting the model to other research areas.
Keywords: port throughput; predicting; forecasting in shipping; ANN; SVM.
Deciphering Published Articles in Cyber Terrorism: A Latent Dirichlet Allocation Algorithm Application
by Las Johansen Caluza
Abstract: An emerging issue called cyberterrorism is a fatal problem causing a disturbance in the cyberspace. To unravel underlying issues about cyber terrorism, it is imperative to look into available documents found in the NATOs repository. Extraction of articles using web-mining technique and performed topic modeling on NLP. Moreover, this study employed Latent Dirichlet Allocation algorithm, an unsupervised machine learning to generate latent themes from the text corpus. An identified five underlying themes revealed based on the result. Finally, a profound understanding of cyber terrorism as a pragmatic menace of the cyberspace through a worldwide spread of black propaganda, recruitment, computer and network hacking, economic sabotage and others revealed. As a result, countries around the world, including NATO and its allies, had continuously improved its capabilities against cyber terrorism.
Keywords: Topic modeling; LDA; cyber terrorism; unsupervised machine learning; NLP; web mining; sequential exploratory design; gibbs sampling; cyberspace;.
An Innovative and Efficient Method for Twitter Sentiment Analysis
by Hima Suresh
Abstract: The research in sentiment analysis is one of the most accomplished fields in data mining area. Specifically, sentiment analysis centres on analyzing attitudes and opinions relating a particular topic of interest using Machine Learning Approaches, Lexicon Based Approaches or Hybrid Approaches. Users are purposive to develop an automated system that could identify and classify sentiments in the related text. An efficient approach for predicting sentiments would allow us to bring out opinions from the web contents and to predict online public choices, which could prove valuable for ameliorating changes in the sentiment of Twitter users. This paper presents a proposed model to analyze the brand impact using the real data gathered from the micro blog, Twitter collected over a period of 14 months and also discusses the review covering the existing methods and approaches in sentiment analysis. Twitter-based information gathering techniques enable collecting direct responses from the target audience; it provides valuable understanding into public sentiments in the prediction of an opinion of a particular product. The experimental result shows that the proposed method for Twitter sentiment analysis is the best, with an unrivalled accuracy of 86.8%.
Keywords: Sentiment Analysis; Machine Learning Approach; Lexicon Based Approach; Supervised Learning.
A new development of an adaptive Xbar-R control chart under a fuzzy environment
by H. Sabahno, S.M. Mousavi, A. Amiri
Abstract: It is proved that adaptive control charts have better performance than classical control charts due to adaptability of some or all of their parameters to the previous process information. Fuzzy classical control charts have been occasionally considered by many researchers in the last two decades; however, fuzzy adaptive control charts have not been investigated. In this paper, we introduce a new adaptive fuzzy control chart that allows all of the charts parameters to adapt based on the process state in the previous sample. Also, the warning limits are redefined in the fuzzy environments. We utilize fuzzy mode defuzzification technique to design the decision procedure in the proposed fuzzy adaptive control chart. Finally, an illustrative example is used to present the application of the proposed control chart.
Keywords: Xbar-R control charts; Adaptive control charts; Fuzzy uncertainty; Trapezoidal fuzzy numbers.
Human Activity Recognition based on Interaction Modelling
by Subetha T, Chitrakala S
Abstract: Human Activity Recognition aims at recognizing and interpreting the activities of humans automatically from videos. Among the activities of humans, identifying the interactions between human within minimal computation time and reduced misclassification rate is a cumbersome task. Hence, an Interaction based Human Activity recognition system is proposed in this paper that utilizes silhouette features to identify and classify the interactions between humans. The main issues that affect the performance in various stages of activity recognition are sudden illumination changes, detection of static human, lack of inhibiting spatio-temporal features while extracting silhouettes, data discrimination, data variance, crowding problem, and computational complexity. To accomplish the preceding issues three new algorithms named weight-based updating Gaussian Mixture Model (wu-GMM), Spatial Dissemination-based Contour Silhouettes (SDCS), and Weighted Constrained Dynamic Time Warping (WCDTW) are proposed. Experiments are conducted with the benchmarking datasets such as Gaming dataset and Kinect Interaction dataset. The results demonstrate that the proposed system recognizes the interaction based activity of humans with reduced misclassification rate and minimal processing time compared to the existing motion-pose geometric descriptor representation (MPGD) for various activities like the right punch, left punch, defense, and so on. The proposed Human Activity Recognition system finds its applications in sports event analysis, video surveillance, content-based video retrieval, robotics, and others.
Keywords: Human Activity Recognition; weight-based updating Gaussian Mixture Model; Spatial Dissemination-based Contour Silhouettes; Weighted Constrained Dynamic Time Warping; Dynamic Time Warping; reduced variance-t Stochastic Neighbor Embedding.
Using implicitly and explicitly rated online customer reviews to build opinionated Arabic lexicons
by Mohammad DAOUD
Abstract: Creating an opinionated lexicon is an important step towards a reliable social media analysis system. In this article we are proposing an approach and describing an experiment to build an Arabic polarized lexical database from analysing online implicitly and explicitly rated customer reviews. These reviews are written in Modern Standard Arabic and Palestinian/Jordanian dialect. Therefore, the produced lexicon comprises casual slangs and dialectic entries used by the online community, which is useful for sentiment analysis of informal social media microblogs. We have extracted 28000 entries from processing 15100 reviews and by expanding the initial lexicon through Google Translate. We calculated an implicit rating for every review driven by its text to address the problem of ambiguous opinions of certain online posts, where the text of the review does not match the given rating (the explicit rating). Each entry was given a polarity tag and a confidence score. High confidence scores have increased the precision of the polarization process. Explicit rating has increased the coverage and confidence of polarity.
Keywords: polarized lexicon; social media analysis; opinion mining; term extraction; sentiment analysis.
Mining hidden opinions from objective sentences
by Farek Lazhar
Abstract: Sentiment analysis and opinion mining is a very popular and active research area in natural language processing, it deals with structured and unstructured data to identify and extract people's opinions, sentiments and emotions in many resources of subjectivity such as product reviews, blogs, social networks, etc. All existing feature-level opinion mining approaches deal with the detection of subjective sentences and eliminate objective ones before extracting explicit features and their related positive or negative polarities. However, objective sentences can carry implicit opinions and a lack attention given to such sentences can adversely affect the obtained results. In this paper, we propose a classification-based approach to extract implicit opinions from objective sentences. Firstly, we apply a rule-based approach to extract explicit feature-opinion pairs from subjective sentences. Secondly, in order to build a classification model, we construct a training corpus based on extracted explicit feature-opinion pairs and subjective sentences. Lastly, mining implicit feature-opinion pairs from objective sentences is formulated into a text classification problem using the model previously built. Tested on customer reviews in three different domains, experimental results show the effectiveness of mining opinions from objective sentences.
Keywords: opinion mining; hidden opinion; objectivity; subjectivity; supervised learning.
Topical document clustering: two-stage post processing technique
by Poonam Goyal, N. Mehala, Divyansh Bhatia, Navneet Goyal
Abstract: Clustering documents is an essential step in improving efficiency and effectiveness of information retrieval systems. We propose a two-phase split-merge (SM) algorithm, which can be applied to topical clusters obtained from existing query-context-aware document clustering algorithms, to produce soft topical document clusters. The SM is a post-processing technique which combines the advantages of document and feature-pivot topical document clustering approaches. The split phase splits the topical clusters by relating them to the topics obtained by disambiguating web search results, and converts them into homogeneous soft clusters. In the merge phase, similar clusters are merged by feature-pivot approach. The SM is tested on the outcome of two hierarchical query-context aware document clustering algorithms on different datasets including TREC session-track 2011 dataset. The obtained topical-clusters are also updated by an incremental approach with the progress in the data stream. The proposed algorithm improves the quality of clustering appreciably in all the experiments conducted.
Keywords: topical clustering; query clustering; query context; document clustering; incremental clustering; soft clustering.
Support vector machines for credit risk assessment with imbalanced datasets
by Sihem Khemakhem, Younes Boujelbene
Abstract: Support vector machines (SVM) have a limited performance in credit scoring issues due to the imbalanced data sets in which the number of unpaid is lower than paid loans. In this work, we developed an SVM model with more kernels on a set of imbalanced data and suggested two data resampling alternatives: random over sampling (ROS) and synthetic minority oversampling technique (SMOTE). The aim of this work is to explore the relevance of re-sampling data with the SVM technique for an accurate credit risk prediction rate to the class imbalance constraint. The performance criteria chosen to evaluate the suggested technique were accuracy, sensitivity specificity, error type I, error type II, G-mean and the area under the receiver operating characteristic curve (AUC). Significant empirical results obtained from an experimental study of a real imbalanced database of loans granted by a Tunisian bank demonstrated the performance improvement thanks to sampling strategies in SVM, thus leading to a better prediction accuracy of the creditworthiness of borrowers.
Keywords: credit scoring; support vector machines; SVM; synthetic minority oversampling technique; SMOTE; random over sampling; ROS; credit risk assessment; imbalanced datasets; performance criteria; Tunisian bank; creditworthiness prediction accuracy.
An efficient context-aware agglomerative fuzzy clustering framework for plagiarism detection
by Anirban Chakrabarty, Sudipta Roy
Abstract: Plagiarism refers to the act of copying content without acknowledging the original source. Though there are several existing commercial tools for plagiarism detection, still plagiarism is tricky and challenging due to the rise in volume of online publications. Existing plagiarism detection methods use paraphrasing, sentence and key-word matching, but such techniques has not been very effective. In this work, a framework for fuzzy based plagiarism detection is proposed using a context-aware agglomerative clustering approach with an improved time complexity. The work aims in retrieving key concepts at word, sentence and paragraph level by integrating semantic features in a novel optimisation function to detect plagiarism effectively. The notion of fuzzy clustering has been applied to improve the robustness and consistency of results for clustering multi-disciplinary papers. The experimental analysis is supported by comparison with other contemporary techniques which indicate the superiority of proposed approach for plagiarism detection.
Keywords: fuzzy clustering; context similarity; plagiarism detection; spanning tree; agglomerative clustering; validity index; constrained objective function.