International Journal of Data Mining, Modelling and Management (18 papers in press)
Analysis of a Performability Model for the BRT System
by Renata Dantas, Jamilson Dantas, Gabriel Alves, Paulo Maciel
Abstract: Large cities have increasing mobility problems due to the large number of vehicles on the streets, which results in traffic jams and the eventual a waste of time and resources. An alternative to improve traffic is to prioritize the public transportation system. Several metropolises around the world are adopting Bus Rapid Transit (BRT) systems since they present compelling results considering the cost-benefit perspective. Evaluating metrics such as performance, reliability, and performability aids in the planning, monitoring, and optimizing of the BRT systems. This paper presents hierarchical models, using CTMC modeling techniques, to assess metrics such as performance and performability. Results show that these models pointed to the peak intervals that are more likely to arrive at the destination in a shorter time, in addition to showing the probability of the vehicle being affected by the failure at each interval. It was also possible to establish bases for the replication of the model in different scenarios to enable new comparative studies.
Keywords: Bus Rapid Transit (BRT); CTMC; Performability analysis.
Can Market Indicators Forecast The Port Throughput?
by AYLIN ÇALIŞKAN, BURCU KARAÖZ
Abstract: The main aim of this study is to forecast the likelihood of increasing or decreasing port throughput from month to month with determined market indicators as input variables. Additionally, the other aim is to determine whether Artificial Neural Network (ANN) and Support Vector Machines (SVM) algorithms are capable of accurately predicting the movement of port throughput. To the aim, Turkish ports were chosen as research environment. The monthly average exchange rates of U.S. dollar, Euro, and gold (compared to Turkish Lira), and crude oil prices were used as market indicators in the prediction models. The experimental results reveal that, the model with specific market indicators, successfully forecasts the direction of movement on port throughput with accuracy rate of 90.9 % in ANN and accuracy rate of 84.6 % in SVM. The model developed in the research may help managers to develop short-term logistics plans in operational processes and may help researchers in terms of adapting the model to other research areas.
Keywords: port throughput; predicting; forecasting in shipping; ANN; SVM.
Deciphering Published Articles in Cyber Terrorism: A Latent Dirichlet Allocation Algorithm Application
by Las Johansen Caluza
Abstract: An emerging issue called cyberterrorism is a fatal problem causing a disturbance in the cyberspace. To unravel underlying issues about cyber terrorism, it is imperative to look into available documents found in the NATOs repository. Extraction of articles using web-mining technique and performed topic modeling on NLP. Moreover, this study employed Latent Dirichlet Allocation algorithm, an unsupervised machine learning to generate latent themes from the text corpus. An identified five underlying themes revealed based on the result. Finally, a profound understanding of cyber terrorism as a pragmatic menace of the cyberspace through a worldwide spread of black propaganda, recruitment, computer and network hacking, economic sabotage and others revealed. As a result, countries around the world, including NATO and its allies, had continuously improved its capabilities against cyber terrorism.
Keywords: Topic modeling; LDA; cyber terrorism; unsupervised machine learning; NLP; web mining; sequential exploratory design; gibbs sampling; cyberspace;.
An Innovative and Efficient Method for Twitter Sentiment Analysis
by Hima Suresh
Abstract: The research in sentiment analysis is one of the most accomplished fields in data mining area. Specifically, sentiment analysis centres on analyzing attitudes and opinions relating a particular topic of interest using Machine Learning Approaches, Lexicon Based Approaches or Hybrid Approaches. Users are purposive to develop an automated system that could identify and classify sentiments in the related text. An efficient approach for predicting sentiments would allow us to bring out opinions from the web contents and to predict online public choices, which could prove valuable for ameliorating changes in the sentiment of Twitter users. This paper presents a proposed model to analyze the brand impact using the real data gathered from the micro blog, Twitter collected over a period of 14 months and also discusses the review covering the existing methods and approaches in sentiment analysis. Twitter-based information gathering techniques enable collecting direct responses from the target audience; it provides valuable understanding into public sentiments in the prediction of an opinion of a particular product. The experimental result shows that the proposed method for Twitter sentiment analysis is the best, with an unrivalled accuracy of 86.8%.
Keywords: Sentiment Analysis; Machine Learning Approach; Lexicon Based Approach; Supervised Learning.
A new development of an adaptive Xbar-R control chart under a fuzzy environment
by H. Sabahno, S.M. Mousavi, A. Amiri
Abstract: It is proved that adaptive control charts have better performance than classical control charts due to adaptability of some or all of their parameters to the previous process information. Fuzzy classical control charts have been occasionally considered by many researchers in the last two decades; however, fuzzy adaptive control charts have not been investigated. In this paper, we introduce a new adaptive fuzzy control chart that allows all of the charts parameters to adapt based on the process state in the previous sample. Also, the warning limits are redefined in the fuzzy environments. We utilize fuzzy mode defuzzification technique to design the decision procedure in the proposed fuzzy adaptive control chart. Finally, an illustrative example is used to present the application of the proposed control chart.
Keywords: Xbar-R control charts; Adaptive control charts; Fuzzy uncertainty; Trapezoidal fuzzy numbers.
Human Activity Recognition based on Interaction Modelling
by Subetha T, Chitrakala S
Abstract: Human Activity Recognition aims at recognizing and interpreting the activities of humans automatically from videos. Among the activities of humans, identifying the interactions between human within minimal computation time and reduced misclassification rate is a cumbersome task. Hence, an Interaction based Human Activity recognition system is proposed in this paper that utilizes silhouette features to identify and classify the interactions between humans. The main issues that affect the performance in various stages of activity recognition are sudden illumination changes, detection of static human, lack of inhibiting spatio-temporal features while extracting silhouettes, data discrimination, data variance, crowding problem, and computational complexity. To accomplish the preceding issues three new algorithms named weight-based updating Gaussian Mixture Model (wu-GMM), Spatial Dissemination-based Contour Silhouettes (SDCS), and Weighted Constrained Dynamic Time Warping (WCDTW) are proposed. Experiments are conducted with the benchmarking datasets such as Gaming dataset and Kinect Interaction dataset. The results demonstrate that the proposed system recognizes the interaction based activity of humans with reduced misclassification rate and minimal processing time compared to the existing motion-pose geometric descriptor representation (MPGD) for various activities like the right punch, left punch, defense, and so on. The proposed Human Activity Recognition system finds its applications in sports event analysis, video surveillance, content-based video retrieval, robotics, and others.
Keywords: Human Activity Recognition; weight-based updating Gaussian Mixture Model; Spatial Dissemination-based Contour Silhouettes; Weighted Constrained Dynamic Time Warping; Dynamic Time Warping; reduced variance-t Stochastic Neighbor Embedding.
Using implicitly and explicitly rated online customer reviews to build opinionated Arabic lexicons
by Mohammad DAOUD
Abstract: Creating an opinionated lexicon is an important step towards a reliable social media analysis system. In this article we are proposing an approach and describing an experiment to build an Arabic polarized lexical database from analysing online implicitly and explicitly rated customer reviews. These reviews are written in Modern Standard Arabic and Palestinian/Jordanian dialect. Therefore, the produced lexicon comprises casual slangs and dialectic entries used by the online community, which is useful for sentiment analysis of informal social media microblogs. We have extracted 28000 entries from processing 15100 reviews and by expanding the initial lexicon through Google Translate. We calculated an implicit rating for every review driven by its text to address the problem of ambiguous opinions of certain online posts, where the text of the review does not match the given rating (the explicit rating). Each entry was given a polarity tag and a confidence score. High confidence scores have increased the precision of the polarization process. Explicit rating has increased the coverage and confidence of polarity.
Keywords: polarized lexicon; social media analysis; opinion mining; term extraction; sentiment analysis.
SAMPLE SELECTION ALGORITHMS FOR CREDIT RISK MODELING THROUGH DATA MINING TECHNIQUES
by Eftychios Protopapadakis, Dimitrios Niklis, Michalis Doumpos, Anastasios Doulamis, Constantin Zopounidis
Abstract: Credit risk assessment is a very challenging and important problem in the domain of financial risk management. The development of reliable credit rating/scoring models is of paramount importance in this area. There are different algorithms and approaches for constructing such models to classify credit applicants (firms or individuals) into risk classes. Reliable sample selection is crucial for this task. The aim of this paper is to examine the effectiveness of sample selection schemes in combination with different classifiers for constructing reliable default prediction models. We consider different algorithms to select representative cases and handle class imbalances. Empirical results are reported for a data set of Greek companies from the commercial sector.
Keywords: Credit risk modeling; Data mining; Sampling; Classification.
A Flexible Architecture for the Pre-Processing of Solar Satellite Image Time Series Data The SETL Architecture
by Carlos Roberto Silveira Junior, Marilde Terezinha Prado Santos, Marcela Xavier Ribeiro
Abstract: Satellite Image Time Series (SITS) is a challenging domain for Knowledge Discovery Database due to their characteristics: Each image has several sunspots and each sunspot is associated with sensor data composed of the radiation level and the sunspot classifications. Each image has time parameters and sunspots coordinates, spatiotemporal data. Several challenges of SITS domain are faced during the Extract, Transform and Load (ETL) process. In this paper, we proposed an Architecture called SITSs ETL (SETL) that extracts the visual characteristics of each sunspot and associates it with sunspots sensor data considering the spatiotemporal relations. SETL brings flexibility and extensibility to working with challenging domains such as SITS because it integrates textual, visual and spatiotemporal characteristics at sunspot-record level. Furthermore, we obtained acceptable performance results according to a domain expert and increased the possibility of using different data mining algorithms comparing to the Art State.
Keywords: Satellite Image Time Series; Spatiotemporal Extract; Transform; and Load process; Temporal Series of Solar Image processing.
Fast Parallel PageRank Technique for Detecting Spam
by Nilay Khare, Hema Dubey
Abstract: Brin and Larry proposed PageRank in 1998, which appears as a
prevailing link analysis technique used by web search engines to rank its search
results list. Computation of PageRank values in an efficient and faster manner
for very immense web graph is truly an essential concern for search engines
today. To identify the spam web pages and also deal with them is yet another
important concern in web browsing. In this research article, an efficient and faster
parallel PageRank algorithm is proposed, which harnesses the power of graphics
processing units (GPUs). In proposed algorithm, the PageRank scores are nonuniformly
distributes among the web pages, so it is also competent of coping with
spam web pages. The experiments are performed on standard datasets available
in Stanford Large Network Dataset Collection. There is a speed up of about 1.1
to 1.7 for proposed parallel PageRank algorithm over existing parallel PageRank
Keywords: GPU; CUDA; Parallel PageRank Technique; Spam Web Pages.
Prototype-based classification and error analysis under bootstrapping strategy
by Doosung Hwang, Youngju Son
Abstract: A prototype-based classification is proposed to select handfuls of class data for learning rules and prediction. A class point is considered as a prototype if it forms a hypersphere that represents a part of class area measured by any distance metric and class label. The prototype selection algorithm, formulated by a set covering optimisation, selects the number of within-class points that is as small as possible, while preserving class covering regions for the unknown data distribution. The upper bound of the error is analysed to compare the effectiveness of the prototype-based classification with the Bayes classifier. Under a bootstrapping strategy and the 0/1 loss, the bias and variance components are driven from a generalisation error without assuming the unknown distribution of a given problem. This analysis provides a way to evaluate prototype-based models and select the optimal model estimate for any standard classifier. The experiments show that the proposed approach is very competitive when compared to the nearest neighbour and the Bayes classifier and efficient in choosing prototypes in terms of class covering regions, data size and computation time.
Keywords: class prototype; set covering optimisation; greedy method; nearest neighbour; error analysis.
A new method for behavioural-based malware detection using reinforcement learning
by Sepideh Mohammadkhani, Mansour Esmaeilpour
Abstract: Malware is - the abbreviation for malicious software - a comprehensive term for software that is deliberately created to perform an unauthorised and often harmful function. Viruses, backdoors, key-loggers, Trojans, password thieves' software, spyware, adwares are number of malware samples. Previously, calling something a virus or Trojan was enough. However, methods of contamination are developed, the term virus and other malware definition was not satisfactory for all types of malicious programs. This research focus on clustering the malware according to the malware features. To avoid the dangers of malware, some applications have been created to track them down. This paper presents a new method for detection of malware using reinforcement learning. The result demonstrates that the proposed method can detect the malware more accurate.
Keywords: antivirus; AVS; malware; reinforcement learning.
Efficient spatial query processing for KNN queries using well organised net-grid partition indexing approach
by K. Geetha, A. Kannan
Abstract: In recent years, most of the applications use mobile devices with geographical positioning systems support for providing location-based services. However, the queries sent through the mobile devices to obtain such services consume more time for processing due to the size of the spatial data. In order to solve this problem, an efficient indexing method for providing effective query processing services in mobile computing environments is proposed. This indexing method increases the efficiency of the query retrieval in mobile network environments. Since, all the existing mobile-based network applications utilise the node to node access of spatial objects for processing the query, the mobile query retrieval part in spatial databases is becoming the greatest disadvantage by consuming more time to process the query. The experimental results carried out using the proposed net-grid-based partition index approach show that the proposed model provides fast retrieval with high accuracy in processing of spatial queries.
Keywords: cache mechanism; KNN queries; location-based services; LBS; mobile environments; partition index; query processing; spatial data management; spatial networks; spatial query; wireless data broadcast.
Tree-based text stream clustering with application to spam mail classification
by Phimphaka Taninpong, Sudsanguan Ngamsuriyaroj
Abstract: This paper proposes a new text clustering algorithm based on a tree structure. The main idea of the clustering algorithm is a sub-tree at a specific node represents a document cluster. Our clustering algorithm is a single pass scanning algorithm which traverses down the tree to search for all clusters without having to predefine the number of clusters. Thus, it fits our objectives to produce document clusters having high cohesion, and to keep the minimum number of clusters. Moreover, an incremental learning process will perform after a new document is inserted into the tree, and the clusters will be rebuilt to accommodate the new information. In addition, we applied the proposed clustering algorithm to spam mail classification and the experimental results show that tree-based text clustering spam filter gives higher accuracy and specificity than the cobweb clustering, naïve Bayes and KNN.
Keywords: clustering; data mining; text clustering; text mining; text stream; tree-based clustering; spam; spam classification; text classification.
Special Issue on: Big Data Engineering Recent Advances in Intelligent Methods, Methodologies and Techniques
Allegories for Database Modeling
by Bartosz Zielinski, Paweł Maslanka, Scibor Sobieski
A Grammar-based Approach for XML Schema Extraction and Heterogeneous Document Integration
by Prudhvi Janga, Karen C. Davis
Towards a Comparative Evaluation of Text-Based Specification Formalisms and Diagrammatic Notations
by Kobamelo Moremedi, John Andrew Van Der Poll
Effective and Efficient Distributed Management of Big Clinical Data: A Framework
by Alfredo Cuzzocrea, Giorgio Mario Grasso, Massimiliano Nolich