International Journal of Business Intelligence and Data Mining (67 papers in press)
An Effective Preprocessing Algorithm for Model Building in Collaborative Filtering based Recommender System
by Srikanth T, M. Shashi
Abstract: Recommender systems suggest interesting items for online users based on the ratings expressed by them for the other items maintained globally as the rating matrix. The rating matrix is often sparse and very huge due to large number of users expressing their ratings only for a few items among the large number of alternatives. Sparsity and scalability are the challenging issues to achieve accurate predictions in recommender systems. This paper focuses on model building approach to collaborative filtering-based recommender systems using low rank matrix approximation algorithms for achieving scalability and accuracy while dealing with sparse rating matrices. A novel preprocessing methodology is proposed to counter data sparsity problem by transforming the sparse rating matrix denser before extracting latent factors to appropriately characterise the users and items in low dimensional space. The quality of predictions made either directly or indirectly through user clustering were investigated and found to be competitive with the existing collaborative filtering methods in terms of reduced MAE and increased NDCG values on bench mark datasets.
Keywords: Recommender System; Collaborative Filtering; Dimensionality Reduction; Pre- Processing,Sparsity,Scalability,Matrix Factorization.
AGS: A Precise and Efficient AI Based Hybrid Software Effort Estimation Model
by Vignaraj Vikraman, S. Srinivasan
Abstract: To predict the amount of effort to develop software is a tedious
process for software companies. Hence, predicting the software development
effort remains a complex issue drawing in extensive research consideration.
The success of software development process considerably depends on proper
estimation of effort required to develop that software. Effective software effort
estimation techniques enable project managers to schedule software life cycle
activities properly. The main objective of this paper is to propose a novel
approach in which an artificial intelligence (AI)-based technique, called AGS
algorithm, is used to determine the software effort estimation. AGS is hybrid
method combining three techniques, namely: adaptive neuro fuzzy inference
system (ANFIS), genetic algorithm and satin bower bird optimisation (SBO)
algorithm. The performance of the proposed method is assessed using a well
standard dataset with real-time benchmark with many attributes. The major
metrics used in the performance evaluation are correlation coefficient (CC),
kilo lines of code (KLoC) and complexity of the software. The experimental
result shows that the prediction accuracy of the proposed model is better than
the existing algorithmic models.
Keywords: Software Effort Estimation; AI; ANFIS; Lines of code (LoC); Genetic Algorithm (GA); Satin Bower Bird Optimiser (SBO); Correlation Co-efficient (CC); Kilo Lines of Code (KLoC),Software Complexity.
Using bagging to enhance clustering procedures for planar shapes
by Elaine Cristina De Assis, Renata Souza, Getulio José Amorim Do Amaral
Abstract: Partitional clustering algorithms find a partition maximizing or minimizing some numerical criterion. Statistical shape analysis is used to make decisions observing the shape of objects. The shape of an object is the remaining information when the effects of location, scale and rotation are removed. This paper introduces clustering algorithms suitable for planar shapes. Four numerical criteria are adapted to each algorithm. In order to escape from local optima to reach a better clustering, these algorithms are performed in the framework of Bagging procedures. Simulation studies are carried to validate these proposed methods and two real-life data sets are also considered. The experiment quality is assessed by the corrected Rand index and the results the application of the proposed algorithms showed the effectiveness of these algorithms using different clustering criteria and the union of the Bagging method to the cluster algorithms provided substantial gains in of the quality of the clusters.
Keywords: Statistical Shape Analysis; Partitional Clustering Methods; Bagging Procedure.
EFFICIENT TEXT DOCUMENT CLUSTERING WITH NEW SIMILARITY MEASURES
by Lakshmi R, S. Baskar
Abstract: In this paper, two new similarity measures, namely distance of term
frequency-based similarity measure (DTFSM) and presence of common
terms-based similarity measure (PCTSM), are proposed to compute the
similarity between two documents for improving the effectiveness of text
document clustering. The effectiveness of the proposed similarity measures is
evaluated on reuters-21578 and WebKB datasets for clustering the documents
using K-means and K-means++ clustering algorithms. The results obtained by
using the proposed DTFSM and PCTSM are significantly better than other
measures for document clustering in terms of accuracy, entropy, recall and
F-measure. It is evident that the proposed similarity measures not only improve
the effectiveness of the text document clustering, but also reduce the
complexity of similarity measures based on the number of required operations
during text document clustering.
Keywords: Document Clustering; Similarity Measures; Accuracy; Entropy; Recall; F-Measure; K-means clustering Algorithm.
Analysis and Prediction of Heart Disease Aid of Various Data Mining Techniques: A Survey
by V. Poornima, D. Gladis
Abstract: In recent times, health diseases are expanding gradually because of inherited. Particularly, heart disease has turned out to be the more typical nowadays, i.e., life of individuals is at hazard. The data mining strategies specifically decision tree, Na
Keywords: Data mining; Heart Disease Prediction; performance measure; Fuzzy; and clustering.
Signal-Flow Graph Analysis and Implementation of Novel Power Tracking Algorithm Using Fuzzy Logic Controller
by S. VENKATESAN, Manimaran Saravanan, Subramanian Venkatnarayanan, Senior Member IEEE
Abstract: This paper discussed merits of novel modified perturb and observe (P&O) maximum power point tracker (MPPT) algorithm for stand-alone solar PV system using interleaved LUO converter with fuzzy logic controller (FLC). The merits of FLC based system are compared with existing system. Analytical expressions of the proposed converter are derived through signal flow graph. The proposed interleaved LUO converter based PV system with fuzzy controller reduces considerable amount of ripple content and also proposed MPPT algorithm creates less hunting around maximum power point. Simulations at different illumination levels are carried-out using MATLAB/Simulink. It also experimentally verified with a typical 40 W solar PV panel. The result confirms the superiority of the proposed system with fuzzy controller.
Keywords: Fuzzy Logic Controller; Interleaved LUO Converter; Maximum Power Point Tracking (MPPT); Modified P&O algorithm; Photovoltaic(PV) system.
SoLoMo Cities: Socio-Spatial City Formation Detection and Evolution Tracking Approach
by Sara Elhishi, Mervat Abu-Elkheir, Ahmed Aboul-Fotouh
Abstract: The tremendous growth of telecommunication devices coupled with
the huge number of social media users has revealed a new kind of development
that turning our cities into information-rich smart platforms. We analyse the
role of LBSN check-ins using social community detection methods to extract
city structured communities, which we call SoLoMo cities, using a modified
version of Louvain algorithm, then we track these communities evolution
patterns through a pairwise consecutive matching process to detect behavioural
events changing citys communities. The findings of the experiments on the
Brightkite dataset can be summarised as follows: online users check-in
activities reveal a set of well-formed physical land spaces of citys
communities, the concentration of online social interactions and the formation
of those cities are positively correlated with a percentage of 89%. Finally, we
were able to track the evolution of the discovered communities through
detecting three community behaviour events: survive, grow and shrink.
Keywords: location-based social networks; LBSN; social; spatial analysis; community detection; evolution; tracking; Brightkite.
Discovery of Rare Association Rules in the Distribution of Lawsuits in the Federal Justice System of Southern Brazil
by Lucia Gruginskie, Guilherme Vaccaro, Leonardo Chiwiakwosky, Attilla Blesz Jr
Abstract: In the context of data mining, infrequent association rules may be beneficial for analysing rare or extreme cases with very low support values and high confidence. In researching risky situations or allocating specific resources, such rules may have a much greater impact than rules with high support value. The objective of this study is to obtain association rules from the database of lawsuits filed in the Federal Court of Southern Brazil in 2016, including both frequent and rare rules. By finding these rules, especially rare ones, the information collected can assist in the decision-making process, in this case, such as training clerks or establishing specialised courts.
Keywords: Association Rules; Rare Rules; Distribution of lawsuits; Brazilian Federal Justice; Data mining.
Integral Verification and Validation for Knowledge Discovery Procedure Models
by Anne Antonia Scheidler, Markus Rabe
Abstract: This paper explains why the knowledge discovery in database (KDD) procedure models lacks verification and validation (V&V) mechanisms and introduces an approach for integral V&V. Based on a generic model for knowledge discovery, a structure named 'KDD triangle model' is presented. This model has a modular design and can be adapted for other KDD procedure models. This has the benefit of allowing existing projects for improving their quality assurance in knowledge discovery. In this paper, the different phases of the developed triangle model for KDD are discussed. One special focus is on the phase results and related testing mechanisms. This paper also describes possible V&V techniques for the developed integral V&V mechanism to ensure direct applicability of the model.
Keywords: knowledge discovery in databases; data mining; procedure model; verification and validation; quality assurance.
A Multiclass Classification Approach for Incremental Entity Resolution on Short Textual Data
by Denilson Pereira, João A. Silva
Abstract: Several web applications maintain data repositories containing references to thousands of real-world entities originating from multiple sources, and they continually receive new data. Identifying the distinct entities and associating the correct references to each one is a problem known as entity resolution. The challenge is to solve the problem incrementally, as the data arrive, especially when those data are described by a single textual attribute. In this paper, we propose a new approach for incremental entity resolution. The method we have implemented, called AssocIER, uses an ensemble of multiclass classifiers with self-training and detection of novel classes. We have evaluated our method in various real-world datasets and scenarios, comparing it with a traditional entity resolution approach. The results show that AssocIER is effective and efficient to solve unstructured data in collections with a large number of entities and features, and is able to detect hundreds of novel classes.
Keywords: Entity Resolution; Associative Classification; Incremental Learning; Novel Class Detection; Self-training.
Method for Improvement of Transparency: Use of Text Mining Techniques for Reclassification of Governmental Expenditures Records in Brazil
by Gustavo De Oliveira Almeida, Kate Revoredo, Claudia Cappelli, Cristiano Maciel
Abstract: Many countries have transparency laws requiring availability of data. However, often data is available but not transparent. We present the Transparency Portal of Brazilian Federal Government case and discuss limitations of public acquisitions data stored in free text format. We employed text-mining techniques to reclassify descriptive texts of measurement units related to products and services. The solution presented in KNIME and JAVA aggregated measurements in the original (n = 69,372 with 78% reduction in number of descriptions, 94% items classified) and in cross validation sample (n = 105,266 with 88% reduction, classifying 78% of items). In addition, we tested computational time for processing of texts for a wide range of data input sizes, suggesting the stability and scalability of the solution to process larger datasets. Finally, we produced analysis identifying probable input errors, suppliers and purchasing units with abnormal transactions and factors affecting procurement prices. We present suggestions for future research and improvements.
Keywords: e-government; data mining; open government; text mining; transparency; KNIME; knowledge discovery; techniques; Brazil.
Data Mining in Credit Insurance Information System for Bank Loans Risk Management in Developing Countries
by Fouad J. Al Azzawi
Abstract: The task of credit risk insurance in our time is critical since loans
are taken by everyone and everywhere and it is quite difficult to accurately
estimate the possible losses that are incurred by failing to pay those loans.
This work proposes an information system module for the banking system to
improve the risk management operation that distributes losses on some fair
basis, as well as accepting the maximum number of loan requests. Insuring the
risk associated with stumbled loans, the bank will partially or completely shift
losses under this contract to the insurance company, thus minimising its own
losses. The proposed module could find out for what price the bank can buy
such insurance policy. The proposed module also could be a key valuable
motivation for different development countries to update their strategy of
current insurance market to outsource part of the states insurance functions to
independent insurance industry. Data mining techniques and mathematical
induction have been used and successfully implemented this model. An optimal
classification solution module for predicting risky loan requests have been
successfully employed. New mathematical model has been developed for
calculating the cost of insurance policy in crisis economy.
Keywords: Data mining; Credit insurance; information systems; Bank loans; risk management; developing countries.
CARs-RP: Lasso Based Class Association Rules Pruning
by AZMI Mohamed, Abdelaziz Berrado
Abstract: Classification based on association rules gets more and more interest in research and practice. In many contexts, rules are often mined from sparse data in high-dimensional spaces, which leads to large number of rules with considerable containment and overlap. Pruning is often used in search for an optimal subset of rules. This paper introduces a method for class association rules (CARs) pruning. It learns weights for a set of CARs by maximising the likelihood function subject to the sum of the absolute values of the weights. The pruning strength is controlled by a shrinkage parameter ?. The suggested method allows the user to choose the appropriate subset of CARs. This is achieved based on a trade-off between the accuracy and complexity of the resulting classifier which is controlled by changing ?. Experimental analysis shows that the introduced method allows to build more concise classifiers with comparable accuracy to other methods.
Keywords: class association rules; pruning; regularization; weighting; associative classification.
PPM-HC: a Method for Helping Project Portfolio Management Based on Topic Hierarchy Learning
by Ricardo M. Marcacini, Ricardo A. M. Pinto, Flavia Bernardini
Abstract: The projects categorisation is a crucial step in the project portfolio management (PPM). Categorising projects allows the organisation to identify categories with a lack or excess of projects, according to its strategic objectives. In this work, we present a new method for project portfolio management based on hierarchical clustering (PPM-HC) to organise the projects at several levels of abstraction. In the PPM-HC, similar projects are allocated to the same clusters and subclusters. PPM-HC automatically learns an understandable topic hierarchy from the project portfolio dataset, thereby facilitating the (human) task of exploring, analysing and prioritising the projects of the organisation. We also proposed a card sorting-based technique which allows the evaluation of the projects categorisation using an intuitive visual map. We carried out an experimental evaluation based on a benchmark dataset and we also presented a real-world case study. The results show that the proposed PPM-HC method is promising.
Keywords: Project Portfolio Management; Projects Categorization; Topic Hierarchy Learning; Hierarchical Clustering.
An efficient approach for Defect Detection in Texture analysis using Improved Support Vector Machine
by Manimozhi I., Janakiraman S.
Abstract: Texture defect detection can be defined as the process of determining the location and size of the collection pixels in a textures image which deviate in their intensity values or spatial in compression to a background texture. The detection of abnormalities is a very challenging problem in computer vision. In our proposed method we have designed a method for detecting the defect of pattern texture analysis. Initially, features are extracted from the input image using the gray level co-occurrence matrix (GLCM) and gray level run-length matrix (GLRLM). Then the extracted features are fed to the input of classification stage. Here the classification is done by improved support vector machine (ISVM). The proposed pattern analysis the traditional support vector machine is improved by means of kernel methods. Final stage is the classified features are segmented using the modified fuzzy C means algorithm (MFCM).
Keywords: Texture defect detection; preprocessing; Gray Level Co-occurrence matrix; Gray Level Run-Length Matrix; Improved Support Vector Machine; modified fuzzy c means algorithm.
A DYNAMIC REPLICATIVE K-MEANS WITH SELF-COMPILING PARTICLE SWARM INTELLIGENCE FOR DATASET CLASSIFICATION
by A. M. Viswa Bharathy
Abstract: The classification techniques proposed so far is not sufficiently intelligent in classifying data set beyond two level classifications. To multi classify the data set for network data we are in need of more hybrid algorithms. In this paper we propose a hybrid technique by combining a modified K-means algorithm called dynamic replicative K-means (DRKM) with self-compiling particle swarm intelligence (SCPSI). The dataset we have chosen for the experiment is KDD Cup 99. The DRKM-SCPSI performs better in terms of detection rate (DR), false positive rate (FPR) and accuracy which is visible from the results presented.
Keywords: anomaly; detection; intrusion; K-Means; PSI.
PORTFOLIO SELECTION WITH SUPPORT VECTOR REGRESSION: MULTIPLE KERNELS COMPARISON
by Pedro Alexandre Henrique, Pedro Albuquerque, Peng Yao Hao, Sarah Sabino
Abstract: This study aimed to verify whether the use of support vector regression (SVR) makes the portfolios return exceed the market. For such propose, SVR was applied for 15 different kernel functions to select the best stocks for each quarter, calculating the quarterly portfolio return and cumulative return along the period. Subsequently, the returns of these portfolios were compared with the returns of a market benchmark. Whites (2000) test was applied to avoid the data-snooping effect in assessing the statistical significance of the portfolios developed by the training strategies. The portfolio selected by SVR with inverse multiquadric kernel presented the highest cumulative return of 374.40% and a value at risk (VaR) of 6.87%.The results of this study corroborate the superiority hypothesis of the innovative method of Support Vector Regression in the formation of portfolios, thus constituting a robust predictive method capable to cope with high dimensionality interactions.
Keywords: Statistical Learning Theory. Optimization Theory. Financial Econometrics. Support Vector Machine. Kernel methods.
Worldwide Gross Revenue Prediction for Bollywood Movies using Hybrid Ensemble Model
by Alina Zaidi, Siddhaling Urolagin
Abstract: Prediction of revenue before a movie is released can be very beneficial for stakeholders and investors in the movie industry. Even though Indian cinema is a booming industry, the literature work in the field of movie revenue prediction is more inclined towards non-Indian movie. In this study we built a novel hybrid prediction model to predict worldwide gross for Bollywood movies. Bollywood movies dataset is prepared by downloading movie related features from IMDb and YouTube movie trailers which consisting of 674 movies. K-means clustering is performed on the movie dataset and two major clusters are identifier. Important features specific to clusters are selected. The proposed hybrid prediction model performs segregation of movies into two clusters and employs prediction model for each cluster. Prediction models we tested included various basic machine learning models and ensemble models. The ensemble model that combined predictions from support vector regression, neural network and ridge regression gave us the best result for both clusters and we chose it to be our final model. We obtain an overall MAE of 0.0272 and R2 of 0.80 after 10-fold cross validation.
Keywords: Bollywood; Movie Revenue Prediction; Box office; Regression; Ensemble; Feature Selection; Machine Learning; Scikit-Learn.
Health Data Warehouses: Reviewing Advanced Solutions for Medical Knowledge Discovery
by Norah Alghamdi
Abstract: The implementation of a data warehouse and a decision support system by utilising the capabilities of information retrieval and knowledge discovery tools in the healthcare fields, has allowed for the enhancement in the offered healthcare. In this work, we present a review of recent data warehouses and decision support systems in the healthcare domain with their significance, and applications of evidence-based medicine, electronic health records, and nursing. Given the growing trend on their implementation in healthcare services, researches, and education, we present here the most recent publications that employ these tools to produce suitable decisions for patients or health providers. For all the reviewed publications, we have intensively explored their problems, suggested solutions, utilised methods, and their findings. We have also highlighted the strength of the existing approaches and identified potential drawbacks including data correctness, completeness, consistency, and integration to provide proper medical decision-making.
Keywords: Data warehouses; Data Mining; Health Data; Medical Records; Quality; Knowledge Discovery; OLAP.
Survey on-demand: A versatile scientific article automated inquiry method using text mining applied to Asset Liability Management
by Pedro Henrique Albuquerque, Igor Nascimento, Peng Yao Hao
Abstract: We proposed a methodology that automatically relate content of text documents with lexical items. The model estimates whether an article addresses a specific research object based on the relevant words in its abstract and title using text mining and partial least square discriminant analysis. The model is efficient in accuracy and the adjustment and validation indicators are either superior or equal to the other models in the literature on text classification. In comparison to existing methods, our method offers highly interpretable outcomes and allows flexible measurements of word frequency. The proposed solution may aid scholars regarding the process of searching theoretical references, suggesting scientific articles based on the similarities among the used vocabulary. Applied to the finance area, our framework has indicated that approximately 10% of the publications in the selected journals that address the subject of asset liability management. Moreover, we highlight the journals with the largest number of publications over time and the key words about the subject using only freely accessible information.
Keywords: dimensionality reduction; discriminant analysis; text classification; partial least square; bibliometrics.
Clustering Student Instagram accounts using Author-Topic Model Based
by Nur Rakhmawati, Faiz NF, Irmasari Hafidz, Indra Raditya, Pande Dinatha, Andrianto Suwignyo
Abstract: The aim of this study proposes topic model to cluster a group of high school teenager's Instagram account in Surabaya, Indonesia by using the author-topic models method. We collect valid 235 Instagram account (133 female, 102 male students). We gather a total 3,346 captions of the Instagram post from 18 senior high schools. We find major findings what are the topics that define their Instagram's post or caption, seven topics namely: feeling, Surabaya events, photography, artists, vacation, religion and music. Through the process, the lowest perplexity come from 90 iterations, which suggests six groups of topics. The six topics are concluded based on the lowest perplexity value and labelled according to the words included in the topic. The topic of Photography discussed by six schools. Photography-Artists and vacation are discussed by three schools, while feeling, religion and music are being discussed by two and one school respectively.
Keywords: Topic Modelling ; Senior High School Students ; Author-Topic Models.
The approach of using ontology as pre-knowledge source for semi-supervised labelled topic model by applying text dependency graph
by Phu Pham, Phuc Do
Abstract: Multiple topics discovering from text is an important task in text mining. From the past, the supervised approaches fail to explore multiple topics in text. The topic modelling approach, such as: LSI, pLSI, LDA, etc. are considered as an unsupervised method which supports to discover distributions of multiple topics in text documents. The labelled LDA (LLDA) model is a supervised method which enables to integrate human labelled topics with the given text corpus during the process of modelling topics. However, in real applications, we may not have enough high qualified knowledge to properly assign the topics for all documents before applying the LLDA. In this paper, we present two approaches which are taken the advantage of dependency graph-of-words (GOW) in text analysis. The GOW approach uses frequent sub-graph mining (FSM) technique to extract graph-based concepts from text. Our first approach is method of using graph-based concepts for constructing domain-specific ontology. It is called GC2Onto model. In our second approach, the graph-based concepts are also applied to improve the quality of traditional LLDA. It is called LLDA-GOW model. We combine two GC2Onto and LLDA-GOW models to leverage the multiple topic identification as well as other mining tasks in text.
Keywords: topic identification; labelled topic modelling; LDA; labelled LDA; ontology-driven topic labelling; dependency graph.
RFID BI Mobility and Producer to Consumer Traceability Architecture
by Andre Claude Bayomock Linwa
Abstract: Radio frequency identifier (RFID) emerged in 2000 an intelligent remote object identification. RFID helps tracking object position and relevant information using radio frequency technology (Bouet and dos Santos, 2008; Pais, 2010). Its application in industries, highly increases the inventory management consistency and accuracy, by capturing in real-time observed object attributes for traceability and quality control purpose. In order to provide traceability and quality control services, RFID applications should offer two main services: business intelligence (BI) and mobility management. The RFID BI provides production traceability services (QoS metrics related to manufacturing processes). And RFID mobility service maintains accurate RFID tag location. In this paper, a generic RFID BI mobility' data model is defined. In the proposed data model, RFID product information generated by a supply chain organisation is translated or migrated from a producer to a consumer. This migration generates two distinct types of RFID mobility: internal (inside buildings) and external.
Keywords: Mobility Management; RFID; Business Intelligence BI; Data Models; Business Processes; QoS; Mobile Networks; GPS; Events; Mobility Subscription.
A comparison of cluster algorithms as applied to unsupervised surveys
by Kathleen C. Garwood, Arpit Dhobale
Abstract: When considering answering important questions with data, unsupervised data offers extensive insight opportunity and unique challenges. This study considers student survey data with a specific goal of clustering students into like groups with underlying concept of identifying different poverty levels. Fuzzy logic is considered during the data cleaning and organising phase helping to create a logical dependent variable for analysis comparison. Using multiple data reduction techniques, the survey was reduced and cleaned. Finally, multiple clustering techniques (k-means, k-modes and hierarchical clustering) are applied and compared. Though each method has strengths, the goal was to identify which was most viable when applied to survey data and specifically when trying to identify the most impoverished students.
Keywords: Fuzzy logic; cluster analysis; unsupervised learning; survey analysis; decision support system; k-means; k-modes; hierarchical clustering.
Discovery of inconsistent generalized coherent rules
by Anuradha Radhakrishnan, Rajkumar N, Rathi Gopalakrishnan, Soosaimichael PrinceSahayaBrighty
Abstract: Mining multiple-level association rules in a predefined taxonomy is an hierarchies that paves the way for generalised rule mining using interestingness measures like support and confidence. Coherent rule mining identifies significant rules in a database without using interestingness measures. In this paper we propose a new mining algorithm called generalised inconsistent coherent rule mining (GICRM) for mining a new form of generalised coherent rules called Inconsistent coherent rules. The discovered rules are called inconsistent because the correlation of the rules changes from one level of taxonomy to another. The rules are mined from a structured dataset of predefined taxonomy. The inconsistent rules mined would be noteworthy at business point of view for taking strategic decisions in market basket analysis.
Keywords: GICRM; multiple-level; generalized inconsistent coherent rule; taxonomy.
Time and Structural Anomalies Detection in Business Processes Using Process Mining
by Elham Saeedi, Faramarz Safi-Esfahani
Abstract: Information systems are increasingly being integrated into operational process and as a result, many events are recorded by information systems. Lack of compatibility between the process model and the observed behaviour is one of the challenges in constructing the process model in process mining. This lack of compatibility could be present in both the structure (sequence of the task) and the time spent in each task. In this paper, a hybrid approach for detecting structural and time anomalies via process mining is proposed. A dataset form Iran Insurance Company is used for performing a case study. The proposed method has detected 98.5% of structure anomalies and 96.3% of time anomalies which is one of the main achievements of this paper. A second standard dataset is used to further examine the proposed method that referred to as dataset 2. The proposed method has demonstrated a better performance compared with the baseline approach.
Keywords: Process mining; conformance checking; workflow mining; structural anomaly; time anomaly; flexible model; Insurance anomaly; anomaly detection; process model; control-flow perspective.
Analysis of road accident data and determining affecting factors by using regression models and decision tree
by Hanieh GharehGozlu
Abstract: This study analyses the road accident data with the aim to predict the probability of the road accidents leading to death and determine the affecting factors. Regression models including logit, probit, complementary log-log, gompertz and decision trees based on the CART algorithm were used to analyse the actual data of the rail road police centre of the country. The results show that the logit regression model is superior to the other models from the perspective of the scales of the health indicator. Also, the variables of day of week, age, shoulder path, road side, road type, road position, maximum speed, belt safety, specific safety equipment, vehicle type and vehicle manufacturer country are among the variables that significantly affect the probability of road deaths, and can be controlled by controlling their levels.
Keywords: Road accidents; Regression models; Decision tree model; Accuracy indicator scales.
A Review of Market Basket Analysis on Business Intelligence and Data Mining
by Nilam Nur Amir Sjarif, Nurulhuda Firdaus Mohd Azmi, Siti Sophiayati Yuhaniz, Doris Hooi-Ten Wong
Abstract: Business insight (BI) is an information driven arrangement which umbrellas assortment of instruments, advances, applications, procedures and methodologies that empower mining of helpful learning and data from operational information resources. Hidden patterns or trends got from the tremendous volume of information are add to informed and strategic decision making. Market basket analysis (MBA) is one of the regularly utilised data mining technique in BI to help business organisation in accomplishing upper hand. In spite of the fact that, the appropriation of the MBA as a data mining technique in BI tools are common in e-commerce, paper that survey BI and MBA is limited. This paper gives a major picture on the current state of BI and the application of the MBA as a BI technique. Written works identified with BI and MBA from different sources such as digital libraries and Google Scholar are explored. The survey serves to some degree as a guide or platform for researchers and practitioners for future improvement.
Keywords: Market Basket Analysis; Business Intelligence; Data Mining.
Stock Price Forecasting and News Sentiment Analysis Model using Artificial Neural Network
by Sriram K. V, Somesh Yadav, Ritesh Singh Suhag
Abstract: The stock market is highly volatile, and the prediction of stock prices has always been an area of interest to many statisticians and researchers. This study is an attempt to predict the prices of stock using Artificial Neural Network (ANN). Three models have been built, one for the future prediction of stock prices based on previous trends, the second for prediction of next day closing price based on todays opening price, and the third one analyzes the sentiment of news articles and gives scores based on the news impact. ANN is trained with the historical data using R-studio platform which is then used to predict the future values. Our experimental results for various stock prices showed that the model is effective using ANN.
Keywords: Stock Pricing; Forecasting; Artificial Neural Network; News sentiment; Opening price; Closing price; R Studio; Data analytics;.
Associative Classification Model for Forecasting Stock Market Trends
by Everton Castelão Tetila, Bruno Brandoli Machado, Jose F. Rorigues-Jr, Nícolas Alessando De Souza Belete, Diego A. Zanoni, Thayliny Zardo, Michel Constantino, Hemerson Pistori
Abstract: This paper proposes an associative classification model based on three technical indicators to forecast future trends of stock market. Our methodology assessed the performance of nine technical indicators, using a portfolio of ten stocks and a twelve-year time series. The experimental results showed that the use of a set of technical indicators leads to higher classification rates compared to the use of sole technical indicators, reaching an accuracy of 88.77%. The proposed approach also uses a multidimensional data cube that allows automatic updating of stock market asset values, which are essential to keep the forecast updated. The results indicate that our approach can support investors and analysts to operate in stock market.
Keywords: stock market trends; technical indicators; associative classification; data mining; business intelligence.
Mining the Productivity Data of Garment Industry
by Abdullah Al Imran, Md Shamsur Rahim, Tanvir Ahmed
Abstract: The Garment Industry one of the key examples of the industrial globalization of this modern era. It is a highly labour-intensive industry with lots of manual processes. Satisfying the huge global demand for garment products is mostly dependent on the production and delivery performance of the employees in the garment manufacturing companies. So, it is highly desirable among the decision makers in the garments industry to track, analyse and predict the productivity performance of the working teams in their factories. This study explores the application of state-of-the-art data mining techniques for analysing industrial data, revealing meaningful insights, and predicting the productivity performance of the working teams in a garment company. As part of our exploration, we have applied 8 different data mining techniques with 6 evaluation metrics. Our experimental results show that the Tree Ensemble model and Gradient Boosted Tree model are the best performing models in the application scenario.
Keywords: Data Mining; Productivity Prediction; Pattern Mining; Classification; Garment Industry; Industrial Engineering.
GENERAL CRIME FROM THE DATA MINING POINT OF VIEW. A SYSTEMATIC LITERATURE REVIEW
by Maria Antonia Walteros Alcazar, Nicolas Aguirre Yacup, Sandra P. Castillo Landinez, Pablo E. Caicedo Rodríguez
Abstract: In recent decades, crime has become an issue of great concern to nations, which is why there is significant progress in the development of investigations in different areas. The literature review considers the data mining techniques applied to crime research, throughout the analysis of four thematic axes: countries, data sources, data mining techniques and software employed in different articles. The analysis used a systematic methodology to examine the 111 articles selected among 2008-2018 from almost 70 journals. The articles of this review are focused on different types of crime. The findings indicated that USA is the most active country analysing crimes using data mining techniques; also, the most common sources are open data websites and crime studies. In general, are more frequent than those that cover a specific type of crime, the algorithm mainly used in studies is cluster followed by classification, and the most widely used software is WEKA.
Keywords: Data Mining DM; Crime; Criminal Patterns; Law Enforcement; Data Mining Techniques; Algorithms; Review; Knowledge Discovery; Literature Review LR;.
XML web quality analysis by employing MFCM clustering technique and KNN classification
by M. Gopianand, P. Jaganathan
Abstract: The great accomplishment of web search engine is keyword search which is the most trendy search representation for regular consumers. It is permits that the consumer can create the queries without the knowledge of query language and the database schema. So, it is also considered as a user friendly method. The quality of XML web has to be accurate if the exact queries have to be answered. Here we have proposed a method to access the quality of the XML web by analysing the keyword present in the XML web based on the respective keyword search. In our proposed method we collect number of XML documents and are clustered based on the keyword depending on the type of XML files. Modified fuzzy C means (MFCM) is used for clustering. Once the clustering based on the respective keyword is done, we classify the XML web based on quality of the data by utilising KNN classifier.
Keywords: XML web; K nearest neighbour; error value; classification accuracy; feature vectors.
A technique for semantic annotation and retrieval of e-learning objects
by A. Balavivekanandhan
Abstract: The primary objective of my research is to design and develop semantic annotation and retrieval model for e-learning document. In training phase, the documents from different domains are taken and the informative words from each document are obtained based on balanced mutual information and frequency of contents in each document. We then use the informative words to identify the superordinates and the objects. The superordinates, the informative words and the objects from each document will give the relation and properties of each document. The relation and properties of each document are then used to cluster the documents. In the testing phase, we give a query or a document as input to the system to retrieve the relevant documents. If a document is given as input, the relation and properties of that document are first identified and it is used to retrieve the relevant documents.
Keywords: e-learning; document clustering; balanced mutual information; one-way matching; cluster-based matching.
Online products recommendation system using genetic kernel fuzzy C-means and probabilistic neural network
by E. Manohar, D. Shalini Punithavathani
Abstract: The purchaser's review plays a significant role in choosing the purchasing activities for online shopping as a customer desires to obtain the opinion of other purchasers by observing their opinion through online products. However, most appropriate product selection from the best website is a challenging problem for online users. Accordingly, this paper proposes a hybrid recommendation system for identifying customer preferences and recommending the most appropriate product. To do this, first the dataset is collected and prepared in the pre-processing step. Genetic kernel fuzzy C-means (GAKFCM) is used for usage cluster formation after the pre-processing step. The different features are extracted from each cluster-based user interest level. The user interest levels are used as features for classifier to extract user knowledge discovery. Based upon the user interest level, the product recommendation is done using probabilistic neural network (PNN). The simulation results show high precision rate which clearly indicates that the proposed method is very useful and appealing.
Keywords: website; web-log; ranking; rating; review; products; genetic kernel fuzzy C-means; GAKFCM; probabilistic neural network; PNN.
Hybridising neural network and pattern matching under dynamic time warping for time series prediction
by Thanh Son Nguyen
Abstract: Pattern matching-based forecasting models are attractive due to their simplicity and the ability to predict complex nonlinear behaviours. Euclidean measure is the most commonly used metric for pattern matching in time series. However, its weakness is that it is sensitive to distortion in time axis; so, this can influence on forecasting results. The dynamic time warping (DTW) measure is introduced as a solution to the weakness of Euclidean distance metric. In addition, artificial neural networks (ANNs) have been widely used in the time series forecasting. They have been used to capture the complex relationships with a variety of patterns. In this work, we propose an improved hybrid method which is an affine combination of neural network model and DTW-based pattern matching model for time series prediction. This method can take full advantage of the individual strengths of the two models to create a more effective approach for time series prediction. Experimental results show that our proposed method outperforms neural network model and DTW-based pattern matching method used separately in time series prediction.
Keywords: time series; pattern matching; artificial neural network; ANN; time series prediction; dynamic time warping; DTW; k-nearest neighbour.
Discrete Weibull regression for modelling football outcomes
by Alessandro Barbiero
Abstract: We propose the use of the discrete Weibull distribution for modelling football match results, as an alternative to existing Poisson and generalised Poisson models. The number of goals scored by the two teams playing a football match are regarded as a pairwise observation and are modelled first through two independent discrete Weibull variables, and then through two dependent discrete Weibull variables, using a copula approach that accommodates non-null correlation. The parameters of the bivariate discrete Weibull distributions are assumed to depend on covariates such as the attack and defence abilities of the two teams and the 'home effect'. Several discrete Weibull regression models are proposed and then applied to the 2015-2016 Italian Serie A. Even if the interpretation of parameters is less immediate than in the case of bivariate Poisson models, nevertheless these models represent a suitable alternative, which can be applied also in other fields than sport data analysis.
Keywords: count data; count regression model; Frank copula; Poisson distribution; sport analytics.
Fuzzy-based review rating prediction in e-commerce
by P. Velvizhy, A. Pravi, M. Selvi, S. Ganapathy, A. Kannan
Abstract: Opinion mining is an ongoing research area in e-commerce which aims at analyzing the people's opinions, sentiments and emotions. Moreover, the existing e-commerce systems allow the users to share their feedback in the form of textual reviews regarding the products and services. It also allows the consumers to give ratings for products that help in future recommendation of products. In this research work, a computational framework for efficiently predicting the consumer review ratings on the products has been proposed. The proposed framework integrates dimensionality reduction, genetic algorithm, fuzzy c-means and adaptive neuro-fuzzy inference techniques to overcome the limitations of the existing systems. Experiments have been conducted in this work using Amazon dataset consisting of reviews for different products. This system provides better performance and prediction accuracy for review ratings when it is compared with the related work.
Keywords: sentiment analysis; review ratings prediction; dimensionality reduction; genetic algorithm; data mining; fuzzy c-means.
REFERS: refined and effective fuzzy e-commerce recommendation system
by Sankar Pariserum Perumal, Ganapathy Sannasi, Kannan Arputharaj
Abstract: Online shopping culture is gaining traction globally and some of the biggest beneficiaries of this e-commerce shift are Amazon, eBay, etc. Recommendation systems guide online users in a personalised manner to choose what they want and their interest on each product present in the catalogue list. In such a scenario, the existing systems need complete information for making recommendations, which is not always possible in real applications. Therefore, a novel refined and effective fuzzy e-commerce recommendation system has been proposed in this paper that combines the benefits of difference in importance within the rating factors by a single user and new similarity measure approach that aims at improved recommendation list to the e-commerce user. The proposed methodology has been implemented using a new similarity measure on experimental datasets and the refined scores for such e-commerce website-based unlocked mobile phones are compared in this work against classic similarity measures.
Keywords: fuzzy recommendation system; degree of similarity measure; rating factor importance; collective expert rating.
A novel dynamic approach to identifying suspicious customers in money transactions
by Abdul Khalique Shaikh, Amril Nazir
Abstract: Money laundering activity causes a negative impact on the development of the national economy. Anti-money laundering (AML) solutions within financial institutions facilitate to control it in a suitable way. However, one of the fundamental challenges in AML solution is to identify real suspicious transactions. To identify these types of transactions, existing research uses pre-defined rules and statistical approaches that help to detect the suspicious transactions. However, due to the fixed and predetermined rules, it is highly probable that a normal customer can be identified as suspicious customers. To overcome the above limitations, a novel dynamic approach to identifying suspicious customers in money transactions is proposed that is based on dynamic analysis of customer profile features to identify suspicious transactions. The experiment has been executed with real bank customers and their transactions data and the results of the experiment provide promising outcomes in terms of accuracy.
Keywords: anti-money laundering; AML; suspicious transactions; money transaction; dynamic AML analysis; data analysis.
Fibonacci retracement pattern recognition for forecasting foreign exchange market
by Mohd Fauzi Ramli, Ahmad Kadri Junoh, Mahyun Ab Wahab, Wan Zuki Azman Wan Muhamad
Abstract: Fibonacci retracement implicates a forecast of future movements in foreign exchange rates (forex) of the previous movement inductive analysis. Fibonacci ratios are used to forecast the retracements level of 0.382, 0.500 and 0.618 and to determine the current trend which provide the mathematical foundation for the Elliott wave theory. K-nearest neighbour (KNN) and linear discriminant analysis (LDA) algorithm are the pattern recognition method for nonlinear feature mining of Elliott wave patterns. Results show that LDA is better than KNN in terms of classification accuracy data which are 99.43%. Among of three levels of Fibonacci retracement results, the 38.2% shows the best forecasting for Great Britain Pound pair to US Dollar currency as major pair by using mean absolute error (MAE), root mean square error (RMSE) and pearson correlation coefficient (r) as the statistical measurements which are 0.001884, 0.000019 and 0.992253 for uptrend and 0.001685, 0.000019 and 0.998806 for downtrend.
Keywords: forex; forecast; Fibonacci retracement; Elliott wave; golden ratio.
ScrAnViz: a tool for analytics and visualisation of unstructured data
by Sriraghav Kameswaran, V.S. Felix Enigo
Abstract: Existing big data visualisation tools are meant for visualising structured data. But survey shows that about 80-90% of potentially usable business information is in unstructured format. Analysing unstructured data is challenging due to lack of structure and relational form. In this paper, we have proposed a tool called ScrAnViz that can structure data, perform analysis and provide visualisation thereby helps in decision making for business people and end users. An attribute based opinion mining algorithm has been developed and implemented. Performance analysis shows that the algorithm has reduced the search time by three times than the traditional document level sentiment analysis systems.
Keywords: unstructured data; data analytics; sentiment analysis; opinion mining; data visualisation.
Implementation of multi node Hadoop virtual cluster on open stack cloud environments
by S. Karthikeyan, R. Manimegalai
Abstract: Nowadays computing plays a vital role in information technology and all other fields. Yes, the cloud computing is one of the biggest milestone in most leading next generation technology and booming up in IT filed and business sectors. In our day to day life the data is being generated is enormous amount such as tera (TB), peta (PB), zeta (ZB) bytes. Hadoop Map Reduce is the popular distributed computing paradigm to process data intensive jobs in cloud. Completion time goals for deadline of map reduce jobs set by users are becoming crucial in existing cloud based data processing environments like Hadoop. In this paper proposes a real-time implementation of multi node Hadoop virtual cluster on open stack cloud environments and also it processes the huge data sets in parallel different virtual machines (VMs) and it compares average execution time for different node virtual clusters and various size inputs.
Keywords: cloud; data intensive; Hadoop; Map Reduce; open stack-cluster; virtualisation.
Impact of clustering on quality of recommendation in cluster-based collaborative filtering: an empirical study
by Monika Singh, Monica Mehrotra
Abstract: In memory nearest neighbour computation is a typical approach for collaborative filtering (CF) due to its high recommendation accuracy. However, this approach fails on scalability; which is the declined performance of the same due to the rapid increase in the number of users and items in archetypal merchandising applications. One of the popular techniques to attenuate scalability issue is cluster-based collaborative filtering (CBCF), which uses clustering approach to group most similar users/items from complete dataset. In this work we present a detailed analysis of the impact of clustering in CF approach. Specifically, we study how the extent of clustering impacts collaborative filtering systems in terms of quality of predictions, quality of recommendations, throughput and coverage. Based on the empirical results obtained from two datasets, Movielens100K and Jester; we conclude that with increasing number of clusters the quality of predictions, the quality of recommendations and the throughput are enhanced but the coverage provided by clustered subsystems declines.
Keywords: recommender systems; collaborative filtering; CF; clustering; prediction; nearest neighbours; clustering-based collaborative filtering; CBCF; average recommendation time; coverage; quality of predictions and quality of recommendations.
Mining big data streams using business analytics tools: a bird's eye view on MOA and SAMOA
by P.M. Arunkumar, S. Kannimuthu
Abstract: Big data evolves as the prominent field in modern computing era. Big data analytics and its impact on extracting business intelligence is becoming indispensable for plethora of applications. The non-proprietary software revolution paved the way for illustrious evolution of tools like Weka, rapid miner, orange and R. Traditional data mining techniques hardly adapts to the requirements of rapid data analysis. The data stream processing algorithms that handle multitude of data endow with greater challenge in real time. Big data mining requires further improvisation in traditional tools to address the challenges of massive data processing. This paper highlights the importance of data stream mining and explores two important open source frameworks, namely massive online analysis (MOA) and scalable advanced massive online analysis (SAMOA). The implications of both the tools augurs well for further deliberations in big data research community. Business information system (BIS) models can reach unprecedented heights with the proliferation of these business analytics tools.
Keywords: big data; data mining; data streams; massive online analysis; MOA; business intelligence.
Weighted neuro-fuzzy hybrid algorithm for channel equalisation in time varying channel
by Zeeshan Ahmad Abbasi, Zainul Abdin Jaffery
Abstract: In MIMO-OFDM communication systems, accurate and specific channel estimation and equalisations are plays a major role. In this paper, we use weighted neuro-fuzzy hybrid (WNFH) channel estimation algorithm for channel equalisation. The pilot is designed based on combination of neural network and fuzzy logic system. Scaled conjugate gradient (SCG) is mutual with group search optimiser (GSO) algorithm along with; the training procedure of neural network is prepared using the hybrid training algorithm. In the transmitter section, the projected system contains quadrature amplitude modulation (QAM) and transmitter. By considering the channel prediction error to recover the performance of symbol detection the minimum mean-square error (MMSE) estimation design is accomplished. To reduce the MMSE of channel estimation and the calculated pilot sequences present great superiority in MIMO-OFDM system. Experimentation outcome shows that the channel assessment is supportive.
Keywords: MIMO-OFDM; group search optimiser; GSO; scaled conjugate gradient; SCG; channel estimation.
Decision tree classifier for university single rate tuition fee system
by Taufik Fuadi Abidin, Samsul Rizal, Teuku Mohamad Iqbalsyah, Rizal Wahyudi
Abstract: The regulation about single rate tuition fee for undergraduate study at state universities in Indonesia was enacted in 2013. The tuition fee is calculated based on the needs of each academic program and the regional cost index. The fee is grouped into several categories and set differently for each university. For Syiah Kuala University, located in Banda Aceh, Indonesia, the tuition fee is grouped into five different categories. This paper describes the construction of J48 decision tree classifier and evaluates its performance during training and testing phases when compared to ID3 and Naive Bayes classifiers to determine the category. The results show that the J48 decision tree classifier outperforms the other two classifiers in both phases. In the training phase, the F-measure and ROC for the J48 decision tree classifier are 0.889 and 0.973, respectively, and in the testing phase, the F-measure and ROC are 0.911 and 0.987, respectively.
Keywords: decision tree classifier; multi-class classification; university single rate tuition fee system.
Analysing thyroid disease using density-based clustering technique
by Khushboo Chandel, Veenita Kunwar, A. Sai Sabitha, Abhay Bansal, Tanupriya Choudhury
Abstract: Data mining in medicine has been used to predict unknown patterns in health data and to obtain diagnostic results. Healthcare industry generates large amounts of complex data about patients, diseases and treatments. Data mining in healthcare provides benefits like detecting fraud, availing medical facilities for patients at low cost, ensuring high quality patient care and making healthcare policies. Disease detection has become essential due to increased number of health issues occurring day by day. The thyroid has become one such concern with numerous cases being detected yearly. It causes improper functioning of the thyroid gland. In this paper, clustering technique has been used to detect and understand factors influencing thyroid disease. DBSCAN algorithm has been used as it can handle clusters of varying shapes and sizes and is noise resistant. PCA has also been done for finding high dimension data patterns and to reduce dimension. The experimental setup has been implemented in RapidMiner.
Keywords: data mining; clustering; thyroid disease; DBSCAN; principal component analysis.
A collaborative content-based movie recommender system
by Bolanle Adefowoke Ojokoh, Oluwatosin Olatunbosun Aboluje, Tobore Igbe
Abstract: In this paper, Pearson's correlation coefficient is employed for collaborative filtering due to its ability to manipulate numerical data as well as determine linear relationship among existing users. Its steps involve a user-user representation, similarity generation and prediction generation with a goal to produce a predicted opinion of the active user about a specific item. Concept of parental control is also incorporated for enhancement. Evaluation of the system was done using precision, recall, F-measure, discounted cumulative gain (DCG), idealised discounted cumulative gain (IDCG), normalised discounted cumulative gain (nDCG) and mean absolute error (MAE). Three hundred fortysix datasets were used, out of which 126 were gathered from local video shops and 220 were extracted from internet movie database (IMDb). These were used for the experiments and the results generated through mining of data obtained from profiles and ratings of system users prove the system's average ranking quality of the collaborative filtering algorithm is 95.9%.
Keywords: recommendation; collaborative filtering; correlation coefficient; evaluation; movies.
A statistical approach to investigate the alternatives of love in Moulana's Divan
by Mohammad Reza Mahmoudi, Ali Abasalizadeh, Marzieh Rahmati
Abstract: Conceptual metaphor is the systematic mapping of conceptual domains on each other. Love is the most important axis of mystical path. In this paper, all the lines in Moulana's are studied and different words, which are used as alternatives of love, are determined and classified in 11 areas. Then chi-square goodness of fit test is used to investigate and compare the frequency of different areas and words which are used as alternatives of love, separately. Finally, based on the clustering methods, these alternatives are clustered in three (high frequency, medium frequency, and low frequency). The results indicate the word 'fire' and the area 'human' have the highest uses as the alternatives of love.
Keywords: conceptual metaphor; love; Moulana; statistics; data mining; text mining.
Optimal region growing and multi-kernel SVM for fault detection in electrical equipments using infrared thermography images
by C. Shanmugam, E. Chandira Sekaran
Abstract: Infrared thermography (IRT) has played an essential part in observing and examining thermal defects of electrical equipment without ending, which has vital enormity for the dependability of electrical recorded. This paper dissected the electrical parts are faulted or non-faulted with the help of segmentation and classification model. The features are calculated from the input thermal images and regions of interest (ROI) is segmented by utilising optimal region growing (ORG) technique and faults are classified using multi kernel support vector machine (MKSVM). In the tests, the classification performances from different input features are assessed. For enhancing the performance of the segmentation investigation optimisation procedure that is whale optimisation (WO) is used. Before classifying, the extracted electrical components are fused by using feature level fusion (FLF) procedure to fused vector in all images. These multi kernel classification performance indices, including sensitivity, specificity and accuracy are utilised to recognise the most appropriate input feature and the best arrangement of classifiers. The performance of SVM is contrasted with a neural network. The correlation comes about demonstrating that our technique can accomplish a superior performance with accuracy at 98.21%.
Keywords: infrared thermography; IRT; feature extraction; support vector machine; SVM; optimisation; classification and fault detection.
Topic-driven top-k similarity search by applying constrained meta-path based in content-based schema-enriched heterogeneous information network
by Phu Pham, Phuc Do
Abstract: In this paper, we propose a model of TopCPathSim in order to address the problem related to 'topic-driven' similarity searching based on 'constrained meta-path' (or also called 'restricted meta-path') between same-typed objects within the content-based heterogeneous information networks (HINs). The topic distributions over content-based objects such as: paper/article on the bibliographic network or user's comments/reviews on the social networks, etc. are obtained by using the LDA topic model. We conduct the experiments on the real DBLP, Aminer and ACM datasets which demonstrate the effectiveness of our proposed model. Throughout experiments, our proposed model gains about 73.56% in accuracy. The output results also show that the combination of probabilistic topic model with constrained meta-path is promising to leverage the output quality of topic-oriented similarity searching in content-based HINs.
Keywords: constrained meta-path; content-based heterogeneous information network; topic-driven similarity search; LDA; topic modelling.
Location-based personalised recommendation systems for the tourists in India
by Madhusree Kuanr, Sachi Nandan Mohanty
Abstract: This study examines the collaborative filtering in recommender system by categorising users according to their choices of place, food, local item purchase, etc. The proposed system will store the opinions of the local users about the sites, foods and products for purchase available in those sites. It uses collaborative filtering technique to find the similar users to a given querying user. The system recommends the best sites along with good foods and products available on those sites according to the recent data. Two hundred (male = 110, female = 90) married individuals from Bhubaneswar, Odisha (India) participated in this survey. Cosine similarity is used in the proposed system to find the similar users of a given input query user. The results revealed that collaborative filtering is the more reliable technique for personalised recommender systems. Experimental results show performance of the proposed system in terms of precision, recall and F-measure values.
Keywords: collaborative filtering; recommender systems; user profile generation; India.
Deep learning framework for early detection of intrusion in virtual environment
by G. Madhu Priya, S. Mercy Shalinie, P. Mohana Priya
Abstract: Today's business enterprise adapts cloud-based services as its architectural design. Intelligence technique incorporated into the architecture gives massive tangible and intangible benefits in terms of performance and reliability. Such cloud-based business architecture faces many threats towards its availability. DDoS attack is the most prominent threat as its impact is more in the virtual resource-based cloud infrastructure. Therefore, there is a need for a business intelligence-based framework to detect early the attack by monitoring the virtual network traffic. The proposed intelligence framework uses a deep learning framework, continuous discriminative-deep belief network (CD-DBN). CD-DBN dynamically captures attack patterns from the network data, analyses the data and detects the intrusion to the cloud. The observed result shows that the earlier detection approach guarantees the availability of cloud services to the legitimate users and enhances the cloud resource usage.
Keywords: deep learning; restricted Boltzmann machine; deep belief network; cloud environment; virtualisation; hypervisor; intrusion detection; availability threat; DDoS attack; SysBench benchmark suite.
Efficient search for top-k discords in streaming time series
by Bui Cong Giao, Duong Tuan Anh
Abstract: The problem of anomaly detection in streaming time series has received much attention recently. The problem addresses finding the most anomalous subsequence (discord) over a time-series stream, which might arrive at high speed. The fact that finding top-k discords is more useful than finding the most unusual subsequence since users might make a choice among the top-k discords instead of choosing only one. Hence, an efficient method of search for top-k discords in streaming time series is proposed in the paper. The method uses a lower bound threshold, a lower bounding technique on a common dimensionality reduction transform, and a state-of-the-art technique of the distance computation between two time-series subsequences to prune off unnecessary distance calculations. The three techniques are arranged in a cascading fashion to speed up the performance of the method. Furthermore, the proposed method can return a set of top-k discords on the fly. The experimental results show that the proposed method can acquire quality discords nearly identical to those obtained by HOT SAX, a well-known method of anomaly detection. Remarkably, our proposed method demonstrates a fast response in handling time-series streams at high speed.
Keywords: anomaly detection; discord; streaming time series.
Stability analysis of feature ranking techniques in the presence of noise: a comparative study
by Iman Ramezani, Mojtaba Khorram Niaki, Milad Dehghani, Mostafa Rezapour
Abstract: Noisy data is one of the common problems associated with real-world data, and may affects the performance of the data models, consequent decisions and the performance of feature ranking techniques. In this paper, we show how stability performance can be changed if different feature ranking methods against attribute noise and class noise are used. We consider Kendall's Tau rank correlation and Spearman rank correlation to evaluate various feature ranking methods stability, and quantify the degree of agreement between ordered lists of features created by a filter on a clean dataset and its outputs on the same dataset corrupted with different combinations of the noise level. According to the results of Kendall and Spearman measures, Gini index (GI) and information gain (IG) have the best performances respectively. Nevertheless, both Kendall and Spearman measures results show that ReliefF (RF) is the most sensitive (the worst) performance.
Keywords: attribute noise; class noise; filter-based feature ranking; threshold-based feature ranking; stability; Kendall's Tau rank correlation; Spearman rank correlation.
Anomaly detection for elderly home care
by Kurnianingsih, Lukito Edi Nugroho, Widyawan, Lutfan Lazuardi, Anton Satria Prabuwono, Mahardhika Pratama
Abstract: In this paper, we propose a model for detecting anomalies in elderly home care. Two scenarios are investigated in detecting anomalies: 1) the elderly person's vital signs and their surrounding environment; 2) the mobility patterns of the elderly. We evaluated our proposed model by employing the isolation forest which detects anomalies using an isolation approach on a random forest of decision trees. We compare isolation forest on unlabeled data with statistical methods on labelled data. Subsequently, to show the reliability of the isolation concept, we compare it with a distance measure concept. The experiment shows that isolation forest has higher detection accuracy and lower error prediction for two attributes in the first scenario: skin temperature and heart rate, whereas, in the second scenario, multi-covariance determinant has a slightly better accuracy compared to isolation forest (3.9% difference in accuracy) and has a small number of prediction errors compared to isolation forest.
Keywords: anomaly detection; isolation forest; elderly home care.
A simple transform domain-based low level primitives preserving texture synthesis
by S. Anuvelavan, M. Ganesh, P. Ganesan
Abstract: In this work, a new patch-based texture synthesis scheme with orthogonal polynomials model coefficients is presented. The proposed scheme has four phases. In the first phase, a block matching technique that identifies a best match, to synthesis in the output image of bigger size is designed in terms of ordered orthogonal polynomials model coefficients. In case of successful match of block, called patch-hit, the proposed scheme finds candidate blocks with triangular search, in the next phase. In the patch selection phase, the proposed scheme considers a subset of orthogonal polynomials model coefficients among the blocks, for the purpose of synthesis which consumes less memory and time. This synthesised output is smoothened in the final phase, by preserving the low level contents between the synthesised patches. The performance of the proposed scheme is measured with energy, contrast, correlation, homogeneity and entropy between the original and synthesised images and is also compared with existing texture synthesis schemes. The results are encouraging.
Keywords: texture synthesis; orthogonal polynomials; patch-hit; candidate block; patch selection.
Using diverse set of features to design a content-based video retrieval system optimised by gravitational search algorithm
by Sadagopan Padmakala, Ganapathy Sankar Anandha Mala, K.M. Anandkumar
Abstract: This paper explains about the content-based video retrieval approach (CBVR) using four varieties of features and 12 distance measurements, which is optimised by gravitational search algorithm (GSA). Initially, CBVR technique extracts five kinds of features such as colour, texture, shape, image and audio features that belong to each frame. Consequently, it emerges particular distance measurements for every sort of features to compute the similarity between query frame and remaining in the database frame. In this paper, we have used GSA to find the nearly optimal combination between the features and their respective similarity measurements. At last, from the video database, the query-based videos are recovered. For experimentation, here we used two types of databases such as sports video and UCF sports action datasets. The experimental results demonstrate that the proposed CBVR method shows better performance when contrasted with other existing methods.
Keywords: video retrieval; distance measurements; colour; texture; shape; audio; content-based video retrieval; CBVR; similarity; combinations.
g*-closed sets in intuitionistic fuzzy topological spaces
by T. Gandhimathi, M. Rameshkumar
Abstract: This paper is devoted to the study of intuitionistic fuzzy topological spaces. In this paper we introduce and study the concepts of intuitionistic fuzzy g*-closed sets and intuitionistic fuzzy g*-open sets in intuitionistic fuzzy topological spaces. We show that intuitionistic fuzzy g*-closed sets lies between intuitionistic fuzzy g-closed sets and intuitionistic fuzzy g-closed sets. We obtained some characterisations and several preservation theorems of the spaces.
Keywords: intuitionistic fuzzy topology; intuitionistic fuzzy g*-closed sets; intuitionistic fuzzy g*-open sets.
Multi-document-based text summarisation through deep learning algorithm
by G. Padmapriya, K. Duraiswamy
Abstract: The proposed approach is provided an effort in terms of deep leaning algorithm to retrieve an effective text summary for a set of documents. Basically, the proposed system consists of two phases such as training phase and the testing phases. The training phase is used for exploiting the three different algorithms to make the text summarisation process an effective one. Similar to every training phase, the proposed training phases is also possessed of known data and attributes. After that, the testing phase is implemented to test the efficiency of the proposed approach. For experimentation, we used four documents sets which are selected from the DUC (2002). The experimental evaluation showed expected results as, the average precision of 78%, the average recall of 1 and the average f-measure of 84%.
Keywords: particle swarm optimisation; text summarisation; deep learning algorithm.
Sentimental event detection from Arabic tweets
by Mohammad Daoud, Daoud Daoud
Abstract: This article presents and evaluates an approach to detect sentimental events from Twitter Arabic data streams. Sentimental events attract strongly opinionated responses from the online community; therefore, we aim at detecting the association of a topic with a positive or a negative sentiment at a particular time. To achieve that, we build sentimental time series where the frequencies of that association (between topics and sentiment) are recorded. And then, we use several algorithms to locate possible events. Events in positive timelines will be considered as positive, and similarly for negative events. Our approaches use Shannon diversity index and hill climbing peak finding. We experimented our proposed algorithms with the domain of football (soccer) news. The results showed good precision and recall considering mainstream media as a reference. The success of such experiment can open the door for many useful applications including reputation and brand monitoring systems for various domains and languages.
Keywords: event detection; sentiment analysis; social media analysis; diversity analysis; data mining.
An efficient feature extraction for biometric authentication
by P. Betty, D. Mohana Geetha, I. Jeena Jacob
Abstract: Biometric authentication has received greater significance due to its high uniqueness and performance. The ability of quick and convenient authentication is required due to its widespread demand. Extraction of feature is the primary and important task for effective authentication. Dissimilar chrominance texture pattern (DiCTP) technique is used in this paper for effective feature extraction. Patterns of two sequences are generated from the inter channel information of the image which extracts the coloured texture information of the input. Unique information is generated from RGB and BRG planes of the image which produces a part of diversified chromatic feature vectors. The local binary pattern (LBP) code is generated and added along with the feature vector, which aids to inculcate the greyscale information of the image. The experimental results are formulated using the CASIA Face Image Database Version 5 (DB1) and Indian Face database (DB2) which give considerable improvements over the existing methodology.
Keywords: biometric authentication; dissimilar chrominance texture pattern; content-based image retrieval.
Link prediction in multilayer networks
by Deepak Malik, Anurag Singh
Abstract: Link prediction has gained popularity in recent years in large networks. Researchers have proposed various methods for finding the missing links. These methods include common neighbour, Jaccard coefficient, etc. based on the proximity of the nodes. These methods have limitations as they treat all common nodes equal from a pair of nodes. A new method is proposed, common neighbour's common neighbour (CNCN). Its performance is better than the existing methods in a single layer network. These methods are based on the topological features of the network. The proposed method finds the different behaviour of common nodes for a pair of nodes. The link prediction is also useful in the multiplex networks. The link predictions in the multiplex networks are more useful than the single layer network as several layers may give more information about a node than the single layer network. Two methods are proposed using dynamic and static weights.
Keywords: common neighbours; complex network; link prediction.
ComRank: community-based ranking approach for heterogeneous information network analysis and mining
by Phu Pham, Phuc Do
Abstract: In this paper, we propose the ComRank model to address this problem of ranking a specific typed of object, over the generated topic-driven communities in the information networks. The topic-driven communities are generated by applying the latent topic modelling of LDA. Our proposed ComRank model is directly generated ranking results for specific typed object in the different network communities. We apply our approach to construct the scholastic recommendation system, which support the researchers to find the appropriate citations or potential authors for cooperating while doing scientific researches. The ComRank model is tested with the real-world dataset of DBLP bibliographic network. The experimental results demonstrated that our proposed model can generate the meaningful ranking results within detected topic-driven communities.
Keywords: information network; heterogeneous network; community detection; community-based ranking; meta-path-based ranking.
Building acoustic model for phoneme recognition using PSO-DBN
by B.R. Laxmi Sree, M.S. Vijaya
Abstract: Deep neural networks has shown its power in generous classification problems including speech recognition. This paper proposes to enhance the power of deep belief network (DBN) further by pre-training the neural network using particle swarm optimisation (PSO). The objective of this work is to build an efficient acoustic model with deep belief networks for phoneme recognition with much better computational complexity. The result of using PSO for pre-training the network drastically reduces the training time of DBN and also decreases the phoneme error rate (PER) of the acoustic model built to classify the phonemes. Three variations of PSO namely, the basic PSO, second generation PSO (SGPSO) and the new model PSO (NMPSO) are applied in pre-training the DBN to analyse their performance on phoneme classification. It is observed that the basic PSO is performing comparably better to other PSOs considered in this work, most of the time.
Keywords: phoneme recognition; deep neural networks; particle swarm optimisation; acoustic model; Tamil speech recognition; deep learning; deep belief networks.
High dimensional sentiment classification of product reviews using evolutionary computation
by Sonu Lal Gupta, Anurag Singh Baghel
Abstract: Feature selection is an important process in text classification. In general, traditional feature selection approaches are based on exhaustive search hence become inefficient due to a large search space. Further, this task becomes more challenging as the number of features increases. Recently, evolutionary computation (EC)-based search techniques have received a lot of attention in solving feature selection problem in high-dimensional feature space. This paper proposes a particle swarm optimisation (PSO)-based feature selection approach which is capable of generating the desired number of high-quality features from a large feature space. The proposed algorithm is tested on a large dataset and compared with several existing state-of-the-art algorithms used for feature selection. The accuracy of the underlying classifier has been considered as a measure of performance. Our obtained results demonstrated that the proposed PSO-based feature selection approach outperforms the other traditional feature selection algorithms in all the considered classifiers.
Keywords: sentiment classification; feature selection; particle swarm optimisation; PSO; evolutionary computation; support vector machine; SVM; naïve Bayes; NB; mutual information; MI; chi-square; CHI.