International Journal of Business Intelligence and Data Mining (73 papers in press)
An Effective Preprocessing Algorithm for Model Building in Collaborative Filtering based Recommender System
by Srikanth T, M. Shashi
Abstract: Recommender systems suggest interesting items for online users based on the ratings expressed by them for the other items maintained globally as the rating matrix. The rating matrix is often sparse and very huge due to large number of users expressing their ratings only for a few items among the large number of alternatives. Sparsity and scalability are the challenging issues to achieve accurate predictions in recommender systems. This paper focuses on model building approach to collaborative filtering-based recommender systems using low rank matrix approximation algorithms for achieving scalability and accuracy while dealing with sparse rating matrices. A novel preprocessing methodology is proposed to counter data sparsity problem by transforming the sparse rating matrix denser before extracting latent factors to appropriately characterise the users and items in low dimensional space. The quality of predictions made either directly or indirectly through user clustering were investigated and found to be competitive with the existing collaborative filtering methods in terms of reduced MAE and increased NDCG values on bench mark datasets.
Keywords: Recommender System; Collaborative Filtering; Dimensionality Reduction; Pre- Processing,Sparsity,Scalability,Matrix Factorization.
Building Acoustic Model for Phoneme Recognition using PSO-DBN
by B.R. Laxmi Sree, M.S. Vijaya
Abstract: Deep neural networks has shown its power in generous classification problems including speech recognition. This paper proposes to enhance the power of deep belief network (DBN) further by pre-training the neural network using particle swarm optimisation (PSO). The objective of this work is to build an efficient acoustic model with deep belief networks for phoneme recognition with much better computational complexity. The result of using PSO for pre-training the network drastically reduces the training time of DBN and also decreases the Phoneme error rate (PER) of the acoustic model built to classify the phonemes. Three variations of PSO namely, the basic PSO, second generation PSO (SGPSO) and the New model PSO (NMPSO) are applied in pre-training the DBN to analyse their performance on phoneme classification. It is observed that the basic PSO is performing comparably better to other PSOs considered in this work, most of the time.
Keywords: Phoneme Recognition; Deep Neural Networks; Particle Swarm Optimisation; Acoustic Model; Tamil Speech Recognition; Deep Learning. Deep Belief Networks.
Efficient search for top-k discords in streaming time series
by Giao Bui Cong, Duong Tuan Anh
Abstract: The problem of anomaly detection in streaming time series has received much attention recently. The problem addresses finding the most anomalous subsequence (discord) over a time-series stream, which might arrive at high speed. The fact that finding top-k discords is more useful than finding the most unusual subsequence since users might make a choice among the top-k discords instead of choosing only one. Hence, an efficient method of search for top-k discords in streaming time series is proposed in the paper. The method uses a lower bound threshold, a lower bounding technique on a common dimensionality reduction transform, and a state-of-the-art technique of the distance computation between two time-series subsequences to prune off unnecessary distance calculations. The three techniques are arranged in a cascading fashion to speed up the performance of the method. Furthermore, the proposed method can return a set of top-k discords on the fly. The experimental results show that the proposed method can acquire quality discords nearly identical to those obtained by HOT SAX, a well-known method of anomaly detection. Remarkably, our proposed method demonstrates a fast response in handling time-series streams at high speed.
Keywords: anomaly detection; discord; streaming time series.
Mining Big data streams using Business analytics tools: A bird
by Arunkumar PM, S. Kannimuthu
Abstract: Big data evolves as the prominent field in modern computing era. Big data analytics and its impact on extracting business intelligence is becoming indispensable for plethora of applications. The non-proprietary software revolution paved the way for illustrious evolution of tools like Weka, rapid miner, orange and R. Traditional data mining techniques hardly adapts to the requirements of rapid data analysis. The data stream processing algorithms that handle multitude of data endow with greater challenge in real time. Big data mining requires further improvisation in traditional tools to address the challenges of Massive data processing. This paper highlights the importance of data stream mining and explores two important open source frameworks, namely massive online analysis (MOA) and scalable advanced massive online analysis (SAMOA). The implications of both the tools augurs well for further deliberations in big data research community. Business information system (BIS) models can reach unprecedented heights with the proliferation of these business analytics tools.
Keywords: Big Data; Data mining; Data streams; Massive online analysis; Business Intelligence.
A novel dynamic approach to identifying suspicious customers in money transactions
by Abdul Khalique Shaikh, Amril Nazir
Abstract: Money laundering activity causes a negative impact on the development of the national economy. Anti-money laundering (AML) solutions within financial institutions facilitate to control it in a suitable way. However, one of the fundamental challenges in AML solution is to identify real suspicious transactions. To identify these types of transactions, existing research uses pre-defined rules and statistical approaches that help to detect the suspicious transactions. However, due to the fixed and predetermined rules, it is highly probable that a normal customer can be identified as suspicious customers. To overcome the above limitations, a novel dynamic approach to identifying suspicious customers in money transactions is proposed that is based on dynamic analysis of customer profile features to identify suspicious transactions. The experiment has been executed with real bank customers and their transactions data and the results of the experiment provide promising outcomes in terms of accuracy.
Keywords: AML; anti-money laundering; suspicious transactions; money transaction; dynamic AML analysis; data analysis.
Anomaly detection for elderly home care
by Kurnianingsih Kurnianingsih, Lukito Edi Nugroho, Widyawan Widyawan, Lutfan Lazuardi, Anton Satria Prabuwono, Mahardhika Pratama
Abstract: In this paper, we propose a model for detecting anomalies in elderly home care. Two scenarios are investigated in detecting anomalies: 1) the elderly person's vital signs and their surrounding environment; 2) the mobility patterns of the elderly. We evaluated our proposed model by employing the isolation forest which detects anomalies using an isolation approach on a random forest of decision trees. We compare isolation forest on unlabeled data with statistical methods on labelled data. Subsequently, to show the reliability of the isolation concept, we compare it with a distance measure concept. The experiment shows that isolation forest has higher detection accuracy and lower error prediction for two attributes in the first scenario: skin temperature and heart rate, whereas, in the second scenario, multi-covariance determinant has a slightly better accuracy compared to isolation forest (3.9% difference in accuracy) and has a small number of prediction errors compared to isolation forest.
Keywords: anomaly detection; isolation forest; elderly home care.
Multi-Document Based Text Summarization Through Deep Learning Algorithm
by G. PadmaPriya, K. Duraiswamy
Abstract: The proposed approach is provided an effort in terms of deep leaning algorithm to retrieve an effective text summary for a set of documents. Basically, the proposed system consists of two phases such as training phase and the testing phases. The training phase is used for exploiting the three different algorithms to make the text summarisation process an effective one. Similar to every training phase, the proposed training phases is also possessed of known data and attributes. After that, the testing phase is implemented to test the efficiency of the proposed approach. For experimentation, we used four documents sets which are selected from the DUC (2002). The experimental evaluation showed expected results as, the average precision of 78%, the average recall of 1 and the average f-measure of 84%.
Keywords: Particle Swarm Optimisation; Text Summarization ; Deep Learning Algorithm.
Online Products Recommendation System using Genetic Kernel Fuzzy C-Means and Probabilistic Neural Network
by Manohar E, D. Shalini Punithavathani
Abstract: The purchaser's review plays a significant role in choosing the purchasing activities for online shopping as a customer desires to obtain the opinion of other purchasers by observing their opinion through online products. However, most appropriate product selection from the best website is a challenging problem for online users. Accordingly, this paper proposes a hybrid recommendation system for identifying customer preferences and recommending the most appropriate product. To do this, first the dataset is collected and prepared in the pre-processing step. Genetic kernel fuzzy C-means (GAKFCM) is used for usage cluster formation after the pre-processing step. The different features are extracted from each cluster-based user interest level. The user interest levels are used as features for classifier to extract user knowledge discovery. Based upon the user interest level, the product recommendation is done using probabilistic neural network (PNN). The simulation results show high precision rate which clearly indicates that the proposed method is very useful and appealing.
Keywords: website; web-log; ranking; rating; review; products; Genetic Kernel Fuzzy C-Means; probabilistic neural network.
Hybridising Neural Network and Pattern Matching under Dynamic Time Warping for Time Series Prediction
by Thanh Son Nguyen
Abstract: Pattern matching-based forecasting models are attractive due to their simplicity and the ability to predict complex nonlinear behaviours. Euclidean measure is the most commonly used metric for pattern matching in time series. However, its weakness is that it is sensitive to distortion in time axis; so, this can influence on forecasting results. The dynamic time warping (DTW) measure is introduced as a solution to the weakness of Euclidean distance metric. In addition, artificial neural networks (ANNs) have been widely used in the time series forecasting. They have been used to capture the complex relationships with a variety of patterns. In this work, we propose an improved hybrid method which is an affine combination of neural network model and DTW-based pattern matching model for time series prediction. This method can take full advantage of the individual strengths of the two models to create a more effective approach for time series prediction. Experimental results show that our proposed method outperforms neural network model and DTW-based pattern matching method used separately in time series prediction.
Keywords: time series; pattern matching; artificial neural network; time series prediction; dynamic time warping; k-nearest neighbour.
REFERS: Refined & Effective Fuzzy E-commerce Recommendation System
by Sankar Pariserum Perumal, Ganapathy Sannasi, Kannan Arputharaj
Abstract: Online shopping culture is gaining traction globally and some of the biggest beneficiaries of this e-commerce shift are Amazon, eBay, etc. Recommendation systems guide online users in a personalised manner to choose what they want and their interest on each product present in the catalogue list. In such a scenario, the existing systems need complete information for making recommendations, which is not always possible in real applications. Therefore, a novel refined and effective fuzzy e-commerce recommendation system has been proposed in this paper that combines the benefits of difference in importance within the rating factors by a single user and new similarity measure approach that aims at improved recommendation list to the e-commerce user. The proposed methodology has been implemented using a new similarity measure on experimental datasets and the refined scores for such e-commerce website-based unlocked mobile phones are compared in this work against classic similarity measures.
Keywords: Fuzzy recommendation system; degree of similarity measure; rating factor importance; collective expert rating.
Decision tree classifier for university single rate tuition fee system
by Taufik F. Abidin, Samsul Rizal
Abstract: The regulation about single rate tuition fee for undergraduate study at state universities in Indonesia was enacted in 2013. The tuition fee is calculated based on the needs of each academic program and the regional cost index. The fee is grouped into several categories and set differently for each university. For Syiah Kuala University, located in Banda Aceh, Indonesia, the tuition fee is grouped into five different categories. This paper describes the construction of J48 decision tree classifier and evaluates its performance during training and testing phases when compared to ID3 and Naive Bayes classifiers to determine the category. The results show that the J48 decision tree classifier outperforms the other two classifiers in both phases. In the training phase, the F-measure and ROC for the J48 decision tree classifier are 0.889 and 0.973, respectively, and in the testing phase, the F-measure and ROC are 0.911 and 0.987, respectively.
Keywords: Decision tree classifier; multi-class classification; university single rate tuition fee system.
Using Diverse Set of Features to Design a Content-Based Video Retrieval System Optimized by Gravitational Search Algorithm
by S. Padmakala, Ganapathy Sankar Anandha Mala, K.M. Anandkumar
Abstract: This paper explains about the content based video retrieval approach (CBVR) using four varieties of features and 12 distance measurements, which is optimized by gravitational search algorithm (GSA). Initially, CBVR technique extracts five kinds of features such as color, texture, shape, image and audio features that belong to each frame. Consequently, it emerges particular distance measurements for every sort of features to compute the similarity between query frame and remaining in the database frame. In this paper, we have used GSA to find the nearly optimal combination between the features and their respective similarity measurements. At last, from the video database, the query based videos are recovered. For experimentation, here we used two types of databases such as sports video and UCF sports action datasets. The experimental results demonstrate that the proposed CBVR method shows better performance when contrasted with other existing methods.
Keywords: video retrieval; distance measurements; color; texture; shape; audio; CBVR; similarity; combinations.
Weighted Neuro-Fuzzy Hybrid Algorithm for Channel Equalization in Time Varying Channel
by Zeeshan A Abbasi, Zainul Abdin Jaffery
Abstract: In MIMO-OFDM communication systems, accurate and specific channel estimation and equalisations are plays a major role. In this paper, we use weighted neuro-fuzzy hybrid (WNFH) channel estimation algorithm for channel equalisation. The pilot is designed based on combination of neural network and fuzzy logic system. Scaled conjugate gradient (SCG) is mutual with group search optimiser (GSO) algorithm along with; the training procedure of neural network is prepared using the hybrid training algorithm. In the transmitter section, the projected system contains quadrature amplitude modulation (QAM) and transmitter. By considering the channel prediction error to recover the performance of symbol detection the minimum mean-square error (MMSE) estimation design is accomplished. To reduce the MMSE of channel estimation and the calculated pilot sequences present great superiority in MIMO-OFDM system. Experimentation outcome shows that the channel assessment is supportive.
Keywords: MIMO-OFDM; Group Search Optimizer; Scaled Conjugate Gradient; Channel Estimation.
Discrete Weibull regression for modeling football outcomes
by Alessandro Barbiero
Abstract: We propose the use of the discrete Weibull distribution for modeling football match results, as an alternative to existing Poisson and generalized Poisson models. The number of goals scored by the two teams playing a football match are regarded as a pairwise observation and are modelled first through two independent discrete Weibull variables, and then through two dependent discrete Weibull variables, using a copula approach that accommodates non-null correlation. The parameters of the bivariate discrete Weibull distributions are assumed to depend on covariates such as the attack and defense abilities of the two teams and the 'home effect'. Several discrete Weibull regression models are proposed and then applied to the 2015-2016 Italian Serie A. Even if the interpretation of parameters is less immediate than in the case of bivariate Poisson models, nevertheless these models represent a suitable alternative, which can be applied also in other fields than sport data analysis.
Keywords: count data; count regression model; Frank copula; Poisson distribution; sport analytics.
Implementation of Multi Node Hadoop Virtual Cluster on Open Stack Cloud Environments
by Karthikeyan Saminathan, R. Manimegalai
Abstract: Nowadays computing plays a vital role in information technology and all other fields. Yes, the Cloud Computing is one of the biggest milestone in most leading next generation technology and booming up in IT filed and business sectors. In our day to day life the data is being generated is enormous amount such as Tera (TB), Peta(PB), Zeta(ZB) bytes. Hadoop Map Reduce is the popular distributed computing paradigm to process data intensive jobs in cloud. Completion time goals or deadline of map reduce jobs set by users are becoming crucial in existing cloud based data processing environments like Hadoop. In this paper proposed a real-time implementation of single node Hadoop cluster on Open stack private cloud and handles the huge data sets in parallel Virtual Machines and compares its average execution time for different size inputs.
Keywords: Cloud Data intensive- Hadoop - Map Reduce- Open Stack-Cluster.
ScrAnViz: A Tool for Analytics and Visualization of Unstructured Data
by Sriraghav Kameswaran, V.S. Felix Enigo
Abstract: Existing big data visualization tools are meant for visualizing structured data. But survey shows that about 80-90% of potentially usable business information is in unstructured format. Analyzing unstructured data is challenging due to lack of structure and relational form. In this paper, we have proposed a tool called ScrAnViz that can structure data, perform analysis and provide visualization thereby helps in decision making for business people and end users. An attribute based opinion mining algorithm has been developed and implemented. Performance analysis shows that the algorithm has reduced the search time by three times than the traditional document level sentiment analysis systems.
Keywords: Unstructured data; Data Analytics; Sentiment Analysis; Opinion Mining; Data visualization.
Link prediction in multilayer networks
by Deepak Malik, Anurag Singh
Abstract: Link prediction has gained popularity in recent years in large networks. Researchers have proposed various methods for finding the missing links. These methods include common neighbour, Jaccard coefficient, etc. based on the proximity of the nodes. These methods have limitations as they treat all common nodes equal from a pair of nodes. A new method is proposed, common neighbours common neighbour (CNCN). Its performance is better than the existing methods in a single layer network. These methods are based on the topological features of the network. The proposed method finds the different behaviour of common nodes for a pair of nodes. The link prediction is also useful in the multiplex networks. The link predictions in the multiplex networks are more useful than the single layer network as several layers may give more information about a node than the single layer network. Two methods are proposed using dynamic and static weights.
Keywords: common neighbours; complex network; link prediction.
FUZZY BASED REVIEW RATING PREDICTION IN E-COMMERCE
by P. Velvizhy, A. Pravi, M. Selvi, S. Ganapathy, A. Kannan
Abstract: Opinion Mining is an ongoing research area in E-commerce which aims at analyzing the people's opinions, sentiments and emotions. Moreover, the existing E-commerce systems allow the users to share their feedback in the form of textual reviews regarding the products and services. It also allows the consumers to give ratings for products that help in future recommendation of products. In this research work, a computational framework for efficiently predicting the consumer review ratings on the products has been proposed. The proposed framework integrates Dimensionality Reduction, Genetic Algorithm, Fuzzy C-Means and Adaptive Neuro-Fuzzy Inference techniques to overcome the limitations of the existing systems. Experiments have been conducted in this work using Amazon dataset consisting of reviews for different products. This system provides better performance and prediction accuracy for review ratings when it is compared with the related work.
Keywords: sentiment analysis; review ratings prediction; dimensionality reduction; genetic algorithm; data mining; fuzzy c means.
A Technique for Semantic Annotation and Retrieval of E-Learning Objects
by Balavivekanandhan A
Abstract: The primary objective of my research is to design and develop semantic annotation and retrieval model for e-learning document. In training phase, the documents from different domains are taken and the informative words from each document are obtained based on balanced mutual information and frequency of contents in each document. We then use the informative words to identify the superordinates and the objects. The superordinates, the informative words and the objects from each document will give the relation and properties of each document. The relation and properties of each document are then used to cluster the documents. In the testing phase, we give a query or a document as input to the system to retrieve the relevant documents. If a document is given as input, the relation and properties of that document are first identified and it is used to retrieve the relevant documents.
Keywords: e-learning; document clustering; balanced mutual information; one way matching; cluster based matching.
A COLLABORATIVE CONTENT-BASED MOVIE RECOMMENDER SYSTEM
by Bolanle Ojokoh, Oluwatosin Olatunbosun Aboluje, Tobore Igbe
Abstract: In this paper, Pearson's correlation coefficient is employed for collaborative filtering due to its ability to manipulate numerical data as well as determine linear relationship among existing users. Its steps involve a user-user representation, similarity generation and prediction generation with a goal to produce a predicted opinion of the active user about a specific item. Concept of parental control is also incorporated for enhancement. Evaluation of the system was done using precision, recall, F-measure, discounted cumulative gain (DCG), idealised discounted cumulative gain (IDCG), normalised discounted cumulative gain (nDCG) and mean absolute error (MAE). Three hundred fortysix datasets were used, out of which 126 were gathered from local video shops and 220 were extracted from internet movie database (IMDb). These were used for the experiments and the results generated through mining of data obtained from profiles and ratings of system users prove the system's average ranking quality of the collaborative filtering algorithm is 95.9%.
Keywords: Movies; Recommendation; Collaborative Filtering; Information Filtering; Correlation Coefficient; Evaluation.
Location based Personalized Recommendation systems for the Tourists in India
by Madhusree Kuanr, Sachi Nandan Mohanty
Abstract: This study examines the collaborative filtering in recommender
system by categorising users according to their choices of place, food, local
item purchase, etc. The proposed system will store the opinions of the local
users about the sites, foods and products for purchase available in those sites. It
uses collaborative filtering technique to find the similar users to a given
querying user. The system recommends the best sites along with good foods
and products available on those sites according to the recent data. Two hundred
(male = 110, female = 90) married individuals from Bhubaneswar, Odisha
(India) participated in this survey. Cosine similarity is used in the proposed
system to find the similar users of a given input query user. The results
revealed that collaborative filtering is the more reliable technique for
personalised recommender systems. Experimental results show performance of
the proposed system in terms of precision, recall and F-measure values.
Keywords: collaborative filtering; recommender systems; user profile
Stability analysis of feature ranking techniques in the presence of noise: a comparative study
by Iman Ramezani, Mojtaba Khorram Niaki, Milad Dehghani, Mostafa Rezapour
Abstract: Noisy data is one of the common problems associated with real-world data, and may affects the performance of the data models, consequent decisions and the performance of feature ranking techniques. In this paper, we show how stability performance can be changed if different feature ranking methods against attribute noise and class noise are used. We consider Kendalls Tau rank correlation and Spearman rank correlation to evaluate various feature ranking methods stability, and quantify the degree of agreement between ordered lists of features created by a filter on a clean dataset and its outputs on the same dataset corrupted with different combinations of the noise level. According to the results of Kendall and Spearman measures, Gini index (GI) and information gain (IG) have the best performances respectively. Nevertheless, both Kendall and Spearman measures results show that ReliefF (RF) is the most sensitive (the worst) performance.
Keywords: attribute noise; class noise; filter-based feature ranking; threshold-based feature ranking; stability; Kendall's Tau rank correlation; Spearman rank correlation.
Topic-driven top-k similarity search by applying constrained meta-path based in content-based schema-enriched heterogeneous information network
by Phu Pham, Phuc Do
Abstract: In this paper, we propose a model of TopCPathSim in order to
address the problem related to topic-driven similarity searching based on
constrained meta-path (or also called restricted meta-path) between
same-typed objects within the content-based heterogeneous information
networks (HINs). The topic distributions over content-based objects such as:
paper/article on the bibliographic network or users comments/reviews on the
social networks, etc. are obtained by using the LDA topic model. We conduct
the experiments on the real DBLP, Aminer and ACM datasets which
demonstrate the effectiveness of our proposed model. Throughout experiments,
our proposed model gains about 73.56% in accuracy. The output results also
show that the combination of probabilistic topic model with constrained
meta-path is promising to leverage the output quality of topic-oriented
similarity searching in content-based HINs.
Keywords: constrained meta-path; content-based heterogeneous information network; topic-driven similarity search; LDA; topic modelling.
Deep learning framework for early detection of intrusion in Virtual Environment
by Madhu Priya G, S. Mercy Shalinie, P. Mohana Priya
Abstract: Today's business enterprise adapts cloud based services as its architectural design. Intelligence technique incorporated into the architecture gives massive tangible and intangible benefits in terms of performance and reliability. Such cloud based business architecture faces many threats towards its availability. DDoS attack is the most prominent threat as its impact is more in the virtual resource based cloud infrastructure. Therefore, there is a need for a Business Intelligence based framework to detect early the attack by monitoring the virtual network traffic. The proposed intelligence framework uses a deep learning framework, Continuous Discriminative-Deep Belief Network (CD-DBN). CD-DBN dynamically captures attack patterns from the network data, analyzes the data and detects the intrusion to the cloud. The observed result shows that the earlier detection approach guarantees the availability of cloud services to the legitimate users and enhances the cloud resource usage.
Keywords: Deep Learning; Restricted Boltzmann Machine; Deep Belief Network; Cloud Environment; Virtualization; Hypervisor; Intrusion Detection; Availability threat; DDoS attack; SysBench benchmark suite.
Analysing Thyroid Disease using Density Based Clustering Technique
by Tanupriya Choudhury, Veenita Kunwar, A. Sai Sabitha, Abhay Bansal, Tanupriya Choudhury
Abstract: Data mining in medicine has been used to predict unknown patterns
in health data and to obtain diagnostic results. Healthcare industry generates
large amounts of complex data about patients, diseases and treatments. Data
mining in healthcare provides benefits like detecting fraud, availing medical
facilities for patients at low cost, ensuring high quality patient care and making
healthcare policies. Disease detection has become essential due to increased
number of health issues occurring day by day. The thyroid has become one
such concern with numerous cases being detected yearly. It causes improper
functioning of the thyroid gland. In this paper, clustering technique has been
used to detect and understand factors influencing thyroid disease. DBSCAN
algorithm has been used as it can handle clusters of varying shapes and sizes
and is noise resistant. PCA has also been done for finding high dimension
data patterns and to reduce dimension. The experimental setup has been
implemented in RapidMiner.
Keywords: Data mining; Clustering; Thyroid disease; DBSCAN; Principal component analysis.
A Simple Transform Domain Based Low Level Primitives Preserving Texture Synthesis
by S. Anuvelavan, M. GANESH, P. Ganesan
Abstract: In this work, a new patch-based texture synthesis scheme with
orthogonal polynomials model coefficients is presented. The proposed scheme
has four phases. In the first phase, a block matching technique that identifies a
best match, to synthesis in the output image of bigger size is designed in terms
of ordered orthogonal polynomials model coefficients. In case of successful
match of block, called patch-hit, the proposed scheme finds candidate blocks
with triangular search, in the next phase. In the patch selection phase, the
proposed scheme considers a subset of orthogonal polynomials model
coefficients among the blocks, for the purpose of synthesis which consumes
less memory and time. This synthesised output is smoothened in the final
phase, by preserving the low level contents between the synthesised patches.
The performance of the proposed scheme is measured with energy, contrast,
correlation, homogeneity and entropy between the original and synthesised
images and is also compared with existing texture synthesis schemes. The
results are encouraging.
Keywords: Texture Synthesis; Orthogonal Polynomials; Patch-Hit; Candidate Block; Patch Selection.
Optimal Region growing and Multi-kernel SVM for fault detection in Electrical Equipments using Infrared Thermography Images
by C. Shanmugam, E. Chandira Sekaran
Abstract: Infrared thermography (IRT) has played an essential part in
observing and examining thermal defects of electrical equipment without
ending, which has vital enormity for the dependability of electrical recorded.
This paper dissected the electrical parts are faulted or non-faulted with the help
of segmentation and classification model. The features are calculated from the
input thermal images and regions of interest (ROI) is segmented by utilising
optimal region growing (ORG) technique and faults are classified using multi
kernel support vector machine (MKSVM). In the tests, the classification
performances from different input features are assessed. For enhancing the
performance of the segmentation investigation optimisation procedure that is
whale optimisation (WO) is used. Before classifying, the extracted electrical
components are fused by using feature level fusion (FLF) procedure to fused
vector in all images. These multi Kernel classification performance indices,
including sensitivity, specificity and accuracy are utilised to recognise the most
appropriate input feature and the best arrangement of classifiers. The
performance of SVM is contrasted with a neural network. The correlation
comes about demonstrating that our technique can accomplish a superior
performance with accuracy at 98.21%.
Keywords: Feature extraction; Whale optimisation,Support vector machine; optimisation; Classification and fault detection,Infrared thermography.
ComRank: community-based ranking approach for heterogeneous information network analysis and mining
by Phu Pham, Phuc Do
Abstract: In this paper, we propose the ComRank model to address this
problem of ranking a specific typed of object, over the generated topic-driven
communities in the information networks. The topic-driven communities are
generated by applying the latent topic modelling of LDA. Our proposed
ComRank model is directly generated ranking results for specific typed object
in the different network communities. We apply our approach to construct the
scholastic recommendation system, which support the researchers to find the
appropriate citations or potential authors for cooperating while doing scientific
researches. The ComRank model is tested with the real-world dataset of DBLP
bibliographic network. The experimental results demonstrated that our
proposed model can generate the meaningful ranking results within detected
Keywords: information network; heterogeneous network; bibliographic network; community detection; community-based ranking; path-based ranking.
AGS: A Precise and Efficient AI Based Hybrid Software Effort Estimation Model
by Vignaraj Vikraman, S. Srinivasan
Abstract: To predict the amount of effort to develop software is a tedious
process for software companies. Hence, predicting the software development
effort remains a complex issue drawing in extensive research consideration.
The success of software development process considerably depends on proper
estimation of effort required to develop that software. Effective software effort
estimation techniques enable project managers to schedule software life cycle
activities properly. The main objective of this paper is to propose a novel
approach in which an artificial intelligence (AI)-based technique, called AGS
algorithm, is used to determine the software effort estimation. AGS is hybrid
method combining three techniques, namely: adaptive neuro fuzzy inference
system (ANFIS), genetic algorithm and satin bower bird optimisation (SBO)
algorithm. The performance of the proposed method is assessed using a well
standard dataset with real-time benchmark with many attributes. The major
metrics used in the performance evaluation are correlation coefficient (CC),
kilo lines of code (KLoC) and complexity of the software. The experimental
result shows that the prediction accuracy of the proposed model is better than
the existing algorithmic models.
Keywords: Software Effort Estimation; AI; ANFIS; Lines of code (LoC); Genetic Algorithm (GA); Satin Bower Bird Optimiser (SBO); Correlation Co-efficient (CC); Kilo Lines of Code (KLoC),Software Complexity.
High dimensional sentiment classification of product reviews using evolutionary computation
by Sonu Lal Gupta, Anurag Singh Baghel
Abstract: Feature selection is an important process in text classification. In
general, traditional feature selection approaches are based on exhaustive search
hence become inefficient due to a large search space. Further, this task becomes
more challenging as the number of features increases. Recently, evolutionary
computation (EC)-based search techniques have received a lot of attention in
solving feature selection problem in high-dimensional feature space. This paper
proposes a particle swarm optimisation (PSO)-based feature selection approach
which is capable of generating the desired number of high-quality features from
a large feature space. The proposed algorithm is tested on a large dataset and
compared with several existing state-of-the-art algorithms used for feature
selection. The accuracy of the underlying classifier has been considered as a
measure of performance. Our obtained results demonstrated that the proposed
PSO-based feature selection approach outperforms the other traditional feature
selection algorithms in all the considered classifiers.
Keywords: sentiment classification; feature selection; particle swarm
optimisation; PSO; evolutionary computation; support vector machine; SVM;
naïve Bayes; NB; mutual information; MI; chi-square; CHI.
Using bagging to enhance clustering procedures for planar shapes
by Elaine Cristina De Assis, Renata Souza, Getulio José Amorim Do Amaral
Abstract: Partitional clustering algorithms find a partition maximizing or minimizing some numerical criterion. Statistical shape analysis is used to make decisions observing the shape of objects. The shape of an object is the remaining information when the effects of location, scale and rotation are removed. This paper introduces clustering algorithms suitable for planar shapes. Four numerical criteria are adapted to each algorithm. In order to escape from local optima to reach a better clustering, these algorithms are performed in the framework of Bagging procedures. Simulation studies are carried to validate these proposed methods and two real-life data sets are also considered. The experiment quality is assessed by the corrected Rand index and the results the application of the proposed algorithms showed the effectiveness of these algorithms using different clustering criteria and the union of the Bagging method to the cluster algorithms provided substantial gains in of the quality of the clusters.
Keywords: Statistical Shape Analysis; Partitional Clustering Methods; Bagging Procedure.
Impact of Clustering on quality of Recommendation in Cluster based Collaborative Filtering: an Empirical Study
by MONIKA SINGH, Monica Mehrotra
Abstract: In memory nearest neighbour computation is a typical approach for
collaborative filtering (CF) due to its high recommendation accuracy. However,
this approach fails on scalability; which is the declined performance of the
same due to the rapid increase in the number of users and items in archetypal
merchandising applications. One of the popular techniques to attenuate
scalability issue is cluster-based collaborative filtering (CBCF), which uses
clustering approach to group most similar users/items from complete dataset. In
this work we present a detailed analysis of the impact of clustering in CF
approach. Specifically, we study how the extent of clustering impacts
collaborative filtering systems in terms of quality of predictions, quality of
recommendations, throughput and coverage. Based on the empirical results
obtained from two datasets, Movielens100K and Jester; we conclude that with
increasing number of clusters the quality of predictions, the quality of
recommendations and the throughput are enhanced but the coverage provided
by clustered subsystems declines.
Keywords: Recommender Systems; Collaborative Filtering; Clustering; Prediction; Nearest neighbors; Clustering based collaborative filtering; Average recommendation time; Coverage; Quality of predictions and Qua.
EFFICIENT TEXT DOCUMENT CLUSTERING WITH NEW SIMILARITY MEASURES
by Lakshmi R, S. Baskar
Abstract: In this paper, two new similarity measures, namely distance of term
frequency-based similarity measure (DTFSM) and presence of common
terms-based similarity measure (PCTSM), are proposed to compute the
similarity between two documents for improving the effectiveness of text
document clustering. The effectiveness of the proposed similarity measures is
evaluated on reuters-21578 and WebKB datasets for clustering the documents
using K-means and K-means++ clustering algorithms. The results obtained by
using the proposed DTFSM and PCTSM are significantly better than other
measures for document clustering in terms of accuracy, entropy, recall and
F-measure. It is evident that the proposed similarity measures not only improve
the effectiveness of the text document clustering, but also reduce the
complexity of similarity measures based on the number of required operations
during text document clustering.
Keywords: Document Clustering; Similarity Measures; Accuracy; Entropy; Recall; F-Measure; K-means clustering Algorithm.
XML web quality analysis by employing MFCM clustering Technique and KNN classification
by M. Gopianand, P. Jaganathan
Abstract: The great accomplishment of web search engine is keyword search which is the most trendy search representation for regular consumers. It is permits that the consumer can create the queries without the knowledge of query language and the database schema. So, it is also considered as a user friendly method. The quality of XML web has to be accurate if the exact queries have to be answered. Here we have proposed a method to access the quality of the XML web by analysing the keyword present in the XML web based on the respective keyword search. In our proposed method we collect number of XML documents and are clustered based on the keyword depending on the type of XML files. Modified fuzzy C means (MFCM) is used for clustering. Once the clustering based on the respective keyword is done, we classify the XML web based on quality of the data by utilising KNN classifier.
Keywords: XML web; K nearest neighbor; Error value; Classification accuracy; feature vectors.
Analysis and Prediction of Heart Disease Aid of Various Data Mining Techniques: A Survey
by V. Poornima, D. Gladis
Abstract: In recent times, health diseases are expanding gradually because of inherited. Particularly, heart disease has turned out to be the more typical nowadays, i.e., life of individuals is at hazard. The data mining strategies specifically decision tree, Na
Keywords: Data mining; Heart Disease Prediction; performance measure; Fuzzy; and clustering.
Signal-Flow Graph Analysis and Implementation of Novel Power Tracking Algorithm Using Fuzzy Logic Controller
by S. VENKATESAN, Manimaran Saravanan, Subramanian Venkatnarayanan, Senior Member IEEE
Abstract: This paper discussed merits of novel modified perturb and observe (P&O) maximum power point tracker (MPPT) algorithm for stand-alone solar PV system using interleaved LUO converter with fuzzy logic controller (FLC). The merits of FLC based system are compared with existing system. Analytical expressions of the proposed converter are derived through signal flow graph. The proposed interleaved LUO converter based PV system with fuzzy controller reduces considerable amount of ripple content and also proposed MPPT algorithm creates less hunting around maximum power point. Simulations at different illumination levels are carried-out using MATLAB/Simulink. It also experimentally verified with a typical 40 W solar PV panel. The result confirms the superiority of the proposed system with fuzzy controller.
Keywords: Fuzzy Logic Controller; Interleaved LUO Converter; Maximum Power Point Tracking (MPPT); Modified P&O algorithm; Photovoltaic(PV) system.
SoLoMo Cities: Socio-Spatial City Formation Detection and Evolution Tracking Approach
by Sara Elhishi, Mervat Abu-Elkheir, Ahmed Aboul-Fotouh
Abstract: The tremendous growth of telecommunication devices coupled with
the huge number of social media users has revealed a new kind of development
that turning our cities into information-rich smart platforms. We analyse the
role of LBSN check-ins using social community detection methods to extract
city structured communities, which we call SoLoMo cities, using a modified
version of Louvain algorithm, then we track these communities evolution
patterns through a pairwise consecutive matching process to detect behavioural
events changing citys communities. The findings of the experiments on the
Brightkite dataset can be summarised as follows: online users check-in
activities reveal a set of well-formed physical land spaces of citys
communities, the concentration of online social interactions and the formation
of those cities are positively correlated with a percentage of 89%. Finally, we
were able to track the evolution of the discovered communities through
detecting three community behaviour events: survive, grow and shrink.
Keywords: location-based social networks; LBSN; social; spatial analysis; community detection; evolution; tracking; Brightkite.
AN EFFICIENT FEATURE EXTRACTION FOR BIOMETRIC AUTHENTICATION
by Betty P, Mohanageetha D, Jeena Jacob
Abstract: Biometric authentication has received greater significance due to its high uniqueness and performance. The ability of quick and convenient authentication is required due to its widespread demand. Extraction of feature is the primary and important task for effective authentication. Dissimilar chrominance texture pattern (DiCTP) technique is used in this paper for effective feature extraction. Patterns of two sequences are generated from the inter channel information of the image which extracts the coloured texture information of the input. Unique information is generated from RGB and BRG planes of the image which produces a part of diversified chromatic feature vectors. The local binary pattern (LBP) code is generated and added along with the feature vector, which aids to inculcate the greyscale information of the image. The experimental results are formulated using the CASIA Face Image Database Version 5 (DB1) and Indian Face database (DB2) which give considerable improvements over the existing methodology.
Keywords: Biometric Authentication; Dissimilar Chrominance Texture Pattern ; Content Based Image Retrieval.
Discovery of Rare Association Rules in the Distribution of Lawsuits in the Federal Justice System of Southern Brazil
by Lucia Gruginskie, Guilherme Vaccaro, Leonardo Chiwiakwosky, Attilla Blesz Jr
Abstract: In the context of data mining, infrequent association rules may be beneficial for analysing rare or extreme cases with very low support values and high confidence. In researching risky situations or allocating specific resources, such rules may have a much greater impact than rules with high support value. The objective of this study is to obtain association rules from the database of lawsuits filed in the Federal Court of Southern Brazil in 2016, including both frequent and rare rules. By finding these rules, especially rare ones, the information collected can assist in the decision-making process, in this case, such as training clerks or establishing specialised courts.
Keywords: Association Rules; Rare Rules; Distribution of lawsuits; Brazilian Federal Justice; Data mining.
Integral Verification and Validation for Knowledge Discovery Procedure Models
by Anne Antonia Scheidler, Markus Rabe
Abstract: This paper explains why the knowledge discovery in database (KDD) procedure models lacks verification and validation (V&V) mechanisms and introduces an approach for integral V&V. Based on a generic model for knowledge discovery, a structure named 'KDD triangle model' is presented. This model has a modular design and can be adapted for other KDD procedure models. This has the benefit of allowing existing projects for improving their quality assurance in knowledge discovery. In this paper, the different phases of the developed triangle model for KDD are discussed. One special focus is on the phase results and related testing mechanisms. This paper also describes possible V&V techniques for the developed integral V&V mechanism to ensure direct applicability of the model.
Keywords: knowledge discovery in databases; data mining; procedure model; verification and validation; quality assurance.
A Multiclass Classification Approach for Incremental Entity Resolution on Short Textual Data
by Denilson Pereira, João A. Silva
Abstract: Several web applications maintain data repositories containing references to thousands of real-world entities originating from multiple sources, and they continually receive new data. Identifying the distinct entities and associating the correct references to each one is a problem known as entity resolution. The challenge is to solve the problem incrementally, as the data arrive, especially when those data are described by a single textual attribute. In this paper, we propose a new approach for incremental entity resolution. The method we have implemented, called AssocIER, uses an ensemble of multiclass classifiers with self-training and detection of novel classes. We have evaluated our method in various real-world datasets and scenarios, comparing it with a traditional entity resolution approach. The results show that AssocIER is effective and efficient to solve unstructured data in collections with a large number of entities and features, and is able to detect hundreds of novel classes.
Keywords: Entity Resolution; Associative Classification; Incremental Learning; Novel Class Detection; Self-training.
Method for Improvement of Transparency: Use of Text Mining Techniques for Reclassification of Governmental Expenditures Records in Brazil
by Gustavo De Oliveira Almeida, Kate Revoredo, Claudia Cappelli, Cristiano Maciel
Abstract: Many countries have transparency laws requiring availability of data. However, often data is available but not transparent. We present the Transparency Portal of Brazilian Federal Government case and discuss limitations of public acquisitions data stored in free text format. We employed text-mining techniques to reclassify descriptive texts of measurement units related to products and services. The solution presented in KNIME and JAVA aggregated measurements in the original (n = 69,372 with 78% reduction in number of descriptions, 94% items classified) and in cross validation sample (n = 105,266 with 88% reduction, classifying 78% of items). In addition, we tested computational time for processing of texts for a wide range of data input sizes, suggesting the stability and scalability of the solution to process larger datasets. Finally, we produced analysis identifying probable input errors, suppliers and purchasing units with abnormal transactions and factors affecting procurement prices. We present suggestions for future research and improvements.
Keywords: e-government; data mining; open government; text mining; transparency; KNIME; knowledge discovery; techniques; Brazil.
Data Mining in Credit Insurance Information System for Bank Loans Risk Management in Developing Countries
by Fouad J. Al Azzawi
Abstract: The task of credit risk insurance in our time is critical since loans
are taken by everyone and everywhere and it is quite difficult to accurately
estimate the possible losses that are incurred by failing to pay those loans.
This work proposes an information system module for the banking system to
improve the risk management operation that distributes losses on some fair
basis, as well as accepting the maximum number of loan requests. Insuring the
risk associated with stumbled loans, the bank will partially or completely shift
losses under this contract to the insurance company, thus minimising its own
losses. The proposed module could find out for what price the bank can buy
such insurance policy. The proposed module also could be a key valuable
motivation for different development countries to update their strategy of
current insurance market to outsource part of the states insurance functions to
independent insurance industry. Data mining techniques and mathematical
induction have been used and successfully implemented this model. An optimal
classification solution module for predicting risky loan requests have been
successfully employed. New mathematical model has been developed for
calculating the cost of insurance policy in crisis economy.
Keywords: Data mining; Credit insurance; information systems; Bank loans; risk management; developing countries.
Fibonacci Retracement Pattern Recognition for Forecasting Foreign Exchange Market
by Mohd Fauzi Ramli, AHMAD KADRI JUNOH, Mahyun Ab Wahab, Wan Zuki Azman Wan Muhamad
Abstract: Fibonacci retracement implicates a forecast of future movements in
foreign exchange rates (forex) of the previous movement inductive analysis.
Fibonacci ratios are used to forecast the retracements level of 0.382, 0.500 and
0.618 and to determine the current trend which provide the mathematical
foundation for the Elliott wave theory. K-nearest neighbour (KNN) and linear
discriminant analysis (LDA) algorithm are the pattern recognition method for
nonlinear feature mining of Elliott wave patterns. Results show that LDA is
better than KNN in terms of classification accuracy data which are 99.43%.
Among of three levels of Fibonacci retracement results, the 38.2% shows the
best forecasting for Great Britain Pound pair to US Dollar currency as major
pair by using mean absolute error (MAE), root mean square error (RMSE) and
pearson correlation coefficient (r) as the statistical measurements which are
0.001884, 0.000019 and 0.992253 for uptrend and 0.001685, 0.000019 and
0.998806 for downtrend.
Keywords: forex; forecast; fibonacci retracement; elliott wave; golden ratio.
CARs-RP: Lasso Based Class Association Rules Pruning
by AZMI Mohamed, Abdelaziz Berrado
Abstract: Classification based on association rules gets more and more interest in research and practice. In many contexts, rules are often mined from sparse data in high-dimensional spaces, which leads to large number of rules with considerable containment and overlap. Pruning is often used in search for an optimal subset of rules. This paper introduces a method for class association rules (CARs) pruning. It learns weights for a set of CARs by maximising the likelihood function subject to the sum of the absolute values of the weights. The pruning strength is controlled by a shrinkage parameter ?. The suggested method allows the user to choose the appropriate subset of CARs. This is achieved based on a trade-off between the accuracy and complexity of the resulting classifier which is controlled by changing ?. Experimental analysis shows that the introduced method allows to build more concise classifiers with comparable accuracy to other methods.
Keywords: class association rules; pruning; regularization; weighting; associative classification.
A statistical approach to investigate the alternatives of love in Moulanas Divan
by Mohammad Reza Mahmoudi, Ali Abbasalizadeh, Marzieh Rahmati
Abstract: Conceptual metaphor is the systematic mapping of conceptual domains on each other. Love is the most important axis of mystical path. In this paper, all the lines in Moulanas are studied and different words, which are used as alternatives of love, are determined and classified in 11 areas. Then chi-square goodness of fit test is used to investigate and compare the frequency of different areas and words which are used as alternatives of love, separately. Finally, based on the clustering methods, these alternatives are clustered in three (high frequency, medium frequency, and low frequency). The results indicate the word fire and the area human have the highest uses as the alternatives of love.
Keywords: Conceptual Metaphor Love; Moulana; Statistics; Data Mining; Text Mining.
PPM-HC: a Method for Helping Project Portfolio Management Based on Topic Hierarchy Learning
by Ricardo M. Marcacini, Ricardo A. M. Pinto, Flavia Bernardini
Abstract: The projects categorisation is a crucial step in the project portfolio management (PPM). Categorising projects allows the organisation to identify categories with a lack or excess of projects, according to its strategic objectives. In this work, we present a new method for project portfolio management based on hierarchical clustering (PPM-HC) to organise the projects at several levels of abstraction. In the PPM-HC, similar projects are allocated to the same clusters and subclusters. PPM-HC automatically learns an understandable topic hierarchy from the project portfolio dataset, thereby facilitating the (human) task of exploring, analysing and prioritising the projects of the organisation. We also proposed a card sorting-based technique which allows the evaluation of the projects categorisation using an intuitive visual map. We carried out an experimental evaluation based on a benchmark dataset and we also presented a real-world case study. The results show that the proposed PPM-HC method is promising.
Keywords: Project Portfolio Management; Projects Categorization; Topic Hierarchy Learning; Hierarchical Clustering.
An efficient approach for Defect Detection in Texture analysis using Improved Support Vector Machine
by Manimozhi I., Janakiraman S.
Abstract: Texture defect detection can be defined as the process of determining the location and size of the collection pixels in a textures image which deviate in their intensity values or spatial in compression to a background texture. The detection of abnormalities is a very challenging problem in computer vision. In our proposed method we have designed a method for detecting the defect of pattern texture analysis. Initially, features are extracted from the input image using the gray level co-occurrence matrix (GLCM) and gray level run-length matrix (GLRLM). Then the extracted features are fed to the input of classification stage. Here the classification is done by improved support vector machine (ISVM). The proposed pattern analysis the traditional support vector machine is improved by means of kernel methods. Final stage is the classified features are segmented using the modified fuzzy C means algorithm (MFCM).
Keywords: Texture defect detection; preprocessing; Gray Level Co-occurrence matrix; Gray Level Run-Length Matrix; Improved Support Vector Machine; modified fuzzy c means algorithm.
A DYNAMIC REPLICATIVE K-MEANS WITH SELF-COMPILING PARTICLE SWARM INTELLIGENCE FOR DATASET CLASSIFICATION
by A. M. Viswa Bharathy
Abstract: The classification techniques proposed so far is not sufficiently intelligent in classifying data set beyond two level classifications. To multi classify the data set for network data we are in need of more hybrid algorithms. In this paper we propose a hybrid technique by combining a modified K-means algorithm called dynamic replicative K-means (DRKM) with self-compiling particle swarm intelligence (SCPSI). The dataset we have chosen for the experiment is KDD Cup 99. The DRKM-SCPSI performs better in terms of detection rate (DR), false positive rate (FPR) and accuracy which is visible from the results presented.
Keywords: anomaly; detection; intrusion; K-Means; PSI.
PORTFOLIO SELECTION WITH SUPPORT VECTOR REGRESSION: MULTIPLE KERNELS COMPARISON
by Pedro Alexandre Henrique, Pedro Albuquerque, Peng Yao Hao, Sarah Sabino
Abstract: This study aimed to verify whether the use of support vector regression (SVR) makes the portfolios return exceed the market. For such propose, SVR was applied for 15 different kernel functions to select the best stocks for each quarter, calculating the quarterly portfolio return and cumulative return along the period. Subsequently, the returns of these portfolios were compared with the returns of a market benchmark. Whites (2000) test was applied to avoid the data-snooping effect in assessing the statistical significance of the portfolios developed by the training strategies. The portfolio selected by SVR with inverse multiquadric kernel presented the highest cumulative return of 374.40% and a value at risk (VaR) of 6.87%.The results of this study corroborate the superiority hypothesis of the innovative method of Support Vector Regression in the formation of portfolios, thus constituting a robust predictive method capable to cope with high dimensionality interactions.
Keywords: Statistical Learning Theory. Optimization Theory. Financial Econometrics. Support Vector Machine. Kernel methods.
Worldwide Gross Revenue Prediction for Bollywood Movies using Hybrid Ensemble Model
by Alina Zaidi, Siddhaling Urolagin
Abstract: Prediction of revenue before a movie is released can be very beneficial for stakeholders and investors in the movie industry. Even though Indian cinema is a booming industry, the literature work in the field of movie revenue prediction is more inclined towards non-Indian movie. In this study we built a novel hybrid prediction model to predict worldwide gross for Bollywood movies. Bollywood movies dataset is prepared by downloading movie related features from IMDb and YouTube movie trailers which consisting of 674 movies. K-means clustering is performed on the movie dataset and two major clusters are identifier. Important features specific to clusters are selected. The proposed hybrid prediction model performs segregation of movies into two clusters and employs prediction model for each cluster. Prediction models we tested included various basic machine learning models and ensemble models. The ensemble model that combined predictions from support vector regression, neural network and ridge regression gave us the best result for both clusters and we chose it to be our final model. We obtain an overall MAE of 0.0272 and R2 of 0.80 after 10-fold cross validation.
Keywords: Bollywood; Movie Revenue Prediction; Box office; Regression; Ensemble; Feature Selection; Machine Learning; Scikit-Learn.
Health Data Warehouses: Reviewing Advanced Solutions for Medical Knowledge Discovery
by Norah Alghamdi
Abstract: The implementation of a data warehouse and a decision support system by utilising the capabilities of information retrieval and knowledge discovery tools in the healthcare fields, has allowed for the enhancement in the offered healthcare. In this work, we present a review of recent data warehouses and decision support systems in the healthcare domain with their significance, and applications of evidence-based medicine, electronic health records, and nursing. Given the growing trend on their implementation in healthcare services, researches, and education, we present here the most recent publications that employ these tools to produce suitable decisions for patients or health providers. For all the reviewed publications, we have intensively explored their problems, suggested solutions, utilised methods, and their findings. We have also highlighted the strength of the existing approaches and identified potential drawbacks including data correctness, completeness, consistency, and integration to provide proper medical decision-making.
Keywords: Data warehouses; Data Mining; Health Data; Medical Records; Quality; Knowledge Discovery; OLAP.
Survey on-demand: A versatile scientific article automated inquiry method using text mining applied to Asset Liability Management
by Pedro Henrique Albuquerque, Igor Nascimento, Peng Yao Hao
Abstract: We proposed a methodology that automatically relate content of text documents with lexical items. The model estimates whether an article addresses a specific research object based on the relevant words in its abstract and title using text mining and partial least square discriminant analysis. The model is efficient in accuracy and the adjustment and validation indicators are either superior or equal to the other models in the literature on text classification. In comparison to existing methods, our method offers highly interpretable outcomes and allows flexible measurements of word frequency. The proposed solution may aid scholars regarding the process of searching theoretical references, suggesting scientific articles based on the similarities among the used vocabulary. Applied to the finance area, our framework has indicated that approximately 10% of the publications in the selected journals that address the subject of asset liability management. Moreover, we highlight the journals with the largest number of publications over time and the key words about the subject using only freely accessible information.
Keywords: dimensionality reduction; discriminant analysis; text classification; partial least square; bibliometrics.
Clustering Student Instagram accounts using Author-Topic Model Based
by Nur Rakhmawati, Faiz NF, Irmasari Hafidz, Indra Raditya, Pande Dinatha, Andrianto Suwignyo
Abstract: The aim of this study proposes topic model to cluster a group of high school teenager's Instagram account in Surabaya, Indonesia by using the author-topic models method. We collect valid 235 Instagram account (133 female, 102 male students). We gather a total 3,346 captions of the Instagram post from 18 senior high schools. We find major findings what are the topics that define their Instagram's post or caption, seven topics namely: feeling, Surabaya events, photography, artists, vacation, religion and music. Through the process, the lowest perplexity come from 90 iterations, which suggests six groups of topics. The six topics are concluded based on the lowest perplexity value and labelled according to the words included in the topic. The topic of Photography discussed by six schools. Photography-Artists and vacation are discussed by three schools, while feeling, religion and music are being discussed by two and one school respectively.
Keywords: Topic Modelling ; Senior High School Students ; Author-Topic Models.
The approach of using ontology as pre-knowledge source for semi-supervised labelled topic model by applying text dependency graph
by Phu Pham, Phuc Do
Abstract: Multiple topics discovering from text is an important task in text mining. From the past, the supervised approaches fail to explore multiple topics in text. The topic modelling approach, such as: LSI, pLSI, LDA, etc. are considered as an unsupervised method which supports to discover distributions of multiple topics in text documents. The labelled LDA (LLDA) model is a supervised method which enables to integrate human labelled topics with the given text corpus during the process of modelling topics. However, in real applications, we may not have enough high qualified knowledge to properly assign the topics for all documents before applying the LLDA. In this paper, we present two approaches which are taken the advantage of dependency graph-of-words (GOW) in text analysis. The GOW approach uses frequent sub-graph mining (FSM) technique to extract graph-based concepts from text. Our first approach is method of using graph-based concepts for constructing domain-specific ontology. It is called GC2Onto model. In our second approach, the graph-based concepts are also applied to improve the quality of traditional LLDA. It is called LLDA-GOW model. We combine two GC2Onto and LLDA-GOW models to leverage the multiple topic identification as well as other mining tasks in text.
Keywords: topic identification; labelled topic modelling; LDA; labelled LDA; ontology-driven topic labelling; dependency graph.
RFID BI Mobility and Producer to Consumer Traceability Architecture
by Andre Claude Bayomock Linwa
Abstract: Radio frequency identifier (RFID) emerged in 2000 an intelligent remote object identification. RFID helps tracking object position and relevant information using radio frequency technology (Bouet and dos Santos, 2008; Pais, 2010). Its application in industries, highly increases the inventory management consistency and accuracy, by capturing in real-time observed object attributes for traceability and quality control purpose. In order to provide traceability and quality control services, RFID applications should offer two main services: business intelligence (BI) and mobility management. The RFID BI provides production traceability services (QoS metrics related to manufacturing processes). And RFID mobility service maintains accurate RFID tag location. In this paper, a generic RFID BI mobility' data model is defined. In the proposed data model, RFID product information generated by a supply chain organisation is translated or migrated from a producer to a consumer. This migration generates two distinct types of RFID mobility: internal (inside buildings) and external.
Keywords: Mobility Management; RFID; Business Intelligence BI; Data Models; Business Processes; QoS; Mobile Networks; GPS; Events; Mobility Subscription.
Sentimental Event Detection from Arabic Tweets
by Mohammad Daoud, Daoud Daoud
Abstract: This article presents and evaluates an approach to detect sentimental events from Twitter Arabic data streams. Sentimental events attract strongly opinionated responses from the online community; therefore, we aim at detecting the association of a topic with a positive or a negative sentiment at a particular time. To achieve that, we build sentimental time series where the frequencies of that association (between topics and sentiment) are recorded. And then, we use several algorithms to locate possible events. Events in positive timelines will be considered as positive, and similarly for negative events. Our approaches use Shannon diversity index and hill climbing peak finding. We experimented our proposed algorithms with the domain of football (soccer) news. The results showed good precision and recall considering mainstream media as a reference. The success of such experiment can open the door for many useful applications including reputation and brand monitoring systems for various domains and languages.
Keywords: event detection; sentiment analysis; social media analysis; diversity analysis; data mining.
A comparison of cluster algorithms as applied to unsupervised surveys
by Kathleen C. Garwood, Arpit Dhobale
Abstract: When considering answering important questions with data, unsupervised data offers extensive insight opportunity and unique challenges. This study considers student survey data with a specific goal of clustering students into like groups with underlying concept of identifying different poverty levels. Fuzzy logic is considered during the data cleaning and organising phase helping to create a logical dependent variable for analysis comparison. Using multiple data reduction techniques, the survey was reduced and cleaned. Finally, multiple clustering techniques (k-means, k-modes and hierarchical clustering) are applied and compared. Though each method has strengths, the goal was to identify which was most viable when applied to survey data and specifically when trying to identify the most impoverished students.
Keywords: Fuzzy logic; cluster analysis; unsupervised learning; survey analysis; decision support system; k-means; k-modes; hierarchical clustering.
Discovery of inconsistent generalized coherent rules
by Anuradha Radhakrishnan, Rajkumar N, Rathi Gopalakrishnan, Soosaimichael PrinceSahayaBrighty
Abstract: Mining multiple-level association rules in a predefined taxonomy is an hierarchies that paves the way for generalised rule mining using interestingness measures like support and confidence. Coherent rule mining identifies significant rules in a database without using interestingness measures. In this paper we propose a new mining algorithm called generalised inconsistent coherent rule mining (GICRM) for mining a new form of generalised coherent rules called Inconsistent coherent rules. The discovered rules are called inconsistent because the correlation of the rules changes from one level of taxonomy to another. The rules are mined from a structured dataset of predefined taxonomy. The inconsistent rules mined would be noteworthy at business point of view for taking strategic decisions in market basket analysis.
Keywords: GICRM; multiple-level; generalized inconsistent coherent rule; taxonomy.
Time and Structural Anomalies Detection in Business Processes Using Process Mining
by Elham Saeedi, Faramarz Safi-Esfahani
Abstract: Information systems are increasingly being integrated into operational process and as a result, many events are recorded by information systems. Lack of compatibility between the process model and the observed behaviour is one of the challenges in constructing the process model in process mining. This lack of compatibility could be present in both the structure (sequence of the task) and the time spent in each task. In this paper, a hybrid approach for detecting structural and time anomalies via process mining is proposed. A dataset form Iran Insurance Company is used for performing a case study. The proposed method has detected 98.5% of structure anomalies and 96.3% of time anomalies which is one of the main achievements of this paper. A second standard dataset is used to further examine the proposed method that referred to as dataset 2. The proposed method has demonstrated a better performance compared with the baseline approach.
Keywords: Process mining; conformance checking; workflow mining; structural anomaly; time anomaly; flexible model; Insurance anomaly; anomaly detection; process model; control-flow perspective.
g*-CLOSED SETS IN INTUITIONISTIC FUZZY TOPOLOGICAL SPACES
by Gandhi Mathi
Abstract: This paper is devoted to the study of intuitionistic fuzzy topological spaces. In this paper we introduce the concepts of intuitionistic fuzzy g*-closed sets in intuitionistic fuzzy topological spaces and studied some of its basic properties. Also we introduce the concepts of intuitionistic fuzzy g*-open sets in intuitionistic fuzzy topological spaces and derived several basic properties. We show that Intuitionistic fuzzy g*-closed sets lies between intuitionistic fuzzy ?-closed sets and intuitionistic fuzzy g-closed sets. We also introduced application of intuitionistic fuzzy g*-closed sets namely intuitionistic fuzzy T_(1/2)^*space and(_^*)T_(1/2) space. We obtained some characterizations and several preservation theorems of intuitionistic fuzzy topological spaces.
Keywords: Intuitionistic fuzzy topology; Intuitionistic fuzzy g*-closed sets; Intuitionistic fuzzy g*-open sets.
Analysis of road accident data and determining affecting factors by using regression models and decision tree
by Hanieh GharehGozlu
Abstract: This study analyses the road accident data with the aim to predict the probability of the road accidents leading to death and determine the affecting factors. Regression models including logit, probit, complementary log-log, gompertz and decision trees based on the CART algorithm were used to analyse the actual data of the rail road police centre of the country. The results show that the logit regression model is superior to the other models from the perspective of the scales of the health indicator. Also, the variables of day of week, age, shoulder path, road side, road type, road position, maximum speed, belt safety, specific safety equipment, vehicle type and vehicle manufacturer country are among the variables that significantly affect the probability of road deaths, and can be controlled by controlling their levels.
Keywords: Road accidents; Regression models; Decision tree model; Accuracy indicator scales.
A Review of Market Basket Analysis on Business Intelligence and Data Mining
by Nilam Nur Amir Sjarif, Nurulhuda Firdaus Mohd Azmi, Siti Sophiayati Yuhaniz, Doris Hooi-Ten Wong
Abstract: Business insight (BI) is an information driven arrangement which umbrellas assortment of instruments, advances, applications, procedures and methodologies that empower mining of helpful learning and data from operational information resources. Hidden patterns or trends got from the tremendous volume of information are add to informed and strategic decision making. Market basket analysis (MBA) is one of the regularly utilised data mining technique in BI to help business organisation in accomplishing upper hand. In spite of the fact that, the appropriation of the MBA as a data mining technique in BI tools are common in e-commerce, paper that survey BI and MBA is limited. This paper gives a major picture on the current state of BI and the application of the MBA as a BI technique. Written works identified with BI and MBA from different sources such as digital libraries and Google Scholar are explored. The survey serves to some degree as a guide or platform for researchers and practitioners for future improvement.
Keywords: Market Basket Analysis; Business Intelligence; Data Mining.
Stock Price Forecasting and News Sentiment Analysis Model using Artificial Neural Network
by Sriram K. V, Somesh Yadav, Ritesh Singh Suhag
Abstract: The stock market is highly volatile, and the prediction of stock prices has always been an area of interest to many statisticians and researchers. This study is an attempt to predict the prices of stock using Artificial Neural Network (ANN). Three models have been built, one for the future prediction of stock prices based on previous trends, the second for prediction of next day closing price based on todays opening price, and the third one analyzes the sentiment of news articles and gives scores based on the news impact. ANN is trained with the historical data using R-studio platform which is then used to predict the future values. Our experimental results for various stock prices showed that the model is effective using ANN.
Keywords: Stock Pricing; Forecasting; Artificial Neural Network; News sentiment; Opening price; Closing price; R Studio; Data analytics;.
Associative Classification Model for Forecasting Stock Market Trends
by Everton Castelão Tetila, Bruno Brandoli Machado, Jose F. Rorigues-Jr, Nícolas Alessando De Souza Belete, Diego A. Zanoni, Thayliny Zardo, Michel Constantino, Hemerson Pistori
Abstract: This paper proposes an associative classification model based on three technical indicators to forecast future trends of stock market. Our methodology assessed the performance of nine technical indicators, using a portfolio of ten stocks and a twelve-year time series. The experimental results showed that the use of a set of technical indicators leads to higher classification rates compared to the use of sole technical indicators, reaching an accuracy of 88.77%. The proposed approach also uses a multidimensional data cube that allows automatic updating of stock market asset values, which are essential to keep the forecast updated. The results indicate that our approach can support investors and analysts to operate in stock market.
Keywords: stock market trends; technical indicators; associative classification; data mining; business intelligence.
An automated ontology learning for benchmarking classifier models through gain-based relative-non-redundant feature selection: a case-study with erythemato-squamous disease
by Sivasankari Sivasubramanian, Shomona Gracia Jacob
Abstract: Erythemato-squamous disease (ESD) is one of the complex diseases in the dermatology field, the diagnosis of which is challenging, due to common morphological features and often leads to inconsistent results. Besides, diagnosis has been done on the basis of inculcated visible symptoms pertinent with the expertise of the physician. Hence, ontology construction for prediction of erythemato-squamous disease through data mining techniques was believed to yield a clear representation of the relationships between the disease, symptoms and course of treatment. However, the classification accuracy required to be high in order to obtain a precise ontology. This required identifying the correct set of optimal features required to predict ESD. This paper proposes the Gain based Relative-Non-Redundant Attribute selection approach for diagnosis of ESD. This methodology yielded 98.1% classification accuracy with Adaboost algorithm that executed J48 as the base classifier. The feature selection approach revealed an optimal feature set comprising of 19 selected features.
Keywords: ontology; feature selection; classifier; web ontology language; gain base; erythemato-squamous.
Grey wolf optimiser-based feature selection for feature-level multi-focus image fusion
by K. Sujatha, D. Shalini Punithavathani, J. Janet, S. Venkatalakshmi
Abstract: This paper proposes optimal ensemble-individual-features (OEIF) for multi-focus image fusion through combining the decision information of individual features. This proposed system consists of three stages. In the first stage, the different types of features such as spatial, texture and frequency are extracted from every block on input blurred images. In the second step, grey wolf optimiser (GWO)-based features validation method is proposed to find suitable features from source images. This method is based on an iterative process, in which each individual represents a candidate solution for validating/invalidating the features. In the final step, the ensemble decision based on optimal individual features is utilised to fuse blurred images. We prove that OEIF method is better in comparison to the noisy feature-based individual pixel-level and the feature-level fusion methods with different multi-focus images and it reveals that OGWO-based proposed method performs better visual quality than other methods.
Keywords: multi-focus image fusion; grey wolf optimiser; feature validation; spatial; texture; frequency.
Secure hash algorithm-based multiple tenant user security over green cloud environment
by N.R. Ram Mohan, S. Padmalal, B. Chitra
Abstract: This paper proposes a green cloud multi-tenant trust authentication with secure hash algorithm-3 (GreenCloud-MTASHA3) scheme to eliminate the unauthorised tenant access. GreenCloud-MTASHA3 scheme provide security over the multiple tenant requests by referring the confidentiality, integrity and availability rate. Confidentiality refers to limiting the unauthorised tenant's green cloud data access using the additive homomorphic privacy property in proposed scheme. Additive homomorphic privacy property-based encryption function is developed to improve the privacy preserving level. To attain the integrity level between the tenant requests and green cloud server machine in GreenCloud-MTASHA3 scheme an encrypted trust data management process is carried out. Trustworthiness of tenant request is measured to maintain the consistency level on security with minimal computational time. The proposed scheme attains the confidentiality, integrity and availability rate on communicating task. Experiment is conducted on factors such as secure computation confidence, authorised tenant computational time and space taken on storing encrypted data.
Keywords: confidentiality; secure hash algorithm; multi-tenant; computational time; integrity; privacy level; cryptographic system; green cloud; security.
Application of a hybrid data mining model to identify the main predictive factors influencing hospital length of stay
by Ahmed Belderrar, Abdeldjebar Hazzab
Abstract: Length of hospital stay is one of the most appropriate measures that can be used for management of hospital resources and assistant of hospital admissions. The main predictive factors associated with the length of stay are critical requirements and should be identified to build a reliable prediction model for hospital stays. A hybrid integration approach consisting of fuzzy radial basis function neural network and hierarchical genetic algorithms was proposed. The proposed approach was applied on a dataset collected from a variety of intensive care units. We achieved an acceptable forecast accuracy level with more than 80.50%. We found 14 common predictive factors. Most notably, we consistently found that the demographic characteristics, hospital features, medical events and comorbidities strongly correlate to the length of stay. The proposed approach can be used as an effective tool for healthcare providers and can be extended to other hospital predictions.
Keywords: data mining; hospital management; hospital stay; hybrid prediction model; predictive factors.
Unsupervised key frame selection using information theory and colour histogram difference
by Janya Sainui, Masashi Sugiyama
Abstract: Key frame selection is one of the important research issues in video content analysis, as it helps effective video browsing and retrieval as well as efficient storage. Key frames would typically be as different from each other as possible but, at the same time, cover the entire content of the video. However, the existing methods still lose some meaningful frames due to an inaccurate evaluation of the differences between frames. To address this issue, in this paper, we propose a novel method of key frame selection which incorporates an information theoretic measure, called quadratic mutual information (QMI), with the colour histogram difference. Here, these two criteria are used to produce an appropriate frame difference measure. Through the experiments, we demonstrate that the proposed key frame selection method generates a more coverage of the entire video content with minimum redundancy of key frames compared with the competing approaches.
Keywords: key frame selection; similarity measure; information theory; quadratic mutual information; QMI; colour histogram difference.
Accurate recognition of ancient handwritten Tamil characters from palm prints for the Siddha medicine systems
by E.K. Vellingiriraj, P. Balasubrmanie
Abstract: The ancient Tamil characters recognition is the complex task because there is no sufficient training information is available. Various researchers attempted to perform accurate recognition of ancient Tamil characters. In our preceding work, hybrid multi-neural learning based prediction and recognition system (HMNL-PRS) is introduced for the prediction process which lacks from inaccurate recognition. In this proposed research work, this is overcome by proposing the Brahmi character prediction and conversion system (BC-PCS) methodology. Here, the modified graph based segmentation algorithm (MGSA) is used to segment the characters. And then the statistical and structural features are extracted based on which classification is done using hybridised support vector machine based fuzzy neural network. In the MATLAB simulation environment, the proposed research work is implemented and it is confirmed that the proposed research work direct to give the excellent result compared to the preceding research methodology in terms of recognition rate.
Keywords: Brahmi characters; accurate recognition; segmentation; graph based approach; classification.
A utility-based approach for business intelligence to discover beneficial itemsets with or without negative profit in retail business industry
by C. Sivamathi, S. Vijayarani
Abstract: Utility mining is defined as discovery of high utility itemsets from the large databases. It can be applied in business intelligence for business decision-making such as arranging products in shelf, catalogue design, customer segmentation, cross-selling etc. In this work a novel algorithm MAHUIM (matrix approach for high utility itemset mining) is proposed to reveal high utility itemsets from a transaction database. The proposed algorithm uses dynamic matrix structure. The algorithm scans the database only once and does not generate candidate itemsets. The algorithm calculates minimum threshold value automatically, without seeking from the user. The proposed algorithm is compared with the existing algorithms like HUI-Miner, D2HUP and EFIM. For handling negative utility values, MANHUIM algorithm is proposed and this is compared with HUINIV. For performance analysis, four benchmark datasets like Connect, Foodmart, Chess and Mushroom are used. The result shows that the proposed algorithms are efficient than the existing ones.
Keywords: utility mining high utility itemsets; individual item utility; transaction utility; automatic threshold selection; profitable transactions; pruned items.
Benchmarking tree-based least squares twin support vector machine classifiers
by Mayank Arya Chandra, S.S. Bedi
Abstract: Least square twin support vector machine is an emerging learning method applied in classification problem. This paper present a tree-based least square twin support vector machine (T-LSTWSVM) for classification. Classification procedure depends on the correlation of input feature as well as output feature. UCI benchmark data sets are used to evaluate the test set performance of tree-based least square twin support vector machine (T-LSTWSVM) classifiers with multiple kernel functions such as linear, polynomial and radial basis function (RBF) kernels. This method applies on two main types of classification problems such as binary class problem as well as multi-class problem. The evaluation and accuracy is calculated in terms of distance metric. It was observed that multi-class classification problem performed excellently by tree-based method.
Keywords: binary tree; classification; hyper plane; kernel function; machine learning; support vector machine; SVM; least square twin SVM.