International Journal of Business Intelligence and Data Mining (101 papers in press)
An Effective Preprocessing Algorithm for Model Building in Collaborative Filtering based Recommender System
by Srikanth T, M. Shashi
Abstract: Recommender systems suggest interesting items for online users based on the ratings expressed by them for the other items maintained globally as the rating matrix. The rating matrix is often sparse and very huge due to large number of users expressing their ratings only for a few items among the large number of alternatives. Sparsity and scalability are the challenging issues to achieve accurate predictions in recommender systems. This paper focuses on model building approach to collaborative filtering-based recommender systems using low rank matrix approximation algorithms for achieving scalability and accuracy while dealing with sparse rating matrices. A novel preprocessing methodology is proposed to counter data sparsity problem by transforming the sparse rating matrix denser before extracting latent factors to appropriately characterise the users and items in low dimensional space. The quality of predictions made either directly or indirectly through user clustering were investigated and found to be competitive with the existing collaborative filtering methods in terms of reduced MAE and increased NDCG values on bench mark datasets.
Keywords: Recommender System; Collaborative Filtering; Dimensionality Reduction; Pre- Processing,Sparsity,Scalability,Matrix Factorization.
Trajectory tracking of the robot end-effector for the minimally invasive surgeries
by Jose De Jesus Rubio, Panuncio Cruz, Enrique Garcia, Cesar Felipe Juarez, David Ricardo Cruz, Jesus Lopez
Abstract: The surgery technology has been highly investigated, with the
purpose to reach an efficient way of working in medicine. Consequently,
robots with small tools have been incorporated in many kind of surgeries
to reach the following improvements: the patient gets a faster recovery, the
surgery is not invasive, and the robot can access to the body occult parts. In
this article, an adaptive strategy for the trajectory tracking of the robot end
effector is addressed; it consists of a proportional derivative technique plus
an adaptive compensation. The proportional derivative technique is employed
to reach the trajectory tracking. The adaptive compensation is employed to
reach approximation of some unknown dynamics. The robot described in this
study is employed in minimally invasive surgeries.
Keywords: Trajectory tracking; robot; minimal invasive surgery.
Multi Label Learning Approaches for Multi Species Avifaunal Occurrence Modelling: A Case Study of South Eastern Tamil Nadu
by Appavu Alias Balamurugan, P.K.A. Chitra, S. Geetha
Abstract: Many multi label problem transformation (PT) and algorithm
adaptation (AA) methods need to be explored to get good candidate for
avifaunal occupancy modelling. This research contrasted eight commonly used
state-of-the-art PT and AA multi label methods. The data was created by
collecting January 2014December 2014 records from e-bird repository for the
study area Madurai district, south eastern Tamil Nadu. The analysis shows that
classifier chain (CC) and multi label naive Bayes (MLNB) are the good
aspirants for avifauna data. The MLNB did best with 0.019 hamming loss and
90% average precision. To the best of our knowledge this is the first time to use
MLNB for avifaunal data and the results of multi label naive Bayes concludes
that out of 143 species observed, six species had high occurrence rate and 68
species had low occurrence rate.
Keywords: Species distribution models; multi species; multi label Learning; Multi Label Naive Bayes; Central part of southern Tamil Nadu.
Analytics on Talent Search Examination Data
by Anagha Vaidya, Vyankat Munde, Shailaja Shirwaikar
Abstract: Learning analytics and educational data mining has greatly
supported the process of assessing and improving the quality of education.
While learning analytics has a longer development cycle, educational data
mining suffers from the inadequacy of data captured through learning
processes. The data captured from examination process can be suitably
extended to perform some descriptive and predictive analytics. This paper
demonstrates the possibility of actionable analytics on the data collected from
talent search examination process by adding to it some data pre-processing
steps. The analytics provides some insight into the learners characteristics
and demonstrates how analytics on examination data can be a major support
for bringing the quality in education field.
Keywords: Learning Analytics; Educational Data Mining; clustering; linear modelling.
CBRec: a book recommendation system for
children using the matrix factorisation and
content-based filtering approaches
by Yiu-Kai Ng
Abstract: Promoting good reading habits among children is essential, given
the enormous influence of reading on students development as learners and
members of the society. Unfortunately, very few (children) websites or online
applications recommend books to children, even though they can play a
significant role in encouraging children to read. Given that a few popular
book websites suggest books to children based on the popularity of books or
rankings on books, they are not customised/personalised for each individual
user and likely recommend books that users do not want or like. We have
integrated the matrix factorisation approach and the content-based approach,
in addition to predicting the grade levels of books, to recommend books for
children. Recent research works have demonstrated that a hybrid approach,
which combines different filtering approaches, is more effective in making
recommendations. Conducted empirical study has verified the effectiveness of
our proposed children book recommendation system.
Keywords: Book recommendation; matrix factorisation; content analysis; children.
Enhancing Purchase Decision using Multi-word Target Bootstrapping with Part-of-Speech Pattern Recognition Algorithm
by M. Pradeepa Sivaramakrishnan, C. Deisy
Abstract: In this research work, multi-word target related terms are extracted
automatically from the customer reviews for sentiment analysis. We used LIDF
measure and have proposed a novel measure called, TCumass in iterative
multi-word target (IMWT) bootstrapping algorithm. In addition, part-of-speech
pattern recognition (PPR) algorithm has been proposed to identify the
appropriate target and emotional words from multi-word target related terms.
This article aims to bring out both implicit and explicit targets with their
corresponding polarities in an unsupervised manner. We proposed two models
namely, MWTB without PPR and MWTB with PPR. Thus, the present research
illustrates the comparison between the proposed works and the existing
multi-aspect bootstrapping (MAB) algorithm. The experiment has been done
based on different data sets and thereafter the performance evaluated using
different measures. From this study, the result expounds that MWTB with PPR
model performs well, having achieved the precise targets and emotional words.
Keywords: Bootstrapping; emotional polarity; multi-word target; Part-of-Speech (POS); sentiment analysis.
Inferring the Level of Visibility from Hazy Images
by Alexander A. S. Gunawan, Heri Prasetyo, Indah Werdiningsih, Janson Hendryli
Abstract: In our research, we would like to exploit crowdsourced photos from
social media to create low-cost fire disaster sensors. The main problem is to
analyse how hazy the environment looks like. Therefore, we provide a brief
survey of methods dealing with visibility level of hazy images. The methods
are divided into two categories: single-image approach and learning-based
approach. The survey begins with discussing single image approach. This
approach is represented by visibility metric based on contrast-to-noise ratio
(CNR) and similarity index between hazy image and its dehazing image. This
is followed by a survey of learning-based approach using two contrast
approaches that is: 1) based on theoretical foundation of transmission light,
combining with the depth image using new deep learning method; 2) based on
black-box method by employing convolutional neural networks (CNN) on hazy
Keywords: Hazy image; visibility level; single image approach; learning based approach; social media.
The Mediation Roles of Purchase intention and brand trust in Relationship between social marketing activities and brand loyalty
by Nasrin Yazdanian, Saman Ronagh, Parya Laghaei, Fatemeh Mostafshar
Abstract: The rise of social media significantly challenges the way of firms
managing about introducing their brands. The literature on social media
marketing activities (SMMA) has promoted specially in the field of luxury
marketing. Building on the basic of web 2.0 social media applications have
simplified and facilitated extraordinary growth in customer interaction in
modern times. The objective of this study is to examine the role of affecting
factors which influence Iranian luxury brands customers attitude toward
purchase intention and brand loyalty. A questionnaire was used for collecting
data from a sample of 114 luxury brand customers in social media in Tehran,
capital and metropolitan city of Iran. Structural equation modelling was applied
to examine the impact of social media marketing activities on brand loyalty.
The mediating role of purchase intention and brand trust is considered too. The
results indicated that entertainment does not have positive impact on purchase
intention, brand trust and brand loyalty. The results of this research enable
luxury brands managers to forecast the future purchasing behaviour of their
customers and provide a guide to managing their strategies and marketing
activities in competitive environment.
Keywords: Luxury brands; Social Media Marketing Activities; brand trust ; loyalty; purchase intention.
Application of a hybrid data mining model to identify the main predictive factors influencing hospital length of stay
by Ahmed Belderrar, Abdeldjebar Hazzab
Abstract: Length of hospital stay is one of the most appropriate measures that can be used for management of hospital resources and assistant of hospital admissions. The main predictive factors associated with the length of stay are critical requirements and should be identified to build a reliable prediction model for hospital stays. A hybrid integration approach consisting of fuzzy radial basis function neural network and hierarchical genetic algorithms was proposed. The proposed approach was applied on a data set collected from a variety of intensive care units. We achieved an acceptable forecast accuracy level with more than 80.50%. We found 14 common predictive factors. Most notably, we consistently found that the demographic characteristics, hospital features, medical events and comorbidities strongly correlates to the length of stay. The proposed approach can be used as an effective tool for healthcare providers and can be extended to other hospital predictions.
Keywords: data mining; hospital management; length of hospital stay; hybrid prediction model; predictive factors.
Genetic Algorithm based Intelligent Multiagent Architecture for Extracting Information from Hidden Web Databases
by Weslin D, T. Joshva Devadas
Abstract: Though there are enormous amount of information available in the
web, only very small portion of the available information is visible to the users.
Due to the non-visibility of huge information, the traditional search engines
cannot index or access all information present in the web. The main challenge
in the mining of the relevant information from a huge hidden web database is to
identify the entry points to access the hidden web databases. The existing web
crawlers cannot retrieve all information from the hidden web databases. To
retrieve all the relevant information from the hidden web, this paper proposes
an architecture that uses genetic algorithm and intelligent agents for accessing
hidden web databases. The proposed architecture is termed as genetic algorithm
based intelligent multi-agent system (GABIAS). The experimental results show
that the proposed architecture provides better precision and recall than the
existing web crawlers.
Keywords: Genetic Algorithm (GA); Hidden Web; Intelligent Agent; Web Crawler.
Efficient Clustering Technique for K-Anonymization with Aid of Optimal KFCM
by Chitra Ganabathi G., P. Uma Maheswari
Abstract: The k-anonymity model is a simple and practical approach for data
privacy preservation. To minimise the information loss due to anonymisation, it
is crucial to group similar data together and then anonymises each group
individually. So that in this paper proposes a novel clustering method for
conducting the k-anonymity model effectively. The clustering will be done by
an optimal kernel based fuzzy c-means clustering algorithm (KFCM). In
KFCM, the original Euclidean distance in the FCM is replaced by a
kernel-induced distance. Here the objective function of the kernel fuzzy
c-means clustering algorithm is optimised with the help of modified grey wolf
optimisation algorithm (MGWO). Based on that, the collected data is grouped
in an effective manner. The performance of the proposed technique is evaluated
by means of information loss, time taken to group the available data. The
proposed technique will be implemented in the working platform of MATLAB.
Keywords: Privacy preservation; k-anonymity; Kernel Fuzzy C-Means; Grey wolf optimization; information loss.
Optimal Decision Tree Fuzzy Rule Based Classifier (ODT-FRC) For Heart Disease Prediction Using Improved Cuckoo Search Algorithm
by Subhashini Narayan, Jagadeesh Gobal
Abstract: Heart disease is a major cause for anomaly in developed countries
and one of the basic diseases in developing countries. Then there is a necessary
to insert an alternative expressively caring network for predicting heart disease
of a patient. The clinical alternative expressively caring networks contain three
method of preprocessing such as preprocessing, generate decision rule and rule
weighting, classification. Initially, the Cleveland data, Hungarian data and
Switzerland data are loud in the reliable information from the database in
preprocessing. On this process, underline quantity reduction method will be
associated to reduce the components space exploiting orthogonal
neighbourhood safeguarding projection (OLPP) computation. While, the
combinations of cuckoo search algorithm, fuzzy and decision tree classifier can
create a hybrid classifier. Here, fuzzy and decision tree algorithm will be
sufficiently combined with cuckoo search (CS) algorithm and which will guide
for accurate grouping.
Keywords: preprocessing; cuckoo search; fuzzy; decision tree; classification.
A Novel Attribute Based Dynamic Clustering with Schedule Based Rotation Method (ADC-SBR) for Outlier Detection
by Karthikeyan .G, P. Balasubramanie
Abstract: Detection of outliers in bank transactions has gained popularity in
the recent years. The existing outlier detection techniques are unable to process
the high volume of data. Hence, to address this issue, an efficient attribute
based dynamic clustering-schedule based rotation (ADC-SBR) method is
proposed. The similarity between transactions within a cluster is estimated
using Jaccard coefficient based labelling approach and the optimal cluster head
is chosen by the similarity-based cluster head selection (SbCHS) method.
The outlier detection is performed in two levels. The node level outlier
detection is performed using linear regression model and the cluster level
outlier detection is performed by deviation based ranking. An own dataset with
bank transactions is used for the experimental analysis. The suggested method
is implemented in Apache Spark and is compared with existing algorithms for
the metrics. The comparison results prove that the proposed method is optimal
for all metrics than existing algorithms.
Keywords: Attribute based Dynamic Clustering (ADC) - Schedule based Rotation (SBR); Jaccard coefficient; Linear Regression method; Deviation based ranking; Similarity based Cluster Head Selection (SbCHS).
Mining Multilingual and Multiscript Twitter Data: Unleashing the Language and Script Barrier
by Bidhan Sarkar, Nilanjan Sinhababu, Manob Roy, Pijush Kanti Dutta Pramanik, Prasenjit Choudhury
Abstract: Micro-blogging sites like Twitter have become an opinion hub
where views on diverse topics are expressed. Interpreting, comprehending and
analysing this emotion-rich information can unearth many valuable insights.
The job is trivial if the tweets are in English. But lately, increase in native
languages for communication has imposed a great challenge in social media
mining. Things become more complicated when people use Roman scripts to
write non-English languages. India, being a country with a diverse collection of
scripts and languages, encounters the problem severely. We have developed a
system that automatically identifies and classifies native tweets, irrespective of
the script used. Converting all tweets to English, we get rid of the script vs
language problem. The new approach we formulated consists of Script
Identification, Language analysis, and Clustered mining. Considering English
and the top two Indian languages, we found that the proposed framework gives
better precision than the prevailing approaches.
Keywords: Twitter Mining; Language Classification; Script Identification; Indic language; Preprocessing; Naive Bayes; Support Vector Machine; LDA.
An Automated Ontology Learning for benchmarking classifier models through Gain-Based Relative-Non-Redundant (GBRNR) Feature Selection : A case-study with Erythemato
by S. Sivasankari, Shomona Gracia Jacob
Abstract: Erythemato-squamous disease (ESD) is one of the complex diseases
in the dermatology field, the diagnosis of which is challenging, due to common
morphological features and often leads to inconsistent results. Besides,
diagnosis has been done on the basis of inculcated visible symptoms pertinent
with the expertise of the physician. Hence, ontology construction for prediction
of Erythemato-squamous disease through data mining techniques was believed
to yield a clear representation of the relationships between the disease,
symptoms and course of treatment. However, the classification accuracy
required to be high in order to obtain a precise ontology. This required
identifying the correct set of optimal features required to predict ESD. This
paper proposes the Gain based Relative-Non-Redundant Attribute selection
approach for diagnosis of ESD. This methodology yielded 98.1% classification
accuracy with Adaboost algorithm that executed J48 as the base classifier. The
feature selection approach revealed an optimal feature set comprising of 19
Keywords: Ontology; Feature Selection; Classifier; Web Ontology Language; Gain Base;Erythemato-Squamous.
Optimal Page Ranking System For Web Page Personalization Using MKFCM And GSA
by Pranitha P., M.A.H. Farquad, G. Narshimha
Abstract: In this personalised web search (PWS), we utilise a kernel-based
FCM for clustering a web pages. For effective personalised web search, queries
are optimised using GSA with respect to clustered query sessions. In offline
processing, initially preprocess the input information taken from consumer
visited web pages and are transformed in to numerical matrix. These matrices
are gathered with the help of kernel-based FCM method after produce a vector
for consumer query and detect a minimum distance as centroid values these
values are input to the GSA algorithm. It will engender these links given top N
web pages from cluster. In online processing, the user query is engaged as input
then extract some web pages from Google, Bing, Yahoo also extract content
and snippet from web pages. Finally, detect a sum of contents and snippets and
web pages would be considered in descending order.
Keywords: Kernelbased Fuzzy c-means; Clustering; offline; online; preprocessing; Google; Bing; Yahoo.
Privacy Preserving-Aware Over Big Data in Clouds Using GSA and Map Reduce Framework
by Sekar K., Mokkala Padmavathamma
Abstract: This paper proposes a privacy preserving-aware-based approach
over Big data in clouds using GSA and MapReduce framework. It consists of
two modules such as; MapReduce module and evaluation module. In MR
module, convolution process is applied to the dataset and creates a new kernel
matrix. The convolution process is correctly done; the utility and privacy
information of the data is well secured. Once the convolution process is over,
the privacy-persevering framework over big data in cloud systems is performed
based on the evaluation module. In Evaluation module, the neural-network is
trained based on the Gravitational Search Algorithm with Scaled conjugate
gradient (GSA-SCG) algorithm which is improving the utility of the privacy
data. Finally, the reduced privacy datas are stored in the service provider
(CSP). The MapReduce framework is to ensure the private data, which is in
charge for anonymising original data sets as per privacy requirements.
Keywords: Map reduce; privacy preserving; big data; Cloud service provider; cloud system; GSA; convolution; entropy.
Secure Hash Algorithm based Multiple Tenant User Security over Green Cloud Environment
by Ram Mohan, S. Padmalal, B. Chitra
Abstract: This paper proposes a green cloud multi-tenant trust authentication
with secure hash algorithm-3 (GreenCloud-MTASHA3) scheme to eliminate
the unauthorised tenant access. GreenCloud-MTASHA3 scheme provide
security over the multiple tenant requests by referring the confidentiality,
integrity and availability rate. Confidentiality refers to limiting the unauthorised
tenants green cloud data access using the additive homomorphic privacy
property in proposed scheme. Additive homomorphic privacy property-based
encryption function is developed to improve the privacy preserving level.
To attain the integrity level between the tenant requests and green cloud
server machine in GreenCloud-MTASHA3 scheme an encrypted trust data
management process is carried out. Trustworthiness of tenant request is
measured to maintain the consistency level on security with minimal
computational time. The proposed scheme attains the confidentiality, integrity
and availability rate on communicating task. Experiment is conducted on
factors such as secure computation confidence, authorised tenant computational
time and space taken on storing encrypted data.
Keywords: Green Cloud; Security; Confidentiality; Secure Hash Algorithm; Computational Time; Multi-Tenant; Integrity; Privacy Level; Cryptographic System.
Frequent Pattern Mining for Parameterised Automatic Variable Key based cryptosystems
by Shaligram Prajapat
Abstract: Huge amount of information is exchanged electronically in most
enterprises and organisations. In particular, in all financial and e-business set
ups the amount of data stored or exchanged is growing enormously over public
network among variety of computing devices. Securing this gargantuan sized
input is challenging. This paper provides a framework for securing information
exchange using parametric approaches with AVK approach and investigating
strength of this cryptosystem using mining algorithms on symmetric key-based
cryptosystem. This work demonstrates association rule application as one of the
component of cryptic mining system used to process the encrypted data for
extracting use full patterns and association. The degree of identified patterns
may be use full to rank the degree of safety and class of cryptic algorithm,
during auditing of security algorithms.
Keywords: Mining algorithms; symmetric key cryptography; AVK.
A hybrid framework for Job Scheduling on Cloud using Firefly and BAT algorithm
by Hariharan B., Dassan Paul Raj
Abstract: Nowadays cloud computing is an emerging field, requires more
algorithm and techniques for the various process of cloud computing. Here, we
have considered the job scheduling process in cloud computing platform that
needs a good algorithm to schedule the jobs requested from various users of
cloud computing environment. Here, the request can be from any platform so
scheduling is indispensable one when a number of users need the particular
jobs. In this research, we have intended to develop a hybrid algorithm for job
scheduling in cloud computing environment. Accordingly, multiple criteria will
be taken for scheduling various jobs located in various servers. Then, the job
scheduling will be done based on a hybrid optimisation algorithm.
Additionally, different jobs with different constraints will be considered and the
cloud computing environment is simulated with the help of cloudsim tool.
Keywords: Cloud Computing; Firefly Algorithm; BAT algorithm; Job Scheduling; FF-BAT Algorithm.
ACCURATE RECOGNITION OF ANCIENT HANDWRITTEN TAMIL CHARACTERS FROM PALM PRINTS FOR THE SIDDHA MEDICINE SYSTEMS
by Vellingiriraj EK, P. Balasubrmanie
Abstract: The ancient Tamil characters recognition is the complex task
because there is no sufficient training information is available. Various
researchers attempted to perform accurate recognition of ancient Tamil
characters. In our preceding work, hybrid multi-neural learning based
prediction and recognition system (HMNL-PRS) is introduced for the
prediction process which lacks from inaccurate recognition. In this proposed
research work, this is overcome by proposing the Brahmi character prediction
and conversion system (BC-PCS) methodology. Here, the modified graph
based segmentation algorithm (MGSA) is used to segment the characters. And
then the statistical and structural features are extracted based on which
classification is done using hybridised support vector machine based fuzzy
neural network. In the MATLAB simulation environment, the proposed
research work is implemented and it is confirmed that the proposed research
work direct to give the excellent result compared to the preceding research
methodology in terms of recognition rate.
Keywords: Brahmi characters; accurate recognition; segmentation; graph based approach; Classification.
Benchmarking Tree based Least Squares Twin Support Vector Machine Classifiers
by Mayank C, S.S. Bedi
Abstract: Least square twin support vector machine is an emerging learning method applied in classification problem. This paper present a tree-based least square twin support vector machine (T-LSTWSVM) for classification. Classification procedure depends on the correlation of input feature as well as output feature. UCI benchmark data sets are used to evaluate the test set performance of tree-based least square twin support vector machine (T-LSTWSVM) classifiers with multiple kernel functions such as linear, polynomial and radial basis function (RBF) kernels. This method applies on two main types of classification problems such as binary class problem as well as multi-class problem. The evaluation and accuracy is calculated in terms of distance metric. It was observed that multi-class classification problem performed excellently by tree-based method.
Keywords: Binary Tree; Classification; Hyper plane; Kernel Function; Machine Learning; Support Vector Machine (SVM); Least Square Twin SVM.
An Utility Based Approach for Business Intelligence to Discover Beneficial Itemsets With or Without Negative Profit in Retail Business Industry
by C. SIVAMATHI, S. Vijayarani
Abstract: Utility mining is defined as discovery of high utility itemsets from the large databases. It can be applied in business Intelligence for business decision-making such as arranging products in shelf, catalogue design, customer segmentation, cross-selling etc. In this work a novel algorithm MAHUIM (matrix approach for high utility itemset mining) is proposed to reveal high utility itemsets from a transaction database. The proposed algorithm uses dynamic matrix structure. The algorithm scans the database only once and does not generate candidate itemsets. The algorithm calculates minimum threshold value automatically, without seeking from the user. The proposed algorithm is compared with the existing algorithms like HUI-Miner, D2HUP and EFIM. For handling negative utility values, MANHUIM algorithm is proposed and this is compared with HUINIV. For performance analysis, four benchmark datasets like Connect, Foodmart, Chess and Mushroom are used. The result shows that the proposed algorithms are efficient than the existing ones.
Keywords: Utility mining; High utility itemset mining; individual item utility; transaction utility; Minimum utility threshold; Negative utility; Pruning strategy; Profitable transactions.
Automated Optimal Test Data Generation for OCL Specification Using Harmony Search Algorithm
by A. Jali
Abstract: Exploring software testing possibilities at an early software life cycle is increasingly necessary to avoid the propagation of defects to the subsequent phases. This requirement demands technique that can generate automated test cases at the initial phases of software development. Thus, we propose a novel framework for automated test data generation using formal specifications written in object constraint language (OCL). We also defined a novel fitness function named exit-predicate-wise branch coverage (EPWBC) to evaluate the generated test data. Another focus of the proposed approach is to optimise the test case generation process by applying, harmony search (HS) algorithm. The experimental results indicate that the proposed framework outperforms the other OCL-based test case generation techniques. Furthermore, it has been inferred that OCL based testing adopting HS algorithm forms an excellent combination to produce more test coverage and an optimal test suite thereby improving the quality of a system.
Keywords: specification-based testing; OCL;object constraint language; HS; harmony search; EPWBC; exit-predicate-wise branch coverage;Optimal Test Case Generation.
Characteristic of Enterprise Collaboration System and Its Implementation Issues in Business Management
by Tanvi Bhatia, Sudhanshu Joshi, Tanvi Bhatia, Sadhna Sharma, Durgesh Samadhiya, Rajiv Ratn Shah
Abstract: Collaboration is an extremely useful area for the most of the enterprise systems particularly within Web 2.0 and Enterprise 2.0. The collaboration provides help in enterprise collaboration system (ECS) to achieve the desired goal by unifying completed tasks of employees or people working on a similar or the same task. Thus, the collaboration systems have witnessed significant attention. The ECS provides consistent and off-the-shelf support to processes and managements within organisations. Management techniques of the ECS may be useful to a community which manages ECS systems for collaboration. In this context, this paper focuses on enterprise collaboration system and answers critical questions related to ECS including: 1) what does collaboration really means for an enterprise system; 2) how can the collaboration help to improve internal processes and management of the system; 3) how it is helpful to improve interactions with customers and partners?
Keywords: Enterprise Collaboration System; Web 2.0; Enterprise 2.0; Management Techniques; Enterprise System.
Unsupervised Key Frame Selection using Information Theory and Color Histogram Difference
by Janya Sainui, Masashi Sugiyama
Abstract: Key frame selection is one of the important research issues in video content analysis, as it helps effective video browsing and retrieval as well as efficient storage. Key frames would typically be as different from each other as possible but, at the same time, cover the entire content of the video. However, the existing methods still lose some meaningful frames due to an inaccurate evaluation of the differences between frames. To address this issue, in this paper, we propose a novel method of key frame selection which incorporates an information theoretic measure, called quadratic mutual information (QMI), with the colour histogram difference. Here, these two criteria are used to produce an appropriate frame difference measure. Through the experiments, we demonstrate that the proposed key frame selection method generates a more coverage of the entire video content with minimum redundancy of key frames compared with the competing approaches.
Keywords: Key frame selection; Similarity measure; Information theory ; Quadratic mutual information ; Color histogram di?erence.
Building Acoustic Model for Phoneme Recognition using PSO-DBN
by B.R. Laxmi Sree, M.S. Vijaya
Abstract: Deep neural networks has shown its power in generous classification problems including speech recognition. This paper proposes to enhance the power of deep belief network (DBN) further by pre-training the neural network using particle swarm optimisation (PSO). The objective of this work is to build an efficient acoustic model with deep belief networks for phoneme recognition with much better computational complexity. The result of using PSO for pre-training the network drastically reduces the training time of DBN and also decreases the Phoneme error rate (PER) of the acoustic model built to classify the phonemes. Three variations of PSO namely, the basic PSO, second generation PSO (SGPSO) and the New model PSO (NMPSO) are applied in pre-training the DBN to analyse their performance on phoneme classification. It is observed that the basic PSO is performing comparably better to other PSOs considered in this work, most of the time.
Keywords: Phoneme Recognition; Deep Neural Networks; Particle Swarm Optimisation; Acoustic Model; Tamil Speech Recognition; Deep Learning. Deep Belief Networks.
Efficient search for top-k discords in streaming time series
by Giao Bui Cong, Duong Tuan Anh
Abstract: The problem of anomaly detection in streaming time series has received much attention recently. The problem addresses finding the most anomalous subsequence (discord) over a time-series stream, which might arrive at high speed. The fact that finding top-k discords is more useful than finding the most unusual subsequence since users might make a choice among the top-k discords instead of choosing only one. Hence, an efficient method of search for top-k discords in streaming time series is proposed in the paper. The method uses a lower bound threshold, a lower bounding technique on a common dimensionality reduction transform, and a state-of-the-art technique of the distance computation between two time-series subsequences to prune off unnecessary distance calculations. The three techniques are arranged in a cascading fashion to speed up the performance of the method. Furthermore, the proposed method can return a set of top-k discords on the fly. The experimental results show that the proposed method can acquire quality discords nearly identical to those obtained by HOT SAX, a well-known method of anomaly detection. Remarkably, our proposed method demonstrates a fast response in handling time-series streams at high speed.
Keywords: anomaly detection; discord; streaming time series.
Mining Big data streams using Business analytics tools: A bird
by Arunkumar PM, S. Kannimuthu
Abstract: Big data evolves as the prominent field in modern computing era. Big data analytics and its impact on extracting business intelligence is becoming indispensable for plethora of applications. The non-proprietary software revolution paved the way for illustrious evolution of tools like Weka, rapid miner, orange and R. Traditional data mining techniques hardly adapts to the requirements of rapid data analysis. The data stream processing algorithms that handle multitude of data endow with greater challenge in real time. Big data mining requires further improvisation in traditional tools to address the challenges of Massive data processing. This paper highlights the importance of data stream mining and explores two important open source frameworks, namely massive online analysis (MOA) and scalable advanced massive online analysis (SAMOA). The implications of both the tools augurs well for further deliberations in big data research community. Business information system (BIS) models can reach unprecedented heights with the proliferation of these business analytics tools.
Keywords: Big Data; Data mining; Data streams; Massive online analysis; Business Intelligence.
A novel dynamic approach to identifying suspicious customers in money transactions
by Abdul Khalique Shaikh, Amril Nazir
Abstract: Money laundering activity causes a negative impact on the development of the national economy. Anti-money laundering (AML) solutions within financial institutions facilitate to control it in a suitable way. However, one of the fundamental challenges in AML solution is to identify real suspicious transactions. To identify these types of transactions, existing research uses pre-defined rules and statistical approaches that help to detect the suspicious transactions. However, due to the fixed and predetermined rules, it is highly probable that a normal customer can be identified as suspicious customers. To overcome the above limitations, a novel dynamic approach to identifying suspicious customers in money transactions is proposed that is based on dynamic analysis of customer profile features to identify suspicious transactions. The experiment has been executed with real bank customers and their transactions data and the results of the experiment provide promising outcomes in terms of accuracy.
Keywords: AML; anti-money laundering; suspicious transactions; money transaction; dynamic AML analysis; data analysis.
Anomaly detection for elderly home care
by Kurnianingsih Kurnianingsih, Lukito Edi Nugroho, Widyawan Widyawan, Lutfan Lazuardi, Anton Satria Prabuwono, Mahardhika Pratama
Abstract: In this paper, we propose a model for detecting anomalies in elderly home care. Two scenarios are investigated in detecting anomalies: 1) the elderly person's vital signs and their surrounding environment; 2) the mobility patterns of the elderly. We evaluated our proposed model by employing the isolation forest which detects anomalies using an isolation approach on a random forest of decision trees. We compare isolation forest on unlabeled data with statistical methods on labelled data. Subsequently, to show the reliability of the isolation concept, we compare it with a distance measure concept. The experiment shows that isolation forest has higher detection accuracy and lower error prediction for two attributes in the first scenario: skin temperature and heart rate, whereas, in the second scenario, multi-covariance determinant has a slightly better accuracy compared to isolation forest (3.9% difference in accuracy) and has a small number of prediction errors compared to isolation forest.
Keywords: anomaly detection; isolation forest; elderly home care.
Multi-Document Based Text Summarization Through Deep Learning Algorithm
by G. PadmaPriya, K. Duraiswamy
Abstract: The proposed approach is provided an effort in terms of deep leaning algorithm to retrieve an effective text summary for a set of documents. Basically, the proposed system consists of two phases such as training phase and the testing phases. The training phase is used for exploiting the three different algorithms to make the text summarisation process an effective one. Similar to every training phase, the proposed training phases is also possessed of known data and attributes. After that, the testing phase is implemented to test the efficiency of the proposed approach. For experimentation, we used four documents sets which are selected from the DUC (2002). The experimental evaluation showed expected results as, the average precision of 78%, the average recall of 1 and the average f-measure of 84%.
Keywords: Particle Swarm Optimisation; Text Summarization ; Deep Learning Algorithm.
Grey-Wolf Optimizer Based Feature Selection for Feature-Level Multi-Focus Image Fusion
by Sujatha K, D. Shalini Punithavathani, J. Janet, S. Venkatalakshmi
Abstract: This paper proposes optimal ensemble-individual-features (OEIF) for multi-focus image fusion through combining the decision information of individual features. This proposed system consists of three stages. In the first stage, the different types of features such as spatial, texture and frequency are extracted from every block on input blurred images. In the second step, grey wolf optimiser (GWO)-based features validation method is proposed to find suitable features from source images. This method is based on an iterative process, in which each individual represents a candidate solution for validating/invalidating the features. In the final step, the ensemble decision based on optimal individual features is utilised to fuse blurred images. We prove that OEIF method is better in comparison to the noisy feature-based individual pixel-level and the feature-level fusion methods with different multi-focus images and it reveals that OGWO-based proposed method performs better visual quality than other methods.
Keywords: Multi-focus image fusion; grey wolf optimiser; feature validation; spatial; texture; frequency.
Online Products Recommendation System using Genetic Kernel Fuzzy C-Means and Probabilistic Neural Network
by Manohar E, D. Shalini Punithavathani
Abstract: The purchaser's review plays a significant role in choosing the purchasing activities for online shopping as a customer desires to obtain the opinion of other purchasers by observing their opinion through online products. However, most appropriate product selection from the best website is a challenging problem for online users. Accordingly, this paper proposes a hybrid recommendation system for identifying customer preferences and recommending the most appropriate product. To do this, first the dataset is collected and prepared in the pre-processing step. Genetic kernel fuzzy C-means (GAKFCM) is used for usage cluster formation after the pre-processing step. The different features are extracted from each cluster-based user interest level. The user interest levels are used as features for classifier to extract user knowledge discovery. Based upon the user interest level, the product recommendation is done using probabilistic neural network (PNN). The simulation results show high precision rate which clearly indicates that the proposed method is very useful and appealing.
Keywords: website; web-log; ranking; rating; review; products; Genetic Kernel Fuzzy C-Means; probabilistic neural network.
Hybridising Neural Network and Pattern Matching under Dynamic Time Warping for Time Series Prediction
by Thanh Son Nguyen
Abstract: Pattern matching-based forecasting models are attractive due to their simplicity and the ability to predict complex nonlinear behaviours. Euclidean measure is the most commonly used metric for pattern matching in time series. However, its weakness is that it is sensitive to distortion in time axis; so, this can influence on forecasting results. The dynamic time warping (DTW) measure is introduced as a solution to the weakness of Euclidean distance metric. In addition, artificial neural networks (ANNs) have been widely used in the time series forecasting. They have been used to capture the complex relationships with a variety of patterns. In this work, we propose an improved hybrid method which is an affine combination of neural network model and DTW-based pattern matching model for time series prediction. This method can take full advantage of the individual strengths of the two models to create a more effective approach for time series prediction. Experimental results show that our proposed method outperforms neural network model and DTW-based pattern matching method used separately in time series prediction.
Keywords: time series; pattern matching; artificial neural network; time series prediction; dynamic time warping; k-nearest neighbour.
REFERS: Refined & Effective Fuzzy E-commerce Recommendation System
by Sankar Pariserum Perumal, Ganapathy Sannasi, Kannan Arputharaj
Abstract: Online shopping culture is gaining traction globally and some of the biggest beneficiaries of this e-commerce shift are Amazon, eBay, etc. Recommendation systems guide online users in a personalised manner to choose what they want and their interest on each product present in the catalogue list. In such a scenario, the existing systems need complete information for making recommendations, which is not always possible in real applications. Therefore, a novel refined and effective fuzzy e-commerce recommendation system has been proposed in this paper that combines the benefits of difference in importance within the rating factors by a single user and new similarity measure approach that aims at improved recommendation list to the e-commerce user. The proposed methodology has been implemented using a new similarity measure on experimental datasets and the refined scores for such e-commerce website-based unlocked mobile phones are compared in this work against classic similarity measures.
Keywords: Fuzzy recommendation system; degree of similarity measure; rating factor importance; collective expert rating.
Decision tree classifier for university single rate tuition fee system
by Taufik F. Abidin, Samsul Rizal
Abstract: The regulation about single rate tuition fee for undergraduate study at state universities in Indonesia was enacted in 2013. The tuition fee is calculated based on the needs of each academic program and the regional cost index. The fee is grouped into several categories and set differently for each university. For Syiah Kuala University, located in Banda Aceh, Indonesia, the tuition fee is grouped into five different categories. This paper describes the construction of J48 decision tree classifier and evaluates its performance during training and testing phases when compared to ID3 and Naive Bayes classifiers to determine the category. The results show that the J48 decision tree classifier outperforms the other two classifiers in both phases. In the training phase, the F-measure and ROC for the J48 decision tree classifier are 0.889 and 0.973, respectively, and in the testing phase, the F-measure and ROC are 0.911 and 0.987, respectively.
Keywords: Decision tree classifier; multi-class classification; university single rate tuition fee system.
Using Diverse Set of Features to Design a Content-Based Video Retrieval System Optimized by Gravitational Search Algorithm
by S. Padmakala, Ganapathy Sankar Anandha Mala, K.M. Anandkumar
Abstract: This paper explains about the content based video retrieval approach (CBVR) using four varieties of features and 12 distance measurements, which is optimized by gravitational search algorithm (GSA). Initially, CBVR technique extracts five kinds of features such as color, texture, shape, image and audio features that belong to each frame. Consequently, it emerges particular distance measurements for every sort of features to compute the similarity between query frame and remaining in the database frame. In this paper, we have used GSA to find the nearly optimal combination between the features and their respective similarity measurements. At last, from the video database, the query based videos are recovered. For experimentation, here we used two types of databases such as sports video and UCF sports action datasets. The experimental results demonstrate that the proposed CBVR method shows better performance when contrasted with other existing methods.
Keywords: video retrieval; distance measurements; color; texture; shape; audio; CBVR; similarity; combinations.
Weighted Neuro-Fuzzy Hybrid Algorithm for Channel Equalization in Time Varying Channel
by Zeeshan A Abbasi, Zainul Abdin Jaffery
Abstract: In MIMO-OFDM communication systems, accurate and specific channel estimation and equalisations are plays a major role. In this paper, we use weighted neuro-fuzzy hybrid (WNFH) channel estimation algorithm for channel equalisation. The pilot is designed based on combination of neural network and fuzzy logic system. Scaled conjugate gradient (SCG) is mutual with group search optimiser (GSO) algorithm along with; the training procedure of neural network is prepared using the hybrid training algorithm. In the transmitter section, the projected system contains quadrature amplitude modulation (QAM) and transmitter. By considering the channel prediction error to recover the performance of symbol detection the minimum mean-square error (MMSE) estimation design is accomplished. To reduce the MMSE of channel estimation and the calculated pilot sequences present great superiority in MIMO-OFDM system. Experimentation outcome shows that the channel assessment is supportive.
Keywords: MIMO-OFDM; Group Search Optimizer; Scaled Conjugate Gradient; Channel Estimation.
Discrete Weibull regression for modeling football outcomes
by Alessandro Barbiero
Abstract: We propose the use of the discrete Weibull distribution for modeling football match results, as an alternative to existing Poisson and generalized Poisson models. The number of goals scored by the two teams playing a football match are regarded as a pairwise observation and are modelled first through two independent discrete Weibull variables, and then through two dependent discrete Weibull variables, using a copula approach that accommodates non-null correlation. The parameters of the bivariate discrete Weibull distributions are assumed to depend on covariates such as the attack and defense abilities of the two teams and the 'home effect'. Several discrete Weibull regression models are proposed and then applied to the 2015-2016 Italian Serie A. Even if the interpretation of parameters is less immediate than in the case of bivariate Poisson models, nevertheless these models represent a suitable alternative, which can be applied also in other fields than sport data analysis.
Keywords: count data; count regression model; Frank copula; Poisson distribution; sport analytics.
Prediction of Process Parameters in Electrical Discharge Machining Using Response Surface Methodology and ANN: An Experimental Study
by T.M. Chenthil Jegan, R. Chitra, V.S. Thangarasu
Abstract: In the present work, the process parameters of Electro Discharge Machining are predicted by Response Surface Methodology and Artificial Neural Network (ANN) in AA6061. AA6061 is extensively used in aircraft and aerospace applications. In order to reduce the depletion of the material during machining, prediction of appropriate machining parameter is essential. Current, Pulse On, Pulse Off and Flushing Pressure are considered as input parameters for prediction. Experiments were conducted with those parameters in five different levels and data collected related to process responses for optimization. Material removal rate and surface roughness measured for each experimental run were compared, utilized to fit a quadratic mathematical model in Response Surface Methodology. ANN with back propagation algorithm was used to develop the relationship between input parameters and predominant output responses. The performance of the developed model is analyzed ANOVA and regression plot. The results proved that ANN model is better for empirical modelling.
Keywords: EDM; Design of Experiments; Response Surface Methodology; Artificial Neural Network Material Removal Rate; Surface Roughness.
Implementation of Multi Node Hadoop Virtual Cluster on Open Stack Cloud Environments
by Karthikeyan Saminathan, R. Manimegalai
Abstract: Nowadays computing plays a vital role in information technology and all other fields. Yes, the Cloud Computing is one of the biggest milestone in most leading next generation technology and booming up in IT filed and business sectors. In our day to day life the data is being generated is enormous amount such as Tera (TB), Peta(PB), Zeta(ZB) bytes. Hadoop Map Reduce is the popular distributed computing paradigm to process data intensive jobs in cloud. Completion time goals or deadline of map reduce jobs set by users are becoming crucial in existing cloud based data processing environments like Hadoop. In this paper proposed a real-time implementation of single node Hadoop cluster on Open stack private cloud and handles the huge data sets in parallel Virtual Machines and compares its average execution time for different size inputs.
Keywords: Cloud Data intensive- Hadoop - Map Reduce- Open Stack-Cluster.
Research on Aircraft Landing Schedule using Opposition Based Genetic Algorithm with Cauchy Mutation
by C. Nithyanandam, Gabriel Mohankumar
Abstract: Optimal scheduling of airport runway operation plays a significant
responsibility in the aircraft transportation. Arrival runways are a crucial
resource in the air traffic system. Arrival delays encompass an immense impact
on airline operations in addition to cost. An imperative responsibility is the
planning of airport operations like arrival and departure of aircraft. At this
juncture, this paper describes the technique of the execution time in addition to
the penalty cost of the every aircrafts. These experimentations demonstrate
whenever aircrafts landing on the runway in the mean while no congestion on
to facilitate particular path, if it is happening subsequently it seems to be
problematic. In order towards eradicating these problems, neural network and
genetic algorithms through Cauchy mutations are utilised in the direction of
eradicating the congestion occur during the runway as well as in addition to
proposed technique towards reducing the penalty cost to be charged.
Keywords: Artificial Neural Network (ANN); Aircraft Selection; Aircraft Landing Problem Opposition Genetic Algorithms with Cauchy Mutation; Runway Selection; Scheduling.
ScrAnViz: A Tool for Analytics and Visualization of Unstructured Data
by Sriraghav Kameswaran, V.S. Felix Enigo
Abstract: Existing big data visualization tools are meant for visualizing structured data. But survey shows that about 80-90% of potentially usable business information is in unstructured format. Analyzing unstructured data is challenging due to lack of structure and relational form. In this paper, we have proposed a tool called ScrAnViz that can structure data, perform analysis and provide visualization thereby helps in decision making for business people and end users. An attribute based opinion mining algorithm has been developed and implemented. Performance analysis shows that the algorithm has reduced the search time by three times than the traditional document level sentiment analysis systems.
Keywords: Unstructured data; Data Analytics; Sentiment Analysis; Opinion Mining; Data visualization.
Link prediction in multilayer networks
by Deepak Malik, Anurag Singh
Abstract: Link prediction has gained popularity in recent years in large networks. Researchers have proposed various methods for finding the missing links. These methods include common neighbour, Jaccard coefficient, etc. based on the proximity of the nodes. These methods have limitations as they treat all common nodes equal from a pair of nodes. A new method is proposed, common neighbours common neighbour (CNCN). Its performance is better than the existing methods in a single layer network. These methods are based on the topological features of the network. The proposed method finds the different behaviour of common nodes for a pair of nodes. The link prediction is also useful in the multiplex networks. The link predictions in the multiplex networks are more useful than the single layer network as several layers may give more information about a node than the single layer network. Two methods are proposed using dynamic and static weights.
Keywords: common neighbours; complex network; link prediction.
FUZZY BASED REVIEW RATING PREDICTION IN E-COMMERCE
by P. Velvizhy, A. Pravi, M. Selvi, S. Ganapathy, A. Kannan
Abstract: Opinion Mining is an ongoing research area in E-commerce which aims at analyzing the people's opinions, sentiments and emotions. Moreover, the existing E-commerce systems allow the users to share their feedback in the form of textual reviews regarding the products and services. It also allows the consumers to give ratings for products that help in future recommendation of products. In this research work, a computational framework for efficiently predicting the consumer review ratings on the products has been proposed. The proposed framework integrates Dimensionality Reduction, Genetic Algorithm, Fuzzy C-Means and Adaptive Neuro-Fuzzy Inference techniques to overcome the limitations of the existing systems. Experiments have been conducted in this work using Amazon dataset consisting of reviews for different products. This system provides better performance and prediction accuracy for review ratings when it is compared with the related work.
Keywords: sentiment analysis; review ratings prediction; dimensionality reduction; genetic algorithm; data mining; fuzzy c means.
A Technique for Semantic Annotation and Retrieval of E-Learning Objects
by Balavivekanandhan A
Abstract: The primary objective of my research is to design and develop semantic annotation and retrieval model for e-learning document. In training phase, the documents from different domains are taken and the informative words from each document are obtained based on balanced mutual information and frequency of contents in each document. We then use the informative words to identify the superordinates and the objects. The superordinates, the informative words and the objects from each document will give the relation and properties of each document. The relation and properties of each document are then used to cluster the documents. In the testing phase, we give a query or a document as input to the system to retrieve the relevant documents. If a document is given as input, the relation and properties of that document are first identified and it is used to retrieve the relevant documents.
Keywords: e-learning; document clustering; balanced mutual information; one way matching; cluster based matching.
A COLLABORATIVE CONTENT-BASED MOVIE RECOMMENDER SYSTEM
by Bolanle Ojokoh, Oluwatosin Olatunbosun Aboluje, Tobore Igbe
Abstract: In this paper, Pearson's correlation coefficient is employed for collaborative filtering due to its ability to manipulate numerical data as well as determine linear relationship among existing users. Its steps involve a user-user representation, similarity generation and prediction generation with a goal to produce a predicted opinion of the active user about a specific item. Concept of parental control is also incorporated for enhancement. Evaluation of the system was done using precision, recall, F-measure, discounted cumulative gain (DCG), idealised discounted cumulative gain (IDCG), normalised discounted cumulative gain (nDCG) and mean absolute error (MAE). Three hundred fortysix datasets were used, out of which 126 were gathered from local video shops and 220 were extracted from internet movie database (IMDb). These were used for the experiments and the results generated through mining of data obtained from profiles and ratings of system users prove the system's average ranking quality of the collaborative filtering algorithm is 95.9%.
Keywords: Movies; Recommendation; Collaborative Filtering; Information Filtering; Correlation Coefficient; Evaluation.
Location based Personalized Recommendation systems for the Tourists in India
by Madhusree Kuanr, Sachi Nandan Mohanty
Abstract: This study examines the collaborative filtering in recommender
system by categorising users according to their choices of place, food, local
item purchase, etc. The proposed system will store the opinions of the local
users about the sites, foods and products for purchase available in those sites. It
uses collaborative filtering technique to find the similar users to a given
querying user. The system recommends the best sites along with good foods
and products available on those sites according to the recent data. Two hundred
(male = 110, female = 90) married individuals from Bhubaneswar, Odisha
(India) participated in this survey. Cosine similarity is used in the proposed
system to find the similar users of a given input query user. The results
revealed that collaborative filtering is the more reliable technique for
personalised recommender systems. Experimental results show performance of
the proposed system in terms of precision, recall and F-measure values.
Keywords: collaborative filtering; recommender systems; user profile
Stability analysis of feature ranking techniques in the presence of noise: a comparative study
by Iman Ramezani, Mojtaba Khorram Niaki, Milad Dehghani, Mostafa Rezapour
Abstract: Noisy data is one of the common problems associated with real-world data, and may affects the performance of the data models, consequent decisions and the performance of feature ranking techniques. In this paper, we show how stability performance can be changed if different feature ranking methods against attribute noise and class noise are used. We consider Kendalls Tau rank correlation and Spearman rank correlation to evaluate various feature ranking methods stability, and quantify the degree of agreement between ordered lists of features created by a filter on a clean dataset and its outputs on the same dataset corrupted with different combinations of the noise level. According to the results of Kendall and Spearman measures, Gini index (GI) and information gain (IG) have the best performances respectively. Nevertheless, both Kendall and Spearman measures results show that ReliefF (RF) is the most sensitive (the worst) performance.
Keywords: attribute noise; class noise; filter-based feature ranking; threshold-based feature ranking; stability; Kendall's Tau rank correlation; Spearman rank correlation.
Topic-driven top-k similarity search by applying constrained meta-path based in content-based schema-enriched heterogeneous information network
by Phu Pham, Phuc Do
Abstract: In this paper, we propose a model of TopCPathSim in order to
address the problem related to topic-driven similarity searching based on
constrained meta-path (or also called restricted meta-path) between
same-typed objects within the content-based heterogeneous information
networks (HINs). The topic distributions over content-based objects such as:
paper/article on the bibliographic network or users comments/reviews on the
social networks, etc. are obtained by using the LDA topic model. We conduct
the experiments on the real DBLP, Aminer and ACM datasets which
demonstrate the effectiveness of our proposed model. Throughout experiments,
our proposed model gains about 73.56% in accuracy. The output results also
show that the combination of probabilistic topic model with constrained
meta-path is promising to leverage the output quality of topic-oriented
similarity searching in content-based HINs.
Keywords: constrained meta-path; content-based heterogeneous information network; topic-driven similarity search; LDA; topic modelling.
Deep learning framework for early detection of intrusion in Virtual Environment
by Madhu Priya G, S. Mercy Shalinie, P. Mohana Priya
Abstract: Today's business enterprise adapts cloud based services as its architectural design. Intelligence technique incorporated into the architecture gives massive tangible and intangible benefits in terms of performance and reliability. Such cloud based business architecture faces many threats towards its availability. DDoS attack is the most prominent threat as its impact is more in the virtual resource based cloud infrastructure. Therefore, there is a need for a Business Intelligence based framework to detect early the attack by monitoring the virtual network traffic. The proposed intelligence framework uses a deep learning framework, Continuous Discriminative-Deep Belief Network (CD-DBN). CD-DBN dynamically captures attack patterns from the network data, analyzes the data and detects the intrusion to the cloud. The observed result shows that the earlier detection approach guarantees the availability of cloud services to the legitimate users and enhances the cloud resource usage.
Keywords: Deep Learning; Restricted Boltzmann Machine; Deep Belief Network; Cloud Environment; Virtualization; Hypervisor; Intrusion Detection; Availability threat; DDoS attack; SysBench benchmark suite.
Analysing Thyroid Disease using Density Based Clustering Technique
by Tanupriya Choudhury, Veenita Kunwar, A. Sai Sabitha, Abhay Bansal, Tanupriya Choudhury
Abstract: Data mining in medicine has been used to predict unknown patterns
in health data and to obtain diagnostic results. Healthcare industry generates
large amounts of complex data about patients, diseases and treatments. Data
mining in healthcare provides benefits like detecting fraud, availing medical
facilities for patients at low cost, ensuring high quality patient care and making
healthcare policies. Disease detection has become essential due to increased
number of health issues occurring day by day. The thyroid has become one
such concern with numerous cases being detected yearly. It causes improper
functioning of the thyroid gland. In this paper, clustering technique has been
used to detect and understand factors influencing thyroid disease. DBSCAN
algorithm has been used as it can handle clusters of varying shapes and sizes
and is noise resistant. PCA has also been done for finding high dimension
data patterns and to reduce dimension. The experimental setup has been
implemented in RapidMiner.
Keywords: Data mining; Clustering; Thyroid disease; DBSCAN; Principal component analysis.
A Simple Transform Domain Based Low Level Primitives Preserving Texture Synthesis
by S. Anuvelavan, M. GANESH, P. Ganesan
Abstract: In this work, a new patch-based texture synthesis scheme with
orthogonal polynomials model coefficients is presented. The proposed scheme
has four phases. In the first phase, a block matching technique that identifies a
best match, to synthesis in the output image of bigger size is designed in terms
of ordered orthogonal polynomials model coefficients. In case of successful
match of block, called patch-hit, the proposed scheme finds candidate blocks
with triangular search, in the next phase. In the patch selection phase, the
proposed scheme considers a subset of orthogonal polynomials model
coefficients among the blocks, for the purpose of synthesis which consumes
less memory and time. This synthesised output is smoothened in the final
phase, by preserving the low level contents between the synthesised patches.
The performance of the proposed scheme is measured with energy, contrast,
correlation, homogeneity and entropy between the original and synthesised
images and is also compared with existing texture synthesis schemes. The
results are encouraging.
Keywords: Texture Synthesis; Orthogonal Polynomials; Patch-Hit; Candidate Block; Patch Selection.
Optimal Region growing and Multi-kernel SVM for fault detection in Electrical Equipments using Infrared Thermography Images
by C. Shanmugam, E. Chandira Sekaran
Abstract: Infrared thermography (IRT) has played an essential part in
observing and examining thermal defects of electrical equipment without
ending, which has vital enormity for the dependability of electrical recorded.
This paper dissected the electrical parts are faulted or non-faulted with the help
of segmentation and classification model. The features are calculated from the
input thermal images and regions of interest (ROI) is segmented by utilising
optimal region growing (ORG) technique and faults are classified using multi
kernel support vector machine (MKSVM). In the tests, the classification
performances from different input features are assessed. For enhancing the
performance of the segmentation investigation optimisation procedure that is
whale optimisation (WO) is used. Before classifying, the extracted electrical
components are fused by using feature level fusion (FLF) procedure to fused
vector in all images. These multi Kernel classification performance indices,
including sensitivity, specificity and accuracy are utilised to recognise the most
appropriate input feature and the best arrangement of classifiers. The
performance of SVM is contrasted with a neural network. The correlation
comes about demonstrating that our technique can accomplish a superior
performance with accuracy at 98.21%.
Keywords: Feature extraction; Whale optimisation,Support vector machine; optimisation; Classification and fault detection,Infrared thermography.
ComRank: community-based ranking approach for heterogeneous information network analysis and mining
by Phu Pham, Phuc Do
Abstract: In this paper, we propose the ComRank model to address this
problem of ranking a specific typed of object, over the generated topic-driven
communities in the information networks. The topic-driven communities are
generated by applying the latent topic modelling of LDA. Our proposed
ComRank model is directly generated ranking results for specific typed object
in the different network communities. We apply our approach to construct the
scholastic recommendation system, which support the researchers to find the
appropriate citations or potential authors for cooperating while doing scientific
researches. The ComRank model is tested with the real-world dataset of DBLP
bibliographic network. The experimental results demonstrated that our
proposed model can generate the meaningful ranking results within detected
Keywords: information network; heterogeneous network; bibliographic network; community detection; community-based ranking; path-based ranking.
AGS: A Precise and Efficient AI Based Hybrid Software Effort Estimation Model
by Vignaraj Vikraman, S. Srinivasan
Abstract: To predict the amount of effort to develop software is a tedious
process for software companies. Hence, predicting the software development
effort remains a complex issue drawing in extensive research consideration.
The success of software development process considerably depends on proper
estimation of effort required to develop that software. Effective software effort
estimation techniques enable project managers to schedule software life cycle
activities properly. The main objective of this paper is to propose a novel
approach in which an artificial intelligence (AI)-based technique, called AGS
algorithm, is used to determine the software effort estimation. AGS is hybrid
method combining three techniques, namely: adaptive neuro fuzzy inference
system (ANFIS), genetic algorithm and satin bower bird optimisation (SBO)
algorithm. The performance of the proposed method is assessed using a well
standard dataset with real-time benchmark with many attributes. The major
metrics used in the performance evaluation are correlation coefficient (CC),
kilo lines of code (KLoC) and complexity of the software. The experimental
result shows that the prediction accuracy of the proposed model is better than
the existing algorithmic models.
Keywords: Software Effort Estimation; AI; ANFIS; Lines of code (LoC); Genetic Algorithm (GA); Satin Bower Bird Optimiser (SBO); Correlation Co-efficient (CC); Kilo Lines of Code (KLoC),Software Complexity.
High dimensional sentiment classification of product reviews using evolutionary computation
by Sonu Lal Gupta, Anurag Singh Baghel
Abstract: Feature selection is an important process in text classification. In
general, traditional feature selection approaches are based on exhaustive search
hence become inefficient due to a large search space. Further, this task becomes
more challenging as the number of features increases. Recently, evolutionary
computation (EC)-based search techniques have received a lot of attention in
solving feature selection problem in high-dimensional feature space. This paper
proposes a particle swarm optimisation (PSO)-based feature selection approach
which is capable of generating the desired number of high-quality features from
a large feature space. The proposed algorithm is tested on a large dataset and
compared with several existing state-of-the-art algorithms used for feature
selection. The accuracy of the underlying classifier has been considered as a
measure of performance. Our obtained results demonstrated that the proposed
PSO-based feature selection approach outperforms the other traditional feature
selection algorithms in all the considered classifiers.
Keywords: sentiment classification; feature selection; particle swarm
optimisation; PSO; evolutionary computation; support vector machine; SVM;
naïve Bayes; NB; mutual information; MI; chi-square; CHI.
Using bagging to enhance clustering procedures for planar shapes
by Elaine Cristina De Assis, Renata Souza, Getulio José Amorim Do Amaral
Abstract: Partitional clustering algorithms find a partition maximizing or minimizing some numerical criterion. Statistical shape analysis is used to make decisions observing the shape of objects. The shape of an object is the remaining information when the effects of location, scale and rotation are removed. This paper introduces clustering algorithms suitable for planar shapes. Four numerical criteria are adapted to each algorithm. In order to escape from local optima to reach a better clustering, these algorithms are performed in the framework of Bagging procedures. Simulation studies are carried to validate these proposed methods and two real-life data sets are also considered. The experiment quality is assessed by the corrected Rand index and the results the application of the proposed algorithms showed the effectiveness of these algorithms using different clustering criteria and the union of the Bagging method to the cluster algorithms provided substantial gains in of the quality of the clusters.
Keywords: Statistical Shape Analysis; Partitional Clustering Methods; Bagging Procedure.
Impact of Clustering on quality of Recommendation in Cluster based Collaborative Filtering: an Empirical Study
by MONIKA SINGH, Monica Mehrotra
Abstract: In memory nearest neighbour computation is a typical approach for
collaborative filtering (CF) due to its high recommendation accuracy. However,
this approach fails on scalability; which is the declined performance of the
same due to the rapid increase in the number of users and items in archetypal
merchandising applications. One of the popular techniques to attenuate
scalability issue is cluster-based collaborative filtering (CBCF), which uses
clustering approach to group most similar users/items from complete dataset. In
this work we present a detailed analysis of the impact of clustering in CF
approach. Specifically, we study how the extent of clustering impacts
collaborative filtering systems in terms of quality of predictions, quality of
recommendations, throughput and coverage. Based on the empirical results
obtained from two datasets, Movielens100K and Jester; we conclude that with
increasing number of clusters the quality of predictions, the quality of
recommendations and the throughput are enhanced but the coverage provided
by clustered subsystems declines.
Keywords: Recommender Systems; Collaborative Filtering; Clustering; Prediction; Nearest neighbors; Clustering based collaborative filtering; Average recommendation time; Coverage; Quality of predictions and Qua.
EFFICIENT TEXT DOCUMENT CLUSTERING WITH NEW SIMILARITY MEASURES
by Lakshmi R, S. Baskar
Abstract: In this paper, two new similarity measures, namely distance of term
frequency-based similarity measure (DTFSM) and presence of common
terms-based similarity measure (PCTSM), are proposed to compute the
similarity between two documents for improving the effectiveness of text
document clustering. The effectiveness of the proposed similarity measures is
evaluated on reuters-21578 and WebKB datasets for clustering the documents
using K-means and K-means++ clustering algorithms. The results obtained by
using the proposed DTFSM and PCTSM are significantly better than other
measures for document clustering in terms of accuracy, entropy, recall and
F-measure. It is evident that the proposed similarity measures not only improve
the effectiveness of the text document clustering, but also reduce the
complexity of similarity measures based on the number of required operations
during text document clustering.
Keywords: Document Clustering; Similarity Measures; Accuracy; Entropy; Recall; F-Measure; K-means clustering Algorithm.
XML web quality analysis by employing MFCM clustering Technique and KNN classification
by M. Gopianand, P. Jaganathan
Abstract: The great accomplishment of web search engine is keyword search which is the most trendy search representation for regular consumers. It is permits that the consumer can create the queries without the knowledge of query language and the database schema. So, it is also considered as a user friendly method. The quality of XML web has to be accurate if the exact queries have to be answered. Here we have proposed a method to access the quality of the XML web by analysing the keyword present in the XML web based on the respective keyword search. In our proposed method we collect number of XML documents and are clustered based on the keyword depending on the type of XML files. Modified fuzzy C means (MFCM) is used for clustering. Once the clustering based on the respective keyword is done, we classify the XML web based on quality of the data by utilising KNN classifier.
Keywords: XML web; K nearest neighbor; Error value; Classification accuracy; feature vectors.
Analysis and Prediction of Heart Disease Aid of Various Data Mining Techniques: A Survey
by V. Poornima, D. Gladis
Abstract: In recent times, health diseases are expanding gradually because of inherited. Particularly, heart disease has turned out to be the more typical nowadays, i.e., life of individuals is at hazard. The data mining strategies specifically decision tree, Na
Keywords: Data mining; Heart Disease Prediction; performance measure; Fuzzy; and clustering.
Signal-Flow Graph Analysis and Implementation of Novel Power Tracking Algorithm Using Fuzzy Logic Controller
by S. VENKATESAN, Manimaran Saravanan, Subramanian Venkatnarayanan, Senior Member IEEE
Abstract: This paper discussed merits of novel modified perturb and observe (P&O) maximum power point tracker (MPPT) algorithm for stand-alone solar PV system using interleaved LUO converter with fuzzy logic controller (FLC). The merits of FLC based system are compared with existing system. Analytical expressions of the proposed converter are derived through signal flow graph. The proposed interleaved LUO converter based PV system with fuzzy controller reduces considerable amount of ripple content and also proposed MPPT algorithm creates less hunting around maximum power point. Simulations at different illumination levels are carried-out using MATLAB/Simulink. It also experimentally verified with a typical 40 W solar PV panel. The result confirms the superiority of the proposed system with fuzzy controller.
Keywords: Fuzzy Logic Controller; Interleaved LUO Converter; Maximum Power Point Tracking (MPPT); Modified P&O algorithm; Photovoltaic(PV) system.
SoLoMo Cities: Socio-Spatial City Formation Detection and Evolution Tracking Approach
by Sara Elhishi, Mervat Abu-Elkheir, Ahmed Aboul-Fotouh
Abstract: The tremendous growth of telecommunication devices coupled with
the huge number of social media users has revealed a new kind of development
that turning our cities into information-rich smart platforms. We analyse the
role of LBSN check-ins using social community detection methods to extract
city structured communities, which we call SoLoMo cities, using a modified
version of Louvain algorithm, then we track these communities evolution
patterns through a pairwise consecutive matching process to detect behavioural
events changing citys communities. The findings of the experiments on the
Brightkite dataset can be summarised as follows: online users check-in
activities reveal a set of well-formed physical land spaces of citys
communities, the concentration of online social interactions and the formation
of those cities are positively correlated with a percentage of 89%. Finally, we
were able to track the evolution of the discovered communities through
detecting three community behaviour events: survive, grow and shrink.
Keywords: location-based social networks; LBSN; social; spatial analysis; community detection; evolution; tracking; Brightkite.
AN EFFICIENT FEATURE EXTRACTION FOR BIOMETRIC AUTHENTICATION
by Betty P, Mohanageetha D, Jeena Jacob
Abstract: Biometric authentication has received greater significance due to its high uniqueness and performance. The ability of quick and convenient authentication is required due to its widespread demand. Extraction of feature is the primary and important task for effective authentication. Dissimilar chrominance texture pattern (DiCTP) technique is used in this paper for effective feature extraction. Patterns of two sequences are generated from the inter channel information of the image which extracts the coloured texture information of the input. Unique information is generated from RGB and BRG planes of the image which produces a part of diversified chromatic feature vectors. The local binary pattern (LBP) code is generated and added along with the feature vector, which aids to inculcate the greyscale information of the image. The experimental results are formulated using the CASIA Face Image Database Version 5 (DB1) and Indian Face database (DB2) which give considerable improvements over the existing methodology.
Keywords: Biometric Authentication; Dissimilar Chrominance Texture Pattern ; Content Based Image Retrieval.
Discovery of Rare Association Rules in the Distribution of Lawsuits in the Federal Justice System of Southern Brazil
by Lucia Gruginskie, Guilherme Vaccaro, Leonardo Chiwiakwosky, Attilla Blesz Jr
Abstract: In the context of data mining, infrequent association rules may be beneficial for analysing rare or extreme cases with very low support values and high confidence. In researching risky situations or allocating specific resources, such rules may have a much greater impact than rules with high support value. The objective of this study is to obtain association rules from the database of lawsuits filed in the Federal Court of Southern Brazil in 2016, including both frequent and rare rules. By finding these rules, especially rare ones, the information collected can assist in the decision-making process, in this case, such as training clerks or establishing specialised courts.
Keywords: Association Rules; Rare Rules; Distribution of lawsuits; Brazilian Federal Justice; Data mining.
Integral Verification and Validation for Knowledge Discovery Procedure Models
by Anne Antonia Scheidler, Markus Rabe
Abstract: This paper explains why the knowledge discovery in database (KDD) procedure models lacks verification and validation (V&V) mechanisms and introduces an approach for integral V&V. Based on a generic model for knowledge discovery, a structure named 'KDD triangle model' is presented. This model has a modular design and can be adapted for other KDD procedure models. This has the benefit of allowing existing projects for improving their quality assurance in knowledge discovery. In this paper, the different phases of the developed triangle model for KDD are discussed. One special focus is on the phase results and related testing mechanisms. This paper also describes possible V&V techniques for the developed integral V&V mechanism to ensure direct applicability of the model.
Keywords: knowledge discovery in databases; data mining; procedure model; verification and validation; quality assurance.
A Multiclass Classification Approach for Incremental Entity Resolution on Short Textual Data
by Denilson Pereira, João A. Silva
Abstract: Several web applications maintain data repositories containing references to thousands of real-world entities originating from multiple sources, and they continually receive new data. Identifying the distinct entities and associating the correct references to each one is a problem known as entity resolution. The challenge is to solve the problem incrementally, as the data arrive, especially when those data are described by a single textual attribute. In this paper, we propose a new approach for incremental entity resolution. The method we have implemented, called AssocIER, uses an ensemble of multiclass classifiers with self-training and detection of novel classes. We have evaluated our method in various real-world datasets and scenarios, comparing it with a traditional entity resolution approach. The results show that AssocIER is effective and efficient to solve unstructured data in collections with a large number of entities and features, and is able to detect hundreds of novel classes.
Keywords: Entity Resolution; Associative Classification; Incremental Learning; Novel Class Detection; Self-training.
Method for Improvement of Transparency: Use of Text Mining Techniques for Reclassification of Governmental Expenditures Records in Brazil
by Gustavo De Oliveira Almeida, Kate Revoredo, Claudia Cappelli, Cristiano Maciel
Abstract: Many countries have transparency laws requiring availability of data. However, often data is available but not transparent. We present the Transparency Portal of Brazilian Federal Government case and discuss limitations of public acquisitions data stored in free text format. We employed text-mining techniques to reclassify descriptive texts of measurement units related to products and services. The solution presented in KNIME and JAVA aggregated measurements in the original (n = 69,372 with 78% reduction in number of descriptions, 94% items classified) and in cross validation sample (n = 105,266 with 88% reduction, classifying 78% of items). In addition, we tested computational time for processing of texts for a wide range of data input sizes, suggesting the stability and scalability of the solution to process larger datasets. Finally, we produced analysis identifying probable input errors, suppliers and purchasing units with abnormal transactions and factors affecting procurement prices. We present suggestions for future research and improvements.
Keywords: e-government; data mining; open government; text mining; transparency; KNIME; knowledge discovery; techniques; Brazil.
Data Mining in Credit Insurance Information System for Bank Loans Risk Management in Developing Countries
by Fouad J. Al Azzawi
Abstract: The task of credit risk insurance in our time is critical since loans
are taken by everyone and everywhere and it is quite difficult to accurately
estimate the possible losses that are incurred by failing to pay those loans.
This work proposes an information system module for the banking system to
improve the risk management operation that distributes losses on some fair
basis, as well as accepting the maximum number of loan requests. Insuring the
risk associated with stumbled loans, the bank will partially or completely shift
losses under this contract to the insurance company, thus minimising its own
losses. The proposed module could find out for what price the bank can buy
such insurance policy. The proposed module also could be a key valuable
motivation for different development countries to update their strategy of
current insurance market to outsource part of the states insurance functions to
independent insurance industry. Data mining techniques and mathematical
induction have been used and successfully implemented this model. An optimal
classification solution module for predicting risky loan requests have been
successfully employed. New mathematical model has been developed for
calculating the cost of insurance policy in crisis economy.
Keywords: Data mining; Credit insurance; information systems; Bank loans; risk management; developing countries.
Fibonacci Retracement Pattern Recognition for Forecasting Foreign Exchange Market
by Mohd Fauzi Ramli, AHMAD KADRI JUNOH, Mahyun Ab Wahab, Wan Zuki Azman Wan Muhamad
Abstract: Fibonacci retracement implicates a forecast of future movements in
foreign exchange rates (forex) of the previous movement inductive analysis.
Fibonacci ratios are used to forecast the retracements level of 0.382, 0.500 and
0.618 and to determine the current trend which provide the mathematical
foundation for the Elliott wave theory. K-nearest neighbour (KNN) and linear
discriminant analysis (LDA) algorithm are the pattern recognition method for
nonlinear feature mining of Elliott wave patterns. Results show that LDA is
better than KNN in terms of classification accuracy data which are 99.43%.
Among of three levels of Fibonacci retracement results, the 38.2% shows the
best forecasting for Great Britain Pound pair to US Dollar currency as major
pair by using mean absolute error (MAE), root mean square error (RMSE) and
pearson correlation coefficient (r) as the statistical measurements which are
0.001884, 0.000019 and 0.992253 for uptrend and 0.001685, 0.000019 and
0.998806 for downtrend.
Keywords: forex; forecast; fibonacci retracement; elliott wave; golden ratio.
CARs-RP: Lasso Based Class Association Rules Pruning
by AZMI Mohamed, Abdelaziz Berrado
Abstract: Classification based on association rules gets more and more interest in research and practice. In many contexts, rules are often mined from sparse data in high-dimensional spaces, which leads to large number of rules with considerable containment and overlap. Pruning is often used in search for an optimal subset of rules. This paper introduces a method for class association rules (CARs) pruning. It learns weights for a set of CARs by maximising the likelihood function subject to the sum of the absolute values of the weights. The pruning strength is controlled by a shrinkage parameter ?. The suggested method allows the user to choose the appropriate subset of CARs. This is achieved based on a trade-off between the accuracy and complexity of the resulting classifier which is controlled by changing ?. Experimental analysis shows that the introduced method allows to build more concise classifiers with comparable accuracy to other methods.
Keywords: class association rules; pruning; regularization; weighting; associative classification.
A statistical approach to investigate the alternatives of love in Moulanas Divan
by Mohammad Reza Mahmoudi, Ali Abbasalizadeh, Marzieh Rahmati
Abstract: Conceptual metaphor is the systematic mapping of conceptual domains on each other. Love is the most important axis of mystical path. In this paper, all the lines in Moulanas are studied and different words, which are used as alternatives of love, are determined and classified in 11 areas. Then chi-square goodness of fit test is used to investigate and compare the frequency of different areas and words which are used as alternatives of love, separately. Finally, based on the clustering methods, these alternatives are clustered in three (high frequency, medium frequency, and low frequency). The results indicate the word fire and the area human have the highest uses as the alternatives of love.
Keywords: Conceptual Metaphor Love; Moulana; Statistics; Data Mining; Text Mining.
PPM-HC: a Method for Helping Project Portfolio Management Based on Topic Hierarchy Learning
by Ricardo M. Marcacini, Ricardo A. M. Pinto, Flavia Bernardini
Abstract: The projects categorisation is a crucial step in the project portfolio management (PPM). Categorising projects allows the organisation to identify categories with a lack or excess of projects, according to its strategic objectives. In this work, we present a new method for project portfolio management based on hierarchical clustering (PPM-HC) to organise the projects at several levels of abstraction. In the PPM-HC, similar projects are allocated to the same clusters and subclusters. PPM-HC automatically learns an understandable topic hierarchy from the project portfolio dataset, thereby facilitating the (human) task of exploring, analysing and prioritising the projects of the organisation. We also proposed a card sorting-based technique which allows the evaluation of the projects categorisation using an intuitive visual map. We carried out an experimental evaluation based on a benchmark dataset and we also presented a real-world case study. The results show that the proposed PPM-HC method is promising.
Keywords: Project Portfolio Management; Projects Categorization; Topic Hierarchy Learning; Hierarchical Clustering.
An efficient approach for Defect Detection in Texture analysis using Improved Support Vector Machine
by Manimozhi I., Janakiraman S.
Abstract: Texture defect detection can be defined as the process of determining the location and size of the collection pixels in a textures image which deviate in their intensity values or spatial in compression to a background texture. The detection of abnormalities is a very challenging problem in computer vision. In our proposed method we have designed a method for detecting the defect of pattern texture analysis. Initially, features are extracted from the input image using the gray level co-occurrence matrix (GLCM) and gray level run-length matrix (GLRLM). Then the extracted features are fed to the input of classification stage. Here the classification is done by improved support vector machine (ISVM). The proposed pattern analysis the traditional support vector machine is improved by means of kernel methods. Final stage is the classified features are segmented using the modified fuzzy C means algorithm (MFCM).
Keywords: Texture defect detection; preprocessing; Gray Level Co-occurrence matrix; Gray Level Run-Length Matrix; Improved Support Vector Machine; modified fuzzy c means algorithm.
A DYNAMIC REPLICATIVE K-MEANS WITH SELF-COMPILING PARTICLE SWARM INTELLIGENCE FOR DATASET CLASSIFICATION
by A. M. Viswa Bharathy
Abstract: The classification techniques proposed so far is not sufficiently intelligent in classifying data set beyond two level classifications. To multi classify the data set for network data we are in need of more hybrid algorithms. In this paper we propose a hybrid technique by combining a modified K-means algorithm called dynamic replicative K-means (DRKM) with self-compiling particle swarm intelligence (SCPSI). The dataset we have chosen for the experiment is KDD Cup 99. The DRKM-SCPSI performs better in terms of detection rate (DR), false positive rate (FPR) and accuracy which is visible from the results presented.
Keywords: anomaly; detection; intrusion; K-Means; PSI.
PORTFOLIO SELECTION WITH SUPPORT VECTOR REGRESSION: MULTIPLE KERNELS COMPARISON
by Pedro Alexandre Henrique, Pedro Albuquerque, Peng Yao Hao, Sarah Sabino
Abstract: This study aimed to verify whether the use of support vector regression (SVR) makes the portfolios return exceed the market. For such propose, SVR was applied for 15 different kernel functions to select the best stocks for each quarter, calculating the quarterly portfolio return and cumulative return along the period. Subsequently, the returns of these portfolios were compared with the returns of a market benchmark. Whites (2000) test was applied to avoid the data-snooping effect in assessing the statistical significance of the portfolios developed by the training strategies. The portfolio selected by SVR with inverse multiquadric kernel presented the highest cumulative return of 374.40% and a value at risk (VaR) of 6.87%.The results of this study corroborate the superiority hypothesis of the innovative method of Support Vector Regression in the formation of portfolios, thus constituting a robust predictive method capable to cope with high dimensionality interactions.
Keywords: Statistical Learning Theory. Optimization Theory. Financial Econometrics. Support Vector Machine. Kernel methods.
Worldwide Gross Revenue Prediction for Bollywood Movies using Hybrid Ensemble Model
by Alina Zaidi, Siddhaling Urolagin
Abstract: Prediction of revenue before a movie is released can be very beneficial for stakeholders and investors in the movie industry. Even though Indian cinema is a booming industry, the literature work in the field of movie revenue prediction is more inclined towards non-Indian movie. In this study we built a novel hybrid prediction model to predict worldwide gross for Bollywood movies. Bollywood movies dataset is prepared by downloading movie related features from IMDb and YouTube movie trailers which consisting of 674 movies. K-means clustering is performed on the movie dataset and two major clusters are identifier. Important features specific to clusters are selected. The proposed hybrid prediction model performs segregation of movies into two clusters and employs prediction model for each cluster. Prediction models we tested included various basic machine learning models and ensemble models. The ensemble model that combined predictions from support vector regression, neural network and ridge regression gave us the best result for both clusters and we chose it to be our final model. We obtain an overall MAE of 0.0272 and R2 of 0.80 after 10-fold cross validation.
Keywords: Bollywood; Movie Revenue Prediction; Box office; Regression; Ensemble; Feature Selection; Machine Learning; Scikit-Learn.
Health Data Warehouses: Reviewing Advanced Solutions for Medical Knowledge Discovery
by Norah Alghamdi
Abstract: The implementation of a data warehouse and a decision support system by utilising the capabilities of information retrieval and knowledge discovery tools in the healthcare fields, has allowed for the enhancement in the offered healthcare. In this work, we present a review of recent data warehouses and decision support systems in the healthcare domain with their significance, and applications of evidence-based medicine, electronic health records, and nursing. Given the growing trend on their implementation in healthcare services, researches, and education, we present here the most recent publications that employ these tools to produce suitable decisions for patients or health providers. For all the reviewed publications, we have intensively explored their problems, suggested solutions, utilised methods, and their findings. We have also highlighted the strength of the existing approaches and identified potential drawbacks including data correctness, completeness, consistency, and integration to provide proper medical decision-making.
Keywords: Data warehouses; Data Mining; Health Data; Medical Records; Quality; Knowledge Discovery; OLAP.
Survey on-demand: A versatile scientific article automated inquiry method using text mining applied to Asset Liability Management
by Pedro Henrique Albuquerque, Igor Nascimento, Peng Yao Hao
Abstract: We proposed a methodology that automatically relate content of text documents with lexical items. The model estimates whether an article addresses a specific research object based on the relevant words in its abstract and title using text mining and partial least square discriminant analysis. The model is efficient in accuracy and the adjustment and validation indicators are either superior or equal to the other models in the literature on text classification. In comparison to existing methods, our method offers highly interpretable outcomes and allows flexible measurements of word frequency. The proposed solution may aid scholars regarding the process of searching theoretical references, suggesting scientific articles based on the similarities among the used vocabulary. Applied to the finance area, our framework has indicated that approximately 10% of the publications in the selected journals that address the subject of asset liability management. Moreover, we highlight the journals with the largest number of publications over time and the key words about the subject using only freely accessible information.
Keywords: dimensionality reduction; discriminant analysis; text classification; partial least square; bibliometrics.
Clustering Student Instagram accounts using Author-Topic Model Based
by Nur Rakhmawati, Faiz NF, Irmasari Hafidz, Indra Raditya, Pande Dinatha, Andrianto Suwignyo
Abstract: The aim of this study proposes topic model to cluster a group of high school teenager's Instagram account in Surabaya, Indonesia by using the author-topic models method. We collect valid 235 Instagram account (133 female, 102 male students). We gather a total 3,346 captions of the Instagram post from 18 senior high schools. We find major findings what are the topics that define their Instagram's post or caption, seven topics namely: feeling, Surabaya events, photography, artists, vacation, religion and music. Through the process, the lowest perplexity come from 90 iterations, which suggests six groups of topics. The six topics are concluded based on the lowest perplexity value and labelled according to the words included in the topic. The topic of Photography discussed by six schools. Photography-Artists and vacation are discussed by three schools, while feeling, religion and music are being discussed by two and one school respectively.
Keywords: Topic Modelling ; Senior High School Students ; Author-Topic Models.
The approach of using ontology as pre-knowledge source for semi-supervised labelled topic model by applying text dependency graph
by Phu Pham, Phuc Do
Abstract: Multiple topics discovering from text is an important task in text mining. From the past, the supervised approaches fail to explore multiple topics in text. The topic modelling approach, such as: LSI, pLSI, LDA, etc. are considered as an unsupervised method which supports to discover distributions of multiple topics in text documents. The labelled LDA (LLDA) model is a supervised method which enables to integrate human labelled topics with the given text corpus during the process of modelling topics. However, in real applications, we may not have enough high qualified knowledge to properly assign the topics for all documents before applying the LLDA. In this paper, we present two approaches which are taken the advantage of dependency graph-of-words (GOW) in text analysis. The GOW approach uses frequent sub-graph mining (FSM) technique to extract graph-based concepts from text. Our first approach is method of using graph-based concepts for constructing domain-specific ontology. It is called GC2Onto model. In our second approach, the graph-based concepts are also applied to improve the quality of traditional LLDA. It is called LLDA-GOW model. We combine two GC2Onto and LLDA-GOW models to leverage the multiple topic identification as well as other mining tasks in text.
Keywords: topic identification; labelled topic modelling; LDA; labelled LDA; ontology-driven topic labelling; dependency graph.
RFID BI Mobility and Producer to Consumer Traceability Architecture
by Andre Claude Bayomock Linwa
Abstract: Radio frequency identifier (RFID) emerged in 2000 an intelligent remote object identification. RFID helps tracking object position and relevant information using radio frequency technology (Bouet and dos Santos, 2008; Pais, 2010). Its application in industries, highly increases the inventory management consistency and accuracy, by capturing in real-time observed object attributes for traceability and quality control purpose. In order to provide traceability and quality control services, RFID applications should offer two main services: business intelligence (BI) and mobility management. The RFID BI provides production traceability services (QoS metrics related to manufacturing processes). And RFID mobility service maintains accurate RFID tag location. In this paper, a generic RFID BI mobility' data model is defined. In the proposed data model, RFID product information generated by a supply chain organisation is translated or migrated from a producer to a consumer. This migration generates two distinct types of RFID mobility: internal (inside buildings) and external.
Keywords: Mobility Management; RFID; Business Intelligence BI; Data Models; Business Processes; QoS; Mobile Networks; GPS; Events; Mobility Subscription.
Sentimental Event Detection from Arabic Tweets
by Mohammad Daoud, Daoud Daoud
Abstract: This article presents and evaluates an approach to detect sentimental events from Twitter Arabic data streams. Sentimental events attract strongly opinionated responses from the online community; therefore, we aim at detecting the association of a topic with a positive or a negative sentiment at a particular time. To achieve that, we build sentimental time series where the frequencies of that association (between topics and sentiment) are recorded. And then, we use several algorithms to locate possible events. Events in positive timelines will be considered as positive, and similarly for negative events. Our approaches use Shannon diversity index and hill climbing peak finding. We experimented our proposed algorithms with the domain of football (soccer) news. The results showed good precision and recall considering mainstream media as a reference. The success of such experiment can open the door for many useful applications including reputation and brand monitoring systems for various domains and languages.
Keywords: event detection; sentiment analysis; social media analysis; diversity analysis; data mining.
A comparison of cluster algorithms as applied to unsupervised surveys
by Kathleen C. Garwood, Arpit Dhobale
Abstract: When considering answering important questions with data, unsupervised data offers extensive insight opportunity and unique challenges. This study considers student survey data with a specific goal of clustering students into like groups with underlying concept of identifying different poverty levels. Fuzzy logic is considered during the data cleaning and organising phase helping to create a logical dependent variable for analysis comparison. Using multiple data reduction techniques, the survey was reduced and cleaned. Finally, multiple clustering techniques (k-means, k-modes and hierarchical clustering) are applied and compared. Though each method has strengths, the goal was to identify which was most viable when applied to survey data and specifically when trying to identify the most impoverished students.
Keywords: Fuzzy logic; cluster analysis; unsupervised learning; survey analysis; decision support system; k-means; k-modes; hierarchical clustering.
Discovery of inconsistent generalized coherent rules
by Anuradha Radhakrishnan, Rajkumar N, Rathi Gopalakrishnan, Soosaimichael PrinceSahayaBrighty
Abstract: Mining multiple-level association rules in a predefined taxonomy is an hierarchies that paves the way for generalised rule mining using interestingness measures like support and confidence. Coherent rule mining identifies significant rules in a database without using interestingness measures. In this paper we propose a new mining algorithm called generalised inconsistent coherent rule mining (GICRM) for mining a new form of generalised coherent rules called Inconsistent coherent rules. The discovered rules are called inconsistent because the correlation of the rules changes from one level of taxonomy to another. The rules are mined from a structured dataset of predefined taxonomy. The inconsistent rules mined would be noteworthy at business point of view for taking strategic decisions in market basket analysis.
Keywords: GICRM; multiple-level; generalized inconsistent coherent rule; taxonomy.
Time and Structural Anomalies Detection in Business Processes Using Process Mining
by Elham Saeedi, Faramarz Safi-Esfahani
Abstract: Information systems are increasingly being integrated into operational process and as a result, many events are recorded by information systems. Lack of compatibility between the process model and the observed behaviour is one of the challenges in constructing the process model in process mining. This lack of compatibility could be present in both the structure (sequence of the task) and the time spent in each task. In this paper, a hybrid approach for detecting structural and time anomalies via process mining is proposed. A dataset form Iran Insurance Company is used for performing a case study. The proposed method has detected 98.5% of structure anomalies and 96.3% of time anomalies which is one of the main achievements of this paper. A second standard dataset is used to further examine the proposed method that referred to as dataset 2. The proposed method has demonstrated a better performance compared with the baseline approach.
Keywords: Process mining; conformance checking; workflow mining; structural anomaly; time anomaly; flexible model; Insurance anomaly; anomaly detection; process model; control-flow perspective.
g*-CLOSED SETS IN INTUITIONISTIC FUZZY TOPOLOGICAL SPACES
by Gandhi Mathi
Abstract: This paper is devoted to the study of intuitionistic fuzzy topological spaces. In this paper we introduce the concepts of intuitionistic fuzzy g*-closed sets in intuitionistic fuzzy topological spaces and studied some of its basic properties. Also we introduce the concepts of intuitionistic fuzzy g*-open sets in intuitionistic fuzzy topological spaces and derived several basic properties. We show that Intuitionistic fuzzy g*-closed sets lies between intuitionistic fuzzy ?-closed sets and intuitionistic fuzzy g-closed sets. We also introduced application of intuitionistic fuzzy g*-closed sets namely intuitionistic fuzzy T_(1/2)^*space and(_^*)T_(1/2) space. We obtained some characterizations and several preservation theorems of intuitionistic fuzzy topological spaces.
Keywords: Intuitionistic fuzzy topology; Intuitionistic fuzzy g*-closed sets; Intuitionistic fuzzy g*-open sets.
The integration of a newly defined N-gram concept and vector space model for documents ranking
by Mostafa A. Salama, Wafaa Salah
Abstract: Vector space model (VSM) is used in measuring the similarity between documents according to the frequency of common words among them. Furthermore, the N-gram concept is integrated in VSM to put into consideration the relation between common consecutive words in the documents. This approach does not consider the context and semantic dependency between nonconsecutive words existing in the same sentence. Accordingly, the approach proposed here presents a new definition of the N-gram concept as N non-consecutive words located in the same sentence, and utilises this definition in the VSM to enhance the measurement of the semantic similarity between documents. This approach measures and visualises the correlation between the words that are commonly existing together within the same sentence to enrich the analysis of domain experts. The results of the experimental work show the robustness of the proposed approach against the current ranking models.
Keywords: N-gram; vector space model; VSM; text mining.
Master node fault tolerance in distributed big data processing clusters
by Ivan Gankevich, Yuri Tipikin, Vladimir Korkhov, Vladimir Gaiduchok, Alexander Degtyarev, A. Bogdanov
Abstract: Distributed computing clusters are often built with commodity hardware which leads to periodic failures of processing nodes due to relatively low reliability of such hardware. While worker node fault-tolerance is straightforward, fault tolerance of master node poses a bigger challenge. In this paper master node failure handling is based on the concept of master and worker roles that can be dynamically re-assigned to cluster nodes along with maintaining a backup of the master node state on one of worker nodes. In such case no special component is needed to monitor the health of the cluster while master node failures can be resolved except for the cases of simultaneous failure of master and backup. We present experimental evaluation of the technique implementation, show benchmarks demonstrating that a failure of a master does not affect running job, and a failure of backup results in re-computation of only the last job step.
Keywords: parallel computing; big data processing; distributed computing; backup node; state transfer; delegation; cluster computing; fault-tolerance; high-availability; hierarchy.
Students performance prediction using hybrid classifier technique in incremental learning
by Roshani Ade
Abstract: The performance in higher education is a turning point in the academics for all students. This academic performance is influenced by many factors, therefore, it is essential to develop predictive data mining model for student's performance so as to identify the difference between high learners and slow learners student. The knowledge is hidden among the educational data set and it is extractable through data mining techniques. In our paper, we used the hybrid classifier approach for the prediction of student's performance using fuzzy ARTMAP and Bayesian ARTMAP classifier. Sensitivity analysis was performed and irrelevant inputs were eliminated. The performance measures used to determine the performance of the techniques include Matthews correlation coefficient (MCC), accuracy rate, true positive, false positive and percentage correctly classified instances. The combined result gives the good accuracy for predicting students' performance while using this approach. Thus, an enhanced prediction method for students is obtained.
Keywords: hybrid classifier; incremental learning; fuzzy ARTMAP; Matthews correlation coefficient; MCC.
Combined local colour curvelet and mesh pattern for image retrieval system
by Yesubai Rubavathi Charles
Abstract: This manuscript presents the content based image retrieval system using new textural features such as colour local curvelet (CLC) based textural descriptor and colour local mesh pattern (CLMP), for the intention of increasing the performance of the image retrieval system. The proposed methods can be able to utilise the distinctive details obtained from spatial coloured textural patterns of various spectral components within the particular local image region. Furthermore, to acquire the benefit of harmonising effect through joint colour texture information, the oppugant colour textural features that obtain the texture patterns of spatial interactions among spectral planes are also integrated in to the creation of CLC and CLMP. Extensive and comparative experiments have been conducted on two benchmark databases, i.e., Corel-1k, MIT VisTex. Retrieval results show that image retrieval using colour local texture features yields better precision and recall than retrieval approaches using either by colour or texture features.
Keywords: content-based image retrieval system; curvelet transform; local mesh pattern; local colour curvelets; LCC; local colour mesh pattern; LCMP.
A survey on time series motif discovery
by Cao Duy Truong, Duong Tuan Anh
Abstract: Time series motifs are repeated subsequences in a long time series. Discovering time series motifs is an important task in time series data mining and this problem has received significant attention from researchers in data mining communities. In this paper, we intend to provide a comprehensive survey of the techniques applied for time series motif discovery. The survey also briefly describes a set of applications of time series motif in various domains as well as in high-level time series data mining tasks. We hope that this article can provide a broad and deep understanding of the time series motif discovery field.
Keywords: time series; motif discovery; window-based; segmentation-based; motif applications.
Fuzzy-based automated interruption testing model for mobile applications
by A. Malini, K. Sundarakantham, C. Mano Prathibhan, A. Bhavithrachelvi
Abstract: Testing of mobile applications during the occurrence of interrupts is termed as interrupt testing. Interrupts can occur either internally within the mobile or from other external factors or systems. Interruption in any smart phones may decrease the performance of mobile applications. In this paper, an automated interruption testing model is proposed to analyse the responsiveness of mobile applications during interrupts. This model monitors the applications installed in the mobile devices and evaluates the overall performance of mobile applications during interrupt using fuzzy logic. An enhanced MobiFuzzy evaluation system (MFES) is proposed that is used to dynamically analyse the test results and identify necessary information required for tuning the application. Fuzzy logic will help the developers or testers in tuning the application performance; by automatically categorising the impact level of performance parameters on the overall performance of the application.
Keywords: mobile application testing; interrupt testing; application tracker; performance testing.
Modelling and simulation of ANFIS-based MPPT for PV system with modified SEPIC converter
by M. Senthil Kumar, P.S. Manoharan, R. Ramachandran
Abstract: This paper presents modelling and simulation of artificial neuro-fuzzy inference system (ANFIS)-based maximum power point tracking (MPPT) algorithm for PV system with modified SEPIC converter. The conventional existing MPPT methods are having major drawbacks of high oscillations at maximum power point and low efficiency due to uncertain nature of solar radiation and temperature. These mentioned problems can be solved by the proposed adaptive (ANFIS)-based MPPT. The proposed work involves ANFIS and modified single ended primary inductor converter (SEPIC) to extract maximum power from PV panel. The results obtained from proposed methodology are compared with other MPPT algorithms such as perturb and observe (P&O), incremental conductance (INC) and radial basis function network (RBFN). The improvement in voltage rating of modified SEPIC is compared with conventional SEPIC converter. The result confirms the superiority of the proposed system.
Keywords: maximum power point tracking; MPPT; modified SEPIC; artificial neuro-fuzzy inference system; ANFIS; radial basis function network; RBFN.
Fuzzy c-means clustering and elliptic curve cryptography using privacy preserving in cloud
by Sasidevi Jayaraman, Sugumar Rajendran, Shanmuga Priya P
Abstract: Cloud computing is the distribution of computing devices which reduce the cost for IT infrastructure. In this projected approach, the databases are measured to collecting method generate the transitional datasets. These datasets acquire the facts increase to pick the responsive data to the encryption and decryption procedure, the responsive data preferred procedure depend upon the entry value. The facts increase is integrated to get the superior bound limitation for the combined maintaining outflow. Responsive data to the elliptic curve cryptography (ECC) system to encrypt the data to isolation procedure. Encrypted data storage system is utilised to protected cloud data standards. Encrypting every transitional data sets are neither competent nor rate effectual one. From the trial outcome, the isolation defending charge of transitional datasets can be appreciably condensed by our method above obtainable ones where the entire datasets are encrypted.
Keywords: cloud computing; intermediate datasets; privacy preserving; encryption; decryption; cryptography; clustering.
Improved artificial neural network with aid of artificial bee colony for medical data classification
by Balasaheb Tarle, Sudarson Jena
Abstract: The ultimate aim of the proposed method is to establish a model for classification of medical data. Various methods have been generated to health related data to detect upcoming health fitness usage including detecting person's spending and illness related issues for diseased persons. In order to achieve promising results in medical data classification, we have planned to utilise orthogonal local preserving projection and optimal classifier. Initially, the pre-processing will be applied for extracting useful information and to convert suitable sample from raw medical datasets. Here, orthogonal local preserving projection (OLPP) is used to reduce the feature dimension. Once the feature reduction is formed, the prediction will be done based on the optimal classifier. In the optimal classifier, artificial bee colony algorithm will be used with neural network. The effectiveness of our proposed is measured in terms of accuracy, sensitivity and specificity. Here, Switzerland dataset achieves the maximum accuracy value 95.935%.
Keywords: orthogonal local preserving projection; OLPP; classifier; neural network; artificial bee colony algorithm.
An effective feature selection for heart disease prediction with aid of hybrid kernel SVM
by T. Keerthika, K. Premalatha
Abstract: In today's modern world cardiovascular disease is the most lethal one. This disease attacks a person so instantly that it hardly gets any time to get treated with. So, diagnosing patients correctly on timely basis is the most challenging task for the medical fraternity. In order to reduce the risk of heart disease, effective feature selection and classification based prediction system is proposed. An efficient feature selection is applied on the high dimensional medical data, for selecting the features fish swarm optimisation algorithm is used. After that, selected features from medical dataset are fed to the HKSVM for classification. The performance of the proposed technique is evaluated by accuracy, sensitivity, specificity, precision, recall and f-measure. Experimental results indicate that the proposed classification framework have outperformed by having better accuracy of 96.03% for Cleveland dataset when compared existing SVM method only achieved 91.41% and optimal rough fuzzy classifier achieved 62.25%.
Keywords: hybrid kernel support vector machine; HKSVM; feature selection; fish swarm optimisation; support vector machines; SVM; optimal rough fuzzy; Cleveland; Hungarian; Switzerland.
The complexity of cluster-connectivity of wireless sensor networks
by H.K. Dai, H.C. Su
Abstract: Wireless sensor networks consist of sensor devices with limited computational capabilities and memory operating in bounded energy resources; hence, network optimisation and algorithmic development in minimising the total energy or power while maintaining the connectivity of the underlying network are crucial for their design and maintenance. We consider a generalised system model of wireless sensor networks whose node set is decomposed into multiple clusters, and show that the decision and the associated minimisation problems of the connectivity of clustered wireless sensor networks appear to be computationally intractable - completeness and hardness, respectively, for the non-deterministic polynomial-time complexity class. An approximation algorithm is devised to minimise the number of end nodes of inter-cluster edges within a factor of 2 of the optimum for the cluster-connectivity.
Keywords: wireless sensor network; connectivity; spanning tree; non-deterministic polynomial-time complexity class; approximation algorithm.
A fast clustering approach for large multidimensional data
by Hajar Rehioui, Abdellah Idrissi
Abstract: Density-based clustering is a strong family of clustering methods. The strength of this family is its ability to classify data of arbitrary shapes and to omit the noise. Among them density-based clustering (DENCLUE), which is one of the well-known powerful density-based clustering methods. DENCLUE is based on the concept of the hill climbing algorithm. In order to find the clusters, DENCLUE has to reach a set of points called density attractors. Despite the advantages of DENCLUE, it remains sensitive to the growth of the size of data and of the dimensionality, in the fact that the density attractors are calculated of each point in the input data. In this paper, in the aim to overcome the DENCLUE shortcoming, we propose an efficient approach. This approach replaces the concept of the density attractor by a new concept which is 'the hyper-cube representative'. The experimental results, provided from several datasets, prove that our approach finds a trade-off between the performance of clustering and the fast response time. In this way, the proposed clustering methods work efficiently for large of multidimensional data.
Keywords: large data; dimensional data; clustering; density-based clustering; DENCLUE.