International Journal of Data Mining, Modelling and Management (19 papers in press)
EFP-Tree: An Efficient FP-Tree for Incremental Mining of Frequent Patterns
by Razieh Davashi, Mohammad-Hossein Nadimi-Shahraki
Abstract: Frequent pattern mining from dynamic databases where there are many incremental updates is a significant research issue in data mining. After incremental updates, the validity of the frequent patterns is changed. A simple way to handle this state is rerunning mining algorithms from scratch which is very costly. To solve this problem, researchers have introduced incremental mining approach. In this article, an efficient FP-tree named EFP-tree is proposed for incremental mining of frequent patterns. For original database, it is constructed like FP-tree by using an auxiliary list without any reconstruction. Consistently, for incremental updates, EFP-tree is reconstructed once and therefore reduces the number of tree reconstructions, reconstructed branches and the search space. The experimental results show that using EFP-tree can reduce reconstructed branches and the runtime in both static and incremental mining and enhance the scalability compared to well-known tree structures CanTree, CP-tree, SPO-tree, and GM-tree in both dense and sparse datasets.
Keywords: Data mining; Dynamic databases; Frequent pattern; Incremental mining; FP-tree.
Human Activity Recognition based on Interaction Modelling
by Subetha T, Chitrakala S
Abstract: Human Activity Recognition aims at recognizing and interpreting the activities of humans automatically from videos. Among the activities of humans, identifying the interactions between human within minimal computation time and reduced misclassification rate is a cumbersome task. Hence, an Interaction based Human Activity recognition system is proposed in this paper that utilizes silhouette features to identify and classify the interactions between humans. The main issues that affect the performance in various stages of activity recognition are sudden illumination changes, detection of static human, lack of inhibiting spatio-temporal features while extracting silhouettes, data discrimination, data variance, crowding problem, and computational complexity. To accomplish the preceding issues three new algorithms named weight-based updating Gaussian Mixture Model (wu-GMM), Spatial Dissemination-based Contour Silhouettes (SDCS), and Weighted Constrained Dynamic Time Warping (WCDTW) are proposed. Experiments are conducted with the benchmarking datasets such as Gaming dataset and Kinect Interaction dataset. The results demonstrate that the proposed system recognizes the interaction based activity of humans with reduced misclassification rate and minimal processing time compared to the existing motion-pose geometric descriptor representation (MPGD) for various activities like the right punch, left punch, defense, and so on. The proposed Human Activity Recognition system finds its applications in sports event analysis, video surveillance, content-based video retrieval, robotics, and others.
Keywords: Human Activity Recognition; weight-based updating Gaussian Mixture Model; Spatial Dissemination-based Contour Silhouettes; Weighted Constrained Dynamic Time Warping; Dynamic Time Warping; reduced variance-t Stochastic Neighbor Embedding.
Using implicitly and explicitly rated online customer reviews to build opinionated Arabic lexicons
by Mohammad DAOUD
Abstract: Creating an opinionated lexicon is an important step towards a reliable social media analysis system. In this article we are proposing an approach and describing an experiment to build an Arabic polarized lexical database from analysing online implicitly and explicitly rated customer reviews. These reviews are written in Modern Standard Arabic and Palestinian/Jordanian dialect. Therefore, the produced lexicon comprises casual slangs and dialectic entries used by the online community, which is useful for sentiment analysis of informal social media microblogs. We have extracted 28000 entries from processing 15100 reviews and by expanding the initial lexicon through Google Translate. We calculated an implicit rating for every review driven by its text to address the problem of ambiguous opinions of certain online posts, where the text of the review does not match the given rating (the explicit rating). Each entry was given a polarity tag and a confidence score. High confidence scores have increased the precision of the polarization process. Explicit rating has increased the coverage and confidence of polarity.
Keywords: polarized lexicon; social media analysis; opinion mining; term extraction; sentiment analysis.
SAMPLE SELECTION ALGORITHMS FOR CREDIT RISK MODELING THROUGH DATA MINING TECHNIQUES
by Eftychios Protopapadakis, Dimitrios Niklis, Michalis Doumpos, Anastasios Doulamis, Constantin Zopounidis
Abstract: Credit risk assessment is a very challenging and important problem in the domain of financial risk management. The development of reliable credit rating/scoring models is of paramount importance in this area. There are different algorithms and approaches for constructing such models to classify credit applicants (firms or individuals) into risk classes. Reliable sample selection is crucial for this task. The aim of this paper is to examine the effectiveness of sample selection schemes in combination with different classifiers for constructing reliable default prediction models. We consider different algorithms to select representative cases and handle class imbalances. Empirical results are reported for a data set of Greek companies from the commercial sector.
Keywords: Credit risk modeling; Data mining; Sampling; Classification.
A Flexible Architecture for the Pre-Processing of Solar Satellite Image Time Series Data The SETL Architecture
by Carlos Roberto Silveira Junior, Marilde Terezinha Prado Santos, Marcela Xavier Ribeiro
Abstract: Satellite Image Time Series (SITS) is a challenging domain for Knowledge Discovery Database due to their characteristics: Each image has several sunspots and each sunspot is associated with sensor data composed of the radiation level and the sunspot classifications. Each image has time parameters and sunspots coordinates, spatiotemporal data. Several challenges of SITS domain are faced during the Extract, Transform and Load (ETL) process. In this paper, we proposed an Architecture called SITSs ETL (SETL) that extracts the visual characteristics of each sunspot and associates it with sunspots sensor data considering the spatiotemporal relations. SETL brings flexibility and extensibility to working with challenging domains such as SITS because it integrates textual, visual and spatiotemporal characteristics at sunspot-record level. Furthermore, we obtained acceptable performance results according to a domain expert and increased the possibility of using different data mining algorithms comparing to the Art State.
Keywords: Satellite Image Time Series; Spatiotemporal Extract; Transform; and Load process; Temporal Series of Solar Image processing.
Fast Parallel PageRank Technique for Detecting Spam
by Nilay Khare, Hema Dubey
Abstract: Brin and Larry proposed PageRank in 1998, which appears as a
prevailing link analysis technique used by web search engines to rank its search
results list. Computation of PageRank values in an efficient and faster manner
for very immense web graph is truly an essential concern for search engines
today. To identify the spam web pages and also deal with them is yet another
important concern in web browsing. In this research article, an efficient and faster
parallel PageRank algorithm is proposed, which harnesses the power of graphics
processing units (GPUs). In proposed algorithm, the PageRank scores are nonuniformly
distributes among the web pages, so it is also competent of coping with
spam web pages. The experiments are performed on standard datasets available
in Stanford Large Network Dataset Collection. There is a speed up of about 1.1
to 1.7 for proposed parallel PageRank algorithm over existing parallel PageRank
Keywords: GPU; CUDA; Parallel PageRank Technique; Spam Web Pages.
Tuning Parameters via a new Rapid, Accurate and Parameter-less Method Using Meta-Learning
by Alireza Hekmatinia, Ali Mohammadi Shanghooshabad, Mohammad Mahdi Motevali, Mehrdad Almasi
Abstract: Abstract: Dealing with a large parameter space in optimization and data mining tasks is extremely time consuming because by increasing the number of parameters, the parameter space increases exponentially. Regardless of the considerable amount of time it takes, the tuning method itself needs to be tuned since methods themselves have at least one parameter. Here a new rapid and parameter-less method is presented to tune algorithms on diverse datasets to achieve high quality results in a short consumed time. Also, for a quick overview of the methods available in this area, taxonomy of the parameter selection approaches is presented here. The method presented here uses a pre-knowledge by using meta-features to guess closer point to optimal point in parameter space of target algorithms (here, Support Vector Machine algorithm is used). For preparing the pre-knowledge, 282 meta-features are introduced and then Genetic Algorithm (GA) is applied to determine best meta-features for the target algorithm. The best meta-feature set is a combination of meta-features that creates the most differentiates between various datasets. Then the best meta-features are used to tune target algorithm on unseen datasets. In experiments, 15 best meta-features are selected from 282 by using the GA over 30 datasets. Finally, by using extracted meta-features, SVMs parameters are tuned over 5 unseen datasets. The results show that in less than 0.19 minute in average, the method obtains approximately the same classification rates in comparison with others, but the consumed time is dramatically declined.
Keywords: Parameter Tuning ; Meta-Learning; Parameter-less Methods; Data Mining; Support Vector Machines.
Analyzing Sentiments based on Multi Feature Combination with Supervised Learning
by Monalisha Ghosh, Goutam Sanyal
Abstract: Sentiment analysis or opinion mining has become an open research domain after the proliferation of Internet and Web 2.0 social media. Feature generation and selection are consequent for text mining as the high dimensional feature set can affect the performance of sentiment analysis. This paper investigates the inability of the widely used feature selection method (IG, Chi-Square, Gini Index) individually as well as their combined approach on four machine learning classification algorithm. Initially, we transform the review datasets into the feature vector of unigram features along with bi-tagged features based on POS pattern. Next, Information gain (IG), Chi squared (χ2) and minimum redundancy maximum relevancy (mRMR) feature selection methods are applied to obtain an optimal feature subset for further functionality. These features are then given input to multiple machine learning classifiers, namely, Support vector machine (SVM), Multinomial Na
Keywords: Sentiment analysis; Opinion mining; text classification; Feature selection method; Machine learning algorithms optimal feature vector,.
A new network-based approach to investigating neurological disorders
by Francesco Cauteruccio, Paolo Lo Giudice, Giorgio Terracina, Domenico Ursino, Nadia Mammone, Francesco Carlo Morabito
Abstract: In this paper, we present a new network-based approach to helping experts to investigate neurological disorders in which the connections among brain areas play a key role. Our approach receives the EEG of a patient and associates a network with it, with nodes that represent electrodes and with edges that denote the disconnection degree of the corresponding brain areas, measured by means of a new string-based metric. Then, it performs some suitable projections on this network, depending on the neurological disorder to investigate. After this, it computes the values of a new coefficient, called connection coefficient, on them. These values can be employed to help neurologists in their analyses. We show how our approach can be employed for three different disorders, namely Creutzfeldt-Jacob Disease, Childhood Absence Epilepsy and Alzheimer's Disease.
Keywords: Network Analysis; Connection Coefficient; Clique; Consensus Multi-Parameterized Edit Distance; Electroencephalogram; Neurological Disorders.
Intrusion detection using classification techniques: a comparative study
by Imad Bouteraa, Makhlouf Derdour, Ahmed Ahmim
Abstract: Todays highly connected world suffers from the increase and variety of cyber-attacks. To mitigate those threats, researchers have been continuously exploring different methods for intrusion detection through the last years. In this paper, we study the use of data mining techniques for intrusion detection. The research intends to compare the performances of classification techniques for intrusion detection. To reach the goal, we involve 74 classification techniques in this comparative study.The study shows that no technique outperforms the others in all situations. However, some classification methods lead to promising results and give clues for further combinations.
Keywords: Data mining; Classification; Network Security; Intrusion detection; KDD99.
An Insight into Application of Big Data Analytics in
by Sravani Nalluri, Sasikala R
Abstract: The main aim of this paper is to comprehend different aspects of big data, to gain insight of the current research trends of application of big data in health care and to identify the different aspects of health care where it can be applied.
In this paper a brief analysis was done on Applications of Big data in health care. The main focus is on the aspects of health where big data is being used, collection of data and tools employed for big data analytics. In addition to it the paper also addresses the type of machine learning algorithms that were used in health care and which statistics commissioned to compare the performance of these algorithms.
Most of the health care data was collected from University of California machine learning repository, from the hospitals and government agencies. Most of the researchers focused only on prediction of the diseases or emergency department visits, or a disease outbreak with the help of HADOOP and WEKA tool. Support vector machine, Artificial neural networks, Naive bayes & Decision tree were commonly used algorithms for prediction of diseases. The performance of the algorithms was compared statistically using Accuracy. In my perspective more research needs to be done in application of Big data Analytics in other domains of health rather than just prediction of disease.
Keywords: Big data; Hadoop; Machine learning algorithms; Healthcare; Map-reduce; Chronic diseases; Accuracy rate; Prevention; Analytics.
Grey Relational Classification Algorithm for Software Fault Proneness with SOM Clustering
by Aarti Aarti, Geeta Sikka, Renu Dhir
Abstract: The estimation by the human judgment to deal with the inherent uncertainty of software gives a vague and imprecise solution. To cope with this challenge, we propose a new hybrid analogy model based on the integration of GRA (grey relational analysis) classification with self-organizing map (SOM) clustering. In this paper, a new classification approach is proposed to distribute the data to similar groups. The attributes are selected based on GRC values. In the proposed, the similarity measure between reference project and cluster head is computed to determine the cluster to which target project belongs. The fault-proneness of reference project is estimated based on the regression equation of the selected cluster. The proposed algorithm gives resilience to users to select n features for both continuous and categorical attributes. In this study, two scenarios based on the integration of proposed classification with regression have been proposed. Experimental results show significant results indicating that proposed methodology can be used for the prediction of faults and produce conceivable results when compared with the results of multilayer-perceptron, logistic regression, bagging, na
Keywords: Self organizing map (SOM); grey relational analysis (GRA); unsupervised classification; fault-proneness; object-oriented (OO).
Overlapping Community Detection With A Novel Hybrid Metaheuristic Optimization Algorithm
by Imane Messaoudi, Nadjet Kamel
Abstract: Social networks are ubiquitous in our daily life. Due to the rapid development of information and electronic technology, social networks are becoming more and more complex in terms of sizes and contents. It is of paramount significance to analyze the structures of social networks in order to unveil the myth beneath complex social networks. Network community detection is recognized as a fundamental tool towards social networks analytics. As a consequence, numerical community detection methods are proposed in the literature. For a real-world social network, an individual may possess multiple memberships, while the existing community detection methods are mainly designed for non-overlapping situations. With regard to this, this paper proposes a hybrid metaheuristic method to detect overlapping communities in social networks. In the proposed method, the overlapping community detection problem is formulated as an optimization problem and a novel bat optimization algorithm is designed
to solve the established optimization model. To enhance the search ability of the
proposed algorithm, a local search operator based on tabu search is introduced.
To validate the effectiveness of the proposed algorithm, experiments on benchmark and real-world social networks are carried out. The experiments indicate
that the proposed algorithm is promising for overlapping community detection
Keywords: Overlapping Community; Modified Density; Tabu Search; Bat Algorithm; Link Clustering.
Bees Colonies For Detecting Communities Evolution Using Data WareHouse
by Yasmine Chaabani, Jalel Akaichi
Abstract: The analysis of social networks and their evolution has gained much interest in recent years. In fact, few methods revealed and tracked meaningful communities over time. These methods also dealt efficiently with structure and topic evolution of networks. In this paper, we propose a novel technique to track dynamic communities and their evolution behaviour. The main objective of our approach and using the Artificial Bee Colony(ABC)is to trace the evolution of community and to optimize our objective function to keep proper partitioning. Moreover, we use a Data warehouse as a mind of bees to store the information of different communities structure in every timestamp. The experimental results showed that the proposed method is efficient in discovering dynamics communities and tracking their evolution.
Keywords: Social Network; Community Detection; Bees Colonies.
A support Architecture to MDA Contribution for Data Mining
by Fatima MESKINE, Safia Nait-Bahloul
Abstract: The data mining process is the sequence of tasks applied to data, in order to discover relations between them to have knowledge. However, the data mining process lacks a formal specification that allows it to be modeled independently of platforms. MDA (Model Driven Architecture) is an approach for the development of software systems, based on the use of models to improve their productivity. Several research works have been elaborated to align the MDA approach with data mining on data warehouses, to specify the data mining process in a very high level of abstraction. In our work, we propose a support architecture that allows positioning these researches in different abstraction levels, on the basis of several criteria; with the aim to identify strengths for each level, in term of modelling; and to have a clear visibility on the MDA contribution for data mining.
Keywords: Data mining; Model Driven Architecture; Data warehouses; UML Profiles; Data Multidimensional Model; Transformation.
Special Issue on: Big Data Engineering Recent Advances in Intelligent Methods, Methodologies and Techniques
Allegories for Database Modeling
by Bartosz Zielinski, Paweł Maslanka, Scibor Sobieski
A Grammar-based Approach for XML Schema Extraction and Heterogeneous Document Integration
by Prudhvi Janga, Karen C. Davis
Towards a Comparative Evaluation of Text-Based Specification Formalisms and Diagrammatic Notations
by Kobamelo Moremedi, John Andrew Van Der Poll
Effective and Efficient Distributed Management of Big Clinical Data: A Framework
by Alfredo Cuzzocrea, Giorgio Mario Grasso, Massimiliano Nolich