International Journal of Data Science (17 papers in press)
An Enhance DE algorithm for analysis in data set
by Dharmpal Singh, J.Paul Choudhury, Mallika De
Abstract: Differential evolution (DE) is a simple, powerful optimization algorithm, which has been widely used to solve constrained optimization problems, multi objective global optimizations, and other complex real-world applications. However, it has been observed that the choices of the best mutation, search strategies, long training time and lower classification accuracy are difficulties for the specific issues. Furthermore, the authors have to know the appropriate encoding schemes and evolutionary operators and as well as the suitable parameter settings to ensure the success of the algorithm. Otherwise it may be lead to demanding computational costs of the time consuming trial and error parameter and operator tuning process.
To minimize these drawbacks, an enhance DE has been proposed to improve searching ability and efficiently guide the evolution of the population toward the global optimum.
It has been further observed that mutation and crossover plays an important role in the DE optimization and several functions are available for it which may leads a different result for the same data set.
Therefore, here an effort has been made to suggest a cross over and mutation strategy which will lead the less time and efficient evolution of the population toward the global optimum.
Keywords: Data mining; association rule; data preprocessing; factor analysis; fuzzy logic; neural network; particle swarm optimization and artificial bee colony DEA.
Summarization of subspace clusters based on Similarity connectedness
by B. Jaya Lakshmi, M. Shashi, K.B. Madhuri
Abstract: Subspace clustering is an emerging area which explores clusters of objects in various subspaces. The existing subspace clustering algorithms like SUBCLU, CLIQUE etc. are computationally expensive as they generate a large number of possibly redundant subspace clusters limiting the interpretability of the results. The problem gets even worse with the increase in dimensionality of the dataset. So, this demands for efficient summarization framework that generates limited number of interesting subspace clusters. The authors have proposed a new frame work for generating low dimensional subspaces, clustering them based on similarity and merging them to form the corresponding subspace clusters subsuming the information content of low dimensional member clusters. A novel algorithm, Similairity connectedness based Clustering on subspace Clusters (SCoC) is proposed to form natural grouping of lower dimensional subspace clusters. The concept of similarity connectedness is introduced to group and merge the subspace clusters formed in different lower dimensional subspaces leaping through the lattice of dimensions. The resulted compact and summarized high dimensional subspace clusters would easily be interpreted for making sound decisions. The SCoC algorithm is thoroughly tested on various benchmark datasets and found that it outperforms PCoC and SUBCLU both in terms of cluster quality as well as execution time.
Keywords: Subspace clusters; summarization; similarity; similarity connectedness; similarity threshold; groups of subspace clusters.
Community Detection in Dynamic Networks with Spark
by Priyangika Piyasinghe, Morris Chang
Abstract: Detecting the evolution of communities within dynamically changing networks is important to understand the latent structure of complex large graphs. In this paper, we present an algorithm to detect real time communities in dynamically changing networks. We demonstrate the proposed methodology through a case study in peer-to-peer botnet detection which is one of the major threats in network security for serving as the infrastructure that is responsible for various cyber crimes. Our method considers online community structure from time to time and improves efficiency by maintaining the same level of accuracy of community detection over time. Experimental evaluation on Apache Spark implementation of the method showed that the execution time improves over dynamic version of Girvan-Newman community detection algorithm while having a higher accuracy level.
Keywords: Dynamic Networks; Community Detection; Girvan-Newman algorithm; Large Graphs; Spark.
Performance Analysis of NARX Neural Network Back Propagation Algorithm by Various Training Functions for Time Series Data
by Ashok Kumar Durairaj, Murugan Solaiyappan
Abstract: This study seeks to investigate the various training functions with Nonlinear Auto Regressive eXogenous Neural Network (NARXNN) to forecasting the closing index of the stock market. An iterative approach strives to adjust the number of hidden neurons of a NARXNN model. This approach systematically constructs different NARXNN models from simple architecture to complex architecture with different training functions and finds the optimum NARXNN model. The effectiveness of the proposed approach is seen to be a step ahead of Bombay Stock Exchange (BSE100) closing stock index of Indian stock market. This approach has identified optimum neuron counts in the hidden layer for every training function with NARXNN which reduces NN structure, training time and increases the convergence speed. The experimental result reveals that neuron counts in the hidden layer cannot be identified by rule of thumb.
Keywords: NARX Neural Network; Time Series Data; Training Functions; Stock Index; Forecasting; Performance Analysis; Indian Stock Market;.
INCORPORATING SECURITY AND INTEGRITY INTO THE MINING PROCESS OF HYBRID WEIGHTED-HASHT APRIORI ALGORITHM USING HADOOP
by Sumithra Radhakrishnan
Abstract: This paper talks about the best algorithms of association rule mining, weighted and hash tree apriori algorithms in a distributed cloud platform and its enhancement as a hybrid weighted-hashT apriori algorithm and its implementation in a eucalyptus platform. Then this research work handles the integrity and security issues of data during the process of mining. The algorithm is experimented in a cloud environment using Eucalyptus platform with VMWare workstation and hadoop distributed file system. And also the work evaluated how distributed implementation goes better than standalone implementations of weighted and hash tree apriori algorithms as well as distributed implementation. The work further studies the effectiveness of using eucalyptus hadoop nodes and the performance changes with respect to, using the security protocol for ensuring the security of data in the mining process.
Keywords: Data mining; Weighted apriori; HashT; Hadoop; Cloud; Data Integrity; Data Security; Eucalyptus; Apriori; Distributed mining;.
Managing data using an ontology for enterprise decision-making: A case of The World Bank
by Tengku Adil Tengku Izhar, Torab Torabi, Trieu Minh Nhut Le
Abstract: People have access to more data in single day than most people that have access to data in the previous decade. This data is created in many forms and it highlights the development of big data. Big data in organizations have transformed the way organizations across industries implement new approach to handle huge amount of data. It means change in skills, structures, technologies and architectures. Organizations rely to this data to achieve specific business priorities. The challenge is how to capture this data and analyse this data into useful information for the specific organization activities because determining relevant data is a key to delivering value of information and knowledge from massive amounts of data collection. In this paper, we describe big data in information spectrum to identify relevant data from large collection of big data to assist information professionals with useful information for decision-making process. We show how this approach provides conceptually simple yet powerful results that can be used to evaluate big data in organizations. We illustrate the relationship between big data and information spectrum using an ontology. The relationship will implies a strong tie to organizational goals, and it involves the management of knowledge that is useful for some purpose and which creates value for the organization in light of the organizational goals. Case study is applied using data from the World Bank. The results from the case study demonstrate how we incorporate big data and information spectrum using an ontology to provide a platform to extra value from large datasets.
Keywords: Big Data; information professionals; information spectrum; ontologies; organizational goals; The World Bank.
A Novel Ensemble decision tree classifier using hybrid feature selection measures for Parkinson's disease prediction
by Bala Brahmeswara Kadaru, RAJA SRINIVASA REDDY B
Abstract: Parkinsons disease and Alzeimers disease are most critical health issues in current days. In neurology, Parkinson disease affects the dopamine receptors of central nervous system. Dopamine is a type of G-protein helps in the process of neural transmission. It affects the movement of patients. Many patients share most of the common symptoms, whereas few distinct symptoms are also recoded. Dopamine cells are degenerated in this disease progressively, which leads rapid growth of severity. Extensive amount of research works were done since years for prediction of Parkinsons disease in the early stage. Till date there is no significant approach which will provide optimized performance for prediction. Alzheimers disease is another neurological disease which generally leads to dementia in most cases. It decreases mental ability gradually which initiated with short term memory loss and ends with more critical conditions. Machine learning approaches are more promising approaches for the prediction of these above said disease. In this paper, we presented a novel ensemble based feature selection measures and decision tree models to predict Parkinsons disease. Experimental results proved that proposed model has high computational accuracy and true positive rate compared to traditional feature selection measures and ensemble decision trees.
Keywords: Feature selection measures; Ensemble Decision Tree; Disease prediction.
Review on propagation of secure data, prevention of attacks and Routing in Mobile Ad-hoc Networks (MANETs)
by Gautam Borkar
Abstract: Wireless communication is considered as a significant part in our modern innovation for transmitting the packets from source node to destination node. In the developing current situation of wireless communications MANET assumes a major part. In this paper, we have built up a definite review about the algorithms and systems utilized for fathoming the different issues like security, authentication and routing. We have clarified three different classifications of issues which happen during broadcasting the packets by contrasting each and the past advancements in this paper. To acquire precise solutions to issues, such as, authentication, protection and security a vast number of protocols, routing strategy and algorithms have been utilized, however it is exceptionally testing to discover the ideal and proficient technique that and can be utilized internationally. In this paper, we have displayed an overview of different existing procedures and afterward basically investigated the work done by the different scientists in the field of MANETs.
Keywords: Wireless networks; MANET; Communication.
Privacy preserving solution to prevent Classification Inference Attacks in Online Social Network
by Agrima Srivastava, Geethakumari G
Abstract: In order to improve their business solutions, the data holders often
release the social network data and its structure to the third party. This data
undergo node and attribute anonymization before its release. This, however, does
not prevent the users from inference attacks which an un-trusted third party or an
adversary would carry out at their end by analyzing the structure of the graph.
Therefore, there is an utmost necessity to not only anonymize the nodes and
their attributes but also to anonymize the edge sets in the released social network
graph. Anonymizing involves perturbing the actual data which results in utility
loss. Ensuring utility and preserving privacy are inversely proportional to each
other and is a challenging task. In this work we have proposed, implemented and
verified an efficient utility based privacy preserving solution to prevent the third party inference attacks for an online social network graph.
Keywords: Privacy; Online Social Networks; Privacy Preserving Data Publishing;rnUtility; Network Classification.
An improved algorithm to handle noise objects in the process of clustering
by Hasanthi Pathberiya, Chandima Tilakaratne, Liwan Liyanage
Abstract: Cluster Analysis is considered as an approach for unsupervised learning. It tends to recognize hidden grouping structure in a set of objects using a predefined set of rules. Objects occupying unusual characteristics add noise to the data space. As a result, complexities and misinterpretation in clustering structures will arise. This study aims at proposing a novel iterative approach to eradicate the effect of noise objects in the process of deriving clusters of data. Performance of the proposed approach is tested on partitioning, hierarchical and neural network based clustering algorithms using both simulated and standard data sets supplemented with noise. An improvement in the quality of clustering structure resulted from the proposed approach is witnessed, compared to that of conventional clustering algorithms.
Keywords: Clustering algorithms; Handling noise data; Mining methods and algorithms; k-means; Ward’s method; Self organizing map.
Survey on Iterative and Incremental Approaches in Distributed Computing Environment
by Afaf Bin Saadon, Hoda Mokhtar
Abstract: Iterative computation has become increasingly needed for a large and important class of applications such as machine learning and data mining. These iterative applications typically apply computations over large-scale datasets. So it is desirable to develop efficient distributed frameworks to process data iteratively. On the other hand, data keeps growing over time as new entries are added and existing entries are deleted or modified. This incremental nature of data makes the previously computed results of iterative applications stale and inaccurate over time. It is hence necessary to periodically refresh the computation so that the new changes can be quickly reflected in the computed results. This paper presents the existing distributed systems that support iterative and incremental computations on large-scale datasets. It describes the main optimizations and features of these systems and identifies their limitations.
Keywords: Big data; Distributed systems; Iterative computation; Incremental processing.
Continuous Skyline Queries in Distributed Environment
by Ibrahim Gomaa, Hoda Mokhtar
Abstract: With the expanding number of communications from different mobile applications that acquire location information, the demand for continuous skyline queries has increased. Continuous skyline queries, unlike traditional skyline queries which consider the static attributes only, consider both dynamic and static attributes. In addition, the rapid growth in information and the extremely fast increase in the data volume and mobile applications that deal with such volume of data such as check-ins recommendation, information services, applications that focus on moving objects in road networks, and navigation services; have both driven the need to adapt new processing environments that are suitable for storing, processing, and maintaining huge amounts of data. In this paper, we present a number of efficient algorithms for processing continuous skyline queries on large datasets using MapReduce framework. We proposed three algorithms namely PCSQ-MR, PDCSQ-MR and EPCSQ-MR to compute the skyline query for a moving object. The main idea of our proposed algorithms is to compute the skyline query only once at the starting position; then update on the result at the movement of the query point rather than computing the skyline at every time from scratch. In addition, experimental results are conducted which demonstrate the accuracy, performance and efficiency of the proposed algorithm.
Keywords: Continuous query processing; moving object; parallel computation; skyline queries; big data management.
Selection of K in K-means Clustering using GA and VMA
by Sanjay Chakraborty, Subham Raj, Shreya Garg
Abstract: Cluster analysis has been widely used in several disciplines, such as statistics, software engineering, biology, psychology and other social sciences, in order to identify natural groups in large amount of data. K-means is one of the most popular clustering algorithms. In spite of several advances in K-means clustering algorithm, it suffers in some drawbacks like, initial cluster centers, stuck in local optima etc. The initial guessing of cluster centers lead to the bad clustering results in K-means and this is one of the major drawbacks of K-means algorithm. In this paper, a new strategy is proposed where we have blended K-Means algorithm with Genetic Algorithm (GA) and Volume metric algorithm (VMA) to predict the best value of initial cluster centers, which is not in the case of only K-means algorithm. The paper concludes with the analysis of the results of using the proposed measure to determine the number of clusters for the K-means algorithm for different well-known data sets from UCI machine learning repository.
Keywords: Clustering; Cluster centers; K-means; Genetic Algorithm; Volume metric algorithm;.
A retrospective data analysis of Legionella pneumophila diagnostic procedures and their impact on patients management: The experience of a rapid point-of-care test
by Eliona Gkika, Dimosthenis Chochlakis, Yannis Tselentis, Constantin Zopounidis, Vassilis Kouikoglou, Kitsos Gkikas, ANNA PSAROULAKI
Abstract: This study aims at a comparative assessment of a conventional test and a point of care test (POCT) for the diagnosis of Legionella pneumophila, by considering laboratory and clinical performance (test turnaround times (TAT), antibiotic treatment, and diagnostic efficiency, as well as, economic criteria.
A retrospective analysis was undertaken using data from the microbiology laboratories of two hospitals in Crete, Greece. We focused on hospitalized patients with clinical evidence of pneumonia and positive test for L. pneumophila (confirmed cases). Hospital A adopts a conventional serological diagnosis based on an indirect fluorescent-antibody technique (IFA) and Hospital B uses a urinary antigen test (UAT), which is a rapid POCT.
The mean TAT was 4.45 days (range 0 21) for the conventional IFA test and 0.11 days (range 0 64) for UAT. A total of 24 laboratory positive cases (11 inpatients, 13 outpatients) were identified out of 905 analyzed samples taken from 751 people. Infection was more prevalent in men, with a mean age of 61.77 years (SD=20.03; range 5 92). The mean daily hospitalization cost of confirmed cases was 127.45 for Hospital A (two in-patients with costs 91.00 and 163.90) and 79.86 for Hospital B (nine in-patients with cost range 60.00 135.00). The mean antibiotic treatment cost per patient in Hospital A was much higher than in Hospital B. Provision of a rapid laboratory diagnosis of L. pneumophila could significantly decrease time to diagnosis, improve treatment and consequently reduce the associated hospitalization charges.
Keywords: Legionella pneumophila; Point of care testing; turnaround time; length of stay; cost reduction.
ANALYSIS OF WEATHER DATA USING VARIOUS REGRESSION ALGORITHMS
by Jahnavi Y
Abstract: Weather forecasting is a vital application in meteorology and has been one of the most challenging problems around the world. Data mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions. This is carried out using several regression algorithms. This paper focuses on weather analysis using various regression algorithms in data mining. There are various regression algorithms such as Linear Regression, Nonlinear Regression, Classification And Regression Tree, Multilayer Perceptron Neural Network, Support Vector Machine etc. In this work Linear Regression, Classification And Regression Tree, Multilayer Perceptron Neural Network and Support Vector Machine are used. For weather analysis various primary atmospheric parameters such as average temperature, average pressure and relative humidity are considered. The performance is analyzed using various evaluation measures. Evaluation criteria like root mean square error, mean absolute error, relative absolute error and root relative square error are used for measuring the performance of regression algorithms. By experimentation it has been observed that the error rate of Linear Regression is more than Classification And Regression Tree, the error rate of Classification And Regression Tree is more than Multilayer Perceptron. Support Vector Machine is better than Multilayer Perceptron, Classification And Regression Tree and Linear Regression. For relative root square error, Classification And Regression Tree has higher rate in evaluating training data and test data.
Keywords: Multilayer Perceptron; Classification And Regression Tree; Support Vector Machine.
Analysis of Co-authorship Network Based on Some Betweenness Centrality Concepts
by Divya Sindhu Lekha, Kannan Balakrishnan, Sunil Kumar R
Abstract: Reliant components of a network are the connector nodes which aid in establishing a strongly connected network. Betweenness centrality of a node well captures its connecting capability. We suggest some new betweenness centrality measures which could be useful in analysing the structural connectivity of a network. In this paper we study the behaviour of collaboration in a co-authorship network, namely the NetScience network, from the perspective of these measures. We analyse the network from a micro perspective, where we consider small groups of scientists doing research in a common subdiscipline. We show that each group is formed by the influence of only one or two highly collaborating authors. Another speculation was that even though these authors are highly influential in smaller groups they do not possess notable contribution to the overall research of main discipline.
Keywords: Complex networks; Network centrality; Graph theory; Betweenness center; Collaboration network; Co-authorship network.
An Application of the Logic of Explanatory Power in Rough Set Analysis: Implications for the Classification of Decision Rules
by Anthony T. Odoemena
Abstract: This paper uses the logic of explanatory power to address the question of uncertain decision rule classification and interpretation in rough set data analysis. A set theoretic configuration of the measure of explanatory power is introduced. The usefulness of the measure is then examined in the context of two data setsone related to car evaluation and the other related to the provision of extra educational supports. It is found that the explanatory power measure has some interesting properties that enhance the informativeness and interpretation of non-deterministic decision rules. The result of the numerical analysis shows that the explanatory power index is unique. The index can also facilitate the establishment of an objective threshold that determines whether the explanatory relevance of the premise in a given decision rule is positive, negative, or neutral.
Keywords: Rough sets; explanatory power; data analysis; decision rules.