Forthcoming articles

 


International Journal of Data Science

 

These articles have been peer-reviewed and accepted for publication in IJDS, but are pending final changes, are not yet published and may not appear here in their final order of publication until they are assigned to issues. Therefore, the content conforms to our standards but the presentation (e.g. typesetting and proof-reading) is not necessarily up to the Inderscience standard. Additionally, titles, authors, abstracts and keywords may change before publication. Articles will not be published until the final proofs are validated by their authors.

 

Forthcoming articles must be purchased for the purposes of research, teaching and private study only. These articles can be cited using the expression "in press". For example: Smith, J. (in press). Article Title. Journal Title.

 

Articles marked with this shopping trolley icon are available for purchase - click on the icon to send an email request to purchase.

 

Articles marked with this Open Access icon are freely available and openly accessible to all without any restriction except the ones stated in their respective CC licenses.

 

Register for our alerting service, which notifies you by email when new issues of IJDS are published online.

 

We also offer RSS feeds which provide timely updates of tables of contents, newly published articles and calls for papers.

 

International Journal of Data Science (28 papers in press)

 

Regular Issues

 

  • Framework for finding maximal association rules in mobile web service environment using soft set   Order a copy of this article
    by Krishna Kumar Mohbey, G.S. Thakur 
    Abstract: Electronic commerce is very popular nowadays. It is a fast and convenient way to transfer information and communicate with people. E-commerce uses various web services to perform a specific task. When a particular user accessed web services, they are sequentially stored into a database that is called web service sequences. Association rules are used to correlate different web services for knowledge prediction. In this paper, we design a framework for generating maximal association rules of accessed web service sequences using soft set. Soft set uses binary values for their standard representation. This framework converts web service sequences into Boolean valued information system using the concept of coexistence attributes in a sequence. We define the concept of maximal association rules between attribute sets. Here maximal support and confidence are also defined using soft set. Experimental results show that the proposed soft set based framework provides identical rules as compared to other maximal association rules and rough set based rules.
    Keywords: Web services sequence; Maximal association rule; Soft set; Co-exist services; Boolean value system.

  • Secondary Protein Structure Prediction Combining Protein Structural Class, Relative Surface Accessibility, and Contact Number   Order a copy of this article
    by Imad Rahal 
    Abstract: With huge amounts of molecular data produced from ever-increasing numbers of genomic and proteomic studies, predicting the secondary structure of proteins from amino acid sequences has become a common expectation among scientists. Several studies in the literature have demonstrated that the accuracy of such predictions can be drastically improved by incorporating additional types of protein data into the prediction process; however, no work has studied the effect of incorporating multiple types of protein data simultaneously. In this work, we report our findings from an extensive experimental study that uses neural networks designed to study the effect of using different combinations of protein data on the accuracy of predicting secondary protein structures. Overall, our experimental results indicate that accuracy improves the most when incorporating contact number, relative surface accessibility or any combination that includes at least one of the two into the prediction process.
    Keywords: protein structure prediction; neural networks; machine learning; scientific data mining; data science; bioinformatics; protein structural class; relative surface accessibility; protein contact number.

  • Hurst Exponent, Fractals and Neural Networks for Forecasting Financial Asset Returns in Brazil   Order a copy of this article
    by Joao Nunes De Mendonça Neto, Luiz Paulo Lopes Fávero, Renata Turola Takamatsu 
    Abstract: Our scope is to verify the existence of a relationship between long-term memory in fractal time series and the prediction error of financial asset returns obtained by Artificial Neural Networks (ANN). We expect that the fractal time series with larger memory can achieve predictions with lower error, since the correlation between elements of the series favors the quality of ANN prediction. As a long-term memory measure, the Hurst exponent of each time series was calculated, and the Root Mean Square Error (RMSE) produced by ANN in each time series was used to measure the prediction error. Hurst exponent computation was conducted through the rescaled range analysis (R/S) algorithm. The ANNs architecture used Time Lagged Feedforward Neural Networks (TLFN), with backpropagation supervised learning process and gradient descent for error minimization. Brazilian financial assets traded at BM&FBovespa, specifically public companies shares and real estate investment funds were considered.
    Keywords: Hurst Exponent; Fractals; Artificial Neural Networks; Time Series Forecasting; Financial Assets.

  • Sequence Similarity Using Composition Method   Order a copy of this article
    by Geetika Munjal, Pooja Sharma, Deepti Gaur 
    Abstract: DNA has enormous capacity to carry very important information in the form of character strings. Sequence analysis is the process of applying a wide range of methods to DNA sequences for understanding the structure, feature or evolution of these Nucleotides strings. The analysis use mathematical methods to convert these character strings to numerical values, and these numerical values are used to find similarity between the sequences. DNA sequences only contain four nucleotides A, C, G and T but in order to find information from these sequences, sequence comparison becomes essential. In this paper, various methods to analyze DNA sequences including usage of entropy, divergence, LZ Complexity, and the role of hybridization are explored. A hybrid model based on the composition vector and distance methods is proposed to find dissimilarity between sequences and this hybrid model is tested on sequences of species downloaded from NCBI
    Keywords: Nucleotides; Entropy; Frequency vector.

  • On the Poisson distribution applicability to the Japanese seismic activity   Order a copy of this article
    by Antoine Bossard 
    Abstract: The Japanese isles are located on the Pacific Ring of Fire, thus facing intense seismic activity. Earthquakes are recorded on a daily basis, and it is easy to understand that (strong) earthquake forecasting is critical and an actively researched topic. In this study, we are investigating the applicability of the Poisson distribution for earthquake forecasting specifically on the Japanese territory. We shall thus analyse recent seismic data aiming at identifying parameters for the Poisson distribution that induce best forecasting, and also at deducing patterns and relations between, for instance, seismic intensities and geographical locations. We shall conduct several experiments on the data gathered to eventually discuss the most promising conditions for the Poisson distribution in this forecasting context.
    Keywords: probabilistic inference; data analysis; sensor network; earthquake; K-NET; KiK-net.

  • Design and Implementation of Non-Perfect Reconstruction Biorthogonal Wavelets for Edge Detection of X-ray Images   Order a copy of this article
    by P.M.K. Prasad, G.Sasi Bhushana Rao, M.N.V.S.S. Kumar, K. Chiranjeevi 
    Abstract: The X-ray bone images are extensively used by the medical practitioners to detect the minute fractures as they are painless and economical compared to other image modalities. Edge detection of X-ray bone image is very useful for the medical practitioners as it provides important information for diagnosis which in turn enables them to give better treatment decisions to the patients. This paper proposes design and implementation of Non-Perfect reconstruction biorthogonal wavelet for the edge detection of X-ray images which helps medical practitioners to analyse the X-ray images. The perfect reconstruction condition of biorthogonal wavelet is relaxed for the effective edge detection. The low pass decomposition filter of non-prefect reconstruction biorthogonal wavelet is even symmetry about zero location. So the edge position is more accurate in multiscale image edge detection and the high pass decomposition filter is odd symmetry about 1/2 location and the sign is monotonic. This property can ensure that the proposed wavelet has good competence in image edge detection. The wavelet with above properties can ensure the validity of edge detection. The non perfect reconstruction biorthogonal wavelet NPR Zbo6.5 wavelet performs well in detecting the edges with better quality.The simulation results show that the non-prefectreconstruction biorthogonal wavelet is effective and accurate. The non-perfect reconstruction biorthogonal wavelet is superior to Perfect Reconstruction (PR) biorthogonal wavelet for edge detection of X-ray images. The various performance metrics like Ratio of Edge pixels to size of an image (REPS), peak signal to noise ratio (PSNR) and computation time are compared for various biorthogonal wavelets.
    Keywords: filter-banks; non-perfect reconstruction; symmetry; biorthogonal; edge detection.

  • Damage Identification of Composite Beam Structure Using Fuzzy Logic Based Model   Order a copy of this article
    by Deepak K. Agarwalla 
    Abstract: Damage identification of beam structures have been in practice for last few decades. The methodologies adopted have been upgraded over the time depending upon the complexities of the damage or crack and the desired accuracy. The utilization of artificial intelligence (AI) techniques has also been considered by many researchers. In the current research, damage detection of a glass fiber reinforced composite cantilever beam subjected to vibration has been carried out. A fuzzy based model using triangular, trapezoidal and Gaussian membership function separately has been developed to predict the damage characteristics i.e. relative damage position and relative damage severity. The inputs required for the fuzzy based model i.e. first three relative natural frequencies and first three mode shape differences have been determined by finite element analysis of the damaged cantilever beam subjected to the natural vibration. An experimental setup has been used to justify the robustness of the proposed technique for damage identification.
    Keywords: damage, artificial intelligence techniques, glass fiber reinforced composite cantilever beam, fuzzy model, triangular membership function, trapezoidal membership function, Gaussian membership function, relative natural frequency, mode shape difference, relative damage position, relative damage severity

  • Data set comparison workflows   Order a copy of this article
    by Marko Robnik-Sikonja 
    Abstract: To assess similarity of two data sets from the point of view of data science, univariate statistical comparisons are mostly insufficient. We present a methodology which estimates similarity of data sets from the point of view of data mining tasks. For example, we provide a relevant information for a decision if a new/related data set can be used with an existing supervised or unsupervised model or not. Another example is testing if an artificially generated data set is appropriate for tuning of model's parameters. We propose several workflows which cover a) statistical properties of generated data, b) distance based structural similarity, and c) predictive similarity of two data sets. We evaluate the proposed workflows on random splits of several data sets and by comparing original data sets with data sets produced by a generator of semi-artificial data. The results show that the proposed workflows can reveal relevant similarity information about data sets needed in many data mining scenarios.
    Keywords: data analysis, data mining, machine learning, data similarity, clustering, classification

  • Classification Diversity Measurement   Order a copy of this article
    by Anthony Scime 
    Abstract: Interesting classification rules can be determined by a number of measures. When searching a domain for a characterization of unique, different, but important, data an appropriate measurement is diversity. Diversity as a measure of a classification rule is based on the relative distinctness of the rule to the other rules in the rule-set. The diversity measure is the sum of the inverse of commonness of a rules items. In this paper, diversity is derived from the simplest classification trees using techniques from statistics and information retrieval, and demonstrated using sample datasets.
    Keywords: Classification data mining; Diversity; Interestingness measurement

  • An Enhance DE algorithm for analysis in data set   Order a copy of this article
    by Dharmpal Singh, J.Paul Choudhury, Mallika De 
    Abstract: Differential evolution (DE) is a simple, powerful optimization algorithm, which has been widely used to solve constrained optimization problems, multi objective global optimizations, and other complex real-world applications. However, it has been observed that the choices of the best mutation, search strategies, long training time and lower classification accuracy are difficulties for the specific issues. Furthermore, the authors have to know the appropriate encoding schemes and evolutionary operators and as well as the suitable parameter settings to ensure the success of the algorithm. Otherwise it may be lead to demanding computational costs of the time consuming trial and error parameter and operator tuning process. To minimize these drawbacks, an enhance DE has been proposed to improve searching ability and efficiently guide the evolution of the population toward the global optimum. It has been further observed that mutation and crossover plays an important role in the DE optimization and several functions are available for it which may leads a different result for the same data set. Therefore, here an effort has been made to suggest a cross over and mutation strategy which will lead the less time and efficient evolution of the population toward the global optimum.
    Keywords: Data mining; association rule; data preprocessing; factor analysis; fuzzy logic; neural network; particle swarm optimization and artificial bee colony DEA.

  • Using Bayesian Inference to Measure the Proximity of Flow Cytometry Data   Order a copy of this article
    by Sherief Abdallah 
    Abstract: Measuring the proximity between two patients is a crucial element in most data mining tasks. For example, to predict whether a patient has cancer, we need to compare the patient to other patient(s) based on the data. In this paper, we focus on patient data that are derived from Flow Cytometry (FCM). FCM is a widely used technique in health-related fields, including cancer diagnosis and HIV monitoring. Measuring and quantifying the proximity between two patients based on the FCM data is challenging. Not only does each file contain thousands of features (representing different cells), but also (and more importantly) the features are unordered. Furthermore, the data of a single patient can be divided over multiple FCS files due to technical limitations of FCM machines. We propose in this paper the use of Bayesian inference, along with Binning, to represent and measure the proximity between two patients using FCM data. We verify the effectiveness of our approach by comparing the performance of several classification algorithms in predicting leukemia cases.
    Keywords: Flow Cytometry; Data Mining; Leukemia; Bayesian Inference.

  • Summarization of subspace clusters based on Similarity connectedness   Order a copy of this article
    by B. Jaya Lakshmi, M. Shashi, K.B. Madhuri 
    Abstract: Subspace clustering is an emerging area which explores clusters of objects in various subspaces. The existing subspace clustering algorithms like SUBCLU, CLIQUE etc. are computationally expensive as they generate a large number of possibly redundant subspace clusters limiting the interpretability of the results. The problem gets even worse with the increase in dimensionality of the dataset. So, this demands for efficient summarization framework that generates limited number of interesting subspace clusters. The authors have proposed a new frame work for generating low dimensional subspaces, clustering them based on similarity and merging them to form the corresponding subspace clusters subsuming the information content of low dimensional member clusters. A novel algorithm, Similairity connectedness based Clustering on subspace Clusters (SCoC) is proposed to form natural grouping of lower dimensional subspace clusters. The concept of similarity connectedness is introduced to group and merge the subspace clusters formed in different lower dimensional subspaces leaping through the lattice of dimensions. The resulted compact and summarized high dimensional subspace clusters would easily be interpreted for making sound decisions. The SCoC algorithm is thoroughly tested on various benchmark datasets and found that it outperforms PCoC and SUBCLU both in terms of cluster quality as well as execution time.
    Keywords: Subspace clusters; summarization; similarity; similarity connectedness; similarity threshold; groups of subspace clusters.

  • A Review of Data Mining Algorithms On Hadoops MapReduce   Order a copy of this article
    by Sikha Bagui, Sean Spratlin 
    Abstract: This paper is a review of the most frequently used data mining algorithms on Hadoops MapReduce. We describe the algorithms with respect to their implementation and performance on Hadoops MapReduce. We also discuss the similarities and differences between MapReduces parallel or distributed implementations and the original standard sequential implementations.
    Keywords: Hadoop; MapReduce; Classifcation; Clustering; KNN; SVM; Regression; Association Rule Mining.

  • Community Detection in Dynamic Networks with Spark   Order a copy of this article
    by Priyangika Piyasinghe, Morris Chang 
    Abstract: Detecting the evolution of communities within dynamically changing networks is important to understand the latent structure of complex large graphs. In this paper, we present an algorithm to detect real time communities in dynamically changing networks. We demonstrate the proposed methodology through a case study in peer-to-peer botnet detection which is one of the major threats in network security for serving as the infrastructure that is responsible for various cyber crimes. Our method considers online community structure from time to time and improves efficiency by maintaining the same level of accuracy of community detection over time. Experimental evaluation on Apache Spark implementation of the method showed that the execution time improves over dynamic version of Girvan-Newman community detection algorithm while having a higher accuracy level.
    Keywords: Dynamic Networks; Community Detection; Girvan-Newman algorithm; Large Graphs; Spark.

  • Performance Analysis of NARX Neural Network Back Propagation Algorithm by Various Training Functions for Time Series Data   Order a copy of this article
    by Ashok Kumar Durairaj, Murugan Solaiyappan 
    Abstract: This study seeks to investigate the various training functions with Nonlinear Auto Regressive eXogenous Neural Network (NARXNN) to forecasting the closing index of the stock market. An iterative approach strives to adjust the number of hidden neurons of a NARXNN model. This approach systematically constructs different NARXNN models from simple architecture to complex architecture with different training functions and finds the optimum NARXNN model. The effectiveness of the proposed approach is seen to be a step ahead of Bombay Stock Exchange (BSE100) closing stock index of Indian stock market. This approach has identified optimum neuron counts in the hidden layer for every training function with NARXNN which reduces NN structure, training time and increases the convergence speed. The experimental result reveals that neuron counts in the hidden layer cannot be identified by rule of thumb.
    Keywords: NARX Neural Network; Time Series Data; Training Functions; Stock Index; Forecasting; Performance Analysis; Indian Stock Market;.

  • INCORPORATING SECURITY AND INTEGRITY INTO THE MINING PROCESS OF HYBRID WEIGHTED-HASHT APRIORI ALGORITHM USING HADOOP   Order a copy of this article
    by Sumithra Radhakrishnan 
    Abstract: This paper talks about the best algorithms of association rule mining, weighted and hash tree apriori algorithms in a distributed cloud platform and its enhancement as a hybrid weighted-hashT apriori algorithm and its implementation in a eucalyptus platform. Then this research work handles the integrity and security issues of data during the process of mining. The algorithm is experimented in a cloud environment using Eucalyptus platform with VMWare workstation and hadoop distributed file system. And also the work evaluated how distributed implementation goes better than standalone implementations of weighted and hash tree apriori algorithms as well as distributed implementation. The work further studies the effectiveness of using eucalyptus hadoop nodes and the performance changes with respect to, using the security protocol for ensuring the security of data in the mining process.
    Keywords: Data mining; Weighted apriori; HashT; Hadoop; Cloud; Data Integrity; Data Security; Eucalyptus; Apriori; Distributed mining;.

  • Managing data using an ontology for enterprise decision-making: A case of The World Bank   Order a copy of this article
    by Tengku Adil Tengku Izhar, Torab Torabi, Trieu Minh Nhut Le 
    Abstract: People have access to more data in single day than most people that have access to data in the previous decade. This data is created in many forms and it highlights the development of big data. Big data in organizations have transformed the way organizations across industries implement new approach to handle huge amount of data. It means change in skills, structures, technologies and architectures. Organizations rely to this data to achieve specific business priorities. The challenge is how to capture this data and analyse this data into useful information for the specific organization activities because determining relevant data is a key to delivering value of information and knowledge from massive amounts of data collection. In this paper, we describe big data in information spectrum to identify relevant data from large collection of big data to assist information professionals with useful information for decision-making process. We show how this approach provides conceptually simple yet powerful results that can be used to evaluate big data in organizations. We illustrate the relationship between big data and information spectrum using an ontology. The relationship will implies a strong tie to organizational goals, and it involves the management of knowledge that is useful for some purpose and which creates value for the organization in light of the organizational goals. Case study is applied using data from the World Bank. The results from the case study demonstrate how we incorporate big data and information spectrum using an ontology to provide a platform to extra value from large datasets.
    Keywords: Big Data; information professionals; information spectrum; ontologies; organizational goals; The World Bank.

  • A Novel Ensemble decision tree classifier using hybrid feature selection measures for Parkinson's disease prediction   Order a copy of this article
    by Bala Brahmeswara Kadaru, RAJA SRINIVASA REDDY B 
    Abstract: Parkinsons disease and Alzeimers disease are most critical health issues in current days. In neurology, Parkinson disease affects the dopamine receptors of central nervous system. Dopamine is a type of G-protein helps in the process of neural transmission. It affects the movement of patients. Many patients share most of the common symptoms, whereas few distinct symptoms are also recoded. Dopamine cells are degenerated in this disease progressively, which leads rapid growth of severity. Extensive amount of research works were done since years for prediction of Parkinsons disease in the early stage. Till date there is no significant approach which will provide optimized performance for prediction. Alzheimers disease is another neurological disease which generally leads to dementia in most cases. It decreases mental ability gradually which initiated with short term memory loss and ends with more critical conditions. Machine learning approaches are more promising approaches for the prediction of these above said disease. In this paper, we presented a novel ensemble based feature selection measures and decision tree models to predict Parkinsons disease. Experimental results proved that proposed model has high computational accuracy and true positive rate compared to traditional feature selection measures and ensemble decision trees.
    Keywords: Feature selection measures; Ensemble Decision Tree; Disease prediction.

  • Review on propagation of secure data, prevention of attacks and Routing in Mobile Ad-hoc Networks (MANETs)   Order a copy of this article
    by Gautam Borkar 
    Abstract: Wireless communication is considered as a significant part in our modern innovation for transmitting the packets from source node to destination node. In the developing current situation of wireless communications MANET assumes a major part. In this paper, we have built up a definite review about the algorithms and systems utilized for fathoming the different issues like security, authentication and routing. We have clarified three different classifications of issues which happen during broadcasting the packets by contrasting each and the past advancements in this paper. To acquire precise solutions to issues, such as, authentication, protection and security a vast number of protocols, routing strategy and algorithms have been utilized, however it is exceptionally testing to discover the ideal and proficient technique that and can be utilized internationally. In this paper, we have displayed an overview of different existing procedures and afterward basically investigated the work done by the different scientists in the field of MANETs.
    Keywords: Wireless networks; MANET; Communication.

  • Privacy preserving solution to prevent Classification Inference Attacks in Online Social Network   Order a copy of this article
    by Agrima Srivastava, Geethakumari G 
    Abstract: In order to improve their business solutions, the data holders often release the social network data and its structure to the third party. This data undergo node and attribute anonymization before its release. This, however, does not prevent the users from inference attacks which an un-trusted third party or an adversary would carry out at their end by analyzing the structure of the graph. Therefore, there is an utmost necessity to not only anonymize the nodes and their attributes but also to anonymize the edge sets in the released social network graph. Anonymizing involves perturbing the actual data which results in utility loss. Ensuring utility and preserving privacy are inversely proportional to each other and is a challenging task. In this work we have proposed, implemented and verified an efficient utility based privacy preserving solution to prevent the third party inference attacks for an online social network graph.
    Keywords: Privacy; Online Social Networks; Privacy Preserving Data Publishing;rnUtility; Network Classification.

  • An improved algorithm to handle noise objects in the process of clustering   Order a copy of this article
    by Hasanthi Pathberiya, Chandima Tilakaratne, Liwan Liyanage 
    Abstract: Cluster Analysis is considered as an approach for unsupervised learning. It tends to recognize hidden grouping structure in a set of objects using a predefined set of rules. Objects occupying unusual characteristics add noise to the data space. As a result, complexities and misinterpretation in clustering structures will arise. This study aims at proposing a novel iterative approach to eradicate the effect of noise objects in the process of deriving clusters of data. Performance of the proposed approach is tested on partitioning, hierarchical and neural network based clustering algorithms using both simulated and standard data sets supplemented with noise. An improvement in the quality of clustering structure resulted from the proposed approach is witnessed, compared to that of conventional clustering algorithms.
    Keywords: Clustering algorithms; Handling noise data; Mining methods and algorithms; k-means; Ward’s method; Self organizing map.

  • Survey on Iterative and Incremental Approaches in Distributed Computing Environment   Order a copy of this article
    by Afaf Bin Saadon, Hoda Mokhtar 
    Abstract: Iterative computation has become increasingly needed for a large and important class of applications such as machine learning and data mining. These iterative applications typically apply computations over large-scale datasets. So it is desirable to develop efficient distributed frameworks to process data iteratively. On the other hand, data keeps growing over time as new entries are added and existing entries are deleted or modified. This incremental nature of data makes the previously computed results of iterative applications stale and inaccurate over time. It is hence necessary to periodically refresh the computation so that the new changes can be quickly reflected in the computed results. This paper presents the existing distributed systems that support iterative and incremental computations on large-scale datasets. It describes the main optimizations and features of these systems and identifies their limitations.
    Keywords: Big data; Distributed systems; Iterative computation; Incremental processing.

  • Continuous Skyline Queries in Distributed Environment   Order a copy of this article
    by Ibrahim Gomaa, Hoda Mokhtar 
    Abstract: With the expanding number of communications from different mobile applications that acquire location information, the demand for continuous skyline queries has increased. Continuous skyline queries, unlike traditional skyline queries which consider the static attributes only, consider both dynamic and static attributes. In addition, the rapid growth in information and the extremely fast increase in the data volume and mobile applications that deal with such volume of data such as check-ins recommendation, information services, applications that focus on moving objects in road networks, and navigation services; have both driven the need to adapt new processing environments that are suitable for storing, processing, and maintaining huge amounts of data. In this paper, we present a number of efficient algorithms for processing continuous skyline queries on large datasets using MapReduce framework. We proposed three algorithms namely PCSQ-MR, PDCSQ-MR and EPCSQ-MR to compute the skyline query for a moving object. The main idea of our proposed algorithms is to compute the skyline query only once at the starting position; then update on the result at the movement of the query point rather than computing the skyline at every time from scratch. In addition, experimental results are conducted which demonstrate the accuracy, performance and efficiency of the proposed algorithm.
    Keywords: Continuous query processing; moving object; parallel computation; skyline queries; big data management.

  • Selection of K in K-means Clustering using GA and VMA   Order a copy of this article
    by Sanjay Chakraborty, Subham Raj, Shreya Garg 
    Abstract: Cluster analysis has been widely used in several disciplines, such as statistics, software engineering, biology, psychology and other social sciences, in order to identify natural groups in large amount of data. K-means is one of the most popular clustering algorithms. In spite of several advances in K-means clustering algorithm, it suffers in some drawbacks like, initial cluster centers, stuck in local optima etc. The initial guessing of cluster centers lead to the bad clustering results in K-means and this is one of the major drawbacks of K-means algorithm. In this paper, a new strategy is proposed where we have blended K-Means algorithm with Genetic Algorithm (GA) and Volume metric algorithm (VMA) to predict the best value of initial cluster centers, which is not in the case of only K-means algorithm. The paper concludes with the analysis of the results of using the proposed measure to determine the number of clusters for the K-means algorithm for different well-known data sets from UCI machine learning repository.
    Keywords: Clustering; Cluster centers; K-means; Genetic Algorithm; Volume metric algorithm;.

  • A comparison of classification methods in automated taxa identification of benthic macroinvertebrates   Order a copy of this article
    by Henry Joutsijoki, Martti Juhola 
    Abstract: In this research, we examined the automated taxa identification of benthic macroinvertebrates. Benthic macroinvertebrates play an important role in biomonitoring. They can be used in water quality assessments. Identification of benthic macroinvertebrates is made usually by highly trained experts, but this approach has high costs and, hence, the automation of this identification process could reduce the costs and would make wider biomonitoring possible. The automated taxa identification of benthic macroinvertebrates returns to image classification. We applied altogether 11 different classification methods to the image dataset of eight taxonomic groups of benthic macroinvertebrates. Wide experimental tests were performed. The best results, around 94% accuracies, were achieved when quadratic discriminant analysis (QDA), radial basis function network and multi-layer perceptron (MLP) were used. On the basis of the results, it can be said that the automated taxa identification of benthic macroinvertebrates is possible with high accuracy.
    Keywords: benthic macroinvertebrates; classification; machine learning; water quality.
    DOI: 10.1504/IJDS.2017.10009003
     
  • Correlated gamma frailty models for bivariate survival data based on reversed hazard rate   Order a copy of this article
    by David D. Hanagal, Arvind Pandey 
    Abstract: Frailty models are used in the survival analysis to account for the unobserved heterogeneity in individual risks to disease and death. To analyse the bivariate data on related survival times (e.g., matched pairs experiments, twin or family data), the shared frailty models were suggested. Shared frailty models are used despite their limitations. To overcome their disadvantages correlated frailty models may be used. In this paper, we introduce the gamma correlated frailty models based on reversed hazard rate (RHR) with three different baseline distributions namely, the generalised log-logistic type I, the generalised log-logistic type II and the modified inverse Weibull. We introduce the Bayesian estimation procedure using Markov Chain Monte Carlo (MCMC) technique to estimate the parameters involved in these models. We present a simulation study to compare the true values of the parameters with the estimated values. We also apply the proposed models to the Australian twin dataset and a better model is suggested.
    Keywords: Bayesian estimation; correlated gamma frailty; generalised log-logistic distribution type I; generalised log-logistic type II; modified inverse Weibull distribution.
    DOI: 10.1504/IJDS.2017.10009004
     
  • Record linkage in organisations: a review and directions for future research   Order a copy of this article
    by Tengku Adil Tengku Izhar, Torab Torabi, M. Ishaq Bhatti 
    Abstract: Record linkage is a task of identifying data from large datasets across different data sources. Although record linkage approach has been applied in many areas, there is limited discussion on the literature that gives an overview on recent development that addressed record linkage in the scope of the organisational goals. This paper is classified according to the recent development on record linkage as an approach to drive the understanding of the dependencies of organisational data in relation to the organisational goals. We observed recent literature based on this classification to identify recent development on record linkage. The results show that there is no study in evaluating record linkage in the scope of organisational data that relate to the organisational goals. The contribution of this paper will serve as a first step to develop the dependency relationship between organisational data and organisational goals.
    Keywords: record linkage; data goal dependency; data linkage; organisational goals; literature review.
    DOI: 10.1504/IJDS.2017.10009005
     
  • Bayesian estimation of Lomax distribution under type-II hybrid censored data using Lindley's approximation method   Order a copy of this article
    by Sanjay Kumar Singh, Umesh Singh, Abhimanyu Singh Yadav 
    Abstract: In this paper, we have discussed the estimation procedure for two parameter Lomax distribution under Type-II hybrid censoring scheme. The maximum likelihood estimation (MLE) and Bayes estimation for the parameters and reliability characteristics have been considered. The Lindley's approximation technique has been used to obtain the Bayes estimates. The performances of the Bayes estimators are compared with the corresponding maximum likelihood estimators (MLEs) in term of their mean square error (MSE). Finally, a real dataset has been used to illustrate the discussed methodology.
    Keywords: Lomax distribution; hybrid censoring; Lindley's approximation technique.
    DOI: 10.1504/IJDS.2017.10009048