International Journal of Data Analysis Techniques and Strategies (37 papers in press)
Solving a Multi-Objective Redundancy Allocation Problem under Opportunistic Maintenance Strategy
by Hadi Mokhtari, Ali Salmasnia, Mohsen Afsahi, Ali Ghorbanian
Abstract: Abstract: Redundancy allocation problem is increasing a systems reliability throughout parallel arrangement of each subsystems components. In recent years, various approaches have been proposed to solve this problem, although most only pursuit the maximization of the system reliability. Meanwhile, increasing the number of identical components in each subsystem leads to higher cost for system. Therefore, this paper presents an approach based on design of experiments (DOE) and data envelopment analysis (DEA) in order to determine the number of redundant component by considering both reliability and cost of the system. Since most of continuous production systems has high setup cost, the proposed approach uses the opportunistic maintenance policy for application of most restorative operations in each systems halt.
Keywords: Redundancy allocation problem, multi-objective optimization, experimental design, data envelopment analysis, opportunistic maintenance policy
REVIEW ON FACTORS AFFECTING CUSTOMER CHURN IN TELECOM SECTOR
by Vishal Mahajan, Richa Misra, Renuka Mahajan
Abstract: Abstract The communications sector is emerging with new technologies, wireless and wireline services. The industrys success expects a better perception of customer requirements and superior quality of service and models. Customer churn has a huge impact on companies and is the prime focus area for the companies to remain competitive and profitable. Hence, significant research had been undertaken by researchers worldwide to understand the dynamics of customer churn. This paper provides a review of around 75 recent journal articles (starting from year 2000) to identify the various churn factors and their complex relationships, in existing telecom churn literature. It gives detailed discussion of what factors were identified in various studies, the sample sizes used and the method used for the study by different researchers. The gaps identified in the previous studies have also been discussed. A model on churn factors, identified from the study is proposed to serve as a roadmap, to build upon exciting churn management techniques.
Keywords: Customer Churn, Telecom, Churn Management, Customer Switching, Customer demographics, Churn behaviour, Churn determinants
VARIABLE SELECTION IN LINEAR REGRESSION IN THE PRESENCE OF OUTLIERS
by Tejaswi Kamble, Dattatraya Kashid
Abstract: Majority variable selection methods are based on ordinary least squares (OLS) parameter estimation method. The performance of these variable selection methods is not satisfactory in the presence of outlier observation in the data. Only few variable selection methods based on other parameter estimation methods like M-estimator are proposed by the researchers.
In this paper, we propose variable selection method using sum of transformed residual based on the M-estimator in the presence of outlier observation(s). The performance of proposed method is evaluated through real data and simulated data.
Keywords: Variable Selection, Outlier, M-estimator, Sum of Transformed Residuals.
Improving blog spam filters via machine learning
by Weiwen Yang, Linchi Kwok
Abstract: As an important platform of electronic commerce, blogs can greatly influence internet users purchasing decisions. Spam, however, can substantially reduce blogs positive impact on electronic commerce. This paper introduces SK, an alternative algorithm combining supervised learning (SVM) and unsupervised learning (K-means++) to detect blog spam. If either classifies a blog as spam, then the blog is assigned to the spam category. Feature selection includes term frequency, inverse document frequency, binary representation, stop words, outgoing links, advertiser content, and burst with keywords. Accuracy of each model was tested and compared in experiments with 3,000 blog pages from University of Maryland and 3,560 internet blogs. Findings suggest that combining the SVM algorithm and K-means++ clustering can increase accuracy of filtering spams by about 7% as compared to using just one of these methods. Strengths and weaknesses of various spam-filtering methods were discussed, providing considerations for businesses when choosing a spam filter.
Keywords: Spam filter; SVM; K-means++; Machine learning; Neural network
Recurrent Neural Networks to Model Input-Output Relationships of Metal Inert Gas (MIG) Welding Process
by Dilip Kumar Pratihar, Geet Lahoti
Abstract: The mechanical strength of weld-bead is dependent on its geometric parameters like bead height, width and penetration, which depend on input process parameters, namely welding speed, arc voltage, wire feed rate, gas flow rate, nozzle-to-plate distance, torch angle etc. Recurrent neural networks were used for conducting both forward and reverse mappings using three approaches. The first approach dealt with the training of Elman network through updating its connecting weights using a back-propagation algorithm. In second approach, a real-coded genetic algorithm was used along with the back-propagation algorithm to tune the network. The third approach utilized a real-coded genetic algorithm only to optimize the network. In forward mapping, third approach was found to outperform the others, but in reverse mapping, first and second approaches were seen to perform better than the third one. The performances of these approaches were found to be data dependent.
Keywords: Recurrent neural networks; Genetic algorithms; MIG welding process; Forward mapping; Reverse mapping.
A Feature Based Selection Technique for Reduction of Large Scale Data
by Ritu Chauhan, Harleen Kaur
Abstract: The inflate development in public healthcare domain has forced numerous organizations to construct and maintain large scale databases or data warehouses. However, the prediction of knowledge should be an automated process to discover hidden information from large scale databases. The elaborated studies in past suggests that minimum interesting variables can determine qualified information while preserving information among the data. In addition, it is determined that large scale databases usually comprise of redundant and irrelevant features which have proven to be major setback for efficient and effective analysis of data. This paper intends to provide an integrated approach by utilizing machine learning technique and other convention statistical techniques for extraction of information from large scale databases. In the formulated approach we have potentially exploited two approaches where the first approach emphasizes on retrieval of feature subsets using MODTree filtering technique from discretized datasets with relative application domain on real datasets of Substance Abuse and Mental Health Data Archive (SAMHDA) collected from different states of United States. The second phase of study exploits statistical techniques on potential targets for discovery of interesting information from reduced datasets. We present a novel perspective using feature selection and statistical techniques for determination of knowledge from large scale databases.
Keywords: Data Mining, Feature Selection, Abusive Substance, Statistical Analysis, Alcohol Consumption.
Recommendation of Hashtags in Social Twitter Network
by K.C. Cheung, Tommy Cheung
Abstract: The development of microblogging services has resulted in the growth of short-text social networking on the Internet which open door to many useful applications such as reputation management and marketing. With more than millions of tweets generated each day, Twitter is one of the largest microblogging sites which allow users to use hashtags to categorize and facilitate the search of tweets which share the same tag. By using a popular or appropriate hashtag in tweets, users could reach a large set of target followers. In this paper, we propose a novel hidden topic model for content-based hashtag recommendation. By ranking the occurrence probability of hashtags of a given topic, a set of hashtag candidates was selected for further analysis. The proposed method is demonstrated with tweets collected from Twitters API for 20 consecutive periods. The advantage of our model is a combination of the use of topic distribution and term selection probability for hashtag recommendation.
Keywords: Hashtag recommendation; Topic models; Twitter; Short-text classification.
Using Graphical Techniques from Discriminant Analysis to Understand and Interpret Cluster Solutions
by Courtney McKim, James Bovaird, Chaorong Wu
Abstract: Clustering is a common form of exploratory analysis in the social and behavioral sciences and education. There are many clustering algorithms available to researchers and each algorithm assigns membership slightly different. This paper compares five classification algorithms (SPSS TwoStep, k-Means, hierarchical (nearest and furthest neighbor), and finite mixture model). Results show the highest agreement among the finite mixture model and the two-step clustering algorithm, as well as k-means and two-step. Hierarchical (nearest neighbor) does not have high agreement with k-means and the mixture model. Once a research decides on a clustering algorithm they often have a hard time interpreting clusters once a solution is reached. This study suggests using discriminant analysis as a method of interpreting cluster solutions which also allows researchers to visually see the interpretation and also provides the number of functions and which measures load on which function allowing more information about the clusters.
Keywords: clustering; discriminant analysis; k-means; two-step; finite mixture model
Solving Time Series Classification Problems Using Combined of Support Vector Machine and Neural Network
by Mohammed Alweshah, Hanadi Tayyeb, Mohammed Ababneh, Hasan Rashaideh
Abstract: The major aim of classification is to extract categories of inputs according to their characteristics. The literature contains several methods that aim to solve the time series classification problem, such as the Artificial Neural Network (ANN) and the Support Vector Machine (SVM). Time series classification is a supervised learning method that maps the input to the output using historical data. The primary objective is to discover interesting patterns hidden in the data. In this study, we use a new method called SVNN which combines the SVM and ANN classification techniques to solve the time series data classification problem. The proposed SVNN is applied to six benchmark UCR time series datasets. The results show that the proposed method outperforms the ANN and SVM on all datasets. Further comparison with other approaches in the literature also shows that the SVNN is able to maximize accuracy. It is believed that combining classification techniques can give better results in terms of accuracy and better solutions for time series classification.
Keywords: Support Vector Machine, artificial neural networks, time series problems
Estimating the Time of a Step Change in the Multivariate-attribute Process Mean Using ANN and MLE
by Amirhossein Amiri, Mohammad Reza Maleki, Fatemeh Sogandi
Abstract: In this paper, we consider correlated multivariate-attribute quality characteristics and provide two methods including a modular method based on artificial neural network (ANN) as well as Maximum Likelihood Estimation (MLE) method to estimate the time of change in the parameters of the process mean. We evaluate the performance of the estimators in terms of some criteria in change point estimation and compare them through simulation studies. The results show that the proposed ANN based model outperforms the MLE approach under most step shifts in the mean vector of the multivariate-attribute process.
Keywords: Step-change point estimation, Multivariate-attribute quality characteristics, Artificial Neural Network (ANN), Maximum Likelihood Estimation (MLE)
An Efficient method for Batch Updates in OPTICS Cluster Ordering
by Dhruv Kumar, Poonam Goyal, Navneet Goyal
Abstract: DBSCAN is one of the popular density-based clustering algorithms, but requires re-clustering the entire data when the input parameters are changed. OPTICS overcomes this limitation. In this paper, we propose a batch-wise incremental OPTICS algorithm which performs efficient insertion and deletion of a batch of points in a hierarchical cluster ordering, which is the output of OPTICS. Only a couple of algorithms are available in the literature on incremental versions of OPTICS. This can be attributed to the sequential access patterns of OPTICS. The existing incremental algorithms address the problem of incrementally updating the hierarchical cluster ordering for point-wise insertion/deletion, but these algorithms are only good for infrequent updates. The proposed incremental OPTICS algorithm performs batch-wise insertions/deletions and is suitable for frequent updates. It produces exactly the same hierarchical cluster ordering as that of classical OPTICS. Real datasets have been used for experimental evaluation of the proposed algorithm and results show remarkable performance improvement over the classical and other existing incremental OPTICS algorithms.
Keywords: OPTICS; Incremental Clustering; Batch Updates; Density-based Clustering
THE CONSUMER CHOICE BETWEEN THE PRIVATE DOCTORS AND THE HEALTH CARE CLINICS
by A.K.S. Sukumaran
Abstract: There are not many studies on health care from the point of view of the health care consumers. The study did not reveal substantial difference between the two health care service providers in the eyes of the consumers on the basis of their preference for the health care attributes included in the study. Higher income groups of consumers did not attach importance to the cost of health care and, the time spent by the doctors with them. Similarly, the eldest consumers were not much worried about the cost of the health care. The youngest consumers preferred convenient location. The consumers in the middle age belonged neither to the youngest nor to the eldest in their preference towards convenient location, friendly staff, and quick appointment. The study concludes that the consumers cannot be segmented on the basis of doctor consumers and clinic consumers, but they can be segmented on the basis of the demographic characteristics of income and age.
Keywords: Health care; private doctor; health care clinic; consumer choice; MANOVA; Multiple Discriminant Analysis; Neural networks.
Identifying and prioritization entrepreneurial behavior factors using fuzzy AHP Approach
by Elham Keshavarz
Abstract: The purpose of this research is to realize how organizations have been sustaining their growth through applying entrepreneurial behavior factors. Regarding the thematic nature of research model, experts opinion in Oil industry, Companies in oil industry have been examined in current study included: 1- Oil pipeline and telecommunication company 2- Oil products distribution company 3- National Gas Company in Semnan and Khorasan province have been brought for this research as statistic population. Numbers of experts participating in the study were 30 persons who were interested in improving discussion. The main tools used for gathering the data in this study were company records and questionnaire. In this study, structural factors, underlying factors, behavior factors sub criteria were ranked regarding to criteria related to different levels of entrepreneurial behavior sub criteria of oil industry by using fuzzy analytic hierarchy process (FAHP). The results obtained from fuzzy AHP method according to the entrepreneurial behavior factors indicate that structural factors are more important than underlying factors and behavior factors. According to the structural factors scale, it is concluded that entrepreneur organization structure is more important than other factors.
Keywords: Entrepreneur; Entrepreneurship benefits; entrepreneurial behavior; FAHP.
When Will the 2015 Millennium Development Goal Of Infant Mortality Rate Be Finally Realized? - Projections for 21 OECD Countries through 2050 -
by Yu Sang Chang, Jinsoo Lee, Hyuk Ju Kwon
Abstract: According to The United Nations Childrens Fund (UNICEF), the number of global infant deaths for those under the age of one year was down from 8.4 million in 1990 to 5.4 million in 2010. However, the declining trend of infant mortality rate varies significantly from country to country based on the vastly different environmental elements they face.This paper attempts to predict the future infant mortality rate of 21 OECD countries through 2015 and 2050 by the use of experience curve model and compare the results to the two other well-known projections by the United Nations Population Division and the U.S. Census Bureau in the context of the Millennium Development Goal targets.The results from all three projections indicate that only one or two countries will meet the two-thirds reduction target of the 2015 Millennium Development Goal. By 2050, four to eighteen countries will still not be able to meet the target. Therefore, each country may need to undertake a comprehensive review of its policies and programs of infant mortality control to generate many alternative plans for major improvement.
Keywords: Child health; Health policy; Infant mortality rate; Experience curve model; Millennium Development Goals.
Clustering and Latent Semantic Indexing Aspects of the Nonnegative Matrix Factorization
by Andri Mirzal
Abstract: This paper proposes a theoretical support for clustering aspect of nonnegative matrix factorization (NMF). By utilizing Karush-Kuhn-Tucker optimality conditions, we show that NMF objective is equivalent to graph clustering objective, so clustering aspect of NMF has a solid justification. Different from previous approaches, which either ignore nonnegativity constraints or assume absolute orthonormality on coefficient matrix in order to derive the equivalency, our approach takes nonnegativity constraints into account and makes no assumption about orthonormality of coefficient matrix. Thus not only stationary point being used in deriving the equivalency is guaranteed to be located on NMF's feasible region, but also the result is more realistic since NMF does not produce orthonormal matrix. Furthermore, because clustering capability of a matrix decomposition technique may imply its latent semantic indexing (LSI) aspect, we also study LSI aspect of NMF.
Keywords: bound-constrained optimization; clustering method; nonnegative matrix factorization; Karush-Kuhn-Tucker conditions; latent semantic indexing; singular value decomposition.
Particle Swarm Optimized Fuzzy Method for Prediction of Water Table Elevation Fluctuation
by Shilpa Jain, Dinesh Bisht, Praksh C. Mathpal
Abstract: Particle Swarm Optimization (PSO) is a population based powerful evolutionary computational technique inspired by social behavior simulation of bird flocking and fish schooling. PSO has been applied successfully to wide range of applications like scheduling, Artificial Neural Networks (ANN) training, control strategy determination and ingredient mix optimization. Fuzzy logic can easily cope up with vagueness and uncertainty in time series data. This has been applied for prediction of water table elevation, in our earlier work and results are quite promising. But the optimization of length of fuzzy intervals was a big constraint for researchers. In this research paper the optimal length of fuzzy intervals in the universe of discourse is been selected using particle swarm optimization. The results obtained after applying this combined approach to prediction of water table elevation are better than the previous method.
Keywords: Fuzzy logic; Particle Swarm Optimization; Mean Square Error; Water table; Forecasting.
An approach for high utility pattern mining
by Malaya Dutta Borah, Rajni Jindal
Abstract: Mining high utility pattern has become prominent as it provides semantic significance (utility/weighted patterns) associated with items in a transaction. Data analysis and respective strategies for mining high utility patterns is important in real world scenarios. Recent researches focused on high utility pattern mining using tree based data structure which suffers greater computation time, since they generate multiple tree branches. To cope up with these problems, this work proposes a novel binary tree based data structure with Average Maximum Utility (AvgMU) and mining algorithm to mine high utility patterns from incremental data which reduces tree constructions and computation time. The proposed algorithms are implemented using synthetic, real datasets and compared with state-of-the-art tree based algorithms. Experimental results show that the proposed work has better performance in terms of running time, scalability and memory consumption than the other algorithms compared in this research work.
Keywords: high utility pattern mining; frequent pattern mining; tree based data structure; incremental mining; data analysis; Average Maximum Utility.
Outlier Detection using Weighted Holoentropy with Hyperbolic Tangent Function
by Manasi V. Harshe, Rajesh H. Kulkarni
Abstract: Numerous research works has been carried out in the literature to detect the outliers, which are often called as anomalies. An outlier is an observation that appears to deviate markedly from other observations in the sample. Outlier detection can usually be considered as a pre-processing step for locating, in a data set, those objects that do not conform to well-defined notions of expected behaviour. Several methods are investigated for outlier detection corresponding to categorical data sets. In the previous work, holoentropy was used for outlier detection, as the weightage function is based on the reverse sigmoid function. In the proposed method, logistic sigmoid function related to hyperbolic tangent will be used as weightage function for finding the outlier data point(s). The advantage of this weightage function is -it can differentiate or distribute the outlier data points effectively as compared with the reverse sigmoid function. The method is implemented with four phases. In the first phase, the data is read out through programming and dynamic entropy computation is done. In the second phase which consists of data points extraction, probability computation and dynamic entropy computation using logistic sigmoid function related to hyperbolic tangent is performed. In the third phase, dynamic entropy related to all the data points are sorted and the top N point are selected as outlier data point(s) and finally the accuracy is computed for evaluating the proposed method whether the outlier data point(s)is detected correctly.
Keywords: Outlier; holoentropy; weightage function.
Using imputation algorithms when missing values appear in the test data in contrast with the training data
by NargesSadat Bathaeian
Abstract: Real datasets suffer from the problem of missing data. Imputation is a common solution for that problem. Most of research works perform imputation algorithms to training data. Therefore, output variable of samples might influence the imputation model. This paper aims to compare different imputation algorithms when they applied to test data and when they applied to training data. In this research, first, the relations between output variable and different imputation algorithms are described. Second, six different types of imputation algorithms are applied to training data as well as test data. Chosen datasets are globally available and cover both classification and regression tasks. Also, missing values are injected artificially to them. Results of experiments show that performance of all algorithms will reduce in the case of elimination of output variables. Particularly, decreases in algorithms which use trees or k nearest neighbor for imputation in classification datasets arent ignorable. Nevertheless, algorithms which are based on random forests have less decline and show better results compare with other five types of algorithms.
Keywords: missing values; imputation algorithms; regression; kNN; MICE; random forest; tree; EM.
Reliability, Availability and Maintainability-RAM analysis of Cake production lines: a Case Study
by Panagiotis Tsarouhas
Abstract: In this study reliability, availability and maintainability analysis was conducted for cake production line by applying statistical techniques on failure data. Data collection from the line and their analysis were valid over a long time of seventeen months. The reliability, availability and maintainability analysis of the failure data were determined to provide an estimate of the current operation management and improve the line efficiency. It was found out that: (a) the availability of the cake production line was 95.44% and dropped to 93.15% because the equipments failures cause an additional production gap in the line, (b) the two machines with the most frequent failures and lowest availabilities are the forming/dosing machine, and the wrapping machine, (c) the worst maintainability occurs at cooling tower, and at oven, and (d) the identification of the best distributions for the failure data and their parameters of the cake production line were made.
Keywords: Cake production line; Reliability; Maintainability; Performance evaluation; Quality; Field failure and repair data.
A Comparative Study of Classifier Ensembles for Detecting Inactive Learner in University
by Bayu Adhi Tama
Abstract: Prediction of undesirable learner's behaviors is important issue in the distance learning system as well as the conventional university. This paper is devoted to benchmark ensemble of weak classifiers (decision tree, random forest, logistic regression, and CART) against single classifier models to predict inactive student. Two real-world datasets were obtained from a distance learning system and a computer science college in Indonesia. To evaluate the performance of the classifier ensembles, several performance metrics such as average accuracy, precision, recall, fall-out, F1, and area under ROC curve (AUC) value were involved. Our experiments reveal that classifier ensembles outperform single classifier in all evaluation metrics. This study contributes to the literature on making a comparative study of ensemble learners in the purview of educational data mining.
Keywords: Classifier ensemble; Educational data mining; Distance learning; Benchmark.
A Lexicon Based Term Weighting Scheme for Emotion Identification of Tweets
by Lovelyn Rose, Raman Venkatesan, Girish Pasupathy, Swaradh Peedikathodiyil
Abstract: Analyzing human emotions help to have a better understanding of human behavior. People exhibit emotions in the textual content posted in social media. Detecting emotions in tweets is a huge challenge due to its limited 140 characters and extensive use of twitter language with evolving terms and slangs. This paper uses various preprocessing techniques, forms a feature vector using lexicons and classifies tweets into Paul Ekmans basic emotions namely, happy, sad, anger, fear, disgust and surprise using machine learning. The twitter language is systematically converted to identifiable English words by extensive preprocessing using the rich dictionaries available for emoticons, interjections and slangs and by handling punctuation marks and hashtags. The feature vector is created by combining words from the NRC Emotion lexicon, WordNet-Affect and online thesaurus. Feature vectors are assigned weight based on the presence of the tweet term in the feature vector and the presence of punctuations and negation and are used to train and classify the tweets using Naive Bayes, Support Vector Machines (SVM) and Random Forests. Experimental results show that the addition of the new No_Emotion class has considerably reduced the misclassification of facts. The use of lexicon features and a novel weighting scheme has produced a considerable gain in terms of accuracy with Random Forest achieving maximum accuracy of almost 73%.
Keywords: Emotion classification; Twitter; Feature selection; Random forest; SVM; Naive Bayes.
A Long Memory Property of Economic and Financial News Flows
by Sergei Sidorov, Alexey Faizliev, Vladimir Balash
Abstract: One of the tools for examining the processes and time series with self-similarity is the long-range correlation exponent (the Hurst exponent). Many methods have been developed for estimating the long-range correlation exponent using experimental time series over the last years. In this paper we estimate the Hurst exponent parameter obtained by different methods using news analytics time series. We exploit the most commonly used methods for estimating the Hurst exponents: fluctuation analysis, the detrended fluctuation analysis and the detrending moving average analysis. Following some previous studies, empirical results show the presence of long-range correlations for the time series of news intensity data. In particular, the paper shows that the behavior of long range dependence for time series of news intensity in the recent period from January 1, 2015 to September 22, 2015 did not change in comparison to the period from September 1, 2010 to October 29, 2010. Moreover, the change of the news analytics provider and the consideration of more recent data did not significantly affect estimates of the Hurst exponent. The results show that the self-similarity property is a stable characteristic of the news flow of information which serves the financial industry and stock markets.
Keywords: long-range correlation; detranded fluctuation analysis; time series; auto-correlation.
A Comprehensive Comparison of Algorithms for the Statistical Modeling of Non-Monotone Relationships via Isotonic Regression of Transformed Data
by Simone Fiori
Abstract: The paper treats the problem of non-linear, non-monotonic regression of bivariate datasets by means of a statistical regression method known from the literature. In particular, the present paper introduces two new regression methods and illustrates the results of a comprehensive comparison of the performances of the best two previous methods, the two new methods introduced here and as much as ten standard regression methods known from the specialized literature. The comparison is performed over nine different datasets, ranging from electrocardiogram data to text analysis data, by means of four figures of merit, that include regression precision as well as runtime.
Keywords: Non-monotone non-linear data-fitting; Data transformation; Isotonic regression; Statistical regression.
Factor-based structural equation modeling: Going beyond PLS and composites
by Ned Kock
Abstract: Partial least squares (PLS) methods offer many advantages for path modeling, such as fast convergence to solutions and relaxed requirements in terms of sample size and multivariate normality. However, they do not deal with factors, but with composites. As a result, they typically underestimate path coefficients and overestimate loadings. Given these, it is difficult to fully justify their use for confirmatory factor analyses or factor-based structural equation modeling (SEM). We addressed this problem through the development of a new method that generates estimates of the true composites and factors, potentially placing researchers in a position where they can obtain consistent estimates of a wide range of model parameters in SEM analyses. A Monte Carlo experiment suggests that this new method represents a solid step in the direction of achieving this ambitious goal.
Keywords: Partial Least Squares; Structural Equation Modeling; Measurement Error; Path Bias; Variation Sharing; Monte Carlo Simulation.
Special Issue on: Big Data Analysis in the Real Estate, Construction and Business Sectors
A new initialization method for k-means algorithm in the clustering problem: Data analysis
by Abolfazl Kazemi, Ghazaleh Khodabandehlouie
Abstract: Clustering is one of the most important tasks in exploratory data analysis. One of the simplest and the most widely used clustering algorithm is K-means which was proposed in 1955. K-means algorithm is conceptually simple and easy to implement. This is evidenced by hundreds of publications over the last fifty years that extend k-means in various ways. Unfortunately, because of its nature, this algorithm is very sensitive to the initial placement of the cluster centers. In order to address this problem, many initialization methods (IMs) have been proposed. In this thesis, we first provide a historical overview of these methods. Then we present two new non-random initialization methods for k-means algorithm. Finally, we analyze the experimental results using real datasets, then the performance of IMs is evaluated by TOPSIS multi criteria decision making method. Finally, we prove that not only famous k-means IMs often have poor performance but also there are in fact strong alternative approaches.
Keywords: Clustering; K-means algorithm; Cluster center initialization; Sum of squared error criterion; Data analysis.
A Watchdog Approach Name Matching Algorithm for Big Data Risk Intelligence
by Anusuya Kirubakaran
Abstract: Even though modern world is ruled by data and preventive measures are in place to keep the data quality higher, risk intelligence teams are challenged for one of the risk analysis task aimed at record linkages on heterogeneous data from multiple data sources due higher ratio of non-standard and poor quality data present in big data systems caused by variety of data format across regions, data platforms, data storage systems, data migration etc. To keep this records linkages in mind, in this paper we try to address the complications in name matching process irrespective of spelling, structure and phonetic variations. Success of name matching is achieved when the algorithm is capable of handling names with discrepancies due to naming conventions, cross language translation, operating system transformation, data migration, batch feeds, typos and other external factors. In this paper, we have discussed the varieties of name representation in data source and the methods to parse & find the maximum probabilities of name match comparable to watchdog security with high accuracy as well as the percentage of false negative rate being reduced. The proposed methods can be applied to financial sectors risk intelligence analysis like Know Your Customer (KYC), Anti-Money Laundering (AML), Customer Due Diligence (CDD), Anti-Terrorism, Watch List Screening and Fraud Detection.
Keywords: Hybrid Name Matching; String Similarity Measure ; Data Matching; Risk Intelligence.
Using Improved Genetic Algorithm under Uncertain Circumstance of Site Selection of O2O Customer Returns
by Hongying Sun
Abstract: Online-to-Offline (O2O) e-commerce supports online purchase and offline servicing. How to deal with service, especial customer returns is a great concern of O2O e-commerce. In recent years, with the growth of online shopping in China, O2O has become a new popular mode of e-commerce appliance. Buying online and returning offline are becoming a dominant shopping mode. The returns of customer should be collected to be treated in a more cost-efficient manner. Although many studies have addressed the problem of determining the number and location of centralized return centers where returned products were collected, sorted and consolidated into a large shipment destined for manufacturers repair facilities, few studies focus on O2O e-commerce. To this end, this paper aims to propose an integer programming model to minimize the cost in construction couple with operating charges by optimizing the sites of reverse logistics with the customer returns. For lowering storage costs, physical stores and their geographical sites should be far away from the residential area. In addition, this paper designs an improved genetic algorithm for solving two-stage heredity under random circumstance in that this model builds up multilayer reverse logistics network for recycling customer returns. The paper seeks the best policy to transport the customer returns to the collecting point and further forward them to the remanufacturing center in factories. Both the simulation and numerical examples prove the effectiveness and feasibility of this improved genetic algorithm.
Keywords: reverse logistics;site selection;improved genetic algorithm;O2O E-commerce.
Data Analysis on Big Data: Improving the Map and Shuffle Phases in Hadoop Map Reduce
by J. V. N. Lakshmi
Abstract: Big Data Analytics is now a key ingredient for success in many business organizations, scientific and engineering disciplines and government endeavors. The data management has become a challenging issue for network centric applications which need to process large amount of data sets. System requires advanced tools to analyse these data sets. As an efficient parallel computing programming model Map Reduce and Hadoop are used for large scale data analysis. However Map Reduce still suffers with performance problems Map Reduce uses a shuffle phase Individual Shuffle service component with efficient I/O policy. The Map phase requires an improvement in its performance as this phases output acts as an input to the next phase. Its result reveals the efficiency, so Map phase needs some intermediate check points which regularly monitor all the splits generated by intermediate phases. It is becoming more major with the increasing complexity in user requirements and computation involved. MapReduce model is designed in a way that there a need to wait until all maps accomplish their given task. This acts as a barrier for effective resource utilization. This paper implements shuffle as a service component to decrease the overall execution time of jobs, monitor map phase by skew handling and increase resource utilization in a cluster.
Keywords: Map Reduce; Hadoop; Shuffle; Big Data; Data Analytics; HDFS; Parallel computing;.
A fuzzy based automatic prediction system for quality evaluation of conceptual data warehouse models
by Naveen Dahiya, Vishal Bhatnagar, Manjeet Singh
Abstract: In the paper, we present an automatic system based on fuzzy logic to predict the understanding time of conceptual data warehouse models. The system takes as input the values of quality metrics for a model and gives understanding time as output. The metrics used for quality evaluation have been proposed and validated by Manuel Serrano. The results of automatic system are compared with the results of actual data collection made manually. The predicted results are highly significant to prove the validity and efficiency of the designed automatic system.
Keywords: Fuzzy logic; quality metrics; data warehouse; understanding time; conceptual models.
Multi criteria Decision Support for Feature Selection in Network Anomaly Detection System
by Seelammal C., Vimala Devi K.
Abstract: The growth of computer networks from LAN to cloud, virtualization and mobility always keeps Intrusion Detection System (IDS) as a critical component in the field of network security infrastructure. Tremendous growth and usage of internet raises concerns about how to protect and communicate the digital information in a safe manner. The market for security solutions for next-generation is rapidly evolving and constantly changing to accommodate todays threat. Many intrusion detection techniques, methods and algorithms are implemented to detect these novel attacks. But theres no clear feature set, uncertainty bounds established as a baseline for dynamic environments. The main objective of this paper is to determine and provide the best feature selection for next generation dynamic environments using Multi Criteria Decision Making, decision tree learning with emphasis on optimization (contingency of weight allocation) of constructed trees and handling large data sets.
Keywords: Intrusion detection; Multi Criteria; Classification; Anomaly; Data mining; Feature Selection; Machine Learning.
Special Issue on: Applications of Risk Analysis and Analytics in Engineering, Economics and Healthcare
On Effects of Asymmetric Information on Non-Life Insurance Prices under Competition
by Hansjoerg Albrecher, Dalit Daily-Amir
Abstract: We extend a game-theoretic model of Dutang et al. (2013) for non-life insurance pricing under competition among insurance companies and investigate the effects of asymmetric information on the equilibrium premium. We study Bayesian Nash equilibria as well as Bayesian Stackelberg equilibria and illustrate the sensitivity of equilibrium prices to various forms and magnitudes of information asymmetry through some numerical examples.
Keywords: Non-Life Insurance Pricing; Premium; Non-Cooperative Game Theory; Asymmetric Information; Nash Equilibrium; Stackelberg equilibrium; Price Sensitivity
Cost risk analysis and learning curve in the military shipbuilding sector
by Abderrahmane Sokri, Ahmed Ghanmi
Abstract: The learning curve shows how unit costs can be expected to fall over time. It has been demonstrated that learning is a major cost risk driver in defence acquisition projects. It can be affected by changes in processes, resource availability, and worker interest. This paper examines the risk that military ship builders may not realize expected production efficiencies. A probabilistic risk approach is used to portray the learning curve risk and estimate the corresponding cost contingency. A case study using a military shipbuilding project is presented and discussed to illustrate the methodology.
Keywords: Risk analysis; cost contingency; learning curve; military; shipbuilding.
Uncertainty in Basic Short-Term Macroeconomic Models with Angel-Daemon Games
by Joaquim Gabarro, Maria Serna
Abstract: We propose the use of an angel-daemon framework to perform an uncertainty analysis of short-term macroeconomic models. The angel-daemon framework defines a strategic game where two agents, the angel and the daemon, act selfishly. These games are defined over an uncertainty profile which presents a short and macroscopic description of a perturbed situation. The Nash equilibria on these games provide stable strategies in perturbed situations, giving a natural estimation of uncertainty.We apply the framework to the the uncertainty analysis of linear versions of the IS-LM and the IS-MP models.
Keywords: Uncertainty profiles; strategic games; zero-sum games; angeldaemongames; IS-LM model; IS-MP model.
Clustering and Hitting Times of Threshold Exceedances and Applications
by Natalia Markovich
Abstract: We investigate exceedances of the process over a sufficiently high
threshold. The exceedances determine the risk of hazardous events like climate
catastrophes, huge insurance claims, the loss and delay in telecommunication
networks. Due to dependence such exceedances tend to occur in clusters. The
cluster structure of social networks is caused by dependence (social relationships
and interests) between nodes and possibly heavy-tailed distributions of the node
degrees. A minimal time to reach a large node determines the first hitting time.
We derive an asymptotically equivalent distribution and a limit expectation of the
first hitting time to exceed the threshold un as the sample size n tends to infinity.
The results can be extended to the second and, generally, to the kth (k > 2) hitting
times. Applications in large-scale networks such as social, telecommunication
and recommender systems are discussed.
Keywords: first hitting time; rare events; exceedance over threshold; cluster of exceedances; extremal index; application.
RISK AWARE INTELLIGENT SYSTEM FOR INSIDER THREAT DETECTION
by Sarala Ramkumar, Zayaraz Godandapani, Vijayalakshmi Vivekanandan
Abstract: Information security risk assessment is mostly performed with focus on the external threats to the information assets than internal threats or insiders. Insiders are people who have or had authorized access to an organization's network, system. Insider attacks are caused by the insiders with privileged access rights to the information assets. Traditional security controls like encryption and policy based access control used in organizations fail to identify insider activity. Fighting insider threats is a tough task for organizations since it is important to have a balance between the grant of required privileges to the users of information in an organization, and identification of malicious access by the users. This paper proposes an intelligent risk aware decision technique that employs quantitative risk assessment and qualitative decision making to identify insiders in an organization and the intensity of their attack.
Keywords: insiders; behavior based trust; context based assess control; fuzzy decision making; information security risk assessment.rnrn.
Execution time distributions in embedded safety-critical systems using extreme value theory
by Jaume Abella, Joan Del Castillo, Francisco Cazorla, Maria Padilla
Abstract: Several techniques have been proposed to upper-bound the worst-case execution time behaviour of programs in the domain of critical real-time embedded systems. These computing systems have strong requirements regarding the guarantees that the longest execution time a program can take is bounded. Some of those techniques use extreme value theory (EVT) as their main prediction method.
In this paper EVT is used to estimate a high quantile for different types of execution time distributions observed for a set of representative programs for the analysis of automotive applications. A major challenge appears when the data set seems to be heavy tailed, because this contradicts the previous assumption of embedded safety-critical systems.
A methodology based on the coefficient of variation is introduced for a threshold selection algorithm to determine the point above which the distribution can be considered generalised Pareto distribution. This methodology also provides an estimation of the extreme value index and high quantile estimates. We have applied these methods to execution time observations collected from the execution of 16 representative automotive benchmarks to predict an upper-bound to the maximum execution time of this programs. Several comparisons with alternative approaches are discussed.
Keywords: worst-case execution times; extreme value theory; generalised Pareto distribution; threshold exceedances; high quantiles.