International Journal of Data Analysis Techniques and Strategies (39 papers in press)
Clustering and Latent Semantic Indexing Aspects of the Nonnegative Matrix Factorization
by Andri Mirzal
Abstract: This paper proposes a theoretical support for clustering aspect of nonnegative matrix factorization (NMF). By utilizing Karush-Kuhn-Tucker optimality conditions, we show that NMF objective is equivalent to graph clustering objective, so clustering aspect of NMF has a solid justification. Different from previous approaches, which either ignore nonnegativity constraints or assume absolute orthonormality on coefficient matrix in order to derive the equivalency, our approach takes nonnegativity constraints into account and makes no assumption about orthonormality of coefficient matrix. Thus not only stationary point being used in deriving the equivalency is guaranteed to be located on NMF's feasible region, but also the result is more realistic since NMF does not produce orthonormal matrix. Furthermore, because clustering capability of a matrix decomposition technique may imply its latent semantic indexing (LSI) aspect, we also study LSI aspect of NMF.
Keywords: bound-constrained optimization; clustering method; nonnegative matrix factorization; Karush-Kuhn-Tucker conditions; latent semantic indexing; singular value decomposition.
Particle Swarm Optimized Fuzzy Method for Prediction of Water Table Elevation Fluctuation
by Shilpa Jain, Dinesh Bisht, Praksh C. Mathpal
Abstract: Particle Swarm Optimization (PSO) is a population based powerful evolutionary computational technique inspired by social behavior simulation of bird flocking and fish schooling. PSO has been applied successfully to wide range of applications like scheduling, Artificial Neural Networks (ANN) training, control strategy determination and ingredient mix optimization. Fuzzy logic can easily cope up with vagueness and uncertainty in time series data. This has been applied for prediction of water table elevation, in our earlier work and results are quite promising. But the optimization of length of fuzzy intervals was a big constraint for researchers. In this research paper the optimal length of fuzzy intervals in the universe of discourse is been selected using particle swarm optimization. The results obtained after applying this combined approach to prediction of water table elevation are better than the previous method.
Keywords: Fuzzy logic; Particle Swarm Optimization; Mean Square Error; Water table; Forecasting.
An approach for high utility pattern mining
by Malaya Dutta Borah, Rajni Jindal
Abstract: Mining high utility pattern has become prominent as it provides semantic significance (utility/weighted patterns) associated with items in a transaction. Data analysis and respective strategies for mining high utility patterns is important in real world scenarios. Recent researches focused on high utility pattern mining using tree based data structure which suffers greater computation time, since they generate multiple tree branches. To cope up with these problems, this work proposes a novel binary tree based data structure with Average Maximum Utility (AvgMU) and mining algorithm to mine high utility patterns from incremental data which reduces tree constructions and computation time. The proposed algorithms are implemented using synthetic, real datasets and compared with state-of-the-art tree based algorithms. Experimental results show that the proposed work has better performance in terms of running time, scalability and memory consumption than the other algorithms compared in this research work.
Keywords: high utility pattern mining; frequent pattern mining; tree based data structure; incremental mining; data analysis; Average Maximum Utility.
Outlier Detection using Weighted Holoentropy with Hyperbolic Tangent Function
by Manasi V. Harshe, Rajesh H. Kulkarni
Abstract: Numerous research works has been carried out in the literature to detect the outliers, which are often called as anomalies. An outlier is an observation that appears to deviate markedly from other observations in the sample. Outlier detection can usually be considered as a pre-processing step for locating, in a data set, those objects that do not conform to well-defined notions of expected behaviour. Several methods are investigated for outlier detection corresponding to categorical data sets. In the previous work, holoentropy was used for outlier detection, as the weightage function is based on the reverse sigmoid function. In the proposed method, logistic sigmoid function related to hyperbolic tangent will be used as weightage function for finding the outlier data point(s). The advantage of this weightage function is -it can differentiate or distribute the outlier data points effectively as compared with the reverse sigmoid function. The method is implemented with four phases. In the first phase, the data is read out through programming and dynamic entropy computation is done. In the second phase which consists of data points extraction, probability computation and dynamic entropy computation using logistic sigmoid function related to hyperbolic tangent is performed. In the third phase, dynamic entropy related to all the data points are sorted and the top N point are selected as outlier data point(s) and finally the accuracy is computed for evaluating the proposed method whether the outlier data point(s)is detected correctly.
Keywords: Outlier; holoentropy; weightage function.
Using imputation algorithms when missing values appear in the test data in contrast with the training data
by NargesSadat Bathaeian
Abstract: Real datasets suffer from the problem of missing data. Imputation is a common solution for that problem. Most of research works perform imputation algorithms to training data. Therefore, output variable of samples might influence the imputation model. This paper aims to compare different imputation algorithms when they applied to test data and when they applied to training data. In this research, first, the relations between output variable and different imputation algorithms are described. Second, six different types of imputation algorithms are applied to training data as well as test data. Chosen datasets are globally available and cover both classification and regression tasks. Also, missing values are injected artificially to them. Results of experiments show that performance of all algorithms will reduce in the case of elimination of output variables. Particularly, decreases in algorithms which use trees or k nearest neighbor for imputation in classification datasets arent ignorable. Nevertheless, algorithms which are based on random forests have less decline and show better results compare with other five types of algorithms.
Keywords: missing values; imputation algorithms; regression; kNN; MICE; random forest; tree; EM.
Reliability, Availability and Maintainability-RAM analysis of Cake production lines: a Case Study
by Panagiotis Tsarouhas
Abstract: In this study reliability, availability and maintainability analysis was conducted for cake production line by applying statistical techniques on failure data. Data collection from the line and their analysis were valid over a long time of seventeen months. The reliability, availability and maintainability analysis of the failure data were determined to provide an estimate of the current operation management and improve the line efficiency. It was found out that: (a) the availability of the cake production line was 95.44% and dropped to 93.15% because the equipments failures cause an additional production gap in the line, (b) the two machines with the most frequent failures and lowest availabilities are the forming/dosing machine, and the wrapping machine, (c) the worst maintainability occurs at cooling tower, and at oven, and (d) the identification of the best distributions for the failure data and their parameters of the cake production line were made.
Keywords: Cake production line; Reliability; Maintainability; Performance evaluation; Quality; Field failure and repair data.
A Comparative Study of Classifier Ensembles for Detecting Inactive Learner in University
by Bayu Adhi Tama
Abstract: Prediction of undesirable learner's behaviors is important issue in the distance learning system as well as the conventional university. This paper is devoted to benchmark ensemble of weak classifiers (decision tree, random forest, logistic regression, and CART) against single classifier models to predict inactive student. Two real-world datasets were obtained from a distance learning system and a computer science college in Indonesia. To evaluate the performance of the classifier ensembles, several performance metrics such as average accuracy, precision, recall, fall-out, F1, and area under ROC curve (AUC) value were involved. Our experiments reveal that classifier ensembles outperform single classifier in all evaluation metrics. This study contributes to the literature on making a comparative study of ensemble learners in the purview of educational data mining.
Keywords: Classifier ensemble; Educational data mining; Distance learning; Benchmark.
A Lexicon Based Term Weighting Scheme for Emotion Identification of Tweets
by Lovelyn Rose, Raman Venkatesan, Girish Pasupathy, Swaradh Peedikathodiyil
Abstract: Analyzing human emotions help to have a better understanding of human behavior. People exhibit emotions in the textual content posted in social media. Detecting emotions in tweets is a huge challenge due to its limited 140 characters and extensive use of twitter language with evolving terms and slangs. This paper uses various preprocessing techniques, forms a feature vector using lexicons and classifies tweets into Paul Ekmans basic emotions namely, happy, sad, anger, fear, disgust and surprise using machine learning. The twitter language is systematically converted to identifiable English words by extensive preprocessing using the rich dictionaries available for emoticons, interjections and slangs and by handling punctuation marks and hashtags. The feature vector is created by combining words from the NRC Emotion lexicon, WordNet-Affect and online thesaurus. Feature vectors are assigned weight based on the presence of the tweet term in the feature vector and the presence of punctuations and negation and are used to train and classify the tweets using Naive Bayes, Support Vector Machines (SVM) and Random Forests. Experimental results show that the addition of the new No_Emotion class has considerably reduced the misclassification of facts. The use of lexicon features and a novel weighting scheme has produced a considerable gain in terms of accuracy with Random Forest achieving maximum accuracy of almost 73%.
Keywords: Emotion classification; Twitter; Feature selection; Random forest; SVM; Naive Bayes.
A Long Memory Property of Economic and Financial News Flows
by Sergei Sidorov, Alexey Faizliev, Vladimir Balash
Abstract: One of the tools for examining the processes and time series with self-similarity is the long-range correlation exponent (the Hurst exponent). Many methods have been developed for estimating the long-range correlation exponent using experimental time series over the last years. In this paper we estimate the Hurst exponent parameter obtained by different methods using news analytics time series. We exploit the most commonly used methods for estimating the Hurst exponents: fluctuation analysis, the detrended fluctuation analysis and the detrending moving average analysis. Following some previous studies, empirical results show the presence of long-range correlations for the time series of news intensity data. In particular, the paper shows that the behavior of long range dependence for time series of news intensity in the recent period from January 1, 2015 to September 22, 2015 did not change in comparison to the period from September 1, 2010 to October 29, 2010. Moreover, the change of the news analytics provider and the consideration of more recent data did not significantly affect estimates of the Hurst exponent. The results show that the self-similarity property is a stable characteristic of the news flow of information which serves the financial industry and stock markets.
Keywords: long-range correlation; detranded fluctuation analysis; time series; auto-correlation.
A Comprehensive Comparison of Algorithms for the Statistical Modeling of Non-Monotone Relationships via Isotonic Regression of Transformed Data
by Simone Fiori
Abstract: The paper treats the problem of non-linear, non-monotonic regression of bivariate datasets by means of a statistical regression method known from the literature. In particular, the present paper introduces two new regression methods and illustrates the results of a comprehensive comparison of the performances of the best two previous methods, the two new methods introduced here and as much as ten standard regression methods known from the specialized literature. The comparison is performed over nine different datasets, ranging from electrocardiogram data to text analysis data, by means of four figures of merit, that include regression precision as well as runtime.
Keywords: Non-monotone non-linear data-fitting; Data transformation; Isotonic regression; Statistical regression.
Factor-based structural equation modeling: Going beyond PLS and composites
by Ned Kock
Abstract: Partial least squares (PLS) methods offer many advantages for path modeling, such as fast convergence to solutions and relaxed requirements in terms of sample size and multivariate normality. However, they do not deal with factors, but with composites. As a result, they typically underestimate path coefficients and overestimate loadings. Given these, it is difficult to fully justify their use for confirmatory factor analyses or factor-based structural equation modeling (SEM). We addressed this problem through the development of a new method that generates estimates of the true composites and factors, potentially placing researchers in a position where they can obtain consistent estimates of a wide range of model parameters in SEM analyses. A Monte Carlo experiment suggests that this new method represents a solid step in the direction of achieving this ambitious goal.
Keywords: Partial Least Squares; Structural Equation Modeling; Measurement Error; Path Bias; Variation Sharing; Monte Carlo Simulation.
Democracy and Economic growth
by Rita Yi Man Li
Abstract: Many nations consider democracy to be an important social value. Nevertheless, does it mean that countries with more democracy are often wealthier? What are the relationships between economic growth and democracy? This research includes 167 countries to study the issue. We employ the data of the democracy index, corruption perception index, inflation, population, number of internet users, balance of trade, foreign direct investment, etc. We have also included sub-indices such as the electoral process and pluralism, functioning of government, political participation, culture, and civil liberties. An innovative part of the paper is how the corruption perception index has been included in our analysis. Besides, principal component analysis is applied to study the relationship between democracy and economic growth. We conclude that it takes democracy a very long time to affect the macroeconomy. The fast pace of change in democracy even harms the macroeconomy. If the economy reaches a well-developed stage, the economy will gradually transform into a democratic city automatically in the absence of any external pressure.
Keywords: democracy; economic growth; corruption perception index; liberalisation.
A Novel Single Scan Distributed Pattern Mining Algorithm (SSDPMA) for Frequent Pattern Identification
by Sheik Yousuf .T, Indra Devi M
Abstract: In data mining, the extraction of frequent patterns from large databases is still a challenging and difficult task due to the various drawbacks such as, high response time, communication costTo alleviates such issues, a new algorithm namely Single Scan Distributed Pattern Mining Algorithm (SSDPMA) is proposed in this paper for frequent mining. The frequent patterns are extracted in a single scan of the database. Then, it is split into multiple files, which will be shared to multiple Virtual Machines (VMs) to store and compute the weight for the distinct records. Then, the support, confidence and threshold values are estimated. If the limit is greater than the given data, the frequent data are mined by using the proposed SSDPMA algorithm. The experimental results evaluate the performance of the proposed system in terms of response time, message size, execution time, run time and memory usage
Keywords: Data Mining; Frequent Pattern Mining; Single Scan Distributed Pattern Mining Algorithm (SSDPMA); Virtual Machine (VM); File Split Algorithm; Item sets; Infrequent Items and Connect 4 Dataset.
An Effective Feature Selection Method Based On Maximum Class Separability For Fault Diagnosis of Ball Bearing
by Tawfik Thelaidjia, Abdelkrim Moussaoui, Salah Chenikher
Abstract: The paper deals with the development of a novel feature selection approach for Bearing fault diagnosis to overcome drawbacks of the Distance Evaluation Technique (DET); one of the well-established feature selection approaches. Its drawbacks are the influence of its effectiveness by the noise and the selection of salient features regardless of the classification system. To overcome these shortcomings, an optimal discrete wavelet transform (DWT) is firstly used to decompose the Bearing vibration signal at different decomposition depths to enhance the signal-to noise ratio. Then, a combination of DET with binary particle swarm optimization (BPSO) algorithm and a criterion based on scatter matrices employed as an objective function are suggested to improve the classification performances and to reduce the computational time. Finally, Support Vector Machine is utilized to automate the identification of different bearing conditions. From the obtained results, the effectiveness of the suggested method is proven.
Keywords: Binary Particle Swarm Optimization; Discrete Wavelet Transform; Distance Evaluation Technique; Feature Selection; Scatter Matrices.
REVIEW ON RECENT DEVELOPMENTS IN FREQUENT ITEMSET BASED DOCUMENT CLUSTERING, ITS RESEARCH TRENDS AND APPLICATIONS
by Dharmendra Singh Rajput
Abstract: The document data is growing at an exponential rate. It is heterogeneous, dynamic and highly unstructured in nature. These characteristics of document data pose new challenges and opportunities for the development of various models and approaches for documents clustering. Various methods are adopted for the development of these models. But these techniques have their advantages and disadvantages. The primary focus of the study is to the analysis of existing methods and approaches for document clustering based on frequent itemsets. Subsequently, this research direction facilitates the exploration of the emerging trends for each extension with applications. In this paper, more than 90 recent (published after 1990) research papers summarized that is published in various reputed journals like IEEE Transaction, ScienceDirect, Springerlink, ACM and few fundamental authoritative articles.
Keywords: Document Clustering; Association Rule Mining; Unstructured Data; Uncertain Data.
A method to rank the efficient units based on cross efficiency matrix without involving the zero weights
by Marziye Mirzadeh Tahroodi, Ali Payan
Abstract: One of the basic objections of the previous models of Cross Efficiency (CE) is the possibility for the weights to equal zero. This case takes place for the inputs and the outputs in the efficient responses in CE models. Therefore, the input and the output weights which equal zero do not play a role in computing the score of the CE. In this paper, to overcome this problem, an idea to prevent the optimal weights to equal zero in the CE method is offered. This new method can be expanded to all CE models. Based on the offered method, a zero-one mixed linear programming problem is proposed to obtain a set of non-zero weights among the optimal solutions of the preliminary CE model. Following, the zero-one mixed linear programming problem is changed into an equivalent linear program. Then, according to a consistent CE matrix the efficient units are ranked. In order to explain the model and indicate its advantage, an example is given.
Keywords: ranking; cross efficiency; zero weights; preference matrix; fuzzy preference relation; zero-one mixed linear programming problem.
Enhancing the Involvement of Decision-Makers in Data Mart Design
by Fahmi Bargui, Hanene Benabdallah, Jamel Feki
Abstract: The design phase of a data warehousing project remains difficult for both decision-makers and requirements analysts. In this paper, we tackle this difficulty through two contributions. First, we propose a natural language based and goal-oriented template for requirements specification that includes all concepts of the decision-making process. The use of familiar concepts and natural language makes our template more accessible and helps decision-makers in validating the specified requirements, which avoids producing data mart that does not meet their needs. Secondly, we propose a decision-making ontology that provides for a systematic decomposition of decision-making goals, which allows new requirements to emerge. This automatic requirements elicitation helps analysts to overcome their lack of domain knowledge, which avoids producing erroneous requirements.
Keywords: Decision Support System; Data Warehouse; Data Mart; Requirements Engineering; Multidimensional modeling; Goal-oriented Requirements Engineering; Automatic Reasoning; Ontology.
A New Feature Subset Selection Model Based on Migrating Birds Optimization
by Naoual El Aboudi, Laila Benhlima
Abstract: Feature selection represents a fundamental preprocessing phase in machine learning as well as data mining applications. It reduces the dimensionality of feature space by dismissing irrelevant and redundant features, which leads to better classification accuracy and less computational cost.
This paper presents a new wrapper feature subset selection model based on a recently designed optimization technique called migrating birds optimization (MBO). Initialisation issue regarding MBO is explored to study its implications on the model behavior by experimenting different initialisation strategies. A neighborhood based on information gain was designed to improve the search effectiveness.
The performance of the proposed model named MBO-FS is compared with some state-of-the-art methods regarding the task of feature selection on 11 UCI datasets. Simulation results show that MBO-FS method achieves promising classification accuracy using a smaller feature set.
Keywords: Feature selection; Migrating birds optimization; Classification.
Feature Selection Methods for Document Clustering: A Comparative Study and a Hybrid Solution
by Asmaa BENGHABRIT, Brahim OUHBI, Bouchra FRIKH, El Moukhtar ZEMMOURI, Hicham BEHJA
Abstract: The web proliferation makes the exploration and the use of the huge amount of available unstructured text documents challenged, which drives the need of document clustering. Hence, improving the performances of this mechanism by using feature selection seems worth investigation. Therefore, this paper proposes an efficient way to highly benefit from feature selection for document clustering. We first present a review and comparative studies of feature selection methods in order to extract efficient ones. Then we propose a sequential and hybrid combination modes of statistical and semantic techniques in order to benefit from crucial information that each of them provides for document clustering. Extensive experiments prove the benefit of the proposed combination approaches. The performance of document clustering is highest when the measures based on Chi-square statistic and the mutual information are linearly combined. Doing so, it avoids the unwanted correlation that the sequential approach creates between the two treatments.
Keywords Document clustering, Feature selection, Statistical and semantic analysis, Chi-square statistic, Mutual Information, K-means algorithm.
Keywords: Document clustering; Feature selection; Statistical and semantic analysis; Chi-square statistic; Mutual Information; K-means algorithm.
Stellar mass black hole optimization for utility mining
by Kannimuthu Subramanian, Premalatha Kandhasamy
Abstract: Major challenges in mining high utility itemsets from the transaction databases requires exponential search space and database-dependent minimum utility threshold. The search space is very large because of large number of distinct items and size of the database. Data analysts need of specifying appropriate minimum utility thresholds for their data mining tasks though they may have no knowledge pertaining to their databases. To get rid of these problems, Stellar mass Black hole optimization (SBO) method is proposed to mine Top-K HUIs from the transaction database without specifying minimum utility threshold. To know the performance of SBO, the experiment results are compared with GA.
Keywords: Data Mining; Genetic Algorithm; Stellar mass Black hole optimization; High Utility Itemsets; Utility mining.
Memetic Particle Swarm Optimization for missing value imputation
by Sivaraj Rajappan, Devi Priya Rangasamy
Abstract: Incomplete values in databases stand as a major concern for data analysts and many methods have been devised to handle them in different missing scenarios. Many researchers are increasingly using evolutionary algorithms for handling them. In this paper, a memetic algorithm based approach is proposed which integrates the principles of Particle Swarm Optimization and Simulated Annealing, a local search method. A novel initialization strategy for PSO is also proposed in order to seed good particles into the population. Simulated Annealing prevents PSO from premature convergence and helps it in reaching global optimum. PSO algorithm exhibits explorative behavior and SA exhibits exploitative behavior and serves as the right combination for memetic algorithm implementation. The proposed algorithm is implemented in different datasets to estimate the missing values and the imputation accuracy and the time taken for execution is found to be better than other standard methods.
Keywords: Memetic Algorithm; tournament selection; Bayesian probability; simulated annealing.
Enhanced Auto Associative Neural Network using feed forward neural network An Approach to improve performance of fault detection and analysis
by Subhas Meti
Abstract: Biosensors have played a significant role in many of present days applications ranging from military applications to healthcare sectors. However, its practicality and robustness in its usage in real time scenario is still a matter of concern. Primarily issues such as prediction of sensor data, noise estimation, and channel estimation and most importantly in fault detection and analysis. In this paper an enhancement is applied to the Auto Associative Neural Network (AANN) by considering the cascade feed forward propagation. The residual noise is also computed along with fault detection and analysis of the sensor data. An experimental result shows a significant reduction in the MSE as compared to conventional AANN. The regression based correlation coefficient has improved in the proposed method as compared to conventional AANN.
Keywords: WBAN; Fault Detection and Analysis; Feed Forward Neural Network; Enhanced AANN; Residual Noise.
A comparative study of unsupervised image clustering systems
by Safa Bettoumi
Abstract: The purpose of clustering algorithms is to give sense and extract value
from large sets of structured and unstructured data. Thus, clustering is present in
all science areas that use automatic learning. Therefore, we present in this paper
a comparative study and an evaluation of different clustering methods proposed
in the literature such as prototype based clustering, fuzzy and probabilistic
clustering, hierarchical clustering and density based clustering. We present also
an analysis of advantages and disadvantages of these clustering methods based
essentially on experimentation. Extensive experiments are conducted on three
real-world high dimensional datasets to evaluate the potential and the effectiveness
of seven well-known methods in terms of accuracy, purity and normalized mutual
Keywords: Unsupervised Clustering; Density Based Clustering; Partitioning Clustering; Fuzzy and Probabilistic Clustering; Hierarchical Clustering.
Sentiment Analysis Based Framework for Assessing Internet Telemedicine Videos
by ARUNKUMAR PM, CHANDRAMATHI S, KANNIMUTHU S
Abstract: Telemedicine services through Internet and mobile devices need effective medical video delivery systems. This work describes a novel framework to study the assessment of Internet based telemedicine videos using Sentiment Analysis. The dataset comprises more than one thousand text comments of medical experts collected from various Medical animation videos of YouTube repository. The proposed framework deploys machine learning classifiers such as Bayes net, KNN, C 4.5 decision tree, SVM (Support Vector Machine) and SVM-PSO (SVM with Particle Swarm Optimization) to infer Opinion Mining outputs. The results portray that SVM-PSO classifier performs better in assessing the reviews of Medical video content with more than 80% accuracy. The Models inference of Precision and Recall values using SVM-PSO algorithm shows 87.8% and 85.57% respectively and henceforth underlines its superiority over other classifiers. The concepts of Sentiment Analysis can be applied effectively to the web based user comments of medical videos and the end results can be highly critical to enhance the reputation of Telemedicine education across the globe.
Keywords: Machine Learning; Telemedicine; Medical videos.
Data Mining Classification Techniques - Comparison for Better Accuracy in Prediction of Cardiovascular Disease
by Richa Sharma
Abstract: Cardiovascular disease is a broad term which includes strokes or any disorder to the system that has the heart at its center, this disease is the critical cause of mortality every year across the globe. Data mining has variety of techniques and algorithms that would help to draw some interesting conclusions, mining in healthcare helps to predict the disease. This study aims to knowledge discovery from heart disease dataset and analyze the several data mining classification techniques for better accuracy and less error rate. Dataset for experiments are choosen from UCI Machine Learning Repository database the dataset are analyzed on two different data mining tools i.e WEKA and Tanagra analysis are done using 10 fold cross validation technique, Na
Keywords: Data mining; Classification techniques; Machine learning Tools; Cardiovascular disease; KNN; Naïve Bayes; C-PLS; Decision Tree.
Real Time Data Warehouse: Health Care Use Case
by Hanen Bouali
Abstract: Recently, advances in hardware technology have allowed experts to auto-matically record transactions and other pieces of information of everydaylife at a rapid rate. System0s that executes complex event over real-timestreams of RFID readings encoded an event. Hence, in the healthcare con-text, applications are increasingly interconnected and can impose a massiveevent load to be processed. Furthermore, existing systems suers the lackfor supporting heterogeneity and dynamism. Consequently, resulting fromRFID technology and many other sensors, streaming data brought anotherdimension to data querying and data mining research. This is due to thefact that, in data stream, only a time window is available. In contrast to thetraditional data sources, data streams present new characteristics as beingcontinuous, high-volume, open-ended and concept drifts. To analyse Com-plex queries for event streams, data warehouse seems to be the answer forthis. However, classical data warehouse does not incorporate the specicityof event streams due to the complexity of their components that are spatial,temporal, semantic and real time. For these reasons, we focus on this paperon presenting the conceptual modelling of the real time data warehouse bydening a new dimensionality and stereotype for classical data warehouse toadapt it to the event streams. Then, to prove the eciency of our real timedata warehouse, we will adapt the general pattern model to a medical unitpregnancy care which shows promising results.
Keywords: data warehouse; data analysis; real time; healthcare.
Enhancement of SentiWordNet using Contextual Valence Shifters
by Poornima Mehta, Satish Chandra
Abstract: Sentence structure has a considerable impact on the sentiment polarity of a sentence. In the presence of Contextual Valence Shifters like conjunctions, conditionals and intensifiers some parts of the sentence are more relevant to determine the sentence polarity. In this work we have used Valence Shifters in sentences to enhance the sentiment lexicon SentiWordNet in a given document set. They have also been used to improve the sentiment analysis at document level. In the near past, microblogging services like Twitter have become an important data source for sentiment analysis. Tweets, being restricted to 140 characters are short and therefore have slangs, are grammatically incorrect, have spelling mistakes and have informal expressions. The method is aimed at noisy and unstructured data like tweets on which computationally intensive tools like dependency parsers are not very successful. Our proposed system works better on both noisy (Stanford and Airlines datasets of Twitter) and structured (Movie review) datasets.
Keywords: Sentiment Analysis; SentiWordNet; Valence Shifters; Micro-blogs; Discourse; Twitter; Lexicon Enhancement.
Bayesian Feature Construction for the Improvement of Classification Performance
by Manolis Maragoudakis
Abstract: in this paper we are going to talk about the problem of the increase in validity, concerning the process of classification, but not through approaches having to do with the improvement of the ability to construct a precise classification model using any algorithm of Machine Learning. On the contrary, we approach this important matter by the view of a wider encoding of the training data and more specifically under the perspective of the creation of more features so that the hidden angles of the subject areas, which model the available data, are revealed to a higher degree. We suggest the use of a novel feature construction algorithm, which is based on the ability of the Bayesian networks to re-enact the conditional independence assumptions of features, bringing forth properties concerning their interrelation that are not clear when a classifier provides the data in their initial form. The results from the increase of the features are shown through the experimental measurement in a wide domain area and after the use of a large number of classification algorithms, where the improvement of the performance of classification is evident.
Keywords: Machine learning; Knowledge engineering methodologies; Pattern analysis; Statistical Pattern Recognition.
A novel ensemble classifier by combining Sampling and Genetic algorithm to combat multiclass imbalanced problems
by Archana Purwar, Sandeep Singh
Abstract: To handle data sets with imbalanced classes is an exigent problem in the area of machine learning and data mining .Though a lot of work has been done by many researcher in the literature for two class imbalanced problems, multiclass problems still needs to be explored . Most of existing imbalanced learning techniques have proved to be inappropriate or even produce a negative effect to handle multiclass problems. To the best of our knowledge, no one has used combination of sampling (with and without replacement) and genetic algorithm to solve multiclass imbalanced problem. In this paper, we propose sampling and Genetic algorithm based ensemble classifier (SA-GABEC) to handle imbalanced classes.SA-GABEC tries to locate the best subset of classifiers for a given sample that are precise in predictions and can create an acceptable diversity in features subspace .These subsets of classifiers are fused together to give better predictions as compared to single classifier. Moreover, this paper also proposes modified SA-GABEC which performs the feature selection before applying sampling and outperforms SA-GABEC. To demonstrate the adequacy of our proposed classifiers, we have validated our classifier using two assessment metrics, recall and extended G-mean. Further, we have compared results with existing approaches such as GAB-EPA, Adaboost and Bagging.
Keywords: Feature extraction; diversity; genetic algorithm; ensemble learning; and multiclass imbalanced problems.
Dynamics of the Network Economy: A Content Analysis of the Search Engine Trends and Correlate Results Using Word Clusters
by Murat Yaslioglu
Abstract: Network economy is a relatively untouched area, strategic approach to the dynamics of this new economy is quite limited. Network economy is about the networks, so it was questioned that what better medium than the biggest network itself can be while collecting insights. Thus, it was decided to follow up the information on the internet including every kind of documentation. In order to do so, initially a deep relation analysis using trends was conducted firstly to find out the related topics to new economys dynamics: network effect, network externalities, interoperability, big data and open standards. Additionally, social media was also investigated since it is considered as the marketplace where network economy applies. After the relation analysis, the correlates of the aforementioned keywords were analysed. And finally all the clean top results on the web were collected by the help of Linux command line tools into various, very large text files. These files were analysed for its content by the help of Nvivo qualitative analysis tool to form clusters. By the broad information available at hand, an extensive discussion on each result is written. It is believed that this new research approach will also guide many future researches on various subjects.
Keywords: Network economy; network effect; network externalities; interoperability; big data; open standards; network strategy; methodology; analytics; word clusters; search engines.
Special Issue on: Big Data Analysis in the Real Estate, Construction and Business Sectors
A new initialization method for k-means algorithm in the clustering problem: Data analysis
by Abolfazl Kazemi, Ghazaleh Khodabandehlouie
Abstract: Clustering is one of the most important tasks in exploratory data analysis. One of the simplest and the most widely used clustering algorithm is K-means which was proposed in 1955. K-means algorithm is conceptually simple and easy to implement. This is evidenced by hundreds of publications over the last fifty years that extend k-means in various ways. Unfortunately, because of its nature, this algorithm is very sensitive to the initial placement of the cluster centers. In order to address this problem, many initialization methods (IMs) have been proposed. In this thesis, we first provide a historical overview of these methods. Then we present two new non-random initialization methods for k-means algorithm. Finally, we analyze the experimental results using real datasets, then the performance of IMs is evaluated by TOPSIS multi criteria decision making method. Finally, we prove that not only famous k-means IMs often have poor performance but also there are in fact strong alternative approaches.
Keywords: Clustering; K-means algorithm; Cluster center initialization; Sum of squared error criterion; Data analysis.
A Watchdog Approach Name Matching Algorithm for Big Data Risk Intelligence
by Anusuya Kirubakaran
Abstract: Even though modern world is ruled by data and preventive measures are in place to keep the data quality higher, risk intelligence teams are challenged for one of the risk analysis task aimed at record linkages on heterogeneous data from multiple data sources due higher ratio of non-standard and poor quality data present in big data systems caused by variety of data format across regions, data platforms, data storage systems, data migration etc. To keep this records linkages in mind, in this paper we try to address the complications in name matching process irrespective of spelling, structure and phonetic variations. Success of name matching is achieved when the algorithm is capable of handling names with discrepancies due to naming conventions, cross language translation, operating system transformation, data migration, batch feeds, typos and other external factors. In this paper, we have discussed the varieties of name representation in data source and the methods to parse & find the maximum probabilities of name match comparable to watchdog security with high accuracy as well as the percentage of false negative rate being reduced. The proposed methods can be applied to financial sectors risk intelligence analysis like Know Your Customer (KYC), Anti-Money Laundering (AML), Customer Due Diligence (CDD), Anti-Terrorism, Watch List Screening and Fraud Detection.
Keywords: Hybrid Name Matching; String Similarity Measure ; Data Matching; Risk Intelligence.
Using Improved Genetic Algorithm under Uncertain Circumstance of Site Selection of O2O Customer Returns
by Hongying Sun
Abstract: Online-to-Offline (O2O) e-commerce supports online purchase and offline servicing. How to deal with service, especial customer returns is a great concern of O2O e-commerce. In recent years, with the growth of online shopping in China, O2O has become a new popular mode of e-commerce appliance. Buying online and returning offline are becoming a dominant shopping mode. The returns of customer should be collected to be treated in a more cost-efficient manner. Although many studies have addressed the problem of determining the number and location of centralized return centers where returned products were collected, sorted and consolidated into a large shipment destined for manufacturers repair facilities, few studies focus on O2O e-commerce. To this end, this paper aims to propose an integer programming model to minimize the cost in construction couple with operating charges by optimizing the sites of reverse logistics with the customer returns. For lowering storage costs, physical stores and their geographical sites should be far away from the residential area. In addition, this paper designs an improved genetic algorithm for solving two-stage heredity under random circumstance in that this model builds up multilayer reverse logistics network for recycling customer returns. The paper seeks the best policy to transport the customer returns to the collecting point and further forward them to the remanufacturing center in factories. Both the simulation and numerical examples prove the effectiveness and feasibility of this improved genetic algorithm.
Keywords: reverse logistics;site selection;improved genetic algorithm;O2O E-commerce.
Data Analysis on Big Data: Improving the Map and Shuffle Phases in Hadoop Map Reduce
by J. V. N. Lakshmi
Abstract: Big Data Analytics is now a key ingredient for success in many business organizations, scientific and engineering disciplines and government endeavors. The data management has become a challenging issue for network centric applications which need to process large amount of data sets. System requires advanced tools to analyse these data sets. As an efficient parallel computing programming model Map Reduce and Hadoop are used for large scale data analysis. However Map Reduce still suffers with performance problems Map Reduce uses a shuffle phase Individual Shuffle service component with efficient I/O policy. The Map phase requires an improvement in its performance as this phases output acts as an input to the next phase. Its result reveals the efficiency, so Map phase needs some intermediate check points which regularly monitor all the splits generated by intermediate phases. It is becoming more major with the increasing complexity in user requirements and computation involved. MapReduce model is designed in a way that there a need to wait until all maps accomplish their given task. This acts as a barrier for effective resource utilization. This paper implements shuffle as a service component to decrease the overall execution time of jobs, monitor map phase by skew handling and increase resource utilization in a cluster.
Keywords: Map Reduce; Hadoop; Shuffle; Big Data; Data Analytics; HDFS; Parallel computing;.
A fuzzy based automatic prediction system for quality evaluation of conceptual data warehouse models
by Naveen Dahiya, Vishal Bhatnagar, Manjeet Singh
Abstract: In the paper, we present an automatic system based on fuzzy logic to predict the understanding time of conceptual data warehouse models. The system takes as input the values of quality metrics for a model and gives understanding time as output. The metrics used for quality evaluation have been proposed and validated by Manuel Serrano. The results of automatic system are compared with the results of actual data collection made manually. The predicted results are highly significant to prove the validity and efficiency of the designed automatic system.
Keywords: Fuzzy logic; quality metrics; data warehouse; understanding time; conceptual models.
Multi criteria Decision Support for Feature Selection in Network Anomaly Detection System
by Seelammal C., Vimala Devi K.
Abstract: The growth of computer networks from LAN to cloud, virtualization and mobility always keeps Intrusion Detection System (IDS) as a critical component in the field of network security infrastructure. Tremendous growth and usage of internet raises concerns about how to protect and communicate the digital information in a safe manner. The market for security solutions for next-generation is rapidly evolving and constantly changing to accommodate todays threat. Many intrusion detection techniques, methods and algorithms are implemented to detect these novel attacks. But theres no clear feature set, uncertainty bounds established as a baseline for dynamic environments. The main objective of this paper is to determine and provide the best feature selection for next generation dynamic environments using Multi Criteria Decision Making, decision tree learning with emphasis on optimization (contingency of weight allocation) of constructed trees and handling large data sets.
Keywords: Intrusion detection; Multi Criteria; Classification; Anomaly; Data mining; Feature Selection; Machine Learning.
GPU Based Reduce Approach for Faculty Performance Evaluation using Classification Technique in Opinion Mining
by Brojo Kishore Mishra, Abhaya Kumar Sahoo, Chittaranjan Pradhan
Abstract: Today's competition market, education system plays a main role for creating better students. To create better students, main focus is given to quality of teaching. That quality can be achieved due to good coordination among faculty and student. To get better quality of teaching, faculty performance should be measured by feedback analysis. Performance of faculty should be evaluated so that we can enhance our education quality. Here we used opinion mining by which Large amount of data can be available in the form of reviews, opinions, feedbacks, remarks, observations, comments, explanations and clarifications. So, we collected feedback about faculty from students through feedback form. To measure the performance of faculty, we used classification technique by using opinion mining. We also used this technique on GPU architecture using CUDA-C programming model as well as map reduce programming model to evaluate performance of a faculty. Then we compared between GPU with reduce approach and map reduce approach for getting faster result. This paper uses GPU architecture for CUDA-C programming and Hadoop framework tool for map reduce programming for faster computation of faculty performance evaluation.
Keywords: Classification; CUDA-C; Education System; Feedback; GPU; Hadoop; Map Reduce; Opinion Mining.
The risk of ecoinnovation introduction at the enterprises
by Pawel Bartoszczuk
Abstract: The goal of this paper is to present eco-innovation implementation risk at enterprises. Eco-innovation is rather modern term and can be a method to solve emerging environmental problems as consequences of economic growth. As for any innovation, eco-innovations have several types and can result therefore in a new or significantly improved product (good or service), process, a new marketing or organisational methods. Eco-innovation should be seen as an integral part of innovation efforts across all the economy sectors. European countries observe many barriers for implementation of eco-innnovation, mainly associated with the high investments risk and limited interest.
Keywords: ecoinnovation; environment; ecological risk; economy; enterprise; environment.
Data Analytics on Census Data to predict the Income and Economic Hierarchy
by Srinivasa Kg, Sharath R, Krishna Chaitanya S, Nirupam K N, Sowmya BJ
Abstract: The US Census Bureau conducts the American Community Survey generating a massive dataset with millions of data points. The rich dataset contains detailed information of approximately 3.5 million households about who they are and how they live including ancestry, education, work, transportation, internet use and residency. This enormous data encourages the need to know more about the population and to derive insight. The ever demanding requirement in exposing the subtlety in case of economic issues is the motivation behind to construe meaningful conclusions in income domain. Hence the focus is to concentrate on bringing out unique insights into the financial status of the people living in the country. These conclusions delineated might aid in delivering wiser decisions in regard to economic growth of the country. Using relevant attributes, demographic graphs are plotted, aiding the conclusions drawn. Also classifications into various economic classes are done using well known classifiers.
Keywords: Demographic graphs; Benford’s law; in-come; K-means clustering; Naive Bayes classifier.