International Journal of Data Analysis Techniques and Strategies (32 papers in press)
Feature Selection Methods for Document Clustering: A Comparative Study and a Hybrid Solution
by Asmaa BENGHABRIT, Brahim OUHBI, Bouchra FRIKH, El Moukhtar ZEMMOURI, Hicham BEHJA
Abstract: The web proliferation makes the exploration and the use of the huge amount of available unstructured text documents challenged, which drives the need of document clustering. Hence, improving the performances of this mechanism by using feature selection seems worth investigation. Therefore, this paper proposes an efficient way to highly benefit from feature selection for document clustering. We first present a review and comparative studies of feature selection methods in order to extract efficient ones. Then we propose a sequential and hybrid combination modes of statistical and semantic techniques in order to benefit from crucial information that each of them provides for document clustering. Extensive experiments prove the benefit of the proposed combination approaches. The performance of document clustering is highest when the measures based on Chi-square statistic and the mutual information are linearly combined. Doing so, it avoids the unwanted correlation that the sequential approach creates between the two treatments.
Keywords Document clustering, Feature selection, Statistical and semantic analysis, Chi-square statistic, Mutual Information, K-means algorithm.
Keywords: Document clustering; Feature selection; Statistical and semantic analysis; Chi-square statistic; Mutual Information; K-means algorithm.
Stellar mass black hole optimization for utility mining
by Kannimuthu Subramanian, Premalatha Kandhasamy
Abstract: Major challenges in mining high utility itemsets from the transaction databases requires exponential search space and database-dependent minimum utility threshold. The search space is very large because of large number of distinct items and size of the database. Data analysts need of specifying appropriate minimum utility thresholds for their data mining tasks though they may have no knowledge pertaining to their databases. To get rid of these problems, Stellar mass Black hole optimization (SBO) method is proposed to mine Top-K HUIs from the transaction database without specifying minimum utility threshold. To know the performance of SBO, the experiment results are compared with GA.
Keywords: Data Mining; Genetic Algorithm; Stellar mass Black hole optimization; High Utility Itemsets; Utility mining.
Memetic Particle Swarm Optimization for missing value imputation
by Sivaraj Rajappan, Devi Priya Rangasamy
Abstract: Incomplete values in databases stand as a major concern for data analysts and many methods have been devised to handle them in different missing scenarios. Many researchers are increasingly using evolutionary algorithms for handling them. In this paper, a memetic algorithm based approach is proposed which integrates the principles of Particle Swarm Optimization and Simulated Annealing, a local search method. A novel initialization strategy for PSO is also proposed in order to seed good particles into the population. Simulated Annealing prevents PSO from premature convergence and helps it in reaching global optimum. PSO algorithm exhibits explorative behavior and SA exhibits exploitative behavior and serves as the right combination for memetic algorithm implementation. The proposed algorithm is implemented in different datasets to estimate the missing values and the imputation accuracy and the time taken for execution is found to be better than other standard methods.
Keywords: Memetic Algorithm; tournament selection; Bayesian probability; simulated annealing.
Enhanced Auto Associative Neural Network using feed forward neural network An Approach to improve performance of fault detection and analysis
by Subhas Meti
Abstract: Biosensors have played a significant role in many of present days applications ranging from military applications to healthcare sectors. However, its practicality and robustness in its usage in real time scenario is still a matter of concern. Primarily issues such as prediction of sensor data, noise estimation, and channel estimation and most importantly in fault detection and analysis. In this paper an enhancement is applied to the Auto Associative Neural Network (AANN) by considering the cascade feed forward propagation. The residual noise is also computed along with fault detection and analysis of the sensor data. An experimental result shows a significant reduction in the MSE as compared to conventional AANN. The regression based correlation coefficient has improved in the proposed method as compared to conventional AANN.
Keywords: WBAN; Fault Detection and Analysis; Feed Forward Neural Network; Enhanced AANN; Residual Noise.
A comparative study of unsupervised image clustering systems
by Safa Bettoumi
Abstract: The purpose of clustering algorithms is to give sense and extract value
from large sets of structured and unstructured data. Thus, clustering is present in
all science areas that use automatic learning. Therefore, we present in this paper
a comparative study and an evaluation of different clustering methods proposed
in the literature such as prototype based clustering, fuzzy and probabilistic
clustering, hierarchical clustering and density based clustering. We present also
an analysis of advantages and disadvantages of these clustering methods based
essentially on experimentation. Extensive experiments are conducted on three
real-world high dimensional datasets to evaluate the potential and the effectiveness
of seven well-known methods in terms of accuracy, purity and normalized mutual
Keywords: Unsupervised Clustering; Density Based Clustering; Partitioning Clustering; Fuzzy and Probabilistic Clustering; Hierarchical Clustering.
Sentiment Analysis Based Framework for Assessing Internet Telemedicine Videos
by ARUNKUMAR PM, CHANDRAMATHI S, KANNIMUTHU S
Abstract: Telemedicine services through Internet and mobile devices need effective medical video delivery systems. This work describes a novel framework to study the assessment of Internet based telemedicine videos using Sentiment Analysis. The dataset comprises more than one thousand text comments of medical experts collected from various Medical animation videos of YouTube repository. The proposed framework deploys machine learning classifiers such as Bayes net, KNN, C 4.5 decision tree, SVM (Support Vector Machine) and SVM-PSO (SVM with Particle Swarm Optimization) to infer Opinion Mining outputs. The results portray that SVM-PSO classifier performs better in assessing the reviews of Medical video content with more than 80% accuracy. The Models inference of Precision and Recall values using SVM-PSO algorithm shows 87.8% and 85.57% respectively and henceforth underlines its superiority over other classifiers. The concepts of Sentiment Analysis can be applied effectively to the web based user comments of medical videos and the end results can be highly critical to enhance the reputation of Telemedicine education across the globe.
Keywords: Machine Learning; Telemedicine; Medical videos.
Data Mining Classification Techniques - Comparison for Better Accuracy in Prediction of Cardiovascular Disease
by Richa Sharma
Abstract: Cardiovascular disease is a broad term which includes strokes or any disorder to the system that has the heart at its center, this disease is the critical cause of mortality every year across the globe. Data mining has variety of techniques and algorithms that would help to draw some interesting conclusions, mining in healthcare helps to predict the disease. This study aims to knowledge discovery from heart disease dataset and analyze the several data mining classification techniques for better accuracy and less error rate. Dataset for experiments are choosen from UCI Machine Learning Repository database the dataset are analyzed on two different data mining tools i.e WEKA and Tanagra analysis are done using 10 fold cross validation technique, Na
Keywords: Data mining; Classification techniques; Machine learning Tools; Cardiovascular disease; KNN; Naïve Bayes; C-PLS; Decision Tree.
Real Time Data Warehouse: Health Care Use Case
by Hanen Bouali
Abstract: Recently, advances in hardware technology have allowed experts to auto-matically record transactions and other pieces of information of everydaylife at a rapid rate. System0s that executes complex event over real-timestreams of RFID readings encoded an event. Hence, in the healthcare con-text, applications are increasingly interconnected and can impose a massiveevent load to be processed. Furthermore, existing systems suers the lackfor supporting heterogeneity and dynamism. Consequently, resulting fromRFID technology and many other sensors, streaming data brought anotherdimension to data querying and data mining research. This is due to thefact that, in data stream, only a time window is available. In contrast to thetraditional data sources, data streams present new characteristics as beingcontinuous, high-volume, open-ended and concept drifts. To analyse Com-plex queries for event streams, data warehouse seems to be the answer forthis. However, classical data warehouse does not incorporate the specicityof event streams due to the complexity of their components that are spatial,temporal, semantic and real time. For these reasons, we focus on this paperon presenting the conceptual modelling of the real time data warehouse bydening a new dimensionality and stereotype for classical data warehouse toadapt it to the event streams. Then, to prove the eciency of our real timedata warehouse, we will adapt the general pattern model to a medical unitpregnancy care which shows promising results.
Keywords: data warehouse; data analysis; real time; healthcare.
Enhancement of SentiWordNet using Contextual Valence Shifters
by Poornima Mehta, Satish Chandra
Abstract: Sentence structure has a considerable impact on the sentiment polarity of a sentence. In the presence of Contextual Valence Shifters like conjunctions, conditionals and intensifiers some parts of the sentence are more relevant to determine the sentence polarity. In this work we have used Valence Shifters in sentences to enhance the sentiment lexicon SentiWordNet in a given document set. They have also been used to improve the sentiment analysis at document level. In the near past, microblogging services like Twitter have become an important data source for sentiment analysis. Tweets, being restricted to 140 characters are short and therefore have slangs, are grammatically incorrect, have spelling mistakes and have informal expressions. The method is aimed at noisy and unstructured data like tweets on which computationally intensive tools like dependency parsers are not very successful. Our proposed system works better on both noisy (Stanford and Airlines datasets of Twitter) and structured (Movie review) datasets.
Keywords: Sentiment Analysis; SentiWordNet; Valence Shifters; Micro-blogs; Discourse; Twitter; Lexicon Enhancement.
Bayesian Feature Construction for the Improvement of Classification Performance
by Manolis Maragoudakis
Abstract: in this paper we are going to talk about the problem of the increase in validity, concerning the process of classification, but not through approaches having to do with the improvement of the ability to construct a precise classification model using any algorithm of Machine Learning. On the contrary, we approach this important matter by the view of a wider encoding of the training data and more specifically under the perspective of the creation of more features so that the hidden angles of the subject areas, which model the available data, are revealed to a higher degree. We suggest the use of a novel feature construction algorithm, which is based on the ability of the Bayesian networks to re-enact the conditional independence assumptions of features, bringing forth properties concerning their interrelation that are not clear when a classifier provides the data in their initial form. The results from the increase of the features are shown through the experimental measurement in a wide domain area and after the use of a large number of classification algorithms, where the improvement of the performance of classification is evident.
Keywords: Machine learning; Knowledge engineering methodologies; Pattern analysis; Statistical Pattern Recognition.
A novel ensemble classifier by combining Sampling and Genetic algorithm to combat multiclass imbalanced problems
by Archana Purwar, Sandeep Singh
Abstract: To handle data sets with imbalanced classes is an exigent problem in the area of machine learning and data mining .Though a lot of work has been done by many researcher in the literature for two class imbalanced problems, multiclass problems still needs to be explored . Most of existing imbalanced learning techniques have proved to be inappropriate or even produce a negative effect to handle multiclass problems. To the best of our knowledge, no one has used combination of sampling (with and without replacement) and genetic algorithm to solve multiclass imbalanced problem. In this paper, we propose sampling and Genetic algorithm based ensemble classifier (SA-GABEC) to handle imbalanced classes.SA-GABEC tries to locate the best subset of classifiers for a given sample that are precise in predictions and can create an acceptable diversity in features subspace .These subsets of classifiers are fused together to give better predictions as compared to single classifier. Moreover, this paper also proposes modified SA-GABEC which performs the feature selection before applying sampling and outperforms SA-GABEC. To demonstrate the adequacy of our proposed classifiers, we have validated our classifier using two assessment metrics, recall and extended G-mean. Further, we have compared results with existing approaches such as GAB-EPA, Adaboost and Bagging.
Keywords: Feature extraction; diversity; genetic algorithm; ensemble learning; and multiclass imbalanced problems.
Dynamics of the Network Economy: A Content Analysis of the Search Engine Trends and Correlate Results Using Word Clusters
by Murat Yaslioglu
Abstract: Network economy is a relatively untouched area, strategic approach to the dynamics of this new economy is quite limited. Network economy is about the networks, so it was questioned that what better medium than the biggest network itself can be while collecting insights. Thus, it was decided to follow up the information on the internet including every kind of documentation. In order to do so, initially a deep relation analysis using trends was conducted firstly to find out the related topics to new economys dynamics: network effect, network externalities, interoperability, big data and open standards. Additionally, social media was also investigated since it is considered as the marketplace where network economy applies. After the relation analysis, the correlates of the aforementioned keywords were analysed. And finally all the clean top results on the web were collected by the help of Linux command line tools into various, very large text files. These files were analysed for its content by the help of Nvivo qualitative analysis tool to form clusters. By the broad information available at hand, an extensive discussion on each result is written. It is believed that this new research approach will also guide many future researches on various subjects.
Keywords: Network economy; network effect; network externalities; interoperability; big data; open standards; network strategy; methodology; analytics; word clusters; search engines.
Testing a File Carving Tool Using Realistic Datasets Generated with Openness by a User Level File System
by Srinivas Kaparthi, Venugopa T
Abstract: File Carver views a used hard disk as a storage medium containing raw data. From users point of view, the same hard disk is storage medium containing files. File carving has application areas of data recovery and digital forensics. It analyzes the raw data and reassembles file fragments, without using files metadata, for reconstructing the actual files present on the disk. During development phase of a file carver, it is inappropriate to use a used hard disk as an input medium due to the fact that the file system does not provide openness regarding file fragmentation and location of data on the disk. In this paper, we propose a method that provides realistic data sets with openness which can be used to test carving tools. Realistic property of data sets is achieved by implementing a file system at user level. A large file is used to mimic a hard disk in this process. The large file, on the hard disk, is handled by the host file system. The same large file to mimic as a test hard disk is handled by a file system at user level. Openness is achieved because the file system at user level acts as a white box while the file system at kernel level acts like a black box. The large file thus generated is a realistic data set generated with openness and can be used as an input for verifying the correctness of a file carving tool during its development phase.
Keywords: File carver; file system; meta data; digital forensics; data recovery.
Fiber Optic Angle Rate Gyroscope Performance Evaluation in Terms of Allan Variance
by Jianbo Hu
Abstract: Based on the analysis of the error-sources of the Fiber Optic Angle Rate Gyroscope (FOARG), the Allan parameters are focused on calculation the Allan variances. The relationship between the Allan variance and the accuracy of FOARG is given. For the existences in the output of some-type FOARG, such as high noise, large volatilities in value and existing notable errors, a data-process algorithim is proposed with meaning and smoothing one. A lot of Matlab blocks, such as data-sampling, meaning and smoothing, are designed to process some-type FOARGs dynamic data and static data and to evaluate its performance with Allan variance.
Keywords: Fiber Optic Gyroscop; data-process; Allan variance.
A Novel Integrated Principal Component Analysis and Support vector Machines based diagnostic system for detection of Chronic Kidney disease
by Aditya Khamparia, Babita Pandey
Abstract: The alarming growth of chronic kidney disease has become a major issue in our nation. The kidney disease does not have specific target, but individuals with diseases such as obesity, cardiovascular disease and diabetes are all at increased risk. On the contrary, there is no such awareness about related kidney disease and its failure which affects individuals health. Therefore, there is need of providing advanced diagnostic system which improves health condition of individual. The intent of proposed work is to combine emerging data reduction technique i.e. principal component analysis (PCA) and Supervised classification technique Support vector machine (SVM) for examination of kidney disease through which patients were being suffered from past. Variety of statistical reasoning and probabilistic features were encountered in proposed work like accuracy and recall parameters which examine the validity of dataset and obtained results. Experimental results concluded that SVM with gaussian radial basis kernel achieved higher precision and performed better than other models in term of diagnostic accuracy rates.
Keywords: Principal component analysis; Support vector machine; classification; kidney disease; kernel; feature extraction.
Hybrid Fuzzy Logic and Gravitational Search Algorithm based Multiple Filters for Image Restoration
by A. Senthilselvi, Sukumar S
Abstract: Image restoration is a noise removal approach, which is used to remove noise from noisy image and restore the image. It has been widely used in various fields, such as computer vision, medical imaging, etc. In this paper, we consider the standard test images for noise removal experimentation. The images are mainly corrupted by an impulse noise.. In this paper, we present a multiple image filters for removal of impulse noises from test images. It utilizes fuzzy logic (FL) approach to design a noise detector (ND) optimized by gravitational search algorithm (GSA) and utilizes median filter (MF) for restoring. The proposed multiple filters first used the FL approach to detect each pixels of a test images are noise corrupted or not. If it is considered as noise-corrupted, the multiple filters restore it with the MF filter. Otherwise, it remains unchanged. Here, at first we split the image into number of windows and each window apply the multiple filters. The filter output is used for the rule generation and the optimal rules are selected using GSA. Then, the optimal rules are given to the fuzzy logic system to detect the noise pixel. For experimentation, in this paper we used five types of standard test images. The experimental results are carried out using different noise level and different methods. The performance measured in terms of PSNR, MSE, and visual quality.
Keywords: Image restoration; impulse noise; fuzzy logic; multiple filters; median filter; standard test images and gravitational search algorithm.
Measuring Pearsons correlation coefficient of fuzzy numbers with different membership functions under weakest t-norm
by Mohit Kumar
Abstract: In statistical theory, the correlation coefficient has been widely used to assess a possible linear association between two variables and often calculated in crisp environment. In this study, a simplified and effective method is presented to compute the Pearsons correlation coefficient of fuzzy numbers with different membership functions using weakest triangular norm (t-norm) based approximate fuzzy arithmetic operations. Different from previous research studies, the correlation coefficient computed in this paper is a fuzzy number rather than a crisp number. The proposed method has been illustrated by computing the correlation coefficient between the technology level and management achievement from a sample of 15 machinery firms in Taiwan. The correlation coefficient computed by proposed method has less uncertainty and obtained results are more exact. The computed results has also been compared with existing approaches.
Keywords: Pearson’s correlation coefficient; fuzzy number; weakest t- norm arithmetic.
PHASE DEPENDENT BREAKDOWN IN BULK ARRIVAL QUEUEING SYSTEM WITH VACATION BREAK-OFF
by NIRANJAN S.P, CHANDRASEKARAN V.M, INDHIRA K
Abstract: This paper investigates dual control policy of bulk arrival queueing model with phase dependent breakdown and vacation interruption. In this model service process is split into two phases called first essential service and second essential service. Here the occurrence of breakdown during first essential service and second essential service are different. When the server got failure during first essential service, service process will be interrupted and sent to repair station immediately. On contrary during second essential service when the server got failure the service will not be interrupted, it performs continuously for current batch by doing some technical precaution arrangements. Server will be repaired after the service completion during renewal period. On the second essential service completion, if the queue length is less than a then the server leaves for vacation. The server has to do preparatory work to initiate service after vacation. During vacation if the queue length reaches the value a then the server breaks the vacation and performs preparatory work to start first essential service. Though the vacation period ends, if the queue length is still less than a then the server remains dormant (idle) until the queue length reaches the value a. For this system probability generating function of the queue size will be obtained by using supplementary variable technique. Various performance measures are also derived with appropriate numerical solution. Additionally cost model is also presented to minimize the total average cost of the system..
Keywords: Phase dependent breakdown; vacation break-off; dual control policy; bulk arrival; batch service; cost optimization; renewal time; supplementary variable technique.
Implementation of an Efficient FPGA Architecture for Capsule Endoscopy Processor Core using Hyper Analytic Wavelet based Image Compression Technique
by Abdul Jaleel N, Vijayakumar P
Abstract: To receive images of human intestine for medical diagnostics, Wireless capsule endoscopy (WCE) is a state-of-the-art technology. This paper proposes implementation of efficient FPGA architecture for capsule endoscopy processor core. The main part of this processor is image compression, for which we proposed an algorithm called as Hyper analytic Wavelet Transform (HWT). The Hyper analytic Wavelet Transform (HWT) is quasi shift-invariant; it has a good directional selectivity, and a reduced degree of redundancy. Huffman coding also used to reduce the amount of bits required to represent a string of symbols. This paper also provided Forward Error Correction (FEC) scheme based on Low Density Parity Check codes (LDPC) to reduce Bit Error Rate (BER) of the transmitted data. Compared to the similar existing works this paper proposed an efficient architecture.
Keywords: Wireless capsule endoscopy (WCE); Hyper analytic Wavelet Transform (HWT); Huffman coding; Low Density Parity Check codes (LDPC); Forward Error Correction (FEC); quasi shift-invariant; Bit Error Rate (BER).
Data Aggregation to Better Understand the Impact of Computerization on Employment
by James Otto, Chaodong Han
Abstract: Data reduction methods are called for to address challenges presented by big data. Correlation of two variables may be less clear if data are organized at disaggregate levels in regression analysis. In this study, we apply data aggregation to regression analysis in the context of a study forecasting the impact of computerization on jobs and wages. We show that data grouped by the ranked independent variable, versus random or other grouping schemes, provides a clearer pattern of the employment impacts of computerization probability on job categories. The coefficient estimates are more consistent for groupings based on a ranked ind ependent variable, than those provided by random grouping of the same independent variable. The improved estimations can have positive policy implications.
Keywords: Data reduction methods; Impact of computerization; Computerization probability; Automation; Data grouping schemes; Statistical regression; Data aggregation; Ranked regression; Information reduction.
Inference in Mixed Linear Models with four variance components - Sub-D and Sub-DI
by Adilson Silva, Miguel Fonseca, Antonio Monteiro
Abstract: This work approaches the new estimators for variance componentes in
mixed linear models Sub-D and its improved version Sub-DI, developed and tested
by Silva (2017). Both estimators were deduced and tested in mixed linear models
with two and three variance components; the authors gave the corresponding
formulations in models with an arbitrary number of variance components but
no one had never tested their performances in models with more than three
variance components. Particularly, here we aim to give the explicit formulations
for both Sub-D and Sub-DI in models with four variance components, as well as
a numerical example testing their performances. Tables containing the results of
the numerical example will be given.
Keywords: Orthogonal Matrices; Variance Components; Sub-D; Sub-DI; Mixed Linear Models.
Special Issue on: DAC9 Theory and Applications of Correspondence Analysis and Classification
Comparison of hierarchical clustering methods for binary data from molecular markers
by Emmanouil D. Pratsinakis, Symela Ntoanidou, Alexios Polidoros, Christos Dordas, Panagiotis Madesis, Ilias Elefterohorinos, George Menexes
Abstract: Data from molecular markers used for constructing dendrograms, which are based on genetic distances between different plant species, are encoded as binary data (0: absence of the band at the agarose gel, 1: the presence of the band at the agarose gel). For the construction of the dendrograms, the most commonly used linkage methods are the UPGMA (Unweighted Pair Group Method with Arithmetic mean) and the Neighbor-Joining, in combination with multiple distances (mainly with the, squared or not, Euclidean distance). It seems that in this scientific field the Golden Standard clustering method (combination of distance and linkage method) is the UPGMA in combination with the squared Euclidean distance. In this study, a review is presented on the distances and the linkage methods used with binary data. Furthermore, an evaluation of the linkage methods and the corresponding appropriate distances (comparison of 162 clustering methods) is attempted using binary data resulted from molecular markers applied to five populations of the wild mustard Sinapis arvensis species. The validation of the various cluster solutions was tested using external criteria (geographical area and herbicides resistance). The results showed that the squared Euclidean distance, in combination with UPGMA linkage method, is not a panacea for dendrogram construction, in the biological sciences, based on binary data derived from molecular markers. Thirty six other hierarchical clustering methods could be used. In addition, the Benz
Keywords: Dendrograms; proximities; linkage methods; Benz.
Assessment of the awareness of Cypriot Accounting Firms level concerning Cyber Risk. An exploratory analysis
by Stratos Moschidis, Efstratios Livanis, Athanasios Thanopoulos
Abstract: Technology development has made a decisive contribution to the digitization of businesses, which makes it easier for them to work more efficiently. However, in recent years, data leakages have shown an increasing trend. To investigate the level of awareness among Cypriot accountancy firms about cyber-related risks, we use the data from a recent survey of Cypriot professional accountants members of Institute of Certified Public Accountants of Cyprus, ICPAC. The categorical nature of the data and the purpose of our research led us to use methods of multidimensional statistical analysis. Τhe emergence of intense differences between accounting companies in relation to the issue as we will present is particularly interesting
Keywords: cyber risk; multiple correspondence analysis; Cypriot accounting firms; exploratory statistics.
Sequential dimension reduction and clustering of mixed-type data
by Angelos Markos, Odysseas Moschidis, Theodore Chadjipantelis
Abstract: Real data sets usually involve a number of variables that are heterogeneous in nature. Clustering of a set of objects described by a mixture of continuous and categorical variables can be a challenging task, because it requires to take into account the relationships between variables that are of different measurement levels. In the context of data reduction, an effective class of methods combine dimension reduction of the variables with clustering of the objects in the reduced space. In this paper, we focus on three methods for sequential dimension reduction and clustering of mixed-type data. The first step of each approach involves the application of Principal Component Analysis or Correspondence Analysis on a suitably transformed matrix to retain as much variance as possible in as few dimensions as possible. In the second step, a partitioning or hierarchical clustering algorithm is applied to the object scores in the reduced space. The common theoretical underpinnings of the three approaches are highlighted. The results of a benchmarking study on simulated data show that sequential dimension reduction and clustering methods outperform alternative approaches, especially when categorical variables are more informative than continuous with regard to the underlying cluster structure. Strengths and limitations of the methods are also demonstrated on a real data set with nominal, ordinal and continuous variables.
Keywords: Cluster Analysis; Dimension Reduction; Correspondence Analysis; Principal Component Analysis; Mixed-type Data.
Special Issue on: LOPAL'2018 Advances and Applications in Optimisation and Learning Algorithms
Bayesian Consensus Clustering with LIME for Security in Big Data
by Balamurugan Selvarathinam
Abstract: Malware creates huge noises in the current data era. The security query rises every day with new Malwares created by the intruders. Malware protection remains one of the trending areas of research in Android platform. Malwares are routed through the SMS / MMS in the subscribers network. The SMS once read is forwarded to other users. This will impact the device, once the intruders access the device data. Device Data theft and the user data theft also includes, credit card credentials, login credentials card information based on the users data stored in android device. This paper works towards how the various malwares in the SMS can be detected to protect Mobile users from potential risks from multiple data sources. Using a single data source will not be very effective with the Spam Detection, as the single data source will not contain all the updated Malwares and Spams. This work uses two methods namely, BCC for Spam Clustering and LIME for Classification of malwares. The significance of these methods is their ability work with unstructured data from different sources. After the two-step classification a set of unique malwares is identified, and all further malwares are grouped according to their category.
Keywords: Bayesian Consensus Clustering; LIME; Classification; Big Data security.
Efficient Data Clustering Algorithm Designed Using Heuristic Approach
by POONAM NANDAL, DEEPA BURA, Meeta Singh
Abstract: Information retrieval from a large amount of information available in a
database is a major issue these days. The relevant information extraction from the
voluminous information available on web is being done using various techniques like Natural
Language Processing, Lexical Analysis, Clustering, Categorization etc. In this paper, we have
discussed the clustering methods used for clustering of large amount of data using different
features to classify the data. In todays era various problem solving techniques makes the use
of heuristic approach for designing and developing various efficient algorithms. In this paper,
we have proposed a clustering technique using a heuristic function to select the centroid so
that the clusters formed are as per the need of the user. The heuristic function designed in this
paper is based on the conceptually similar data points so that they are grouped into accurate
clusters. 𝑘 -means clustering algorithm is majorly used to cluster the data which is also
focussed in this paper. It has been empirically found that the clusters formed and the data
points which belong to a cluster are close to human analysis as compared to existing
Keywords: Clustering; Natural Language Processing; k-means; Concept; Heuristic.
Semantic Integration of Traditional and Heterogeneous Data Sources (UML, XML and RDB) in OWL 2 Triplestore
by Oussama EL Hajjamy, Hajar Khallouki, Larbi Alaoui, Mohamed Bahaj
Abstract: With the success of the internet and the expansion of the amount of data in the web, the exchange of information from various heterogeneous and classical data sources becomes a critical need. In this context, researchers must propose integration solutions that allow applications to simultaneously access several data sources. In this perspective, it is necessary to find a solution for integrating data from classical data sources (UML, XML and RDB) into richer systems based on ontologies using the semantic web language OWL. In this work, we propose a semi-automatic integration approach of classical data sources via a global schema located in database management systems of RDF or OWL data, called triplestore. The goal is to combine several classical and heterogeneous data sources, according to the same schema and unified semantic. Our contribution is subdivided into three axes: The first one aims to establish an automatic solution that converts classical data sources such as UML, XML and relational databases (RDB) to local ontologies based on OWL2 language. The second axis consists of semantically aligning local ontologies based on syntactic, semantic and structural similarity measurement techniques in order to increase the probability of having real correspondences and real differences. Finally, the third axis aims to merge the pre-existing local ontologies into a global ontology based on the alignment found in the previous step. A tool based on our approach has also been developed and tested to demonstrate the power of our strategy and validates the theoretical concept.
Keywords: data integration; UML; XML; RDB; semantic web; OWL2; triplestore; alining ontologies; merge ontologies.
A Novel Homophone-Based Text Compression for Secure Transmission
by Baritha Begum
Abstract: Internet is widely used in recent years for communication. In the last decade, there has been increasing in the amount of data transmitted via Internet, representing text, images, speech, video, sound and computer data. Hence there is a need for efficient compression algorithms that can be effectively used in the existing network bandwidth. Data secrecy is one of the most important concerns in security of any network. Here proposed with Homophone based Encryption with Compression (HEC) algorithm is a viable to maintain data confidentiality. HEC algorithm reduces the quantum of data used for exemplification. Homophone words have that sounds alike but different meanings and spellings. The proposed scheme enhances with the security and compression ratio of the input information in three steps. First input word in transformed into already existing homophones word using an inbuilt dictionary which enhances security. Then compression is done by BWT, modified RLE and Huffman coding. In latter, the array reduction based encryption cum compression used which further increases the compression ratio and improves security. This scheme has been tested with a number of text files from standard corpora. The results indicate that HEC scheme achieves a higher compression ratio and security than many widely used dictionary and statistical based compression schemes.
Keywords: Homophone; data Compression; Encryption; Data security; Bits per character; Compression ratio; unicity distance and compression efficiency.
Improving Social Media Engagements on paid and nonpaid advertisements: A Data Mining Approach
by Jen-peng Huang, Genesis Sembiring Depari
Abstract: The purpose of this research is to develop a strategy to improve the number of social media engagement on Facebook both for paid and nonpaid publications through a data mining approach. Several Facebook post characteristics were weighted in order to rank the input variables importance. Support Vector Machine, Deep Learning, and Random Forest performance along with dynamic parameters were compared in order to obtain a robust algorithm in assessing the importance of several input factors. Random Forest is found as the most powerful algorithm with 79% accuracy and therefore used to analyze the importance of input factors in order to improve the number of engagements of social media posts. Eventually, we found that Total page likes (number of page follower) of company Facebook page are the most important factor in order to have more social media engagements both for paid and nonpaid publications. In order to prove that engagements also beneficial to reach more people, we also examined the correlation of shares, likes, comments and other post characteristics in reaching more people through company Facebook pages. In the final part, we also propose a managerial implication on how to improve the number of engagements in social media both for paid and nonpaid publications.
Keywords: Social Media; Data Mining; Paid Advertisement; Non-Paid Advertisement; Social Media Engagements.
Evaluating information criteria in latent class analysis: Application to identify classes of Breast Cancer data set
by Abdallah Abarda, Mohamed Dakkon, Khawla Asmi, Youssef Bentaleb
Abstract: In recent studies, latent class analysis (LCA) modeling has been proposed as a convenient alternative to standard classification methods. It has become a popular tool for clustering respondents into homogeneous subgroups based on their responses on a set of categorical variables. The abscence of a common accepted statistical indicator for deciding the number of classes in the study of population represents one of a major unresolved issue in the application of the LCA. Determining the number of classes constituting the profiles of a given population is often done by using the likelihood ratio test, however its use is not correct theoretically. To overcome this problem, we will propose an alternative for the classical latent class models selection methods based on the information criteria. This article aims to investigate the performance of information criteria for selecting the latent class analysis models. Nine information criteria are compared under various sample sizes and model dimensionalities. We propose also an application of ICs to select the best model of breast cancer data set.
Keywords: Latent class analysis; Model selection; Information criteria;
Sentiment classification of review data using sentence significance score optimization
by Ketan Kumar Todi, Muralikrishna SN, Ashwath Rao B
Abstract: A significant amount of work has been done in the field of sentiment analysis in textual data using the concepts and techniques of Natural Language Processing (NLP). In this work, unlike the existing techniques, we present a novel method wherein we consider the significance of the sentences in formulating the opinion. Often in any review, the sentences in the review may correspond to different aspects which are often irrelevant in deciding whether the sentiment is positive or negative on a topic. Thus, we assign a sentence significance score to evaluate the overall sentiment of the review. We employ a clustering mechanism followed by the neural network approach to determine the optimal significance score for the review. The proposed supervised method shows a higher accuracy than the state-of-the-art techniques. We further, determine the subjectivity of sentences and establish a relationship between subjectivity of sentences and the significance score. We experimentally show that the significance scores found in the proposed method correspond to identifying the subjective sentences and objective sentences in reviews. The sentences with low significance score corresponds to objective sentences and the sentences with high significance score corresponds to subjective sentences.
Keywords: Aspect ; Sentiment Classification; Clustering; Neural Network; Optimization; Significance score.
Towards Knowledge Warehousing: Application to Smart Housing
by Hadjer Moulai, Habiba Drias
Abstract: The terms data, information and knowledge should not be treated as synonyms in any context. In fact, a hierarchical order between these entities exists where data become information and information become knowledge. Massive amounts of data are analysed everyday in order to extract valuable knowledge to support decision making. However, the size of the extracted knowledge compromises the speed of reasoning and exploitation of the latter. In this paper, we propose the paradigm of knowledge warehousing to store and analyse big amounts of knowledge through online knowledge processing and knowledge mining techniques. Our proposal is supported by an original knowledge warehouse framework and a case study for the smart housing technology. A multi-agent system built on a knowledge warehouse architecture is illustrated where each agent has a knowledge base about his assigned task. The paradigm is expected to be applicable for other knowledge tasks and domains as well.
Keywords: knowledge warehouse; knowledge management; knowledge mining; warehousing technology; smart housing; agent technology.