International Journal of Data Analysis Techniques and Strategies (24 papers in press)
Data Aggregation to Better Understand the Impact of Computerization on Employment
by James Otto, Chaodong Han
Abstract: Data reduction methods are called for to address challenges presented by big data. Correlation of two variables may be less clear if data are organized at disaggregate levels in regression analysis. In this study, we apply data aggregation to regression analysis in the context of a study forecasting the impact of computerization on jobs and wages. We show that data grouped by the ranked independent variable, versus random or other grouping schemes, provides a clearer pattern of the employment impacts of computerization probability on job categories. The coefficient estimates are more consistent for groupings based on a ranked ind ependent variable, than those provided by random grouping of the same independent variable. The improved estimations can have positive policy implications.
Keywords: Data reduction methods; Impact of computerization; Computerization probability; Automation; Data grouping schemes; Statistical regression; Data aggregation; Ranked regression; Information reduction.
Inference in Mixed Linear Models with four variance components - Sub-D and Sub-DI
by Adilson Silva, Miguel Fonseca, Antonio Monteiro
Abstract: This work approaches the new estimators for variance componentes in
mixed linear models Sub-D and its improved version Sub-DI, developed and tested
by Silva (2017). Both estimators were deduced and tested in mixed linear models
with two and three variance components; the authors gave the corresponding
formulations in models with an arbitrary number of variance components but
no one had never tested their performances in models with more than three
variance components. Particularly, here we aim to give the explicit formulations
for both Sub-D and Sub-DI in models with four variance components, as well as
a numerical example testing their performances. Tables containing the results of
the numerical example will be given.
Keywords: Orthogonal Matrices; Variance Components; Sub-D; Sub-DI; Mixed Linear Models.
Detecting Text in License Plates using a novel MSER based Method.
by Admi Mohamed, E.L. Fkihi Sanaa, Faizi Rdouan
Abstract: A new license plate detection method is proposed in this paper. The proposed approach consists of three steps: the first step aims to delete some details in the input image by converting it to a gray level image and inverse it (negative) and then use MSER for the extraction of text in candidate regions. The second step is based on a dynamic grouped DBSCAN algorithm for a fast classification of the connected region, and the outer tangent of circles intersections for filtering regions with the same orientations. Finally, a geometrical and statistical character filter is used to eliminate false detections in the third step. Experimental results show that our approach performs better and achieves a better detection than that proposed by Xu-Cheng Yin(2014).
Keywords: Text detection; MSER; circle overlapping; DBSCAN; License plate detection.
Microarray Cancer Classification using Feature Extraction based Ensemble Learning Method
by ANITA BAI, SWATI HIRA
Abstract: Microarray cancer datasets generally contain many features with a small number of samples, so initially we need to reduce redundant features to allow faster convergence. To address this issue, we proposed a novel feature extraction based ensemble classification technique using support vector machine (SVM) which classifying microarray cancer data and helps to build intelligent systems for early cancer detection. Novelty of the proposed approach is described by classifying cancer data as follows: a) We extracted information by reducing the size of larger dataset using various feature selection techniques, such as, principal component analysis (PCA), chi-square, genetic algorithm (GA) and F-Score. b) Classifying extracted information in two samples as normal and malignant classes using majority voting ensemble SVM. In SVM ensemble based approach we use different SVM kernels, like, linear, polynomial, radial basis function (RBF), and sigmoid. The calculated results of particular kernels are combined using majority voting approach. The effectiveness of the algorithm is validated on six benchmark cancer datasets viz. Colon, Ovarian, Leukaemia, Breast, Lung and Prostate using ensemble SVM classification.
Keywords: Cancer classification; Support vector machine; PCA; GA; F-Score; Chi-square.
Rough set-based attribute reduction and decision rule formulation for marketing data
by Murchhana Tripathy, Anita Panda, Santilata Champati
Abstract: Using the classical Rough Set Theory concept, this study addresses the attribute reduction problem followed by decision rule formulation for marketing data that contains both inconsistence as well as repeated data. Based on the method followed in the work, we propose an algorithm which initially uses the concepts of core and reduct and then performs a cross checking of both by using the significance of the attributes to formulate more accurate and correct rules. For the border line cases it is proposed to use the support and confidence of the rule to determine whether to select the rule or to exclude it. To show the working of the method discussed, We use the marketing data of twenty three Indian cosmetic companies for the current study. Also we conduct a sensitivity analysis of the obtained results to gain insight about the profitability of the companies.
Keywords: Discernibility Matrix; Core; Reduct; Significance of Attributes; Decision Rules; Marketing; Sensitivity Analysis.
A study of the effect of Customer Citizenship Behaviour on Service Quality, Purchase Intentions and Customer Satisfaction
by Thomas Fotiadis
Abstract: Customer Citizenship Behaviour, constitutes a determinative factor of consumer behaviour. It shapes beliefs relating to the Service Quality offered by the enterprise and graduates the magnitude of customer satisfaction. This paper investigates customers behaviour in the light of their intentions to provide information and feedback to the enterprise, to support it in their social circles, to advertise it through word of mouth, to communicate and interact with other customers and to exchange views, and to detect problems that may emerge due to, for example, delays or shortages in certain products. Additionally, the paper surveys the degree by which the aforementioned constituents affect, the perceived quality of the services rendered, the purchase intention and Customer Satisfaction. The Implicative Statistical Analysis technique was used to analyze the data of the survey. Results show that feedback and interaction provided by customers shape Purchase Intention and that these parameters together determine the perceived Service Quality.
Keywords: Consumer behaviour; Customer Citizenship Behaviour; Customer Satisfaction; Purchase Intention; Service Quality.
A Novel Centroids Initialization for K-means Clustering in the Presence of Benign Outliers
by Amin Karami
Abstract: K-means is one of the most important and widely applied clustering algorithms in learning systems. However, it suffers from centroids initialization that makes K-means algorithm unstable. The performance and the stability of the K-means algorithm may be degraded if benign outliers (i.e., long-term independence data points) appear in data. In this paper, we developed a novel algorithm to optimize K-means performance in the presence of benign outliers. We firstly identified the benign outliers and executed K-means across them, then K-means runs over all data points to re-locate clusters' centroids, providing high accuracy. The experimental results over several benchmarking and synthetic data sets confirm that the proposed method significantly outperformed some existing approaches with better accuracy based on applied performance metrics.
Keywords: Clustering; K-means; Centroid Initialization; Benign Outlier.
Improving the predictive ability of multivariate calibration models using Support Vector Data Description
by Walid Gani
Abstract: Outliers detection is a crucial step in building multivariate calibration models and enhancing their predictive ability. However, traditional outliers detection methods often suffer from important drawbacks mainly their reliance on assumptions about the data model distribution and their unsuitability for real life applications. This paper investigates the use of Support Vector Data Description (SVDD) for the detection of outliers and proposes a multivariate calibration strategy which combines partial least squares (PLS) and SVDD. For the assessment of the proposed calibration strategy, an experimental study aiming to predict four chemical properties of diesel fuels is conducted. The results show that the predictive ability of PLS-SVDD is better than the predictive ability of a classical strategy which combines PLS and T^2 method.
Keywords: multivariate calibration; outlier; SVDD; PLS; T^2 method.
Improving Sentiment Analysis Using Preprocessing Techniques and Lexical Patterns
by Stefano Cagnoni, Laura Ferrari, Paolo Fornacciari, Monica Mordonini, Laura Sani, Michele Tomaiuolo
Abstract: Sentiment Analysis has recently gained considerable attention, since the classification of the emotional content of a text (online reviews, blog messages etc.) may have a relevant impact on market research, political science and many other fields. In this paper, we focus on the importance of the text preprocessing phase, proposing a new technique we termed Lexical Pattern-based Feature Weighting (LPFW), that allows one to improve sentence-level Sentiment Analysis by increasing the relevance of the features contained in particular lexical patterns. This approach has been evaluated on two sentiment classification datasets. We show that a systematic optimization of the preprocessing filters is important for obtaining good classification accuracy. Also, we show that LPFW is effective in different application domains and with different training set sizes.
Keywords: Sentiment Analysis; POS Tagging; Natural Language Processing.
A METHODICAL EVALUATION OF CLASSIFIERS IN PREDICTING ACADEMIC PERFORMANCE FOR A MULTI-CLASS APPROACH
by A. Princy Christy, Rama N
Abstract: Predictive analytics has gained importance in recent years as it helps to proactively identify factors that contribute to the success or failure of an event in relevant field. Academic achievements of students can be predicted early by employing algorithms and analyzing relevant data thereby devising solutions to improve performance. In this process choosing the right algorithm is very crucial since performance of algorithms vary depending on the distribution of data and the way it is tuned to handle the data. In order to enhance the performance of algorithms their hyper-parameters were tuned. Many multi-class classifiers were examined and the prediction accuracy of each model developed by employing them was compared. Depending on their classification accuracy the models developed were used to predict the performance of the students. This was done by using micro and macro averaging because of multi-class features. The results show that ensemble classifiers performed well than their individual counterparts
Keywords: Multi-class; Classification; Prediction; Performance metrics; XGBoost; Random Forest Classifier; Feature importance; Grid Search; Macro-average; Micro-average.
Implementation of an efficient FPGA architecture for capsule endoscopy processor core using hyper analytic wavelet-based image compression technique
by N. Abdul Jaleel, P. Vijaya Kumar
Abstract: To receive images of human intestine for medical diagnostics, wireless capsule endoscopy (WCE) is a state-of-the-art technology. This paper proposes implementation of efficient FPGA architecture for capsule endoscopy processor core. The main part of this processor is image compression, for which we proposed an algorithm called as hyper analytic wavelet transform (HWT). The hyper analytic wavelet transform (HWT) is quasi shift-invariant; it has a good directional selectivity and a reduced degree of redundancy. Huffman coding also used to reduce the amount of bits required to represent a string of symbols. This paper also provided forward error correction (FEC) scheme based on low density parity check codes (LDPC) to reduce bit error rate (BER) of the transmitted data. Compared to the similar existing works this paper proposed an efficient architecture.
Keywords: wireless capsule endoscopy; WCE; hyper analytic wavelet transform; HWT; Huffman coding; low density parity check codes; LDPC; forward error correction; FEC; quasi shift-invariant; bit error rate; BER.
Special Issue on: DAC9 Theory and Applications of Correspondence Analysis and Classification
Comparison of hierarchical clustering methods for binary data from molecular markers
by Emmanouil D. Pratsinakis, Symela Ntoanidou, Alexios Polidoros, Christos Dordas, Panagiotis Madesis, Ilias Eleftherohorinos, George Menexes
Abstract: Data from molecular markers used for constructing dendrograms, which are based on genetic distances between different plant species, are encoded as binary data. For dendrograms' construction, the most commonly used linkage method is the UPGMA in combination with the squared Euclidean distance. It seems that in this scientific field, this is the 'golden standard' clustering method. In this study, a review is presented on clustering methods used with binary data. Furthermore, an evaluation of the linkage methods and the corresponding appropriate distances (comparison of 163 clustering methods) is attempted using binary data resulted from molecular markers applied to five populations of the wild mustard Sinapis arvensis species. The validation of the various cluster solutions was tested using external criteria. The results showed that the 'golden standard' is not a 'panacea' for dendrogram construction, based on binary data derived from molecular markers. Thirty seven other hierarchical clustering methods could be used.
Keywords: dendrograms; proximities; linkage methods; Benzécri's chi-squared distance; correspondence analysis; categorical binary data; ISSR markers; Sinapis arvensis.
Assessment of the awareness of Cypriot accounting firms level concerning cyber risk: an exploratory analysis
by Stratos Moschidis, Efstratios Livanis, Athanasios C. Thanopoulos
Abstract: Technology development has made a decisive contribution to the digitisation of businesses, which makes it easier for them to work more efficiently. However, in recent years, data leakages have shown an increasing trend. To investigate the level of awareness among Cypriot accountancy firms about cyber-related risks, we use the data from a recent survey of Cypriot professional accountants' members of Institute of Certified Public Accountants of Cyprus (ICPAC). The categorical nature of the data and the purpose of our research led us to use methods of multidimensional statistical analysis. The emergence of intense differences between accounting companies in relation to the issue as we will present is particularly interesting.
Keywords: cyber risk; multiple correspondence analysis; MCA; Cypriot accounting firms; exploratory statistics.
Sequential dimension reduction and clustering of mixed-type data
by Angelos Markos, Odysseas Moschidis, Theodore Chadjipantelis
Abstract: Clustering of a set of objects described by a mixture of continuous and categorical variables can be a challenging task. In the context of data reduction, an effective class of methods combine dimension reduction with clustering in the reduced space. In this paper, we review three approaches for sequential dimension reduction and clustering of mixed-type data. The first step of each approach involves the application of principal component analysis on a suitably transformed matrix. In the second step, a partitioning or hierarchical clustering algorithm is applied to the object scores in the reduced space. The common theoretical underpinnings of the three approaches are highlighted. The results of a benchmarking study show that sequential dimension reduction and clustering is an effective strategy, especially when categorical variables are more informative than continuous with regard to the underlying cluster structure. Strengths and limitations are also demonstrated on a real mixed-type dataset.
Keywords: cluster analysis; dimension reduction; correspondence analysis; principal component analysis; PCA; mixed-type data.
A comparative evaluation of dissimilarity-based and model-based clustering in science education research: the case of children's mental models of the Earth
by Dimitrios Stamovlasis, Julie Vaiopoulou, George Papageorgiou
Abstract: In the present work, two different classification methods, a dissimilarity-based clustering approach (DBC) and the model-based latent class analysis (LCA), were used to analyse responses to a questionnaire designed to measure children's mental representation of the Earth. It contributes to an ongoing debate in cognitive psychology and science education research between two antagonistic theories on the nature of children's knowledge, that is, the coherent versus fragmented knowledge hypothesis. Methodology-wise the problem concerns the classification of response patterns into distinct clusters, which correspond to specific hypothesised mental models. DBC employs the partitioning around medoids (PAM) approach and selects the final cluster solution based on average silhouette width, cluster stability and interpretability. LCA, a model-based clustering method achieves a taxonomy by employing the conditional probabilities of responses. Initially, a brief presentation and comparison of the two methods is provided, while issues on clustering philosophies are discussed. Both PAM and LCA attained to detect merely the cluster which corresponds to the coherent scientific model and an artificial segment added on purpose in the empirical data. The two methods, despite the obvious deviations in cluster-membership assignment, finally provide sound findings as far as hypotheses tested, by converging to identical conclusions.
Keywords: mental model; latent class analysis; partitioning around medoids; dissimilarity-based clustering; coherent mental model hypothesis; fragmented knowledge hypothesis; science education; model-based clustering.
Special Issue on: LOPAL'2018 Advances and Applications in Optimisation and Learning Algorithms
Bayesian Consensus Clustering with LIME for Security in Big Data
by Balamurugan Selvarathinam
Abstract: Malware creates huge noises in the current data era. The security query rises every day with new Malwares created by the intruders. Malware protection remains one of the trending areas of research in Android platform. Malwares are routed through the SMS / MMS in the subscribers network. The SMS once read is forwarded to other users. This will impact the device, once the intruders access the device data. Device Data theft and the user data theft also includes, credit card credentials, login credentials card information based on the users data stored in android device. This paper works towards how the various malwares in the SMS can be detected to protect Mobile users from potential risks from multiple data sources. Using a single data source will not be very effective with the Spam Detection, as the single data source will not contain all the updated Malwares and Spams. This work uses two methods namely, BCC for Spam Clustering and LIME for Classification of malwares. The significance of these methods is their ability work with unstructured data from different sources. After the two-step classification a set of unique malwares is identified, and all further malwares are grouped according to their category.
Keywords: Bayesian Consensus Clustering; LIME; Classification; Big Data security.
Efficient Data Clustering Algorithm Designed Using Heuristic Approach
by POONAM NANDAL, DEEPA BURA, Meeta Singh
Abstract: Information retrieval from a large amount of information available in a
database is a major issue these days. The relevant information extraction from the
voluminous information available on web is being done using various techniques like Natural
Language Processing, Lexical Analysis, Clustering, Categorization etc. In this paper, we have
discussed the clustering methods used for clustering of large amount of data using different
features to classify the data. In todays era various problem solving techniques makes the use
of heuristic approach for designing and developing various efficient algorithms. In this paper,
we have proposed a clustering technique using a heuristic function to select the centroid so
that the clusters formed are as per the need of the user. The heuristic function designed in this
paper is based on the conceptually similar data points so that they are grouped into accurate
clusters. 𝑘 -means clustering algorithm is majorly used to cluster the data which is also
focussed in this paper. It has been empirically found that the clusters formed and the data
points which belong to a cluster are close to human analysis as compared to existing
Keywords: Clustering; Natural Language Processing; k-means; Concept; Heuristic.
Semantic Integration of Traditional and Heterogeneous Data Sources (UML, XML and RDB) in OWL 2 Triplestore
by Oussama EL Hajjamy, Hajar Khallouki, Larbi Alaoui, Mohamed Bahaj
Abstract: With the success of the internet and the expansion of the amount of data in the web, the exchange of information from various heterogeneous and classical data sources becomes a critical need. In this context, researchers must propose integration solutions that allow applications to simultaneously access several data sources. In this perspective, it is necessary to find a solution for integrating data from classical data sources (UML, XML and RDB) into richer systems based on ontologies using the semantic web language OWL. In this work, we propose a semi-automatic integration approach of classical data sources via a global schema located in database management systems of RDF or OWL data, called triplestore. The goal is to combine several classical and heterogeneous data sources, according to the same schema and unified semantic. Our contribution is subdivided into three axes: The first one aims to establish an automatic solution that converts classical data sources such as UML, XML and relational databases (RDB) to local ontologies based on OWL2 language. The second axis consists of semantically aligning local ontologies based on syntactic, semantic and structural similarity measurement techniques in order to increase the probability of having real correspondences and real differences. Finally, the third axis aims to merge the pre-existing local ontologies into a global ontology based on the alignment found in the previous step. A tool based on our approach has also been developed and tested to demonstrate the power of our strategy and validates the theoretical concept.
Keywords: data integration; UML; XML; RDB; semantic web; OWL2; triplestore; alining ontologies; merge ontologies.
Improving Social Media Engagements on paid and nonpaid advertisements: A Data Mining Approach
by Jen-peng Huang, Genesis Sembiring Depari
Abstract: The purpose of this research is to develop a strategy to improve the number of social media engagement on Facebook both for paid and nonpaid publications through a data mining approach. Several Facebook post characteristics were weighted in order to rank the input variables importance. Support Vector Machine, Deep Learning, and Random Forest performance along with dynamic parameters were compared in order to obtain a robust algorithm in assessing the importance of several input factors. Random Forest is found as the most powerful algorithm with 79% accuracy and therefore used to analyze the importance of input factors in order to improve the number of engagements of social media posts. Eventually, we found that Total page likes (number of page follower) of company Facebook page are the most important factor in order to have more social media engagements both for paid and nonpaid publications. In order to prove that engagements also beneficial to reach more people, we also examined the correlation of shares, likes, comments and other post characteristics in reaching more people through company Facebook pages. In the final part, we also propose a managerial implication on how to improve the number of engagements in social media both for paid and nonpaid publications.
Keywords: Social Media; Data Mining; Paid Advertisement; Non-Paid Advertisement; Social Media Engagements.
Evaluating information criteria in latent class analysis: Application to identify classes of Breast Cancer data set
by Abdallah Abarda, Mohamed Dakkon, Khawla Asmi, Youssef Bentaleb
Abstract: In recent studies, latent class analysis (LCA) modeling has been proposed as a convenient alternative to standard classification methods. It has become a popular tool for clustering respondents into homogeneous subgroups based on their responses on a set of categorical variables. The abscence of a common accepted statistical indicator for deciding the number of classes in the study of population represents one of a major unresolved issue in the application of the LCA. Determining the number of classes constituting the profiles of a given population is often done by using the likelihood ratio test, however its use is not correct theoretically. To overcome this problem, we will propose an alternative for the classical latent class models selection methods based on the information criteria. This article aims to investigate the performance of information criteria for selecting the latent class analysis models. Nine information criteria are compared under various sample sizes and model dimensionalities. We propose also an application of ICs to select the best model of breast cancer data set.
Keywords: Latent class analysis; Model selection; Information criteria;
Sentiment classification of review data using sentence significance score optimization
by Ketan Kumar Todi, Muralikrishna SN, Ashwath Rao B
Abstract: A significant amount of work has been done in the field of sentiment analysis in textual data using the concepts and techniques of Natural Language Processing (NLP). In this work, unlike the existing techniques, we present a novel method wherein we consider the significance of the sentences in formulating the opinion. Often in any review, the sentences in the review may correspond to different aspects which are often irrelevant in deciding whether the sentiment is positive or negative on a topic. Thus, we assign a sentence significance score to evaluate the overall sentiment of the review. We employ a clustering mechanism followed by the neural network approach to determine the optimal significance score for the review. The proposed supervised method shows a higher accuracy than the state-of-the-art techniques. We further, determine the subjectivity of sentences and establish a relationship between subjectivity of sentences and the significance score. We experimentally show that the significance scores found in the proposed method correspond to identifying the subjective sentences and objective sentences in reviews. The sentences with low significance score corresponds to objective sentences and the sentences with high significance score corresponds to subjective sentences.
Keywords: Aspect ; Sentiment Classification; Clustering; Neural Network; Optimization; Significance score.
Towards Knowledge Warehousing: Application to Smart Housing
by Hadjer Moulai, Habiba Drias
Abstract: The terms data, information and knowledge should not be treated as synonyms in any context. In fact, a hierarchical order between these entities exists where data become information and information become knowledge. Massive amounts of data are analysed everyday in order to extract valuable knowledge to support decision making. However, the size of the extracted knowledge compromises the speed of reasoning and exploitation of the latter. In this paper, we propose the paradigm of knowledge warehousing to store and analyse big amounts of knowledge through online knowledge processing and knowledge mining techniques. Our proposal is supported by an original knowledge warehouse framework and a case study for the smart housing technology. A multi-agent system built on a knowledge warehouse architecture is illustrated where each agent has a knowledge base about his assigned task. The paradigm is expected to be applicable for other knowledge tasks and domains as well.
Keywords: knowledge warehouse; knowledge management; knowledge mining; warehousing technology; smart housing; agent technology.
Road signs recognition : State-of-the-art and perspectives
by Btissam Bousarhane, Saloua Bensiali, Driss Bouzidi
Abstract: Traffic accidents represent a global problem that affects, enormously, many countries. Morocco is one of these countries that pay, each year, a heavy price in terms of human lives losses and economic costs. Making cars safer is a crucial element of saving lives on roads. In case of inattention or distraction, drivers need a performant system that is capable of assisting and alerting them when a road sign appears in their field of vision. To create such type of systems, we need to know first the specificity of traffic signs and the major difficulties that still face their recognition, which represents the object of the first and second sections of this paper. We should also study the different methods proposed by researchers to overcome each of these challenges. This study will help us to identify the strengths and weaknesses of each method, as proposed in the third section (Classical vs. Machine learning approaches). Evaluation metrics and criteria for proving the effectiveness of these approaches represents also an important element which section three of this article presents. Ameliorating the existing methods is crucial to ensure the effectiveness of the recognition process, especially by using deep learning algorithms and optimization techniques, as discussed in the last section of this paper.
Keywords: Road signs recognition; detection; classification; tracking; machine learning; deep learning; evaluation datasets; evaluation metrics; hardware optimization; algorithmic optimization; CNN.
Combining Planning and Learning for Context Aware Service Composition
by Tarik Fissaa, Mahmoud Elhamlaoui, Hatim Guermah, Hatim Hafiddi, Mahmoud NASSAR
Abstract: Computing vision introduced by Mark Weiser in the early 90s has defined the basis of whatis called now ubiquitous computing. This new discipline results from the convergence of powerful,small and affordable computing devices with networking technologies that connect them all together.Thus, ubiquitous computing has brought a new generation of service-oriented architectures (SOA)based on context-aware services. These architectures provide users with personalized and adaptedbehaviors by composing multiple services according to their contexts. In this context, the objectiveof this paper is to propose an approach for context-aware semantic based services composition. Ourcontributions are built around following axes: (ii) a semantic based context modeling and context-aware semantic composite service specification, (ii) an architecture for context-aware semantic basedservices composition using Artificial Intelligence planning, (iii) an intelligent mechanism based onreinforcement learning for context-aware selection in order to deal with dynamicity and uncertaincarachter of modern ubiquitous environment
Keywords: Context Awareness; Ontology; Service Composition; Semantic Web; AI Planning; Reinforcement Learning.