International Journal of Data Analysis Techniques and Strategies (19 papers in press)
Data Aggregation to Better Understand the Impact of Computerization on Employment
by James Otto, Chaodong Han
Abstract: Data reduction methods are called for to address challenges presented by big data. Correlation of two variables may be less clear if data are organized at disaggregate levels in regression analysis. In this study, we apply data aggregation to regression analysis in the context of a study forecasting the impact of computerization on jobs and wages. We show that data grouped by the ranked independent variable, versus random or other grouping schemes, provides a clearer pattern of the employment impacts of computerization probability on job categories. The coefficient estimates are more consistent for groupings based on a ranked ind ependent variable, than those provided by random grouping of the same independent variable. The improved estimations can have positive policy implications.
Keywords: Data reduction methods; Impact of computerization; Computerization probability; Automation; Data grouping schemes; Statistical regression; Data aggregation; Ranked regression; Information reduction.
Inference in Mixed Linear Models with four variance components - Sub-D and Sub-DI
by Adilson Silva, Miguel Fonseca, Antonio Monteiro
Abstract: This work approaches the new estimators for variance componentes in
mixed linear models Sub-D and its improved version Sub-DI, developed and tested
by Silva (2017). Both estimators were deduced and tested in mixed linear models
with two and three variance components; the authors gave the corresponding
formulations in models with an arbitrary number of variance components but
no one had never tested their performances in models with more than three
variance components. Particularly, here we aim to give the explicit formulations
for both Sub-D and Sub-DI in models with four variance components, as well as
a numerical example testing their performances. Tables containing the results of
the numerical example will be given.
Keywords: Orthogonal Matrices; Variance Components; Sub-D; Sub-DI; Mixed Linear Models.
Detecting Text in License Plates using a novel MSER based Method.
by Admi Mohamed, E.L. Fkihi Sanaa, Faizi Rdouan
Abstract: A new license plate detection method is proposed in this paper. The proposed approach consists of three steps: the first step aims to delete some details in the input image by converting it to a gray level image and inverse it (negative) and then use MSER for the extraction of text in candidate regions. The second step is based on a dynamic grouped DBSCAN algorithm for a fast classification of the connected region, and the outer tangent of circles intersections for filtering regions with the same orientations. Finally, a geometrical and statistical character filter is used to eliminate false detections in the third step. Experimental results show that our approach performs better and achieves a better detection than that proposed by Xu-Cheng Yin(2014).
Keywords: Text detection; MSER; circle overlapping; DBSCAN; License plate detection.
Microarray Cancer Classification using Feature Extraction based Ensemble Learning Method
by ANITA BAI, SWATI HIRA
Abstract: Microarray cancer datasets generally contain many features with a small number of samples, so initially we need to reduce redundant features to allow faster convergence. To address this issue, we proposed a novel feature extraction based ensemble classification technique using support vector machine (SVM) which classifying microarray cancer data and helps to build intelligent systems for early cancer detection. Novelty of the proposed approach is described by classifying cancer data as follows: a) We extracted information by reducing the size of larger dataset using various feature selection techniques, such as, principal component analysis (PCA), chi-square, genetic algorithm (GA) and F-Score. b) Classifying extracted information in two samples as normal and malignant classes using majority voting ensemble SVM. In SVM ensemble based approach we use different SVM kernels, like, linear, polynomial, radial basis function (RBF), and sigmoid. The calculated results of particular kernels are combined using majority voting approach. The effectiveness of the algorithm is validated on six benchmark cancer datasets viz. Colon, Ovarian, Leukaemia, Breast, Lung and Prostate using ensemble SVM classification.
Keywords: Cancer classification; Support vector machine; PCA; GA; F-Score; Chi-square.
Rough set-based attribute reduction and decision rule formulation for marketing data
by Murchhana Tripathy, Anita Panda, Santilata Champati
Abstract: Using the classical Rough Set Theory concept, this study addresses the attribute reduction problem followed by decision rule formulation for marketing data that contains both inconsistence as well as repeated data. Based on the method followed in the work, we propose an algorithm which initially uses the concepts of core and reduct and then performs a cross checking of both by using the significance of the attributes to formulate more accurate and correct rules. For the border line cases it is proposed to use the support and confidence of the rule to determine whether to select the rule or to exclude it. To show the working of the method discussed, We use the marketing data of twenty three Indian cosmetic companies for the current study. Also we conduct a sensitivity analysis of the obtained results to gain insight about the profitability of the companies.
Keywords: Discernibility Matrix; Core; Reduct; Significance of Attributes; Decision Rules; Marketing; Sensitivity Analysis.
A study of the effect of Customer Citizenship Behaviour on Service Quality, Purchase Intentions and Customer Satisfaction
by Thomas Fotiadis
Abstract: Customer Citizenship Behaviour, constitutes a determinative factor of consumer behaviour. It shapes beliefs relating to the Service Quality offered by the enterprise and graduates the magnitude of customer satisfaction. This paper investigates customers behaviour in the light of their intentions to provide information and feedback to the enterprise, to support it in their social circles, to advertise it through word of mouth, to communicate and interact with other customers and to exchange views, and to detect problems that may emerge due to, for example, delays or shortages in certain products. Additionally, the paper surveys the degree by which the aforementioned constituents affect, the perceived quality of the services rendered, the purchase intention and Customer Satisfaction. The Implicative Statistical Analysis technique was used to analyze the data of the survey. Results show that feedback and interaction provided by customers shape Purchase Intention and that these parameters together determine the perceived Service Quality.
Keywords: Consumer behaviour; Customer Citizenship Behaviour; Customer Satisfaction; Purchase Intention; Service Quality.
A Novel Centroids Initialization for K-means Clustering in the Presence of Benign Outliers
by Amin Karami, Shafiq Urrehman, Mustansar Ali Ghazanfar
Abstract: K-means is one of the most important and widely applied clustering algorithms in learning systems. However, it suffers from centroids initialization that makes K-means algorithm unstable. The performance and the stability of the K-means algorithm may be degraded if benign outliers (i.e., long-term independence data points) appear in data. In this paper, we developed a novel algorithm to optimize K-means performance in the presence of benign outliers. We firstly identified the benign outliers and executed K-means across them, then K-means runs over all data points to re-locate clusters' centroids, providing high accuracy. The experimental results over several benchmarking and synthetic data sets confirm that the proposed method significantly outperformed some existing approaches with better accuracy based on applied performance metrics.
Keywords: Clustering; K-means; Centroid Initialization; Benign Outlier.
Improving the predictive ability of multivariate calibration models using Support Vector Data Description
by Walid Gani
Abstract: Outliers detection is a crucial step in building multivariate calibration models and enhancing their predictive ability. However, traditional outliers detection methods often suffer from important drawbacks mainly their reliance on assumptions about the data model distribution and their unsuitability for real life applications. This paper investigates the use of Support Vector Data Description (SVDD) for the detection of outliers and proposes a multivariate calibration strategy which combines partial least squares (PLS) and SVDD. For the assessment of the proposed calibration strategy, an experimental study aiming to predict four chemical properties of diesel fuels is conducted. The results show that the predictive ability of PLS-SVDD is better than the predictive ability of a classical strategy which combines PLS and T^2 method.
Keywords: multivariate calibration; outlier; SVDD; PLS; T^2 method.
Improving Sentiment Analysis Using Preprocessing Techniques and Lexical Patterns
by Stefano Cagnoni, Laura Ferrari, Paolo Fornacciari, Monica Mordonini, Laura Sani, Michele Tomaiuolo
Abstract: Sentiment Analysis has recently gained considerable attention, since the classification of the emotional content of a text (online reviews, blog messages etc.) may have a relevant impact on market research, political science and many other fields. In this paper, we focus on the importance of the text preprocessing phase, proposing a new technique we termed Lexical Pattern-based Feature Weighting (LPFW), that allows one to improve sentence-level Sentiment Analysis by increasing the relevance of the features contained in particular lexical patterns. This approach has been evaluated on two sentiment classification datasets. We show that a systematic optimization of the preprocessing filters is important for obtaining good classification accuracy. Also, we show that LPFW is effective in different application domains and with different training set sizes.
Keywords: Sentiment Analysis; POS Tagging; Natural Language Processing.
A METHODICAL EVALUATION OF CLASSIFIERS IN PREDICTING ACADEMIC PERFORMANCE FOR A MULTI-CLASS APPROACH
by A. Princy Christy, Rama N
Abstract: Predictive analytics has gained importance in recent years as it helps to proactively identify factors that contribute to the success or failure of an event in relevant field. Academic achievements of students can be predicted early by employing algorithms and analyzing relevant data thereby devising solutions to improve performance. In this process choosing the right algorithm is very crucial since performance of algorithms vary depending on the distribution of data and the way it is tuned to handle the data. In order to enhance the performance of algorithms their hyper-parameters were tuned. Many multi-class classifiers were examined and the prediction accuracy of each model developed by employing them was compared. Depending on their classification accuracy the models developed were used to predict the performance of the students. This was done by using micro and macro averaging because of multi-class features. The results show that ensemble classifiers performed well than their individual counterparts
Keywords: Multi-class; Classification; Prediction; Performance metrics; XGBoost; Random Forest Classifier; Feature importance; Grid Search; Macro-average; Micro-average.
Special Issue on: LOPAL'2018 Advances and Applications in Optimisation and Learning Algorithms
Bayesian Consensus Clustering with LIME for Security in Big Data
by Balamurugan Selvarathinam
Abstract: Malware creates huge noises in the current data era. The security query rises every day with new Malwares created by the intruders. Malware protection remains one of the trending areas of research in Android platform. Malwares are routed through the SMS / MMS in the subscribers network. The SMS once read is forwarded to other users. This will impact the device, once the intruders access the device data. Device Data theft and the user data theft also includes, credit card credentials, login credentials card information based on the users data stored in android device. This paper works towards how the various malwares in the SMS can be detected to protect Mobile users from potential risks from multiple data sources. Using a single data source will not be very effective with the Spam Detection, as the single data source will not contain all the updated Malwares and Spams. This work uses two methods namely, BCC for Spam Clustering and LIME for Classification of malwares. The significance of these methods is their ability work with unstructured data from different sources. After the two-step classification a set of unique malwares is identified, and all further malwares are grouped according to their category.
Keywords: Bayesian Consensus Clustering; LIME; Classification; Big Data security.
Efficient Data Clustering Algorithm Designed Using Heuristic Approach
by POONAM NANDAL, DEEPA BURA, Meeta Singh
Abstract: Information retrieval from a large amount of information available in a
database is a major issue these days. The relevant information extraction from the
voluminous information available on web is being done using various techniques like Natural
Language Processing, Lexical Analysis, Clustering, Categorization etc. In this paper, we have
discussed the clustering methods used for clustering of large amount of data using different
features to classify the data. In todays era various problem solving techniques makes the use
of heuristic approach for designing and developing various efficient algorithms. In this paper,
we have proposed a clustering technique using a heuristic function to select the centroid so
that the clusters formed are as per the need of the user. The heuristic function designed in this
paper is based on the conceptually similar data points so that they are grouped into accurate
clusters. 𝑘 -means clustering algorithm is majorly used to cluster the data which is also
focussed in this paper. It has been empirically found that the clusters formed and the data
points which belong to a cluster are close to human analysis as compared to existing
Keywords: Clustering; Natural Language Processing; k-means; Concept; Heuristic.
Semantic Integration of Traditional and Heterogeneous Data Sources (UML, XML and RDB) in OWL 2 Triplestore
by Oussama EL Hajjamy, Hajar Khallouki, Larbi Alaoui, Mohamed Bahaj
Abstract: With the success of the internet and the expansion of the amount of data in the web, the exchange of information from various heterogeneous and classical data sources becomes a critical need. In this context, researchers must propose integration solutions that allow applications to simultaneously access several data sources. In this perspective, it is necessary to find a solution for integrating data from classical data sources (UML, XML and RDB) into richer systems based on ontologies using the semantic web language OWL. In this work, we propose a semi-automatic integration approach of classical data sources via a global schema located in database management systems of RDF or OWL data, called triplestore. The goal is to combine several classical and heterogeneous data sources, according to the same schema and unified semantic. Our contribution is subdivided into three axes: The first one aims to establish an automatic solution that converts classical data sources such as UML, XML and relational databases (RDB) to local ontologies based on OWL2 language. The second axis consists of semantically aligning local ontologies based on syntactic, semantic and structural similarity measurement techniques in order to increase the probability of having real correspondences and real differences. Finally, the third axis aims to merge the pre-existing local ontologies into a global ontology based on the alignment found in the previous step. A tool based on our approach has also been developed and tested to demonstrate the power of our strategy and validates the theoretical concept.
Keywords: data integration; UML; XML; RDB; semantic web; OWL2; triplestore; alining ontologies; merge ontologies.
Improving Social Media Engagements on paid and nonpaid advertisements: A Data Mining Approach
by Jen-peng Huang, Genesis Sembiring Depari
Abstract: The purpose of this research is to develop a strategy to improve the number of social media engagement on Facebook both for paid and nonpaid publications through a data mining approach. Several Facebook post characteristics were weighted in order to rank the input variables importance. Support Vector Machine, Deep Learning, and Random Forest performance along with dynamic parameters were compared in order to obtain a robust algorithm in assessing the importance of several input factors. Random Forest is found as the most powerful algorithm with 79% accuracy and therefore used to analyze the importance of input factors in order to improve the number of engagements of social media posts. Eventually, we found that Total page likes (number of page follower) of company Facebook page are the most important factor in order to have more social media engagements both for paid and nonpaid publications. In order to prove that engagements also beneficial to reach more people, we also examined the correlation of shares, likes, comments and other post characteristics in reaching more people through company Facebook pages. In the final part, we also propose a managerial implication on how to improve the number of engagements in social media both for paid and nonpaid publications.
Keywords: Social Media; Data Mining; Paid Advertisement; Non-Paid Advertisement; Social Media Engagements.
Evaluating information criteria in latent class analysis: Application to identify classes of Breast Cancer data set
by Abdallah Abarda, Mohamed Dakkon, Khawla Asmi, Youssef Bentaleb
Abstract: In recent studies, latent class analysis (LCA) modeling has been proposed as a convenient alternative to standard classification methods. It has become a popular tool for clustering respondents into homogeneous subgroups based on their responses on a set of categorical variables. The abscence of a common accepted statistical indicator for deciding the number of classes in the study of population represents one of a major unresolved issue in the application of the LCA. Determining the number of classes constituting the profiles of a given population is often done by using the likelihood ratio test, however its use is not correct theoretically. To overcome this problem, we will propose an alternative for the classical latent class models selection methods based on the information criteria. This article aims to investigate the performance of information criteria for selecting the latent class analysis models. Nine information criteria are compared under various sample sizes and model dimensionalities. We propose also an application of ICs to select the best model of breast cancer data set.
Keywords: Latent class analysis; Model selection; Information criteria;
Sentiment classification of review data using sentence significance score optimization
by Ketan Kumar Todi, Muralikrishna SN, Ashwath Rao B
Abstract: A significant amount of work has been done in the field of sentiment analysis in textual data using the concepts and techniques of Natural Language Processing (NLP). In this work, unlike the existing techniques, we present a novel method wherein we consider the significance of the sentences in formulating the opinion. Often in any review, the sentences in the review may correspond to different aspects which are often irrelevant in deciding whether the sentiment is positive or negative on a topic. Thus, we assign a sentence significance score to evaluate the overall sentiment of the review. We employ a clustering mechanism followed by the neural network approach to determine the optimal significance score for the review. The proposed supervised method shows a higher accuracy than the state-of-the-art techniques. We further, determine the subjectivity of sentences and establish a relationship between subjectivity of sentences and the significance score. We experimentally show that the significance scores found in the proposed method correspond to identifying the subjective sentences and objective sentences in reviews. The sentences with low significance score corresponds to objective sentences and the sentences with high significance score corresponds to subjective sentences.
Keywords: Aspect ; Sentiment Classification; Clustering; Neural Network; Optimization; Significance score.
Towards Knowledge Warehousing: Application to Smart Housing
by Hadjer Moulai, Habiba Drias
Abstract: The terms data, information and knowledge should not be treated as synonyms in any context. In fact, a hierarchical order between these entities exists where data become information and information become knowledge. Massive amounts of data are analysed everyday in order to extract valuable knowledge to support decision making. However, the size of the extracted knowledge compromises the speed of reasoning and exploitation of the latter. In this paper, we propose the paradigm of knowledge warehousing to store and analyse big amounts of knowledge through online knowledge processing and knowledge mining techniques. Our proposal is supported by an original knowledge warehouse framework and a case study for the smart housing technology. A multi-agent system built on a knowledge warehouse architecture is illustrated where each agent has a knowledge base about his assigned task. The paradigm is expected to be applicable for other knowledge tasks and domains as well.
Keywords: knowledge warehouse; knowledge management; knowledge mining; warehousing technology; smart housing; agent technology.
Road signs recognition : State-of-the-art and perspectives
by Btissam Bousarhane, Saloua Bensiali, Driss Bouzidi
Abstract: Traffic accidents represent a global problem that affects, enormously, many countries. Morocco is one of these countries that pay, each year, a heavy price in terms of human lives losses and economic costs. Making cars safer is a crucial element of saving lives on roads. In case of inattention or distraction, drivers need a performant system that is capable of assisting and alerting them when a road sign appears in their field of vision. To create such type of systems, we need to know first the specificity of traffic signs and the major difficulties that still face their recognition, which represents the object of the first and second sections of this paper. We should also study the different methods proposed by researchers to overcome each of these challenges. This study will help us to identify the strengths and weaknesses of each method, as proposed in the third section (Classical vs. Machine learning approaches). Evaluation metrics and criteria for proving the effectiveness of these approaches represents also an important element which section three of this article presents. Ameliorating the existing methods is crucial to ensure the effectiveness of the recognition process, especially by using deep learning algorithms and optimization techniques, as discussed in the last section of this paper.
Keywords: Road signs recognition; detection; classification; tracking; machine learning; deep learning; evaluation datasets; evaluation metrics; hardware optimization; algorithmic optimization; CNN.
Combining Planning and Learning for Context Aware Service Composition
by Tarik Fissaa, Mahmoud Elhamlaoui, Hatim Guermah, Hatim Hafiddi, Mahmoud NASSAR
Abstract: Computing vision introduced by Mark Weiser in the early 90s has defined the basis of whatis called now ubiquitous computing. This new discipline results from the convergence of powerful,small and affordable computing devices with networking technologies that connect them all together.Thus, ubiquitous computing has brought a new generation of service-oriented architectures (SOA)based on context-aware services. These architectures provide users with personalized and adaptedbehaviors by composing multiple services according to their contexts. In this context, the objectiveof this paper is to propose an approach for context-aware semantic based services composition. Ourcontributions are built around following axes: (ii) a semantic based context modeling and context-aware semantic composite service specification, (ii) an architecture for context-aware semantic basedservices composition using Artificial Intelligence planning, (iii) an intelligent mechanism based onreinforcement learning for context-aware selection in order to deal with dynamicity and uncertaincarachter of modern ubiquitous environment
Keywords: Context Awareness; Ontology; Service Composition; Semantic Web; AI Planning; Reinforcement Learning.