International Journal of Data Analysis Techniques and Strategies (16 papers in press)
Microarray Cancer Classification using Feature Extraction based Ensemble Learning Method
by ANITA BAI, SWATI HIRA
Abstract: Microarray cancer datasets generally contain many features with a small number of samples, so initially we need to reduce redundant features to allow faster convergence. To address this issue, we proposed a novel feature extraction based ensemble classification technique using support vector machine (SVM) which classifying microarray cancer data and helps to build intelligent systems for early cancer detection. Novelty of the proposed approach is described by classifying cancer data as follows: a) We extracted information by reducing the size of larger dataset using various feature selection techniques, such as, principal component analysis (PCA), chi-square, genetic algorithm (GA) and F-Score. b) Classifying extracted information in two samples as normal and malignant classes using majority voting ensemble SVM. In SVM ensemble based approach we use different SVM kernels, like, linear, polynomial, radial basis function (RBF), and sigmoid. The calculated results of particular kernels are combined using majority voting approach. The effectiveness of the algorithm is validated on six benchmark cancer datasets viz. Colon, Ovarian, Leukaemia, Breast, Lung and Prostate using ensemble SVM classification.
Keywords: Cancer classification; Support vector machine; PCA; GA; F-Score; Chi-square.
Rough set-based attribute reduction and decision rule formulation for marketing data
by Murchhana Tripathy, Anita Panda, Santilata Champati
Abstract: Using the classical Rough Set Theory concept, this study addresses the attribute reduction problem followed by decision rule formulation for marketing data that contains both inconsistence as well as repeated data. Based on the method followed in the work, we propose an algorithm which initially uses the concepts of core and reduct and then performs a cross checking of both by using the significance of the attributes to formulate more accurate and correct rules. For the border line cases it is proposed to use the support and confidence of the rule to determine whether to select the rule or to exclude it. To show the working of the method discussed, We use the marketing data of twenty three Indian cosmetic companies for the current study. Also we conduct a sensitivity analysis of the obtained results to gain insight about the profitability of the companies.
Keywords: Discernibility Matrix; Core; Reduct; Significance of Attributes; Decision Rules; Marketing; Sensitivity Analysis.
Improving the predictive ability of multivariate calibration models using Support Vector Data Description
by Walid Gani
Abstract: Outliers detection is a crucial step in building multivariate calibration models and enhancing their predictive ability. However, traditional outliers detection methods often suffer from important drawbacks mainly their reliance on assumptions about the data model distribution and their unsuitability for real life applications. This paper investigates the use of Support Vector Data Description (SVDD) for the detection of outliers and proposes a multivariate calibration strategy which combines partial least squares (PLS) and SVDD. For the assessment of the proposed calibration strategy, an experimental study aiming to predict four chemical properties of diesel fuels is conducted. The results show that the predictive ability of PLS-SVDD is better than the predictive ability of a classical strategy which combines PLS and T^2 method.
Keywords: multivariate calibration; outlier; SVDD; PLS; T^2 method.
Improving Sentiment Analysis Using Preprocessing Techniques and Lexical Patterns
by Stefano Cagnoni, Laura Ferrari, Paolo Fornacciari, Monica Mordonini, Laura Sani, Michele Tomaiuolo
Abstract: Sentiment Analysis has recently gained considerable attention, since the classification of the emotional content of a text (online reviews, blog messages etc.) may have a relevant impact on market research, political science and many other fields. In this paper, we focus on the importance of the text preprocessing phase, proposing a new technique we termed Lexical Pattern-based Feature Weighting (LPFW), that allows one to improve sentence-level Sentiment Analysis by increasing the relevance of the features contained in particular lexical patterns. This approach has been evaluated on two sentiment classification datasets. We show that a systematic optimization of the preprocessing filters is important for obtaining good classification accuracy. Also, we show that LPFW is effective in different application domains and with different training set sizes.
Keywords: Sentiment Analysis; POS Tagging; Natural Language Processing.
A METHODICAL EVALUATION OF CLASSIFIERS IN PREDICTING ACADEMIC PERFORMANCE FOR A MULTI-CLASS APPROACH
by A. Princy Christy, Rama N
Abstract: Predictive analytics has gained importance in recent years as it helps to proactively identify factors that contribute to the success or failure of an event in relevant field. Academic achievements of students can be predicted early by employing algorithms and analyzing relevant data thereby devising solutions to improve performance. In this process choosing the right algorithm is very crucial since performance of algorithms vary depending on the distribution of data and the way it is tuned to handle the data. In order to enhance the performance of algorithms their hyper-parameters were tuned. Many multi-class classifiers were examined and the prediction accuracy of each model developed by employing them was compared. Depending on their classification accuracy the models developed were used to predict the performance of the students. This was done by using micro and macro averaging because of multi-class features. The results show that ensemble classifiers performed well than their individual counterparts
Keywords: Multi-class; Classification; Prediction; Performance metrics; XGBoost; Random Forest Classifier; Feature importance; Grid Search; Macro-average; Micro-average.
Ranking Enterprise Reputation in the Digital Age: A Survey of Traditional Methods and the Need For More Agile Approaches
by Canan Corlu, Anita Goyal, David Lopez-Lopez, Rocio De La Torre, Angel Juan
Abstract: Different data sources and analytical methodologies can be used to establish a ranking of enterprises according to several performance measures including their reputation (as perceived by the consumers), financial health, and future growth potential. Such a ranking can be extremely useful for third enterprises interested in creating alliances, outsourcing some activities, or simply contracting services offered by external firms. These rankings are already becoming popular in sectors such as higher education, where universities worldwide are analyzed according to several dimensions and sorted by different international and national rankings. This paper reviews well-established methodological approaches that have been employed to generate such rankings. As shown in our review, these techniques have been typically applied on reduced sets of large enterprises, which are usually indexed in stock exchange markets and from which abundant financial data can be obtained. Then, we discuss the need to extend these ranking practices to large sets of small and medium enterprises, which do not usually provide publicly-available data. Still, in consideration of present digital age, we support the following concepts: (i) citation indicators such as those generated by search engines can be employed to automate the fast generation of rankings; and (ii) when properly validated, these agile rankings can be used as proxies for a reputation ranking.
Keywords: ranking enterprises; digital age; decision sciences; data analytics; management science.
Sentiment Analysis: A Review and Framework Foundations
by Bousselham EL HADDAOUI, Raddouane Chiheb, Rdouan Faizi, Abdellatif El Afia
Abstract: The rise of social media as a platform for opinion expression and social interactions motivated the need for an automated data analysis technique for business value extraction with optimal investment considerations. In this respect, Sentiment Analysis (SA) become the de facto approach to investigate generated data and retrieve information such as: sentiments and emotions, discussed topics, etc. via traditional machine learning and modern neural network-based algorithms. The current techniques achieve reasonable accuracy scores but their performance evolution is depending on the context of application, also most implementations are complex and non-reusable components. Our literature review shows a lack in research studies to unify existing systems under a common framework for SA tasks. This paper also highlights the rending movement of neural networks approaches and pinpoint recent research studies for SA sub tasks. A SA framework design proposition is presented based on key research projects and enhanced with other promising works.
Keywords: Sentiment Analysis; Social Media; Text Preprocessing; Machine Learning; Framework.
Special Issue on: LOPAL'2018 Advances and Applications in Optimisation and Learning Algorithms
Bayesian Consensus Clustering with LIME for Security in Big Data
by Balamurugan Selvarathinam
Abstract: Malware creates huge noises in the current data era. The security query rises every day with new Malwares created by the intruders. Malware protection remains one of the trending areas of research in Android platform. Malwares are routed through the SMS / MMS in the subscribers network. The SMS once read is forwarded to other users. This will impact the device, once the intruders access the device data. Device Data theft and the user data theft also includes, credit card credentials, login credentials card information based on the users data stored in android device. This paper works towards how the various malwares in the SMS can be detected to protect Mobile users from potential risks from multiple data sources. Using a single data source will not be very effective with the Spam Detection, as the single data source will not contain all the updated Malwares and Spams. This work uses two methods namely, BCC for Spam Clustering and LIME for Classification of malwares. The significance of these methods is their ability work with unstructured data from different sources. After the two-step classification a set of unique malwares is identified, and all further malwares are grouped according to their category.
Keywords: Bayesian Consensus Clustering; LIME; Classification; Big Data security.
Efficient Data Clustering Algorithm Designed Using Heuristic Approach
by POONAM NANDAL, DEEPA BURA, Meeta Singh
Abstract: Information retrieval from a large amount of information available in a
database is a major issue these days. The relevant information extraction from the
voluminous information available on web is being done using various techniques like Natural
Language Processing, Lexical Analysis, Clustering, Categorization etc. In this paper, we have
discussed the clustering methods used for clustering of large amount of data using different
features to classify the data. In todays era various problem solving techniques makes the use
of heuristic approach for designing and developing various efficient algorithms. In this paper,
we have proposed a clustering technique using a heuristic function to select the centroid so
that the clusters formed are as per the need of the user. The heuristic function designed in this
paper is based on the conceptually similar data points so that they are grouped into accurate
clusters. 𝑘 -means clustering algorithm is majorly used to cluster the data which is also
focussed in this paper. It has been empirically found that the clusters formed and the data
points which belong to a cluster are close to human analysis as compared to existing
Keywords: Clustering; Natural Language Processing; k-means; Concept; Heuristic.
Semantic Integration of Traditional and Heterogeneous Data Sources (UML, XML and RDB) in OWL 2 Triplestore
by Oussama EL Hajjamy, Hajar Khallouki, Larbi Alaoui, Mohamed Bahaj
Abstract: With the success of the internet and the expansion of the amount of data in the web, the exchange of information from various heterogeneous and classical data sources becomes a critical need. In this context, researchers must propose integration solutions that allow applications to simultaneously access several data sources. In this perspective, it is necessary to find a solution for integrating data from classical data sources (UML, XML and RDB) into richer systems based on ontologies using the semantic web language OWL. In this work, we propose a semi-automatic integration approach of classical data sources via a global schema located in database management systems of RDF or OWL data, called triplestore. The goal is to combine several classical and heterogeneous data sources, according to the same schema and unified semantic. Our contribution is subdivided into three axes: The first one aims to establish an automatic solution that converts classical data sources such as UML, XML and relational databases (RDB) to local ontologies based on OWL2 language. The second axis consists of semantically aligning local ontologies based on syntactic, semantic and structural similarity measurement techniques in order to increase the probability of having real correspondences and real differences. Finally, the third axis aims to merge the pre-existing local ontologies into a global ontology based on the alignment found in the previous step. A tool based on our approach has also been developed and tested to demonstrate the power of our strategy and validates the theoretical concept.
Keywords: data integration; UML; XML; RDB; semantic web; OWL2; triplestore; alining ontologies; merge ontologies.
Improving Social Media Engagements on paid and nonpaid advertisements: A Data Mining Approach
by Jen-peng Huang, Genesis Sembiring Depari
Abstract: The purpose of this research is to develop a strategy to improve the number of social media engagement on Facebook both for paid and nonpaid publications through a data mining approach. Several Facebook post characteristics were weighted in order to rank the input variables importance. Support Vector Machine, Deep Learning, and Random Forest performance along with dynamic parameters were compared in order to obtain a robust algorithm in assessing the importance of several input factors. Random Forest is found as the most powerful algorithm with 79% accuracy and therefore used to analyze the importance of input factors in order to improve the number of engagements of social media posts. Eventually, we found that Total page likes (number of page follower) of company Facebook page are the most important factor in order to have more social media engagements both for paid and nonpaid publications. In order to prove that engagements also beneficial to reach more people, we also examined the correlation of shares, likes, comments and other post characteristics in reaching more people through company Facebook pages. In the final part, we also propose a managerial implication on how to improve the number of engagements in social media both for paid and nonpaid publications.
Keywords: Social Media; Data Mining; Paid Advertisement; Non-Paid Advertisement; Social Media Engagements.
Evaluating information criteria in latent class analysis: Application to identify classes of Breast Cancer data set
by Abdallah Abarda, Mohamed Dakkon, Khawla Asmi, Youssef Bentaleb
Abstract: In recent studies, latent class analysis (LCA) modeling has been proposed as a convenient alternative to standard classification methods. It has become a popular tool for clustering respondents into homogeneous subgroups based on their responses on a set of categorical variables. The abscence of a common accepted statistical indicator for deciding the number of classes in the study of population represents one of a major unresolved issue in the application of the LCA. Determining the number of classes constituting the profiles of a given population is often done by using the likelihood ratio test, however its use is not correct theoretically. To overcome this problem, we will propose an alternative for the classical latent class models selection methods based on the information criteria. This article aims to investigate the performance of information criteria for selecting the latent class analysis models. Nine information criteria are compared under various sample sizes and model dimensionalities. We propose also an application of ICs to select the best model of breast cancer data set.
Keywords: Latent class analysis; Model selection; Information criteria;
Sentiment classification of review data using sentence significance score optimization
by Ketan Kumar Todi, Muralikrishna SN, Ashwath Rao B
Abstract: A significant amount of work has been done in the field of sentiment analysis in textual data using the concepts and techniques of Natural Language Processing (NLP). In this work, unlike the existing techniques, we present a novel method wherein we consider the significance of the sentences in formulating the opinion. Often in any review, the sentences in the review may correspond to different aspects which are often irrelevant in deciding whether the sentiment is positive or negative on a topic. Thus, we assign a sentence significance score to evaluate the overall sentiment of the review. We employ a clustering mechanism followed by the neural network approach to determine the optimal significance score for the review. The proposed supervised method shows a higher accuracy than the state-of-the-art techniques. We further, determine the subjectivity of sentences and establish a relationship between subjectivity of sentences and the significance score. We experimentally show that the significance scores found in the proposed method correspond to identifying the subjective sentences and objective sentences in reviews. The sentences with low significance score corresponds to objective sentences and the sentences with high significance score corresponds to subjective sentences.
Keywords: Aspect ; Sentiment Classification; Clustering; Neural Network; Optimization; Significance score.
Towards Knowledge Warehousing: Application to Smart Housing
by Hadjer Moulai, Habiba Drias
Abstract: The terms data, information and knowledge should not be treated as synonyms in any context. In fact, a hierarchical order between these entities exists where data become information and information become knowledge. Massive amounts of data are analysed everyday in order to extract valuable knowledge to support decision making. However, the size of the extracted knowledge compromises the speed of reasoning and exploitation of the latter. In this paper, we propose the paradigm of knowledge warehousing to store and analyse big amounts of knowledge through online knowledge processing and knowledge mining techniques. Our proposal is supported by an original knowledge warehouse framework and a case study for the smart housing technology. A multi-agent system built on a knowledge warehouse architecture is illustrated where each agent has a knowledge base about his assigned task. The paradigm is expected to be applicable for other knowledge tasks and domains as well.
Keywords: knowledge warehouse; knowledge management; knowledge mining; warehousing technology; smart housing; agent technology.
Road signs recognition : State-of-the-art and perspectives
by Btissam Bousarhane, Saloua Bensiali, Driss Bouzidi
Abstract: Traffic accidents represent a global problem that affects, enormously, many countries. Morocco is one of these countries that pay, each year, a heavy price in terms of human lives losses and economic costs. Making cars safer is a crucial element of saving lives on roads. In case of inattention or distraction, drivers need a performant system that is capable of assisting and alerting them when a road sign appears in their field of vision. To create such type of systems, we need to know first the specificity of traffic signs and the major difficulties that still face their recognition, which represents the object of the first and second sections of this paper. We should also study the different methods proposed by researchers to overcome each of these challenges. This study will help us to identify the strengths and weaknesses of each method, as proposed in the third section (Classical vs. Machine learning approaches). Evaluation metrics and criteria for proving the effectiveness of these approaches represents also an important element which section three of this article presents. Ameliorating the existing methods is crucial to ensure the effectiveness of the recognition process, especially by using deep learning algorithms and optimization techniques, as discussed in the last section of this paper.
Keywords: Road signs recognition; detection; classification; tracking; machine learning; deep learning; evaluation datasets; evaluation metrics; hardware optimization; algorithmic optimization; CNN.
Combining Planning and Learning for Context Aware Service Composition
by Tarik Fissaa, Mahmoud Elhamlaoui, Hatim Guermah, Hatim Hafiddi, Mahmoud NASSAR
Abstract: Computing vision introduced by Mark Weiser in the early 90s has defined the basis of whatis called now ubiquitous computing. This new discipline results from the convergence of powerful,small and affordable computing devices with networking technologies that connect them all together.Thus, ubiquitous computing has brought a new generation of service-oriented architectures (SOA)based on context-aware services. These architectures provide users with personalized and adaptedbehaviors by composing multiple services according to their contexts. In this context, the objectiveof this paper is to propose an approach for context-aware semantic based services composition. Ourcontributions are built around following axes: (ii) a semantic based context modeling and context-aware semantic composite service specification, (ii) an architecture for context-aware semantic basedservices composition using Artificial Intelligence planning, (iii) an intelligent mechanism based onreinforcement learning for context-aware selection in order to deal with dynamicity and uncertaincarachter of modern ubiquitous environment
Keywords: Context Awareness; Ontology; Service Composition; Semantic Web; AI Planning; Reinforcement Learning.