Template-Type: ReDIF-Article 1.0
Author-Name: Murchhana Tripathy
Author-X-Name-First: Murchhana
Author-X-Name-Last: Tripathy
Author-Name: Anita Panda
Author-X-Name-First: Anita
Author-X-Name-Last: Panda
Author-Name: Santilata Champati
Author-X-Name-First: Santilata
Author-X-Name-Last: Champati
Title: Rough set-based attribute reduction and decision rule formulation for marketing data
Abstract:
Using the classical rough set theory concept, this study addresses the attribute reduction problem followed by decision rule formulation for marketing data that contains both inconsistence as well as repeated data. Based on the method followed in the work, we propose an algorithm which initially uses the concepts of core and reduct and then performs a cross checking of both by using the significance of the attributes to formulate more accurate and correct rules. For the borderline cases it is proposed to use the support and confidence of the rule to determine whether to select the rule or to exclude it. To show the working of the method discussed, we use the marketing data of 23 Indian cosmetic companies for the current study. Also we conduct a sensitivity analysis of the obtained results to gain insight about the profitability of the companies.
Journal: Int. J. of Data Analysis Techniques and Strategies
Pages: 186-206
Issue: 3
Volume: 13
Year: 2021
Keywords: discernibility matrix; core; reduct; significance of attributes; decision rules; marketing; sensitivity analysis.
File-URL: http://www.inderscience.com/link.php?id=118016
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:ids:injdan:v:13:y:2021:i:3:p:186-206

Template-Type: ReDIF-Article 1.0
Author-Name: Walid Gani
Author-X-Name-First: Walid
Author-X-Name-Last: Gani
Title: Improving the predictive ability of multivariate calibration models using support vector data description
Abstract:
Outlier detection is a crucial step in building multivariate calibration models and enhancing their predictive ability. However, traditional outlier detection methods often suffer from important drawbacks mainly their reliance on assumptions about the data model distribution and their unsuitability for real-life applications. This paper investigates the use of support vector data description (SVDD) for the detection of outliers and proposes a multivariate calibration strategy that combines partial least squares (PLS) and SVDD. For the assessment of the proposed calibration strategy, an experimental study aiming to predict four properties of diesel fuel is conducted. The results show that the predictive ability of PLS-SVDD is better than the predictive ability of a classical strategy that combines PLS and the T&lt;SUP align="right"&gt;&lt;SMALL&gt;2&lt;/SMALL&gt;&lt;/SUP&gt; method.
Journal: Int. J. of Data Analysis Techniques and Strategies
Pages: 227-243
Issue: 3
Volume: 13
Year: 2021
Keywords: multivariate calibration? outlier? support vector data description? SVDD? partial least squares? PLS? T<SUP align="right"><SMALL>2</SMALL></SUP>  method.
File-URL: http://www.inderscience.com/link.php?id=118021
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:ids:injdan:v:13:y:2021:i:3:p:227-243

Template-Type: ReDIF-Article 1.0
Author-Name: Stefano Cagnoni
Author-X-Name-First: Stefano
Author-X-Name-Last: Cagnoni
Author-Name: Laura Ferrari
Author-X-Name-First: Laura
Author-X-Name-Last: Ferrari
Author-Name: Paolo Fornacciari
Author-X-Name-First: Paolo
Author-X-Name-Last: Fornacciari
Author-Name: Monica Mordonini
Author-X-Name-First: Monica
Author-X-Name-Last: Mordonini
Author-Name: Laura Sani
Author-X-Name-First: Laura
Author-X-Name-Last: Sani
Author-Name: Michele Tomaiuolo
Author-X-Name-First: Michele
Author-X-Name-Last: Tomaiuolo
Title: Improving sentiment analysis using preprocessing techniques and lexical patterns
Abstract:
Sentiment analysis has recently gained considerable attention, since the classification of the emotional content of a text (online reviews, blog messages etc.) may have a relevant impact on market research, political science and many other fields. In this paper, we focus on the importance of the &lt;i&gt;text preprocessing phase, proposing a new technique we termed&lt;/i&gt; lexical pattern-based feature weighting (LPFW) that allows one to improve sentence-level sentiment analysis by increasing the relevance of the features contained in particular lexical patterns. This approach has been evaluated on two sentiment classification datasets. We show that a systematic optimisation of the preprocessing filters is important for obtaining good classification accuracy. Also, we show that LPFW is effective in different application domains and with different training set sizes.
Journal: Int. J. of Data Analysis Techniques and Strategies
Pages: 171-185
Issue: 3
Volume: 13
Year: 2021
Keywords: sentiment analysis; natural language processing; POS tagging; feature weighting; word stemming; bag-of-words representation; tf-idf; Penn Treebank Tagset; support vector machines; naïve Bayes multinomial classifier.
File-URL: http://www.inderscience.com/link.php?id=118022
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:ids:injdan:v:13:y:2021:i:3:p:171-185

Template-Type: ReDIF-Article 1.0
Author-Name: A. Princy Christy
Author-X-Name-First: A. Princy
Author-X-Name-Last: Christy
Author-Name: N. Rama
Author-X-Name-First: N.
Author-X-Name-Last: Rama
Title: A methodical evaluation of classifiers in predicting academic performance for a multi-class approach
Abstract:
Predictive analytics has gained importance in recent years as it helps to proactively identify factors that contribute to the success or failure of an event in the relevant field. Academic achievements of students can be predicted early by employing algorithms and analysing relevant data thereby devising solutions to improve performance. In this process choosing the right algorithm is very crucial since performance of algorithms vary depending on the distribution of data and the way it is tuned to handle the data. In order to enhance the performance of algorithms their hyper-parameters were tuned. Many multi-class classifiers were examined and the prediction accuracy of each model developed by employing them was compared. Depending on their classification accuracy the models developed were used to predict the performance of the students. This was done by using micro and macro averaging because of multi-class features. The results show that ensemble classifiers performed well than their individual counterparts.
Journal: Int. J. of Data Analysis Techniques and Strategies
Pages: 207-226
Issue: 3
Volume: 13
Year: 2021
Keywords: multi-class; classification; prediction; performance metrics; XGBoost; random forest classifier; feature importance; grid search; macro-average; micro-average.
File-URL: http://www.inderscience.com/link.php?id=118024
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:ids:injdan:v:13:y:2021:i:3:p:207-226

Template-Type: ReDIF-Article 1.0
Author-Name: S. Balamurugan
Author-X-Name-First: S.
Author-X-Name-Last: Balamurugan
Author-Name: M. Thangaraj
Author-X-Name-First: M.
Author-X-Name-Last: Thangaraj
Title: Bayesian consensus clustering with LIME for security in big data
Abstract:
Malware creates huge noises in the current data era. The security query arises everyday with new malwares created by the intruders. Malware protection remains one of the trending areas of research in the Android platform. Malwares are routed through the SMS/MMS in the subscriber's network. The SMS once read is forwarded to other users. This will impact the device once the intruders access the device data. Device data theft and the user data theft also includes credit card credentials, login credentials and card information based on the users' data stored in the Android device. This paper works towards how the various malwares in the SMS can be detected to protect mobile users from potential risks from multiple data sources. Using a single data source will not be very effective with the spam detection, as the single data source will not contain all the updated malwares and spams. This work uses two methods namely, BCC for spam clustering and LIME for classification of malwares. The significance of these methods is their ability to work with unstructured data from different sources. After the two-step classification, a set of unique malwares is identified, and all further malwares are grouped according to their category.
Journal: Int. J. of Data Analysis Techniques and Strategies
Pages: 15-35
Issue: 1/2
Volume: 13
Year: 2021
Keywords: Bayesian consensus clustering; BCC; large iterative multi-tier ensemble; LIME; ensemble; classification; clustering; data security.
File-URL: http://www.inderscience.com/link.php?id=114665
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:ids:injdan:v:13:y:2021:i:1/2:p:15-35

Template-Type: ReDIF-Article 1.0
Author-Name: Poonam Nandal
Author-X-Name-First: Poonam
Author-X-Name-Last: Nandal
Author-Name: Deepa Bura
Author-X-Name-First: Deepa
Author-X-Name-Last: Bura
Author-Name: Meeta Singh
Author-X-Name-First: Meeta
Author-X-Name-Last: Singh
Title: Efficient data clustering algorithm designed using a heuristic approach
Abstract:
Information retrieval from a large amount of information available in a database is a major issue these days. The relevant information extraction from the voluminous information available on the web is being done using various techniques like natural language processing, lexical analysis, clustering, categorisation, etc. In this paper, we have discussed the clustering methods used for clustering of large amount of data using different features to classify the data. In today's era, various problem solving techniques makes the use of a heuristic approach for designing and developing various efficient algorithms. In this paper, we have proposed a clustering technique using a heuristic function to select the centroid so that the clusters formed are as per the need of the user. The heuristic function designed in this paper is based on the conceptually similar data points so that they are grouped into accurate clusters. &lt;i&gt;k&lt;/i&gt;-means clustering algorithm is majorly used to cluster the data which is also focussed in this paper. It has been empirically found that the clusters formed and the data points which belong to a cluster are close to human analysis as compared to existing clustering algorithms.
Journal: Int. J. of Data Analysis Techniques and Strategies
Pages: 3-14
Issue: 1/2
Volume: 13
Year: 2021
Keywords: clustering; natural language processing; <i>k</i>-means; concept; heuristic; Euclidean distance; 2D algorithm; information retrieval; Manhattan distance; density concept.
File-URL: http://www.inderscience.com/link.php?id=114666
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:ids:injdan:v:13:y:2021:i:1/2:p:3-14

Template-Type: ReDIF-Article 1.0
Author-Name: Oussama El Hajjamy
Author-X-Name-First: Oussama El
Author-X-Name-Last: Hajjamy
Author-Name: Hajar Khallouki
Author-X-Name-First: Hajar
Author-X-Name-Last: Khallouki
Author-Name: Larbi Alaoui
Author-X-Name-First: Larbi
Author-X-Name-Last: Alaoui
Author-Name: Mohamed Bahaj
Author-X-Name-First: Mohamed
Author-X-Name-Last: Bahaj
Title: Semantic integration of traditional and heterogeneous data sources (UML, XML and RDB) in OWL2 triplestore
Abstract:
With the success of the internet and the expansion of the amount of data in the web, the exchange of information from various heterogeneous and classical data sources becomes a critical need. In this context, researchers must propose integration solutions that allow applications to simultaneously access several data sources. In this perspective, we propose a semi-automatic integration approach of classical data sources via a global schema located in database management systems of RDF or OWL data, called triplestore. Our contribution is subdivided into three axes: 1) an automatic mapping solution that converts classical data sources such as UML, XML and RDB to local ontologies based on OWL2 language; 2) an alignment system of local ontologies based on syntactic, semantic and structural similarity measurement techniques in order to increase the probability of having real correspondences and real differences; 3) a fusion system of pre-existing local ontologies into a global ontology based on the alignment found in the previous step.
Journal: Int. J. of Data Analysis Techniques and Strategies
Pages: 36-58
Issue: 1/2
Volume: 13
Year: 2021
Keywords: semantic integration; UML; XML; RDB; semantic web; OWL2; RDF; triplestore; ontologies; mapping; alignment; fusion.
File-URL: http://www.inderscience.com/link.php?id=114667
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:ids:injdan:v:13:y:2021:i:1/2:p:36-58

Template-Type: ReDIF-Article 1.0
Author-Name: Jen-Peng Huang
Author-X-Name-First: Jen-Peng
Author-X-Name-Last: Huang
Author-Name: Genesis Sembiring Depari
Author-X-Name-First: Genesis Sembiring
Author-X-Name-Last: Depari
Title: Improving social media engagements on paid and non-paid advertisements: a data mining approach
Abstract:
The purpose of this research is to develop a strategy to improve the number of social media engagement on Facebook both for paid and non-paid publications through a data mining approach. Several Facebook post characteristics were weighted in order to rank the input variables importance. Three machine learning algorithms performance along with dynamic parameters were compared in order to obtain a robust algorithm in assessing the importance of several input factors. Random forest is found as the most powerful algorithm with 79% accuracy and therefore used to analyse the importance of input factors in order to improve the number of engagements of social media posts. Eventually, total page likes (number of page follower) of a company Facebook page are found as the most important factor in order to have more social media engagements both for paid and non-paid publications. We also propose a managerial implication on how to improve the number of engagements in company social media.
Journal: Int. J. of Data Analysis Techniques and Strategies
Pages: 88-106
Issue: 1/2
Volume: 13
Year: 2021
Keywords: social media; data mining; paid advertisement; non-paid advertisement; social media engagements.
File-URL: http://www.inderscience.com/link.php?id=114668
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:ids:injdan:v:13:y:2021:i:1/2:p:88-106

Template-Type: ReDIF-Article 1.0
Author-Name: Abdallah Abarda
Author-X-Name-First: Abdallah
Author-X-Name-Last: Abarda
Author-Name: Mohamed Dakkon
Author-X-Name-First: Mohamed
Author-X-Name-Last: Dakkon
Author-Name: Khawla Asmi
Author-X-Name-First: Khawla
Author-X-Name-Last: Asmi
Author-Name: Youssef Bentaleb
Author-X-Name-First: Youssef
Author-X-Name-Last: Bentaleb
Title: Evaluating information criteria in latent class analysis: application to identify classes of breast cancer dataset
Abstract:
In recent studies, latent class analysis (LCA) modelling has been proposed as a convenient alternative to standard classification methods. It has become a popular tool for clustering respondents into homogeneous subgroups based on their responses on a set of categorical variables. The absence of a common accepted statistical indicator for deciding the number of classes in the study of population represents one of the major unresolved issues in the application of the LCA. Determining the number of classes constituting the profiles of a given population is often done by using the likelihood ratio test, however the use of such methodology is not correct theoretically. To overcome this problem, we propose an alternative for the classical latent class models selection methods based on the information criteria. This article aims to investigate the performance of information criteria for selecting the latent class analysis models. Nine information criteria are compared under various sample sizes and model dimensionality. We propose also an application of ICs to select the best model of breast cancer dataset.
Journal: Int. J. of Data Analysis Techniques and Strategies
Pages: 72-87
Issue: 1/2
Volume: 13
Year: 2021
Keywords: latent class analysis; model selection; information criteria; classification methods.
File-URL: http://www.inderscience.com/link.php?id=114669
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:ids:injdan:v:13:y:2021:i:1/2:p:72-87

Template-Type: ReDIF-Article 1.0
Author-Name: Ketan Kumar Todi
Author-X-Name-First: Ketan Kumar
Author-X-Name-Last: Todi
Author-Name: S.N. Muralikrishna
Author-X-Name-First: S.N.
Author-X-Name-Last: Muralikrishna
Author-Name: B. Ashwath Rao
Author-X-Name-First: B. Ashwath
Author-X-Name-Last: Rao
Title: Sentiment classification of review data using sentence significance score optimisation
Abstract:
A significant amount of work has been done in the field of sentiment analysis in textual data using the concepts and techniques of natural language processing (NLP). In this work, unlike the existing techniques, we present a novel method wherein we consider the significance of the sentences in formulating the opinion. Often in any review, the sentences in the review may correspond to different aspects which are often irrelevant in deciding whether the sentiment is positive or negative on a topic. Thus, we assign a sentence significance score to evaluate the overall sentiment of the review. We employ a clustering mechanism followed by the neural network approach to determine the optimal significance score for the review. The proposed supervised method shows a higher accuracy than the state-of-the-art techniques. We further determine the subjectivity of sentences and establish a relationship between subjectivity of sentences and the significance score. We experimentally show that the significance scores found in the proposed method correspond to identifying the subjective sentences and objective sentences in reviews. The sentences with low significance score corresponds to objective sentences and the sentences with high significance score corresponds to subjective sentences.
Journal: Int. J. of Data Analysis Techniques and Strategies
Pages: 59-71
Issue: 1/2
Volume: 13
Year: 2021
Keywords: aspect; sentiment classification; clustering; neural network; optimisation; significance score.
File-URL: http://www.inderscience.com/link.php?id=114670
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:ids:injdan:v:13:y:2021:i:1/2:p:59-71

Template-Type: ReDIF-Article 1.0
Author-Name: Hadjer Moulai
Author-X-Name-First: Hadjer
Author-X-Name-Last: Moulai
Author-Name: Habiba Drias
Author-X-Name-First: Habiba
Author-X-Name-Last: Drias
Title: Towards knowledge warehousing: application to smart housing
Abstract:
The terms data, information and knowledge should not be treated as synonyms in any context. In fact, a hierarchical order between these entities exists where data become information and information become knowledge. Massive amounts of data are analysed everyday in order to extract valuable knowledge to support decision making. However, the size of the extracted knowledge compromises the speed of reasoning and exploitation of the latter. In this paper, we propose the paradigm of knowledge warehousing to store and analyse big amounts of knowledge through online knowledge processing and knowledge mining techniques. Our proposal is supported by an original knowledge warehouse framework and a case study for the smart housing technology. A multi-agent system built on a knowledge warehouse architecture is illustrated where each agent has a knowledge base about his assigned task. The paradigm is expected to be applicable for other knowledge tasks and domains as well.
Journal: Int. J. of Data Analysis Techniques and Strategies
Pages: 107-127
Issue: 1/2
Volume: 13
Year: 2021
Keywords: knowledge warehouse; knowledge management; knowledge mining; warehousing technology; smart housing; agent technology.
File-URL: http://www.inderscience.com/link.php?id=114671
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:ids:injdan:v:13:y:2021:i:1/2:p:107-127

Template-Type: ReDIF-Article 1.0
Author-Name: Btissam Bousarhane
Author-X-Name-First: Btissam
Author-X-Name-Last: Bousarhane
Author-Name: Saloua Bensiali
Author-X-Name-First: Saloua
Author-X-Name-Last: Bensiali
Author-Name: Driss Bouzidi
Author-X-Name-First: Driss
Author-X-Name-Last: Bouzidi
Title: Road signs recognition: state-of-the-art and perspectives
Abstract:
Making cars safer is a crucial element of saving lives on roads. In case of inattention or distraction, drivers need a performant system that is capable of assisting and alerting them when a road sign appears in their field of vision. To create such type of systems, we need to know first the major difficulties that still face traffic signs recognition, as presented in the first and second sections of this paper. We should also study the different methods proposed by researchers to overcome each of these challenges, as proposed in the third section. Evaluation metrics and criteria for proving the effectiveness of these approaches represents also an important element which section three of this article presents. Ameliorating the existing methods is crucial to ensure the effectiveness of the recognition process, especially by using deep learning algorithms and optimisation techniques, as discussed in the last section of this paper.
Journal: Int. J. of Data Analysis Techniques and Strategies
Pages: 128-150
Issue: 1/2
Volume: 13
Year: 2021
Keywords: road signs recognition; detection; classification; tracking; machine learning; deep learning; evaluation datasets; evaluation metrics; optimisation.
File-URL: http://www.inderscience.com/link.php?id=114672
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:ids:injdan:v:13:y:2021:i:1/2:p:128-150

Template-Type: ReDIF-Article 1.0
Author-Name: Tarik Fissaa
Author-X-Name-First: Tarik
Author-X-Name-Last: Fissaa
Author-Name: Mahmoud El Hamlaoui
Author-X-Name-First: Mahmoud El
Author-X-Name-Last: Hamlaoui
Author-Name: Hatim Guermah
Author-X-Name-First: Hatim
Author-X-Name-Last: Guermah
Author-Name: Hatim Hafiddi
Author-X-Name-First: Hatim
Author-X-Name-Last: Hafiddi
Author-Name: Mahmoud Nassar
Author-X-Name-First: Mahmoud
Author-X-Name-Last: Nassar
Title: Combining planning and learning for context aware service composition
Abstract:
Computing vision introduced by Mark Weiser in the early '90s has defined the basis of what is called now ubiquitous computing. This new discipline results from the convergence of powerful, small and affordable computing devices with networking technologies that connect them all together. Thus, ubiquitous computing has brought a new generation of service-oriented architectures (SOA) based on context-aware services. These architectures provide users with personalised and adapted behaviours by composing multiple services according to their contexts. In this context, the objective of this paper is to propose an approach for context-aware semantic-based services composition. Our contributions are built around following axes: 1) a semantic-based context modelling and context-aware semantic composite service specification; 2) an architecture for context-aware semantic-based services composition using artificial intelligence planning; 3) an intelligent mechanism based on reinforcement learning for context-aware selection in order to deal with dynamicity and uncertain character of modern ubiquitous environment.
Journal: Int. J. of Data Analysis Techniques and Strategies
Pages: 151-169
Issue: 1/2
Volume: 13
Year: 2021
Keywords: context awareness; ontology; service composition; semantic web; AI planning; reinforcement learning.
File-URL: http://www.inderscience.com/link.php?id=114673
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:ids:injdan:v:13:y:2021:i:1/2:p:151-169

Template-Type: ReDIF-Article 1.0
Author-Name: Anita Bai
Author-X-Name-First: Anita
Author-X-Name-Last: Bai
Author-Name: Swati Hira
Author-X-Name-First: Swati
Author-X-Name-Last: Hira
Title: Microarray cancer classification using feature extraction-based ensemble learning method
Abstract:
Microarray cancer datasets generally contain many features with a small number of samples, so initially we need to reduce redundant features to allow faster convergence. To address this issue, we proposed a novel feature extraction-based ensemble classification technique using support vector machine (SVM) which classifying microarray cancer data and helps to build intelligent systems for early cancer detection. Novelty of the proposed approach is described by classifying cancer data as follows: a) we extracted information by reducing the size of larger dataset using various feature selection techniques, such as, principal component analysis (PCA), chi-square, genetic algorithm (GA) and F-score; b) classifying extracted information in two samples as normal and malignant classes using majority voting ensemble SVM. In SVM ensemble-based approach we use different SVM kernels, like, linear, polynomial, radial basis function (RBF), and sigmoid. The calculated results of particular kernels are combined using majority voting approach. The effectiveness of the algorithm is validated on six benchmark cancer datasets viz. colon, ovarian, leukaemia, breast, lung and prostate using ensemble SVM classification.
Journal: Int. J. of Data Analysis Techniques and Strategies
Pages: 244-263
Issue: 3
Volume: 13
Year: 2021
Keywords: cancer classification; support vector machine; SVM; principal component analysis; PCA; genetic algorithm; F-score; chi-square.
File-URL: http://www.inderscience.com/link.php?id=118014
File-Format: text/html
File-Restriction: Access to full text is restricted to subscribers.
Handle: RePEc:ids:injdan:v:13:y:2021:i:3:p:244-263