International Journal of Data Analysis Techniques and Strategies (14 papers in press)
Microarray Cancer Classification using Feature Extraction based Ensemble Learning Method
by ANITA BAI, SWATI HIRA
Abstract: Microarray cancer datasets generally contain many features with a small number of samples, so initially we need to reduce redundant features to allow faster convergence. To address this issue, we proposed a novel feature extraction based ensemble classification technique using support vector machine (SVM) which classifying microarray cancer data and helps to build intelligent systems for early cancer detection. Novelty of the proposed approach is described by classifying cancer data as follows: a) We extracted information by reducing the size of larger dataset using various feature selection techniques, such as, principal component analysis (PCA), chi-square, genetic algorithm (GA) and F-Score. b) Classifying extracted information in two samples as normal and malignant classes using majority voting ensemble SVM. In SVM ensemble based approach we use different SVM kernels, like, linear, polynomial, radial basis function (RBF), and sigmoid. The calculated results of particular kernels are combined using majority voting approach. The effectiveness of the algorithm is validated on six benchmark cancer datasets viz. Colon, Ovarian, Leukaemia, Breast, Lung and Prostate using ensemble SVM classification.
Keywords: Cancer classification; Support vector machine; PCA; GA; F-Score; Chi-square.
Rough set-based attribute reduction and decision rule formulation for marketing data
by Murchhana Tripathy, Anita Panda, Santilata Champati
Abstract: Using the classical Rough Set Theory concept, this study addresses the attribute reduction problem followed by decision rule formulation for marketing data that contains both inconsistence as well as repeated data. Based on the method followed in the work, we propose an algorithm which initially uses the concepts of core and reduct and then performs a cross checking of both by using the significance of the attributes to formulate more accurate and correct rules. For the border line cases it is proposed to use the support and confidence of the rule to determine whether to select the rule or to exclude it. To show the working of the method discussed, We use the marketing data of twenty three Indian cosmetic companies for the current study. Also we conduct a sensitivity analysis of the obtained results to gain insight about the profitability of the companies.
Keywords: Discernibility Matrix; Core; Reduct; Significance of Attributes; Decision Rules; Marketing; Sensitivity Analysis.
Improving the predictive ability of multivariate calibration models using Support Vector Data Description
by Walid Gani
Abstract: Outliers detection is a crucial step in building multivariate calibration models and enhancing their predictive ability. However, traditional outliers detection methods often suffer from important drawbacks mainly their reliance on assumptions about the data model distribution and their unsuitability for real life applications. This paper investigates the use of Support Vector Data Description (SVDD) for the detection of outliers and proposes a multivariate calibration strategy which combines partial least squares (PLS) and SVDD. For the assessment of the proposed calibration strategy, an experimental study aiming to predict four chemical properties of diesel fuels is conducted. The results show that the predictive ability of PLS-SVDD is better than the predictive ability of a classical strategy which combines PLS and T^2 method.
Keywords: multivariate calibration; outlier; SVDD; PLS; T^2 method.
Improving Sentiment Analysis Using Preprocessing Techniques and Lexical Patterns
by Stefano Cagnoni, Laura Ferrari, Paolo Fornacciari, Monica Mordonini, Laura Sani, Michele Tomaiuolo
Abstract: Sentiment Analysis has recently gained considerable attention, since the classification of the emotional content of a text (online reviews, blog messages etc.) may have a relevant impact on market research, political science and many other fields. In this paper, we focus on the importance of the text preprocessing phase, proposing a new technique we termed Lexical Pattern-based Feature Weighting (LPFW), that allows one to improve sentence-level Sentiment Analysis by increasing the relevance of the features contained in particular lexical patterns. This approach has been evaluated on two sentiment classification datasets. We show that a systematic optimization of the preprocessing filters is important for obtaining good classification accuracy. Also, we show that LPFW is effective in different application domains and with different training set sizes.
Keywords: Sentiment Analysis; POS Tagging; Natural Language Processing.
A METHODICAL EVALUATION OF CLASSIFIERS IN PREDICTING ACADEMIC PERFORMANCE FOR A MULTI-CLASS APPROACH
by A. Princy Christy, Rama N
Abstract: Predictive analytics has gained importance in recent years as it helps to proactively identify factors that contribute to the success or failure of an event in relevant field. Academic achievements of students can be predicted early by employing algorithms and analyzing relevant data thereby devising solutions to improve performance. In this process choosing the right algorithm is very crucial since performance of algorithms vary depending on the distribution of data and the way it is tuned to handle the data. In order to enhance the performance of algorithms their hyper-parameters were tuned. Many multi-class classifiers were examined and the prediction accuracy of each model developed by employing them was compared. Depending on their classification accuracy the models developed were used to predict the performance of the students. This was done by using micro and macro averaging because of multi-class features. The results show that ensemble classifiers performed well than their individual counterparts
Keywords: Multi-class; Classification; Prediction; Performance metrics; XGBoost; Random Forest Classifier; Feature importance; Grid Search; Macro-average; Micro-average.
Ranking Enterprise Reputation in the Digital Age: A Survey of Traditional Methods and the Need For More Agile Approaches
by Canan Corlu, Anita Goyal, David Lopez-Lopez, Rocio De La Torre, Angel Juan
Abstract: Different data sources and analytical methodologies can be used to establish a ranking of enterprises according to several performance measures including their reputation (as perceived by the consumers), financial health, and future growth potential. Such a ranking can be extremely useful for third enterprises interested in creating alliances, outsourcing some activities, or simply contracting services offered by external firms. These rankings are already becoming popular in sectors such as higher education, where universities worldwide are analyzed according to several dimensions and sorted by different international and national rankings. This paper reviews well-established methodological approaches that have been employed to generate such rankings. As shown in our review, these techniques have been typically applied on reduced sets of large enterprises, which are usually indexed in stock exchange markets and from which abundant financial data can be obtained. Then, we discuss the need to extend these ranking practices to large sets of small and medium enterprises, which do not usually provide publicly-available data. Still, in consideration of present digital age, we support the following concepts: (i) citation indicators such as those generated by search engines can be employed to automate the fast generation of rankings; and (ii) when properly validated, these agile rankings can be used as proxies for a reputation ranking.
Keywords: ranking enterprises; digital age; decision sciences; data analytics; management science.
Sentiment Analysis: A Review and Framework Foundations
by Bousselham EL HADDAOUI, Raddouane Chiheb, Rdouan Faizi, Abdellatif El Afia
Abstract: The rise of social media as a platform for opinion expression and social interactions motivated the need for an automated data analysis technique for business value extraction with optimal investment considerations. In this respect, Sentiment Analysis (SA) become the de facto approach to investigate generated data and retrieve information such as: sentiments and emotions, discussed topics, etc. via traditional machine learning and modern neural network-based algorithms. The current techniques achieve reasonable accuracy scores but their performance evolution is depending on the context of application, also most implementations are complex and non-reusable components. Our literature review shows a lack in research studies to unify existing systems under a common framework for SA tasks. This paper also highlights the rending movement of neural networks approaches and pinpoint recent research studies for SA sub tasks. A SA framework design proposition is presented based on key research projects and enhanced with other promising works.
Keywords: Sentiment Analysis; Social Media; Text Preprocessing; Machine Learning; Framework.
Spam Filtering based on PV-DBOW model
by Ghizlane Hnini, Anass FAHFOUH, Jamal RIFFI, Mohamed Adnane MAHRAZ, Ali YAHYAOUY, Hamid TAIRI
Abstract: Many feature extraction techniques have been conducted to deal with spam emails. However, despite their performance and efficiency, they still have a lot of weaknesses. The Term Frequency-Inverse Document Frequency (TF-IDF) and the Bag-of-Words (BoW) are two well-known methods. Yet, they do not capture the semantic aspect of the emails, which may lead to misclassification. To tackle this issue, we propose an architecture based on Distributed Bag of Words version of Paragraph Vector (PV-DBOW). It is considered as a deep learning architecture. The features generated from an email are characterized by their richness, and they capture the semantic aspect of the emails by taking into account the context of the sentences.
The obtained results show that the proposed approach outperforms the state-of-the-art methodologies in terms of precision, recall, F-measure, and accuracy.
Keywords: Spam-filtering ; Deep Learning ; Machine learning ; PV-DBOW; Feature extraction.
A One-class Classification Approach Based on SVDD for Imbalanced and Overlapping Classes
by Seyyed-Mohammad Javadi-Moghaddam, Reyhane Rateghi
Abstract: Imbalanced data classification is in challenge especially when there is overlapping between two classes. The overlap makes it almost impossible to create differences in the two classes and isolate them. In the real world, many of the data sets are imbalanced and overlapped. This paper identifies the overlapping regions optimally by comparing the results of performing a single- class SVDD algorithm in each class. Then the method uses the nearest- neighbor algorithm to classify the data in the overlapping region. The result of the evaluation on the datasets with a high imbalanced rate shows better performance than other approaches.
Keywords: Classification; Imbalanced data; Overlapping; SVVD algorithm; KNN algorithm.
A 3-in-1 Framework for Human Resources Selection and Positioning Based on Machine Learning Tools
by Panagiota Pampouktsi, Katia Lida Kermanidis, Markos Avlonitis
Abstract: Administration is aiming to control the performance of human resources in order to achieve its best possible exploitation and through it resources economy, is an important issue for all organizations. Innovations about human resource evaluation systems, is a success factor of all administrative changes. Management of human resources is based on proper selection and positioning of the staff that adds value to an organization. Artificial intelligence is the new ally for managers, but it is not wide-spread in the public sector. In this study we collected and assessed personnels data in a public organization and by using various classification algorithms, specific models were built according to job description and employees qualifications. Our scope is the development of an innovative framework, using machine learning techniques for meritocratic personnel selection for recruitment and simultaneously, positioning either horizontally in the departments or vertically in leadership positions, unifying three procedures in one.
Keywords: artificial intelligence; evaluation; leadership; public sector.
HEYWOOD CASES: POSSIBLE CAUSES AND SOLUTIONS
by Rayees Farooq Farooq
Abstract: The purpose of the study is to identify the causes and recommend possible solutions to the Heywood cases. The study reviews the literature from 1960-2021 using the keyword search, "Heywood cases," "Improper solutions," and "Negative variance." The studies were explored from selected databases viz. Google scholar, Scopus and Web of Science. The study has found that fixing the negative variance to zero is the most widely used solution to the Heywood cases. The study also found that multivariate normality, small sample size with a large number of indicators, factor loadings of less than 0.5, and model misspecification are the possible causes of Heywood cases. The study suggests novel solutions to overcome the possibility of the Heywood cases, including fixing the negative variance to zero, maintaining the large sample size, and increasing the number of indicators in the construct. The study can be beneficial to the researchers who validate the model using CB-SEM. The study offers a basic understanding of the possible causes and novel solutions to the Heywood cases to help the researchers better develop the constructs/scales. The present research guides the researchers through the various effects of Heywood cases on the study's findings.
Keywords: Heywood cases; improper solutions; misspecification; item-per construct rule; Mahalanobis D2; multivariate normality.
Is matching in different situations equally applicable for impact evaluation studies when using observational data?
by Apsara Karki Nepal, Ghulam Muhammad Shah, Farid Ahmad
Abstract: Randomization of interventions is not feasible in every scenario, more so in development sectors. Alternatively, evaluation practitioners tend to rely on quasi-experimental designs and collect data from intervention and comparison groups, using statistical matching methods to create a counterfactual for impact evaluations. Although different types of statistical matching methods are available, their relative performance is generally unknown to practitioners. Using five sets of household survey data collected from samples of treatment and comparison groups from four countries in the Hindu Kush Himalaya region, we examine the extent of covariate imbalances before and after matching these five sites using four different matching methods. Our results indicate that for small samples with enough imbalances in the covariates before matching, nearest neighbour matching does not perform well, but matching with stratification works better. The performance of radius and kernel matching falls in between nearest neighbour and matching with stratification. If comparison sample is chosen carefully such that their socioeconomic and geographical characteristics are similar to the intervention sample, chances of getting relatively balanced covariates without matching is high. For balanced covariates without matching, the statistical matching technique may not produce better results. We find that the matching is useful when covariate imbalance is high before matching but may be less useful for sample with relatively balanced covariates before matching.
Keywords: matching; variable imbalance; impact evaluation; propensity scores; comparison group; observational data.
An ECOSVS based Support Vector Machine for Network Anomaly Detection
by Meenal Jain, Vikas Saxena
Abstract: In this paper, the Support Vector Machine (SVM) classification technique to classify normal and attack traffic in the Spark distributed environment has been introduced and evaluated. In terms of classification speed, SVM suffers from the important shortcomings of high time and memory training complexities, which depend on the training set size. The authors have proposed an Effective COrrelation based Support Vector Selection (ECOSVS) algorithm for SVM speed optimization. ECOSVS based SVM performed better when compared with the other three supervised classifiers, namely, Logistic Regression (LR), Decision Tree (DT), and Random Forest (RF) in terms of accuracy and training time. Apache Sparks RDD structure has been used for the detection of network-based anomalies. The analysis of the said algorithm was performed on two publically available network datasets, namely, Network Security Laboratory- Knowledge Discovery in Databases dataset (NSL-KDD) and Coburg Intrusion Detection Datasets (CIDDS-2017). The results showed that our proposed algorithm reduced the training set size of NSL-KDD and CIDDS-2017 datasets to 99.3 and 85 percent, respectively. Accuracies of 80 and 87 percent for the ECOSVS based SVM classifier were achieved.
Keywords: ECOSVS; SVM; Anomaly Detection; Apache Spark.
Special Issue on: Big Data Analytics in Business Research
Detection of stragglers and optimal rescheduling of slow running tasks in Bigdata environment using LFCSO-LVQ classifier and Enhanced PSO algorithm
by Hetal A. Joshiara, Chirag S. Thaker, Sanjay M. Shah, Darshan B. Choksi
Abstract: People believe that the stragglers influence the performance of the Big Data (BD) analysis system big-time because of the bad performance of some computing nodes, data skew, etc. A collection of disparate mechanisms, frameworks, along with management techniques were proposed by the researchers for detecting stragglers proactively as well as reactively. Though many existing techniques are out there for Straggler Detection (SD), but the problem of accurately detecting stragglers is not deemed by the most of conventional techniques. Rather, a particular straggler detection approach is adopted and then studies its effectiveness concerning some performance metrics. And also they cannot be implemented in heterogeneous along with homogeneous environments. This paper plans to implement intelligent techniques in finding straggler tasks along with speculating their way of execution. Here, the Levy Flight based Cockroach Search Optimization-centered Linear Vector Quantization (LFCSO-LVQ) classifier is proposed to effectively identify the Slow Running (SR) tasks as of a bunch of user tasks, and the Enhanced Particle Swarm Optimization (EPSO) is proposed for performing optimal rescheduling of the identified SR tasks. There is numerous dataset that is openly available, as of which the inputted data or compilation of user tasks are collected. Subsequent to data collection, the collected data are preprocessed by means of identifying homogenous and heterogeneous tasks. After that, the Apache Spark (AS) split the preprocessed tasks into several sub-tasks. The features are extracted as of these subtasks for SR task prediction. An Information Gain based Linear Discriminant Analysis (IG-LDA) is proposed as a Feature Selection (FS) approach that reduces the classifier's training time and aids it to get the highest accuracy level. Subsequent to FS, the selected ones are inputted to LFCSO-LVQ, which envisages the SR tasks of the dataset centered on the chosen features. After that, EPSO reschedules these predicted tasks to the other fastest nodes of Virtual Machine (VM). The performance efficiency of the proposed LFCSO-LVQ and EPSO for SD and rescheduling are analyzed by conducting experiments. The proposed LFCSO-LVQ classifies the stragglers efficiently when contrasted to existing LVQ, DNN, RNN, and QDLNN concerning f-score and accuracy as shown by the results. The EPSO execute rescheduling in a very good manner when contrasted with existing WOA, MFO, PSO, and BM-LOA concerning waiting time, turnaround time, throughput, latency, CPU usage, along with execution time as confirmed by the results.
Keywords: Bigdata; Straggler Detection; Rescheduling of Tasks; Speculative Execution; slow running tasks identification; Learning Vector Quantization; Optimal Resource Scheduling; Particle Swarm Optimization (PSO).