International Journal of Data Analysis Techniques and Strategies (8 papers in press)
Detection of stragglers and optimal rescheduling of slow running tasks in Bigdata environment using LFCSO-LVQ classifier and Enhanced PSO algorithm
by Hetal A. Joshiara, Chirag S. Thaker, Sanjay M. Shah, Darshan B. Choksi
Abstract: People believe that the stragglers influence the performance of the Big Data (BD) analysis system big-time because of the bad performance of some computing nodes, data skew, etc. A collection of disparate mechanisms, frameworks, along with management techniques were proposed by the researchers for detecting stragglers proactively as well as reactively. Though many existing techniques are out there for Straggler Detection (SD), but the problem of accurately detecting stragglers is not deemed by the most of conventional techniques. Rather, a particular straggler detection approach is adopted and then studies its effectiveness concerning some performance metrics. And also they cannot be implemented in heterogeneous along with homogeneous environments. This paper plans to implement intelligent techniques in finding straggler tasks along with speculating their way of execution. Here, the Levy Flight based Cockroach Search Optimization-centered Linear Vector Quantization (LFCSO-LVQ) classifier is proposed to effectively identify the Slow Running (SR) tasks as of a bunch of user tasks, and the Enhanced Particle Swarm Optimization (EPSO) is proposed for performing optimal rescheduling of the identified SR tasks. There is numerous dataset that is openly available, as of which the inputted data or compilation of user tasks are collected. Subsequent to data collection, the collected data are preprocessed by means of identifying homogenous and heterogeneous tasks. After that, the Apache Spark (AS) split the preprocessed tasks into several sub-tasks. The features are extracted as of these subtasks for SR task prediction. An Information Gain based Linear Discriminant Analysis (IG-LDA) is proposed as a Feature Selection (FS) approach that reduces the classifier's training time and aids it to get the highest accuracy level. Subsequent to FS, the selected ones are inputted to LFCSO-LVQ, which envisages the SR tasks of the dataset centered on the chosen features. After that, EPSO reschedules these predicted tasks to the other fastest nodes of Virtual Machine (VM). The performance efficiency of the proposed LFCSO-LVQ and EPSO for SD and rescheduling are analyzed by conducting experiments. The proposed LFCSO-LVQ classifies the stragglers efficiently when contrasted to existing LVQ, DNN, RNN, and QDLNN concerning f-score and accuracy as shown by the results. The EPSO execute rescheduling in a very good manner when contrasted with existing WOA, MFO, PSO, and BM-LOA concerning waiting time, turnaround time, throughput, latency, CPU usage, along with execution time as confirmed by the results.
Keywords: Bigdata; Straggler Detection; Rescheduling of Tasks; Speculative Execution; slow running tasks identification; Learning Vector Quantization; Optimal Resource Scheduling; Particle Swarm Optimization (PSO).
HEYWOOD CASES: POSSIBLE CAUSES AND SOLUTIONS
by Rayees Farooq Farooq
Abstract: The purpose of the study is to identify the causes and recommend possible solutions to the Heywood cases. The study reviews the literature from 1960-2021 using the keyword search, "Heywood cases," "Improper solutions," and "Negative variance." The studies were explored from selected databases viz. Google scholar, Scopus and Web of Science. The study has found that fixing the negative variance to zero is the most widely used solution to the Heywood cases. The study also found that multivariate normality, small sample size with a large number of indicators, factor loadings of less than 0.5, and model misspecification are the possible causes of Heywood cases. The study suggests novel solutions to overcome the possibility of the Heywood cases, including fixing the negative variance to zero, maintaining the large sample size, and increasing the number of indicators in the construct. The study can be beneficial to the researchers who validate the model using CB-SEM. The study offers a basic understanding of the possible causes and novel solutions to the Heywood cases to help the researchers better develop the constructs/scales. The present research guides the researchers through the various effects of Heywood cases on the study's findings.
Keywords: Heywood cases; improper solutions; misspecification; item-per construct rule; Mahalanobis D2; multivariate normality.
Is matching in different situations equally applicable for impact evaluation studies when using observational data?
by Apsara Karki Nepal, Ghulam Muhammad Shah, Farid Ahmad
Abstract: Randomization of interventions is not feasible in every scenario, more so in development sectors. Alternatively, evaluation practitioners tend to rely on quasi-experimental designs and collect data from intervention and comparison groups, using statistical matching methods to create a counterfactual for impact evaluations. Although different types of statistical matching methods are available, their relative performance is generally unknown to practitioners. Using five sets of household survey data collected from samples of treatment and comparison groups from four countries in the Hindu Kush Himalaya region, we examine the extent of covariate imbalances before and after matching these five sites using four different matching methods. Our results indicate that for small samples with enough imbalances in the covariates before matching, nearest neighbour matching does not perform well, but matching with stratification works better. The performance of radius and kernel matching falls in between nearest neighbour and matching with stratification. If comparison sample is chosen carefully such that their socioeconomic and geographical characteristics are similar to the intervention sample, chances of getting relatively balanced covariates without matching is high. For balanced covariates without matching, the statistical matching technique may not produce better results. We find that the matching is useful when covariate imbalance is high before matching but may be less useful for sample with relatively balanced covariates before matching.
Keywords: matching; variable imbalance; impact evaluation; propensity scores; comparison group; observational data.
An ECOSVS based Support Vector Machine for Network Anomaly Detection
by Meenal Jain, Vikas Saxena
Abstract: In this paper, the Support Vector Machine (SVM) classification technique to classify normal and attack traffic in the Spark distributed environment has been introduced and evaluated. In terms of classification speed, SVM suffers from the important shortcomings of high time and memory training complexities, which depend on the training set size. The authors have proposed an Effective COrrelation based Support Vector Selection (ECOSVS) algorithm for SVM speed optimization. ECOSVS based SVM performed better when compared with the other three supervised classifiers, namely, Logistic Regression (LR), Decision Tree (DT), and Random Forest (RF) in terms of accuracy and training time. Apache Sparks RDD structure has been used for the detection of network-based anomalies. The analysis of the said algorithm was performed on two publically available network datasets, namely, Network Security Laboratory- Knowledge Discovery in Databases dataset (NSL-KDD) and Coburg Intrusion Detection Datasets (CIDDS-2017). The results showed that our proposed algorithm reduced the training set size of NSL-KDD and CIDDS-2017 datasets to 99.3 and 85 percent, respectively. Accuracies of 80 and 87 percent for the ECOSVS based SVM classifier were achieved.
Keywords: ECOSVS; SVM; Anomaly Detection; Apache Spark.
Location and Time Factors Effect on Traffic Accidents Types in Kuwait
by Sharaf AlKheder, Fahad AlRukaibi, Ahmad Aiash
Abstract: The mortality and severe injuries due to traffic accidents in Gulf Co-operation Council (GCC) countries are hastening the necessity for a study that can identify the consequential risk factors. Spatiotemporal factors are among the most critical risk factors affecting accident occurrence and severity. Very few safety studies directed to such factors are available in the literature for this region. So, this work comes to fill the gap in the literature. Location and time-period can highly be correlated with traffic accidents types. In this study, 287983 traffic accidents that happened in 2013, 2014, 2016, and 2017 were collected from General Traffic Department of Kuwait. The collected traffic accidents occurred in four- governorates that included Kuwait-City, Hawally, Al Farwaniya, and Al Ahmadi as those governorates had the highest rate of traffic accidents. The types of traffic accidents that were included in the collected data were crashes, run-over, and rollover accidents. Afterward, the location and the year where and when the accident occurred were chosen to be the independent variable and the dependent variable was the type of accident. In this study, a multinomial logit regression model was chosen to identify the significant variables and determine the correlation between predictors and the dependent variable. A multinomial logit regression was chosen as this model is flexible compared to ordered model, it can handle various types of variables that consist of more than two categories. Moreover, the multicollinearity test was performed to determine if there is an issue when applying this model as it can be affected by multicollinearity between the independent variables. The test showed that there was no multicollinearity issue.
The results showed that both location and time were significant variables that influence the occurring of certain types of accidents. According to the model results, rollover accidents had higher odds of happening in Al Ahmadi governorate. While for the time-period, 2017 was found to have a higher probability of run-over accidents occurring. For the future study, meteorological factors should be included especially because of the extreme weather that most GCC countries had, including sandstorm, high humidity, and high temperatures. Thereafter, the study should try to find the correlations between those factors and traffic accidents injury levels to provide more comprehension for all potential risk factors.
The findings of this study can provide an insight into the potential risk factors that are contributed to traffic accidents injuries. This can help traffic institutions and police departments in assuaging the severe injury or fatal injury that are associated with traffic accidents by pre-identifying the risk factors. Thus, decision-maker can legislate new rules or improve the traffic networks to attempt to suppress fatal or severe traffic accidents.
Keywords: Traffic accidents; multinomial logit model; location; time-period; Kuwait.
Inequalities in the geographic distribution of chronic diseases in Brazil: an index methodology
by Simone Lima, Caroline Mota, Danielle Marinho
Abstract: The purpose of the present article is to compare the geographic distribution of nine chronic diseases in Brazil: arterial hypertension, arthritis/rheumatism, back/spine, bronchitis/asthma, cancer, chronic renal failure, depression, diabetes, and, heart disease. The data used is from the Brazilian National Health Survey (PNS) composed of 60,202 participants (?18 years old). The morbidity rate of diseases was calculated for 27 units of Brazil. A geographic chronic disease index (gCDI) was formulated as a summary measure to group and compare the distribution of these illnesses based on factor analysis (FA). The observation of trends in health-related indexes and maps can be an advantage to analyse large databases. The final index indicated regional differences showing that the South of Brazil had more individuals with chronic diseases compared to the North of the country mainly for arterial hypertension, depression, diabetes, and heart disease.
Keywords: Public health; Brazilian National Health Survey; PNS; Non-communicable diseases; NCDs; Brazil; Chronic Disease Index; Factor analysis.
RECOGNITION OF ONLINE HANDWRITTEN TELUGU STROKE BY DETECTED DOMINANT POINTS USING CURVATURE ESTIMATION
by Srilakshmi Inuganti, R.Rajeshwara Rao
Abstract: Online Handwritten Telugu Character is a mix of strokes, which are from pen-down to pen-up positions. The preliminary objective of Feature Extractions (FE) is to evaluate certain characteristics of stroke that significantly distinguish the stroke from other strokes. The traditional FE methods in Hand-Writing Recognition (HWR) encompassed the cons of high computational time as well as the requirement for a higher number of inputted parameters. To trounce these flaws, we propose a FE method for Telugu strokes utilizing Dominant Points (DP). This is a non-parametric approach. The procedure initially defines the Regions of Support (ROS) for each coordinate as perthe local properties. With this ROS, the curvature is estimated for every point on the curves and also is utilized to gauge DP. The points encompassing local maximum curvatures are stated as DP. The proposed feature also includes the direction between consecutive DPs of the stroke. For classification, a two-stage classifier is employed in which DP along with direction between DPs are utilized in the pre classifier and preprocessed (x,y) coordinates are utilized in the post classifier. In both stages, the k-nn classifier is used. The proposed mechanism is verified with HP-Lab data available in the UNIPEN format as it encompasses Telugu characters. It is perceived as of the outcomes that the proposed feature enhances recognition accuracy over the chosen dataset.
Keywords: Online Handwritten Character Recognition (OHCR); Dominant Points; Curvature Estimation; Bending value; Two-Phase Classifier; Region of Support (ROS).
Long-term Corporate Social Responsibility agenda considering climate change policy and conservation of forest -An Exploratory Analysis of Kerala based Companies
by Rajesh Veluthan
Abstract: Nowadays, CSR is being viewed as a valuable approach for achieving stronger relations with an organization's internal and external stakeholders. Also, environmental management initiatives are crucial in the present scenario owing to the growing environmental concerns. The business organizations also view environment-related CSR activities as a social obligation and a way of paying back to society. In light of such considerations, the present paper discusses the significance of corporate social responsibility, statutory policies related to CSR, climate change and conservation of forest, and the organizational motives behind taking CSR initiatives with particular reference to the Kerala-based organizations. The purpose is to recognize the challenges sourced by climate change along with shortage of forest conservation and the role that can be assumed by the organization-based CSR initiatives in remediating such rising environmental concerns. Also, the benefits of undertaking such efforts for the organizations have been discussed.
Keywords: Corporate social responsibility; CSR programs; statutory regulations; climate change; environmental responsibility; forest conservation; Kerala.