International Journal of Data Science (8 papers in press)
Financial Time Series Prediction: an approach using Motif information and Neural Networks
by Pradeepkumar Dadabada, Ravi Vadlamani
Abstract: Financial Time Series Prediction is an important and complex problem as well. This paper presents an approach to predict financial time series using time series motifs and Artificial Neural Network (ANN) in tandem. A time series motif is a frequently appearing approximate pattern in a given time series. In the proposed approach, we first detect significant motifs from a financial time series using Extreme Points-Clustering (EP-C) algorithm found in the literature. Later, we use these motifs in combination withANNto yield accurate predictions. Three ANNs namely Multi-Layer Perceptron (MLP), General Regression Neural Network (GRNN), and Group Method for Data Handling (GMDH) are employed. The proposed Motif+GMDH hybrid outperformed both Motif+MLP hybrid and Motif+ GRNN hybrid on three financial time series including exchange rates of both EUR/USD and INR/USD, and Crude oil price (USD). Further, we compared the results of the motif-based hybrids with that of the three ANNs without motif information. We found that Motif+MLP hybrid outperformed plain MLP in all datasets statistically at 1% level of significance. Interestingly, however, Motif+GRNN and Motif+GMDH turned out to be statistically no different from GRNN and GMDH respectively at 1% level of significance.
Keywords: Financial Time Series Prediction; Motif; MLP; GRNN; GMDH.
Statistical Analysis of Fatal Crash in Michigan Using More than Two Time Series Models
by Liming Xie
Abstract: This article is to analyze Michigan fatal crash (MFC) in 1974-2014 as a timer series data using ARIMA (0,0,1)-GARCH models to predict future values and trends. The author would like to use the heteroskedasticity from the object, such as the rates of incidence, is tested by the ARCH (Autoregressive Conditional heteroskedasticity) or GARCH (Generalized Autoregressive Conditional heteroskedasticity). The best model of ARCH or GARCH is to measure the volatility of the MFC so that the future values are predicted. Both ARIMA and ARCH or GARCH models are used to predict future values. To obtain the best fit model, the author utilizes ARIMA model to estimate the best model. If possible, differencing the order of ARCH or GARCH modeling to identify the best heteroskedastic residual model of ARIMA. The author uses MA (1) and GARCH (1,1). The results suggest that GARCH modeling clinch the dynamic change of variance exactly. It is inferred that the ARIMA-ARCH/GARCH hybrid modeling is the best method to predict the ahead values of covering the heteroskedastic original objects. Finally, using both ARCH/GARCH forecasting models to predict the future values and the trend of MFC. It suggests the downward trends.
Keywords: MFC; Dynamic change; Heteroskedasticity; ARIMA; ARCH; GARCH; Forecast.
Loading, Searching and Retrieving data from local data nodes on HDFS
by SREERAMA MURTY MATURI
Abstract: The loading and searching the data from data with in local data nodes by using the hadoop environment. In general the loading and searching data by using a query is more complex, because the capacity of the dataset may large. We propose a technique to handle the data in local nodes without overlapping and data retrieved by script. The main task of the query is to store the information on distributed environment and searching the without any delay. Here we define the script to avoid the redundancy of the duplicate while searching and loading the data in dynamic mechanism. And also provide the hadoop file system in distributed environment. The apache script is used to loading and searching the information instead of the SQL mechanism. We improve the performance of query execution and graph theory. The query can split into three parts to search the data individually and combined the results in execution. Here we used the replica concept to store the data at time of executing query in hadoop file system. The script is executed on the locating environment of hadoop file system.
Keywords: HDFS; replica; local,distributed,capacity.
RNN-Based Deep-Learning Approach to Forecasting Hospital System Demands: Application to an Emergency Department
by Farid Kadri, Kahina Abdennbi
Abstract: In recent years hospital systems are confronted to increasing demands. The management of patient flow is one of the main challenges faced by many hospital establishments, in particular emergency departments (EDs). The increasing number of ED demands may lead to ED overcrowding which often increases the patient length of stay. This latter, have a negative impacts on quality of medical services and medical staff. One approach to alleviate such problems is to predict attendances at the ED. Indeed, predicting ED demands greatly helps EDs managers to make suitable plans by optimally allocating the available limited resources for the predicted patient attendances. Existing regression and time series models such as ARIMA models are mainly linear models and cannot describe the stochastic and non-linear nature of time series data. In recent years, Recurrent Neural Networks (RNN) have been applied as novel alternatives for prediction in various domains. In this paper we propose an RNN deep learning based approach for predicting ED demands. The experiments were carried out on a real database collected from the pediatric emergency department (PED) in Lille regional hospital center, France. The RNN-based deep learning approach was shown to provide a useful tool for predicting ED admissions.
Keywords: ED demands; Overcrowding; Prediction; Deep learning; RNN; LSTM-GRU.
What do angles of cornea curvature reveal? A new (Sinusoidal) probability density function with statistical properties assists
by Ramalingam Shanmugam
Abstract: This article introduces a new probability distribution with statistical properties, and it is named Sinusoidal probability distribution (SPD). Using SPD, the article demonstrates an approach to learn from modeling and statistical analyses of the cornea angles curvature as it happened among the 23 patients in a glaucoma clinic. In specific, analytic expressions such as survival function, odds function and its tipping point, its critical value, its convexity, Q-Q plotting positions, variance-mean relation, heterogeneity among randomly sampled cornea patients, vitality function, total value at risk, past life function, entropy, hazard, inverse, and mean functions of SPD are derived and illustrated. The minimum and maximum sample values are shown to be the maximum likelihood estimators of the parameters in which SDF is obtained. The joint probability density function for the lower and upper record values of a sample from SPD with their correlation function is derived and utilized to better understand the implications of the measured cornea angles. A few comments are made in the end to further advance the research in cornea illness.
Keywords: Survival function; new probability density function; mean-variance relation; glaucoma incidences.
Robust Computational Modeling of the Sodium Adsorption Ratio Using Regression Analysis and Support Vector Machine
by Alireza Rostami, Milad Arabloo, Alibakhsh Kasaeian, Khalil Shahbazi
Abstract: In present study, two new methods including least-square support vector machine (LSSVM) and regression-based model, were created for accurate estimation of the adsorption ratio of sodium in terms of ionic concentrations of calcium (Ca2+), magnesium (Mg2+), and sodium (Na+); the bicarbonate (HCO3-) to Ca2+ ratio; and salinity/conductivity of the used water so as to explain the impact of water quality on the irrigation water using a reliable literature database. The results of the developed models were compared with a commonly used model in literature using visual and statistical parameters. Consequently, the supremacy of the regression-based approach is demonstrated with the average absolute relative deviations (AARDs) of 0.06% for HCO3-/Ca2+ ratio?1 and 0.28% for HCO3-/Ca2+ ratio>1. Finally, it should be mentioned that the proposed methods are easy-to-apply and sufficiently accurate which require the less calculations leading to the rapid estimation of sodium adsorption ratio in wide range of operational conditions.
Keywords: Sodium adsorption ratio; Irrigation water; Salinity; Least square support vector machine; Error analysis.
Hierarchical non-Archimedean DEA models: Application on mobile money agents locations in the city of Harare
by Jacob Muvingi, Arshad Ahmud Iqbal Peer, Farhad Hosseinzadeh Lotfi
Abstract: Hierarchical non-Archimedean data envelopment analysis (DEA)rnmodels are proposed to evaluate the efficiency of two types decision making units (DMUs) which are integrated. The determination of non-Archimedean values was extended to cater for decision-making units (DMUs) with a hierarchical group structure. The proposed approach was applied in the location analysis of mobile money agents locations. In a bid to improve adjusted efficiency ratings of groups with unequal size, an adjustment value on selected groups average efficiency ratings was determined through the identification of ideal location groups proxies. Three district location efficiency ratings (DLER-1, DLER-2, and DLER-3), were respectively generated through the non-Archimedean DEA hierarchical method, the DEA hierarchical method where the non-Archimedean epsilon is ignored, and the treatment of district locations as a system made-up of suburb locations. The application of the non-Archimedean value on district locations efficiency analysis reduced the number of efficient district locations.
Keywords: Data envelopment analysis; Location; Mobile money agents;Hierarchical; Parallel systems; non-Archimedean value.
Reducing Feature Selection Bias Using a Model Independent Performance Measure
by Weizeng Ni, Nuo Xu, Honghao Dai, Samuel Huang
Abstract: Feature selection is an important step in the process of learning from data, especially when dealing with dataset with small sample size and high dimensionality. A popular approach for feature selection is the so called wrapper approach. In recent years, researchers have realized that wrappers have a feature selection bias due to data overfitting. External cross-validation or dual-loop cross-validation has been proposed to solve this problem. However, cross-validation approaches tend to bring in excessive variability for small sample size data with high dimensionality. This paper shows that a model-independent approach for feature selection; namely, minimum expected cost of misclassification (MECM), can reduce feature selection bias without the need of cross-validation. A designed experiment was conducted using a synthetic dataset. The results show that 10-fold dual-loop cross-validation based wrapper feature selection has around 33% higher error rate than the noise-free error rate and fails to identify discriminative features consistently in all 10 folds. On the other hand, MECM can select more discriminative features than dual-loop cross-validation and shows more robustness to different classification models than wrapper-based approach. A real-word colon cancer dataset is further used to demonstrate the effectiveness of MEMC.
Keywords: Feature Selection; Overfitting; Microarray Data; Model-independent.