Article: A combination of variable selection and data mining techniques for high-dimensional statistical modelling Journal: International Journal of Information and Decision Sciences (IJIDS) 2013 Vol.5 No.2 pp.154 - 168 Abstract: Variable selection is fundamental to statistical modelling in diverse fields of sciences. This paper deals with the problem of high-dimensional statistical modelling through the analysis of seismological data in Greece acquired during the years 1962-2003. The dataset consists of 10,333 observations and 11 factors, used to detect possible risk factors of large earthquakes. In our study, different statistical variable selection techniques are applied, while data mining techniques enable us to discover associations, meaningful patterns and rules. The statistical methods employed in this work were the non-concave penalised likelihood methods, SCAD, LASSO and Hard, the generalised linear logistic regression and the best subset variable selection. The applied data mining methods were three decision trees algorithms, the classification and regression tree (C&RT), the chi-square automatic interaction detection (CHAID) and the C5.0 algorithm. The way of identifying the significant variables in large datasets along with the performance of used techniques are also discussed. Inderscience Publishers - linking academia, business and industry through research

Title: A combination of variable selection and data mining techniques for high-dimensional statistical modelling

Authors: Christos Koukouvinos; Kalliopi Mylona; Christina Parpoula

Addresses: Department of Mathematics, National Technical University of Athens, Zografou 15773, Athens, Greece ' Faculty of Applied Economics, Universiteit Antwerpen, 2000 Prinsstraat 13, Antwerpen, Belgium ' Department of Mathematics, National Technical University of Athens, Zografou 15773, Athens, Greece

Abstract: Variable selection is fundamental to statistical modelling in diverse fields of sciences. This paper deals with the problem of high-dimensional statistical modelling through the analysis of seismological data in Greece acquired during the years 1962-2003. The dataset consists of 10,333 observations and 11 factors, used to detect possible risk factors of large earthquakes. In our study, different statistical variable selection techniques are applied, while data mining techniques enable us to discover associations, meaningful patterns and rules. The statistical methods employed in this work were the non-concave penalised likelihood methods, SCAD, LASSO and Hard, the generalised linear logistic regression and the best subset variable selection. The applied data mining methods were three decision trees algorithms, the classification and regression tree (C&RT), the chi-square automatic interaction detection (CHAID) and the C5.0 algorithm. The way of identifying the significant variables in large datasets along with the performance of used techniques are also discussed.

Keywords: variable selection; non-concave penalised likelihood; data mining; decision making; high-dimensional statistical modelling; seismological data; Greece; risk assessment; large earthquakes; decision trees.

DOI: 10.1504/IJIDS.2013.053799

International Journal of Information and Decision Sciences, 2013 Vol.5 No.2, pp.154 - 168

Published online: 28 Feb 2014 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: A combination of variable selection and data mining techniques for high-dimensional statistical modelling

Keep up-to-date