Title: Comparison of hierarchical clustering methods for binary data from molecular markers

Authors: Emmanouil D. Pratsinakis; Symela Ntoanidou; Alexios Polidoros; Christos Dordas; Panagiotis Madesis; Ilias Eleftherohorinos; George Menexes

Addresses: Laboratory of Agronomy, School of Agriculture, Faculty of Agriculture, Forestry and Natural Environment, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece ' Laboratory of Agronomy, School of Agriculture, Faculty of Agriculture, Forestry and Natural Environment, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece ' Laboratory of Genetics and Plant Breeding, School of Agriculture, Faculty of Agriculture, Forestry and Natural Environment, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece ' Laboratory of Agronomy, School of Agriculture, Faculty of Agriculture, Forestry and Natural Environment, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece ' Institute of Applied Biosciences, CERTH, Centre for Research and Technology Hellas, 57001 6th Km. Charilaou-Thermi Road Thessaloniki, Greece ' Laboratory of Agronomy, School of Agriculture, Faculty of Agriculture, Forestry and Natural Environment, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece ' Laboratory of Agronomy, School of Agriculture, Faculty of Agriculture, Forestry and Natural Environment, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

Abstract: Data from molecular markers used for constructing dendrograms, which are based on genetic distances between different plant species, are encoded as binary data. For dendrograms' construction, the most commonly used linkage method is the UPGMA in combination with the squared Euclidean distance. It seems that in this scientific field, this is the 'golden standard' clustering method. In this study, a review is presented on clustering methods used with binary data. Furthermore, an evaluation of the linkage methods and the corresponding appropriate distances (comparison of 163 clustering methods) is attempted using binary data resulted from molecular markers applied to five populations of the wild mustard Sinapis arvensis species. The validation of the various cluster solutions was tested using external criteria. The results showed that the 'golden standard' is not a 'panacea' for dendrogram construction, based on binary data derived from molecular markers. Thirty seven other hierarchical clustering methods could be used.

Keywords: dendrograms; proximities; linkage methods; Benzécri's chi-squared distance; correspondence analysis; categorical binary data; ISSR markers; Sinapis arvensis.

DOI: 10.1504/IJDATS.2020.108036

International Journal of Data Analysis Techniques and Strategies, 2020 Vol.12 No.3, pp.190 - 212

Received: 06 Feb 2018
Accepted: 13 May 2018

Published online: 02 Jul 2020 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article