Performance of clustering procedures for grouping germplasms based on mixture data with missing observations
341 / 96
Keywords:
Cluster analysis, Imputation, Missing data, Mixture data, Qualitative traits, Quantitative traits, Random Amplified Polymorphic DNA (RAPD)Abstract
Occurrence of missing observations in mixture of qualitative and quantitative trait data is a common feature in breeding experiments. However, it becomes difficult to cluster the germplasms in presence of missing data. In the present study, five different clustering methods, six different ways of imputing missing data and three levels of missing observations have been considered in order to compare the performance of clustering procedures meant for mixture data. It was found that all the clustering methods are robust against imputation up to 5% missing observations. The INDOMIX and PRINQUAL methods in conjunction with k-means clustering with imputation of missing observations by (i) mean substitution in quantitative traits and frequency substitution in qualitative traits and (ii) multiple imputation in quantitative traits and 0 imputation in qualitative traits found to perform better than EM, ANN and PCAMIX methods for classification of germplasms. This study has been conducted during 2009–10 at Indian Agricultural Statistics Research Institute and for illustration purpose data has been obtained from National Bureau of Plant Genetic Resources.
Downloads
References
Bo T H, Dysvik B and Jonassen I. 2004. LSimpute: Accurate estimation of missing values in micoarray data with least squares method. Nucleic Acids Research 32(3): 34.
de Leeuw J and van Rijckevorsel J L A. 1980. HOMALS and PRINCALS, some generalization of principal components analysis. (in) Data Analysis and Informatics II, pp 231–42,
Diday E, Lebart L, Page‘s J P and Tomassone R (Eds). Elsevier Science Publisher, North Holland / Amsterdam.
Dempster A P, Laird N M and Rubin D B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society 39 (1): 1–38.
Kiers H. 1989. Three - Way Methods for the Analysis of Qualitative and Quantitative Two – Way Data, p 185. DSWO Press, University of Leiden, Netherlands.
Kohonen T. 1988. Self-organizing and Associative Memory, edn 3, 312. Springer-Verlag Inc., New York, USA.
Kolluru R, Rao A R, Prabhakaran V T, Selvi A and Mohapatra T. 2007. Comparative evaluation of clustering techniques for establishing AFLP based genetic relationship among sugarcane cultivars. Journal of Indian Society of Agricultural Statistics 61(1): 51–65.
Little R J A and Rubin D B. 1987. Statistical Analysis with Missing Data, edn 2, p 385. John Wiley and Sons, New York, USA.
SAS. 2005. SAS 9.1.3 Language Reference: Concepts, 3rd edn. SAS Institute Inc., USA.
Statistica. 2009. Statistica 9.0: Statistica Data Miner. StatSoft Inc., OK, USA
Troyanskaya O, Cantor M, Sherlock G, Brown P, Has T, Tibshirani R, Botstein D and Altman R B. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17(6): 520–5.
Winsberg S and Ramsay J O. 1983. Monotone spline.
Downloads
Submitted
Published
Issue
Section
License
Copyright (c) 2014 The Indian Journal of Agricultural Sciences

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
The copyright of the articles published in The Indian Journal of Agricultural Sciences is vested with the Indian Council of Agricultural Research, which reserves the right to enter into any agreement with any organization in India or abroad, for reprography, photocopying, storage and dissemination of information. The Council has no objection to using the material, provided the information is not being utilized for commercial purposes and wherever the information is being used, proper credit is given to ICAR.