Performance of clustering procedures for grouping germplasms based on mixture data with missing observations

RUPAM KUMAR SARKAR; A R RAO; S D WAHI; K V BHAT

doi:10.56093/ijas.v82i12.26254

Authors

RUPAM KUMAR SARKAR Indian Agricultural Statistics Research Institute, New Delhi 110 012
A R RAO Indian Agricultural Statistics Research Institute, New Delhi 110 012
S D WAHI Indian Agricultural Statistics Research Institute, New Delhi 110 012
K V BHAT NBPGR, New Delhi

https://doi.org/10.56093/ijas.v82i12.26254

Keywords:

Cluster analysis, Imputation, Missing data, Mixture data, Qualitative traits, Quantitative traits, Random Amplified Polymorphic DNA (RAPD)

Abstract

Occurrence of missing observations in mixture of qualitative and quantitative trait data is a common feature in breeding experiments. However, it becomes difficult to cluster the germplasms in presence of missing data. In the present study, five different clustering methods, six different ways of imputing missing data and three levels of missing observations have been considered in order to compare the performance of clustering procedures meant for mixture data. It was found that all the clustering methods are robust against imputation up to 5% missing observations. The INDOMIX and PRINQUAL methods in conjunction with k-means clustering with imputation of missing observations by (i) mean substitution in quantitative traits and frequency substitution in qualitative traits and (ii) multiple imputation in quantitative traits and 0 imputation in qualitative traits found to perform better than EM, ANN and PCAMIX methods for classification of germplasms. This study has been conducted during 2009–10 at Indian Agricultural Statistics Research Institute and for illustration purpose data has been obtained from National Bureau of Plant Genetic Resources.

Downloads

Download data is not yet available.

References

Bo T H, Dysvik B and Jonassen I. 2004. LSimpute: Accurate estimation of missing values in micoarray data with least squares method. Nucleic Acids Research 32(3): 34.

de Leeuw J and van Rijckevorsel J L A. 1980. HOMALS and PRINCALS, some generalization of principal components analysis. (in) Data Analysis and Informatics II, pp 231–42,

Diday E, Lebart L, Page‘s J P and Tomassone R (Eds). Elsevier Science Publisher, North Holland / Amsterdam.

Dempster A P, Laird N M and Rubin D B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society 39 (1): 1–38.

Kiers H. 1989. Three - Way Methods for the Analysis of Qualitative and Quantitative Two – Way Data, p 185. DSWO Press, University of Leiden, Netherlands.

Kohonen T. 1988. Self-organizing and Associative Memory, edn 3, 312. Springer-Verlag Inc., New York, USA.

Kolluru R, Rao A R, Prabhakaran V T, Selvi A and Mohapatra T. 2007. Comparative evaluation of clustering techniques for establishing AFLP based genetic relationship among sugarcane cultivars. Journal of Indian Society of Agricultural Statistics 61(1): 51–65.

Little R J A and Rubin D B. 1987. Statistical Analysis with Missing Data, edn 2, p 385. John Wiley and Sons, New York, USA.

SAS. 2005. SAS 9.1.3 Language Reference: Concepts, 3rd edn. SAS Institute Inc., USA.

Statistica. 2009. Statistica 9.0: Statistica Data Miner. StatSoft Inc., OK, USA

Troyanskaya O, Cantor M, Sherlock G, Brown P, Has T, Tibshirani R, Botstein D and Altman R B. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics 17(6): 520–5.

Winsberg S and Ramsay J O. 1983. Monotone spline.