Segmentation of genomic data through multivariate statistical approaches: comparative analysis


243 / 175

Authors

  • ARFA ANJUM Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi
  • SEEMA JAGGI Assistant Director General (HRD),Education Division,KAB II, ICAR, New Delhi
  • SHWETANK LALL Aristocrat Technologies, New Delhi
  • ELDHO VARGHESE Fishery Resources Assessment Division,ICAR-Central Marine Fisheries Research Institute, Kochi
  • ANIL RAI Assistant Director General (ICT),Khishi Bhavan, ICAR, New Delhi(Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi)
  • ARPAN BHOWMIK Division of Design of Experiments, ICAR-Indian Agricultural Statistics Research Institute, New Delhi
  • DWIJESH CHANDRA MISHRA Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi

https://doi.org/10.56093/ijas.v92i7.118040

Keywords:

Genome, Segmentation, Multivariate analysis, Sequential clustering

Abstract

Segmenting a series of measurements along a genome into regions with distinct characteristics is widely used to
identify functional components of a genome. The majority of the research on biological data segmentation focuses on the statistical problem of identifying break or change-points in a simulated scenario using a single variable. Despite the fact that various strategies for finding change-points in a multivariate setup through simulation are available, work on segmenting actual multivariate genomic data is limited. This is due to the fact that genomic data is huge in size and contains a lot of variation within it. Therefore, a study was carried out at the ICAR-Indian Agricultural Statistics Research Institute, New Delhi during 2021 to know the best multivariate statistical method to segment the sequences which may influence the properties or function of a sequence into homogeneous segments. This will reduce the volume of data and ease the analysis of these segments further to know the actual properties of these segments. The genomic data of Rice (Oryza sativa L.) was considered for the comparative analysis of several multivariate approaches and was found that agglomerative sequential clustering was the most acceptable due to its low computational cost and feasibility.

Downloads

Download data is not yet available.

References

Baringhaus L and Franz C. 2004. On a new multivariate twosample test. Journal of Multivariate Analysis 88: 190–206. DOI: https://doi.org/10.1016/S0047-259X(03)00079-4

Bleakley K and Vert J P. 2011. The group fused lasso for multiple change-point detection. Technical Report HAL-00602121.

Computational Biology Center, Paris. Braun J V and Muller H G. 1998. Statistical methods for DNA sequence segmentation. Statistical Science 13(2): 142–62. DOI: https://doi.org/10.1214/ss/1028905933

Du Y, Murani E, Ponsuksili S and Wimmers K. 2014. biomvRhsmm:Genomic Segmentation with Hidden Semi-Markov Model. BioMed Research International 2014: 1–12. DOI: https://doi.org/10.1155/2014/910390

Franz C. 2000. 'A statistical test for the multidimensional twosample problem'. Diploma Thesis, University of Hanover, Germany.

Girimurugan S B, Liu Y, Lung P Y, Vera D L, Dennis J H, Bass H W and Zhang J. 2018. iSeg: An efficient algorithm for segmentation of genomic and epigenomic data. BMC Bioinformatics 19(1): 1–15. DOI: https://doi.org/10.1186/s12859-018-2140-3

Husmeier D and Wright F. 2002. A Bayesian approach to discriminate between alternative DNA sequence segmentations. Bioinformatics 18(2): 226–34. DOI: https://doi.org/10.1093/bioinformatics/18.2.226

James N A and Matteson D S. 2015. ecp: An R package for nonparametric multiple change point analysis of multivariate data. Journal of Statistical Software 62(7): 1–25. DOI: https://doi.org/10.18637/jss.v062.i07

Justel A, Pena D and Zamar R. 1997. A multivariate Kolmogorov–Smirnov test of goodness of fit. Statistics & Probability Letters 35(3): 251–59. DOI: https://doi.org/10.1016/S0167-7152(97)00020-5

Killick R, Fearnhead P and Eckley I A. 2012. Optimal detection of change-points with a linear computational cost. Journal of the American Statistical Association 107(500): 1590–98. DOI: https://doi.org/10.1080/01621459.2012.737745

Mello T and Florencia L. 2019. Segmentr: Segment data minimizing a cost function. Retrieved from https://CRAN.R-project.org/package=segmentr

Momtaz R, Ghanem N M, El-Makky N M and Ismail M A. 2018. Integrated analysis of SNP, CNV and gene expression data in genetic association studies. Clinical Genetics 93(3): 557–66. DOI: https://doi.org/10.1111/cge.13092

Omranian N, Mueller-Roeber B and Nikoloski Z. 2015. Segmentation of biological multivariate time-series data. DOI: https://doi.org/10.1038/srep08937

Scientific Reports 5(1): 1–6.

Ortiz-Estevez M, De Las Rivas J, Fontanillo C and Rubio A. 2011. Segmentation of genomic and transcriptomic microarrays data reveals major correlation between DNA copy number aberrations and gene-loci expression. Genomics 97(2): 86–93. DOI: https://doi.org/10.1016/j.ygeno.2010.10.008

Rigaill, G, Lebarbier E and Robin S. 2012. Exact posterior distributions and model selection criteria for multiple changepoint detection problems. Statistics and Computing 22: 917–29. DOI: https://doi.org/10.1007/s11222-011-9258-8

Wang Y, Wu C, Ji Z, Wang B and Liang Y. 2011. Non-parametric change-point method for differential gene expression detection. PLoS ONE 6(5): 1–16. DOI: https://doi.org/10.1371/journal.pone.0020060

Downloads

Submitted

2021-11-18

Published

2022-03-30

Issue

Section

Articles

How to Cite

ANJUM, A., JAGGI, S., LALL, S., VARGHESE, E., RAI, A., BHOWMIK, A., & MISHRA, D. C. (2022). Segmentation of genomic data through multivariate statistical approaches: comparative analysis. The Indian Journal of Agricultural Sciences, 92(7), 892-896. https://doi.org/10.56093/ijas.v92i7.118040
Citation