DDC: Deep Distribution Classifier, A Convolutional Neural   Network-based Approach for Identifying Data Distributions

Samarth Godara; Avinash G; Rajender Parsad; Sudeep Marwaha

doi:10.56093/jisas.v78i02.171395

Authors

Samarth Godara ICAR-Indian Agricultural Statistics Research Institute, New Delhi
Avinash G ICAR-Indian Agricultural Statistics Research Institute, New Delhi
Rajender Parsad ICAR-Indian Agricultural Statistics Research Institute, New Delhi
Sudeep Marwaha ICAR-Indian Agricultural Statistics Research Institute, New Delhi

https://doi.org/10.56093/jisas.v78i02.171395

Keywords:

Distribution identification; Deep learning; Data distribution classification; Convolutional neural networks; Fast distribution detection.

Abstract

In domains such as the stock market and manufacturing, there’s a growing demand for faster and more accurate data distribution identification methods due to the rapid generation of vast volumes of data, highlighting the need for enhanced real-time decision-making capabilities. Traditional methods of identifying data distributions often rely on manual inspection, limited statistical tests and time-consuming analysis, leading to inefficiencies and inaccuracies in classification. In this scenario, the presented research offers a novel approach leveraging Deep Learning (DL) models to automate the process. The presented methodology also enables faster and more accurate identification of data distributions by the generation of synthetic data points and training of the DL model for identifying different distribution types. The primary objective of this study is to develop a DL model that categorizes data points into specific distributions based on an input dataset. Moreover, for model training and evaluation, a total of 1000 datasets are generated,
each comprising 1000 data points. The study considers five distributions (Normal, Uniform, Exponential, Log-normal and Beta distribution), with 200 datasets generated (with randomly selected parameters) for each distribution. In the study, the DL model is trained first, and later, the model is evaluated on a separate test (unseen) dataset. Then, its performance in classifying the distributions is assessed based on metrics such as accuracy and loss. The study results demonstrate the effectiveness of the proposed approach in accurately classifying the distribution of data points, providing valuable insights into the application of DL for distribution classification tasks. The proposed method enhances scalability, robustness and efficiency by harnessing the power of convolutional neural networks and advanced preprocessing techniques.

Downloads

Download data is not yet available.

References

AbouRizk, Simaan M., Daniel W. Halpin, and James R. Wilson. (1994). “Fitting beta distributions based on sample data.” Journal of Construction Engineering and Management, 120(2), 288-305.

Adcock, Christopher, Martin Eling, and Nicola Loperfido. (2015). “Skewed distributions in finance and actuarial science: a review.” The European Journal of Finance, 21(13-14), 1253-1281.

Aghdam, Hamed Habibi, and Elnaz Jahani Heravi. (2017). Guide to Convolutional Neural Networks. New York, NY: Springer, 10(978-973), 51.

Davis, Paula M. (2020). “Statistics for describing populations.” In Handbook of Sampling Methods for Arthropods in Agriculture, pp. 33-54. CRC Press.

Fraile, Roberto, and Eduardo García-Ortega. (2005). “Fitting an exponential distribution.” Journal of Applied Meteorology and Climatology, 44(10), 1620-1625.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. (2016). Deep Learning. MIT Press.

Habehh, Hafsa, and Suril Gohel. (2021). “Machine learning in healthcare.” Current Genomics, 22(4), 291.

Huang, Jian, Junyi Chai, and Stella Cho. (2020). “Deep learning in finance and banking: A literature review and classification.” Frontiers of Business Research in China, 14(1), 13.

Medsker, Larry R., and Lakhmi Jain. (2001). “Recurrent neural networks.” Design and Applications, 5, 64-67.

Mohammed, Noman, Benjamin C.M. Fung, Patrick C.K. Hung, and Cheuk-Kwong Lee. (2010). “Centralized and distributed anonymization for high-dimensional healthcare data.” ACM Transactions on Knowledge Discovery from Data (TKDD), 4(4), 1-33.

Rai, Rahul, Manoj Kumar Tiwari, Dmitry Ivanov, and Alexandre Dolgui. (2021). “Machine learning in manufacturing and industry 4.0 applications.” International Journal of Production Research, 59(16), 4773-4778.

Ramberg, John S., Edward J. Dudewicz, Pandu R. Tadikamalla, and Edward F. Mykytka. (1979). “A probability distribution and its uses in fitting data.” Technometrics, 21(2), 201-214.

Stedinger, J.R. (1980). “Fitting log normal distributions to hydrologic data.” Water Resources Research, 16(3), 481-490.

Thas, Olivier. (2010). Comparing Distributions. Vol. 233. New York: Springer.

Vose, David. (2016). “Fitting distributions to data.” Retrieved March 8, 2016.

Wang, Junliang, Chuqiao Xu, Jie Zhang, and Ray Zhong. (2022). “Big data analytics for intelligent manufacturing systems: A review.” Journal of Manufacturing Systems, 62, 738-752.

Yu, Yong, Xiaosheng Si, Changhua Hu, and Jianxun Zhang. (2019). “A review of recurrent neural networks: LSTM cells and network architectures.” Neural Computation, 31(7), 1235-1270.