Imbalanced Credit Risk Prediction in Ensemble Learning Classifiers: A Comparative Analysis of SMOTE, ADASYN, SMOTETomek, and Cluster Centroids
Main Article Content
Abstract
The improvement of financial institutions' ability to predict customers' credit risk has benefited from the continuous updating of machine learning algorithms. The ensemble algorithms represented by Random Forest and XGBoost carry forward the advantages of decision tree structures and perform well on datasets with complex characteristics, such as credit data. Research methodology is conducted via sampling design; we employ the German Credit dataset, a benchmark dataset comprising 1000 samples and 18 features, for empirical analysis. Measurement Design: various performance metrics such as accuracy, precision, recall, and F1-score are used to assess the efficacy of the balancing techniques. Analysis Design: A comparative analysis is conducted to evaluate the strengths and weaknesses of these balancing techniques in different ensemble learning classifiers. However, in real credit datasets, the number of defaulted samples is usually only a small percentage. Class imbalance greatly weakens the predictive performance of ensemble models. Therefore, this paper employed four different techniques for handling class imbalance problems, namely SMOTE, ADASYN, SMOTETomek, and ClusterCentroid. A comparison of different tree-based models in datasets is demonstrated where balancing techniques are applied. The conclusions show that all models perform much better on the datasets with balancing techniques than without balancing. Balancing the data in advance does improve the predictive ability of the models, and the over-sampling and integrated sampling methods outperform the under-sampling techniques on small and medium-sized datasets.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Views and opinions appearing in articles in the Journal of Arts of Management It is the responsibility of the author of the article. and does not constitute the view and responsibility of the editorial team I agree that the article is copyright of the Arts and Management Journal.
References
Alessi, L., & Detken, C. (2018). Identifying Excessive Credit Growth and Leverage. Journal of Financial Stability, 35, 215-225. https://doi.org/https://doi.org/10.1016/j.jfs.2017.06.005
Auronen, L. (2003). Asymmetric Information: Theory and Applications. Helsinki University of Technology. Retrieved from https://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.198.9252
Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Vanthienen, J. (2003). Benchmarking State-of-the-art Classification Algorithms for Credit Scoring. Journal of the Operational Research Society, 54, 627-635.
Caouette, J.B., Altman, E.I., Narayanan, P., & Nimmo, R. (2008). Economic Capital and Capital Allocation: Chapter 19. In Managing Credit Risk: The Great Challenge for the Global Financial Markets (2nd ed.). Wiley. https://doi.org/10.1002/9781118266236.ch19
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.
Chawla, N., Lazarevic, A., Hall, L., & Bowyer, K. (2003). SMOTEBoost: Improving Prediction of the Minority Class in Boosting (Vol. 2838). https://doi.org/10.1007/978-3-540-39804-2_12
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Paper presented at the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794). https://doi.org/10.1145/2939672.2939785
Daoud, E. A. (2019). Comparison between XGBoost, LightGBM and CatBoost Using a Home Credit Dataset. International Journal of Computer and Information Engineering, 145, 6-10. https://publications.waset.org/pdf/10009954
Dastile, X., Celik, T., & Potsane, M. (2020). Statistical and machine learning models in credit scoring: A systematic literature survey. Applied Soft Computing, 91, 106263. https://doi.org/https://doi.org/10.1016/j.asoc.2020.106263
Dobbie, W., & Skiba, P. M. (2013). Information Asymmetries in Consumer Credit Markets: Evidence from Payday Lending. American Economic Journal: Applied Economics, 5(4), 256-282. doi:10.1257/app.5.4.256
Dorogush, A. V., Ershov, V., & Gulin, A. (2018). CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363.
Galindo, J., & Tamayo, P. (2000). Credit Risk Assessment Using Statistical and Machine Learning: Basic Methodology and Risk Modeling Applications. Computational Economics, 15(1), 107-143. https://doi.org/10.1023/A:1008699112516
Gamini, P., Yerramsetti, S., Darapu, G., Pentakoti, V., Prudhvi, V. (2021). Detection of Credit Card Fraudulent Transactions using Boosting Algorithms. Journal of Emerging Technologies and Innovative Research, 8(2), 2031-2036.
García, V., Marqués, A. I., & Sánchez, J. S. (2015). An insight into the experimental design for credit risk and corporate bankruptcy prediction systems. Journal of Intelligent Information Systems, 44(1), 159-189. https://doi.org/10.1007/s10844-014-0333-4
Haixiang, G., Li, Y., Shang, J., Mingyun, G., Yuanyue, H., & Gong, B. (2016). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73. https://doi.org/10.1016/j.eswa.2016.12.035
Hand, D. J., & Henley, W. E. (1997). Statistical classification methods in consumer credit scoring: a review. Journal of the Royal Statistical Society: Series A (Statistics in Society), 160(3), 523-541.
Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction (Vol. 2). Springer.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Paper presented at the 2008 IEEE International Joint Conference on Neural Networks (IEEE world congress on computational intelligence).
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., . . . Liu, T.-Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30.
Kotsiantis, S. B. (2013). Decision trees: a recent overview. Artificial Intelligence Review, 39, 261-283.
Kumar, S., Ahmed, R., Bharany, S., Shuaib, M., Ahmad, T., Tag Eldin, E., . . . Shafiq, M. (2022). Exploitation of Machine Learning Algorithms for Detecting Financial Crimes Based on Customers’ Behavior. Sustainability, 14(21), 13875. https://doi.org/10.3390/su142113875
Lessmann, S., Baesens, B., Seow, H.-V., & Thomas, L. C. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247(1), 124-136.
Ma, X., Sha, J., Wang, D., Yu, Y., Yang, Q., & Niu, X. (2018). Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning. Electronic Commerce Research and Applications, 31, 24-39. https://doi.org/https://doi.org/10.1016/j.elerap.2018.08.002
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. Advances in Neural Information Processing Systems, 31.
Stiglitz, J. E., & Weiss, A. (1981). Credit Rationing in Markets with Rationing Credit Information Imperfect. The American Economic Review, 71, 393-410.
Tsai, C.-F. (2009). Feature selection in bankruptcy prediction. Knowl. - Based Syst., 22(2), 120-127.
Wang, T., Zhao, S., Zhu, G., & Zheng, H. (2021). A machine learning-based early warning system for systemic banking crises. Applied Economics, 53(26), 2974-2992. https://doi.org/10.1080/00036846.2020.1870657
Witten, I. H., & Frank, E. (2002). Data mining: practical machine learning tools and techniques with Java implementations. Acm Sigmod Record, 31(1), 76-77.
Xia, Y., Liu, C., Li, Y., & Liu, N. (2017). A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring. Expert Systems with Applications, 78, 225-241.
Xia, Y., Liu, C., Da, B., & Xie, F. (2018). A novel heterogeneous ensemble credit scoring model based on bstacking approach. Expert Syst. Appl., 93(C), 182–199. https://doi.org/10.1016/j.eswa.2017.10.022
Yang, L., & Shami, A. (2020). On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice. Neurocomputing, 415, 295-316. https://doi.org/10.1016/j.neucom.2020.07.061
Zhou, L., Fujita, H., Ding, H., & Ma, R. (2021). Credit Risk Modeling on Data with Two Timestamps in Peer-to-Peer Lending by Gradient Boosting. Applied Soft Computing, 110, 107672. https://doi.org/https://doi.org/10.1016/j.asoc.2021.107672