Comparative Analysis of Distance Metrics in KNN and SMOTE Algorithms for Software Defect Prediction

Khusnul Rahmi Maulidha, Mohammad Reza Faisal, Setyo Wahyu Saputro, Friska Abadi, Dodon Turianto Nugrahadi, Puput Dani Prasetyo Adi, Hariyady Hariyady

Abstract


As the complexity and scale of projects increase, new challenges arise related to handling software defects. One solution uses machine learning-based software defect prediction techniques, such as the K-Nearest Neighbors (KNN) algorithm. However, KNN’s performance can be hindered by the majority vote mechanism and the distance/similarity metric choice, especially when applied to imbalanced datasets. This research compares the effectiveness of Euclidean, Hamming, Cosine, and Canberra distance metrics on KNN performance, both before and after the application of SMOTE (Synthetic Minority Over-sampling Technique). Results show significant improvements in the AUC and F-1 measure values across various datasets after the SMOTE application. Following the SMOTE application, Euclidean distance produced an AUC of 0.7752 and an F1 of 0.7311 for the EQ dataset. With Canberra distance and SMOTE, the JDT dataset produced an AUC of 0.7707 and an F-1 of 0.6342. The LC dataset improved to 0.6752 and 0.3733 in tandem with the ML dataset, which climbed to 0.6845 and 0.4261 with Canberra distance. Lastly, after using SMOTE, the PDE dataset improved to 0.6580 and 0.3957 with Canberra distance. The findings confirm that SMOTE, combined with suitable distance metrics, significantly boosts KNN’s prediction accuracy, with a P-value of 0.0001.

Keywords


Software Defect Prediction; SMOTE; KNN Algorithm; Distance Metrics

Full Text:

Link Download

References


Alfeilat, H. A. A., Hassanat, A. B. A., Lasassmeh, O., Tarawneh, A. S., Alhasanat, M. B., Salman, H. S. E., & Prasath, V. B. S. (2024). Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review | Big Data. Retrieved July 26, 2024, from https://www.liebertpub.com/doi/abs/10.1089/big.2018.0175.

Bala, Y. Z., Samat, P. A., Sharif, K. Y., & Manshor, N. (2024). The influence of machine learning on the predictive performance of cross-project defect prediction: empirical analysis. TELKOMNIKA (Telecommunication Computing Electronics and Control), 22(4), 830–837.

Bowers, A. J., & Zhou, X. (2019). Receiver Operating Characteristic (ROC) Area Under the Curve (AUC): A Diagnostic Measure for Evaluating the Accuracy of Predictors of Education Outcomes. Journal of Education for Students Placed at Risk (JESPAR), 24(1), 20–46. Routledge.

Chakraborty, J., Majumder, S., & Menzies, T. (2021). Bias in machine learning software: Why? how? what to do? Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021 (pp. 429–440). New York, NY, USA: Association for Computing Machinery. Retrieved August 24, 2024, from https://dl.acm.org/doi/10.1145/3468264.3468537

D’Ambros, M., Lanza, M., & Robbes, R. (2010, May 2). (PDF) An extensive comparison of bug prediction approaches. Retrieved July 26, 2024, from https://www.researchgate.net/publication/221657038_An_extensive_comparison_of_bug_prediction_approaches

Dudjak, M., & Martinović, G. (2020). In-Depth Performance Analysis of SMOTE-Based Oversampling Algorithms in Binary Classification. International journal of electrical and computer engineering systems, 11(1), 13–23. Sveučilišta Josipa Jurja Strossmayera u Osijeku, Elektrotehnički fakultet.

Duy-An Ha, Chen, T.-H., & Yuan, S.-M. (2019). Unsupervised methods for Software Defect Prediction. Proceedings of the 10th International Symposium on Information and Communication Technology, SoICT ’19 (pp. 49–55). New York, NY, USA: Association for Computing Machinery. Retrieved August 24, 2024, from https://doi.org/10.1145/3368926.3369711

Giray, G., Bennin, K. E., Köksal, Ö., Babur, Ö., & Tekinerdogan, B. (2023). On the use of deep learning in software defect prediction. Journal of Systems and Software, 195, 111537.

Hidayati, N., & Hermawan, A. (2021). K-Nearest Neighbor (K-NN) algorithm with Euclidean and Manhattan in classification of student graduation. Journal of Engineering and Applied Technology, 2(2). Retrieved August 24, 2024, from https://journal.uny.ac.id/index.php/jeatech/article/view/42777

Huang, C., Li, Y., Loy, C. C., & Tang, X. (2020). Deep Imbalanced Learning for Face Recognition and Attribute Prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(11), 2781–2794. Presented at the IEEE Transactions on Pattern Analysis and Machine Intelligence.

Iqbal, A., Aftab, S., Ali, U., Nawaz, Z., Sana, L., Ahmad, M., & Husen, A. (2019). Performance Analysis of Machine Learning Techniques on Software Defect Prediction using NASA Datasets. International Journal of Advanced Computer Science and Applications, 10, 300–308.

Javed, K., Shengbing, R., Asim, M., & Wani, M. A. (2024, April 10). Cross-Project Defect Prediction Based on Domain Adaptation and LSTM Optimization. Retrieved July 26, 2024, from https://www.mdpi.com/1999-4893/17/5/175

Jin, C. (2021). Cross-project software defect prediction based on domain adaptation learning and optimization. Expert Systems with Applications, 171, 114637.

Kaope, C., & Pristyanto, Y. (2023). The Effect of Class Imbalance Handling on Datasets Toward Classification Algorithm Performance. MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer, 22(2), 227–238.

Kumar, P., Bhatnagar, R., Gaur, K., & Bhatnagar, A. (2021). Classification of Imbalanced Data:Review of Methods and Applications. IOP Conference Series: Materials Science and Engineering, 1099(1), 012077. IOP Publishing.

Kumar, P. S., Nayak, J., & Behera, H. S. (2022). Model-based Software Defect Prediction from Software Quality Characterized Code Features by using Stacking Ensemble Learning. Journal of Engineering Science and Technology Review, 15(2), 137–155.

Malhotra, R., Agrawal, V., Pal, V., & Agarwal, T. (2021). Support vector based oversampling technique for handling class imbalance in software defect prediction. 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence), 1078–1083.

Mahesh Kumar Thota, Francis H Shajin, & P. Rajesh. (2020). Survey on software defect prediction techniques. International Journal of Applied Science and Engineering, 17(4).

Mehta, S., & Patnaik, K. S. (2021). Improved prediction of software defects using ensemble machine learning techniques. Neural Computing and Applications, 33(16), 10551–10562.

Mushtaq, Z., Yaqub, A., Sani, S., & Khalid, A. (2020). Effective K-nearest neighbor classifications for Wisconsin breast cancer data sets. Journal of the Chinese Institute of Engineers, 43(1), 80–92. Taylor & Francis.

Nayak, S., Bhat, M., Reddy, N. V. S., & Rao, B. A. (2022). Study of distance metrics on k—Nearest neighbor algorithm for star categorization. Journal of Physics: Conference Series, 2161(1), 012004. IOP Publishing.

Pertiwi, A. G., Bachtiar, N., Kusumaningrum, R., Waspada, I., & Wibowo, A. (2020). Comparison of performance of k-nearest neighbor algorithm using smote and k-nearest neighbor algorithm without smote in diagnosis of diabetes disease in balanced data. Journal of Physics: Conference Series, 1524(1), 012048. IOP Publishing.

Prasetya, J., & Abdurakhman, A. (2023). COMPARISON OF SMOTE RANDOM FOREST AND SMOTE K-NEAREST NEIGHBORS CLASSIFICATION ANALYSIS ON IMBALANCED DATA. MEDIA STATISTIKA, 15(2), 198–208. Department of Statistics, Faculty of Science and Mathematics, Universitas Diponegoro.

Prusty, S., Patnaik, S., & Dash, S. K. (2022, August 19). Frontiers | SKCV: Stratified K-fold cross-validation on ML classifiers for predicting cervical cancer. Retrieved July 26, 2024, from https://www.frontiersin.org/journals/nanotechnology/articles/10.3389/fnano.2022.972421/full

Reddivari, S., & Raman, J. (2019). Software Quality Prediction: An Investigation Based on Machine Learning. 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI) (pp. 115–122). Presented at the 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI), Los Angeles, CA, USA: IEEE. Retrieved August 24, 2024, from https://ieeexplore.ieee.org/document/8843447/

Rehman, H. A. U., Chyi-Yeu Lin, & Zohaib Mushtaq. (2021). Effective K-Nearest Neighbor Algorithms Performance Analysis of Thyroid Disease. Journal of the Chinese Institute of Engineers, 44(1), 77–87.

Sun, B., & Chen, H. (2021). A Survey of k Nearest Neighbor Algorithms for Solving the Class Imbalanced Problem -. Retrieved July 26, 2024, from https://onlinelibrary.wiley.com/doi/10.1155/2021/5520990

Suyanto, S., Yunanto, P. E., Wahyuningrum, T., & Khomsah, S. (2022). A multi-voter multi-commission nearest neighbor classifier. Journal of King Saud University—Computer and Information Sciences, 34(8, Part B), 6292–6302.

Taunk, K., De, S., Verma, S., & Swetapadma, A. (2019). A Brief Review of Nearest Neighbor Algorithm for Learning and Classification. 2019 International Conference on Intelligent Computing and Control Systems (ICCS) (pp. 1255–1260). Presented at the 2019 International Conference on Intelligent Computing and Control Systems (ICCS). Retrieved July 26, 2024, from https://ieeexplore.ieee.org/document/9065747

Tsalera, E., Papadakis, A., & Samarakou, M. (2020). Monitoring, profiling and classification of urban environmental noise using sound characteristics and the KNN algorithm. Energy Reports, Technologies and Materials for Renewable Energy, Environment and Sustainability, 6, 223–230.

Uddin, S., Haque, I., Lu, H., Moni, M. A., & Gide, E. (2022). Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction. Scientific Reports, 12(1), 6256. Nature Publishing Group.

Vandewiele, G., Dehaene, I., Kovács, G., Sterckx, L., Janssens, O., Ongenae, F., De Backere, F., et al. (2021). Overly optimistic prediction results on imbalanced data: A case study of flaws and benefits when applying over-sampling. Artificial Intelligence in Medicine, 111, 101987.

Zhao, T., Zhang, X., & Wang, S. (2021). GraphSMOTE: Imbalanced Node Classification on Graphs with Graph Neural Networks. Proceedings of the 14th ACM International Conference on Web Search and Data Mining, WSDM ’21 (pp. 833–841). New York, NY, USA: Association for Computing Machinery. Retrieved August 24, 2024, from https://dl.acm.org/doi/10.1145/3437963.3441720

Zhao, Y., Zhu, Y., Yu, Q., & Chen, X. (2021). Cross-Project Defect Prediction Method Based on Manifold Feature Transformation. Future Internet, 13(8), 216. Multidisciplinary Digital Publishing Institute.




DOI: http://dx.doi.org/10.35671/telematika.v18i1.3008

Refbacks

  • There are currently no refbacks.


 



Indexed by:

 

Telematika
ISSN: 2442-4528 (online) | ISSN: 1979-925X (print)
Published by : Universitas Amikom Purwokerto
Jl. Let. Jend. POL SUMARTO Watumas, Purwonegoro - Purwokerto, Indonesia


Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 International License .