Pendekatan Level Data untuk Menangani Ketidakseimbangan Kelas pada Prediksi Cacat Software

Aries Saifudin; Romi Satria Wahono

Pendekatan Level Data untuk Menangani Ketidakseimbangan Kelas pada Prediksi Cacat Software

Aries Saifudin, Romi Satria Wahono

Abstract

Dataset software metrics secara umum bersifat tidak seimbang, hal ini dapat menurunkan kinerja model prediksi cacat software karena cenderung menghasilkan prediksi kelas mayoritas. Secara umum ketidakseimbangan kelas dapat ditangani dengan dua pendekatan, yaitu level data dan level algoritma. Pendekatan level data ditujukan untuk memperbaiki keseimbangan kelas, sedangkan pendekatan level algoritma ditujukan untuk memperbaiki algoritma atau menggabungkan (ensemble) pengklasifikasi agar lebih konduktif terhadap kelas minoritas. Pada penelitian ini diusulkan pendekatan level data dengan resampling, yaitu random oversampling (ROS), dan random undersampling (RUS), dan mensintesis menggunakan algoritma FSMOTE. Pengklasifikasi yang digunakan adalah Naϊve Bayes. Hasil penelitian menunjukkan bahwa model FSMOTE+NB merupakan model pendekatan level data terbaik pada prediksi cacat software karena nilai sensitivitas dan G-Mean model FSMOTE+NB meningkat secara signifikan, sedangkan model ROS+NB dan RUS+NB tidak meningkat secara signifikan.

Full Text:

PDF

References

Anantula, P. R., & Chamarthi, R. (2011). Defect Prediction and Analysis Using ODC Approach in a Web Application. (IJCSIT) International Journal of Computer Science and Information Technologies, 2(5), 2242-2245.

Attenberg, J., & Ertekin, S. (2013). Class Imbalance and Active Learning. In H. He, & Y. Ma, Imbalanced Learning: Foundations, Algorithms, and Applications (pp. 101-149). New Jersey: John Wiley & Sons.

Batuwita, R., & Palade, V. (2010). Efficient Resampling Methods for Training Support Vector Machines with Imbalanced Datasets. Proceedings of the International Joint Conference on Neural Networks (IJCNN) (pp. 1-8). Barcelona: IEEE Computer Society. doi:10.1109/IJCNN.2010.5596787

Bramer, M. (2007). Principles of Data Mining. London: Springer.

Carver, R. H., & Nash, J. G. (2012). Doing Data Analysis with SPSS® Version 18. Boston: Cengage Learning.

Catal, C. (2012). Performance Evaluation Metrics for Software Fault Prediction Studies. Acta Polytechnica Hungarica, 9(4), 193-206.

Chis, M. (2008). Evolutionary Decision Trees and Software Metrics for Module Defects Identification. World Academy of Science, Engineering and Technology, 273-277.

Corder, G. W., & Foreman, D. I. (2009). Nonparametric Statistics for Non-statisticians: A Step-by-step Approach. New Jersey: John Wiley & Sons.

Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 1–30.

Dubey, R., Zhou, J., Wang, Y., Thompson, P. M., & Ye, J. (2014). Analysis of Sampling Techniques for Imbalanced Data: An n = 648 ADNI Study. NeuroImage, 220–241.

Fakhrahmad, S. M., & Sami, A. (2009). Effective Estimation of Modules' Metrics in Software Defect Prediction. Proceedings of the World Congress on Engineering (pp. 206-211). London: Newswood Limited.

Galar, M., Fernández, A., Barrenechea, E., & Herrera, F. (2013). EUSBoost: Enhancing Ensembles for Highly Imbalanced Data-sets by Evolutionary Under Sampling. Pattern Recognition, 3460–3471.

García, S., Fernández, A., Luengo, J., & Herrera, F. (2010). Advanced Nonparametric Tests for Multiple Comparisons in the Design of Experiments in Computational Intelligence and Data Mining: Experimental Analysis of Power. Information Sciences, 2044–2064.

Gayatri, N., Nickolas, S., Reddy, A., & Chitra, R. (2009). Performance Analysis Of Data Mining Algorithms for Software Quality Prediction. International Conference on Advances in Recent Technologies in Communication and Computing (pp. 393-395). Kottayam: IEEE Computer Society.

Gorunescu, F. (2011). Data Mining: Concepts, Models and Techniques. Berlin: Springer-Verlag.

Gray, D., Bowes, D., Davey, N., Sun, Y., & Christianson, B. (2011). The Misuse of the NASA Metrics Data Program Data Sets for Automated Software Defect Prediction. Evaluation & Assessment in Software Engineering (EASE 2011), 15th Annual Conference on, (pp. 96-103). Durham.

Hall, T., Beecham, S., Bowes, D., Gray, D., & Counsell, S. (2011). A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Transactions on Software Engineering, Accepted for publication - available online, 1-31.

Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and Techniques (3rd ed.). San Francisco: Morgan Kaufmann Publishers Inc.

Huck, S. W. (2012). Reading Statistics and Research. Boston: Pearson Education.

In, H. P., Baik, J., Kim, S., Yang, Y., & Boehm, B. (2006, December). A Quality-Based Cost Estimation Model for the Product Line Life Cycle. Communications of the ACM, 49(12), 85-88. doi:10.1145/1183236.1183273

Japkowicz, N. (2013). Assessment Metrics for Imbalanced Learning. In H. He, & Y. Ma, Imbalanced Learning: Foundations, Algorithms, and Applications (pp. 187-206). New Jersey: John Wiley & Sons.

Jones, C. (2013). Software Defect Origins and Removal Methods. Namcook Analytics.

Khoshgoftaar, T. M., Gao, K., & Seliya, N. (2010). Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction. International Conference on Tools with Artificial Intelligence (pp. 137-144). IEEE Computer Society.

Lehtinen, T. O., Mäntylä, M. V., Vanhanen, J., Itkonen, J., & Lassenius, C. (2014). Perceived Causes of Software Project Failures - An Analysis of Their Relationships. Information and Software Technology, 623–643.

Lessmann, S., Baesens, B., Mues, C., & Pietsch, S. (2008). Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings. IEEE Transactions on Software Engineering, 485-496.

Liu, X.-Y., & Zhou, Z.-H. (2013). Ensemble Methods for Class Imbalance Learning. In H. He, & Y. Ma, Imbalanced Learning: Foundations, Algorithms, and Applications (pp. 61-82). New Jersey: John Wiley & Sons.

López, V., Fernández, A., & Herrera, F. (2014). On the Importance of the Validation Technique for Classification with Imbalanced Datasets: Addressing Covariate Shift when Data is Skewed. Information Sciences, 1–13. doi:10.1016/j.ins.2013.09.038

McDonald, M., Musson, R., & Smith, R. (2008). The Practical Guide to Defect Prevention. Washington: Microsoft Press.

Menzies, T., Greenwald, J., & Frank, A. (2007). Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Transactions on Software Engineering, 1-12.

Myrtveit, I., Stensrud, E., & Shepperd, M. (2005). Reliability and Validity in Comparative Studies of Software Prediction Models. IEEE Transactions on Software Engineering, 380-391.

Peng, Y., & Yao, J. (2010). AdaOUBoost: Adaptive Over-sampling and Under-sampling to Boost the Concept Learning in Large Scale Imbalanced Data Sets. Proceedings of the international conference on Multimedia information retrieval (pp. 111-118). Philadelphia, Pennsylvania, USA: ACM.

Riquelme, J. C., Ruiz, R., Rodriguez, D., & Moreno, J. (2008). Finding Defective Modules From Highly Unbalanced Datasets. Actas de los Talleres de las Jornadas de Ingeniería del Software y Bases de Datos (pp. 67-74). Gijon, España: SISTEDES.

Rodriguez, D., Herraiz, I., Harrison, R., Dolado, J., & Riquelme, J. C. (2014). Preliminary Comparison of Techniques for Dealing with Imbalance in Software Defect Prediction. 18th International Conference on Evaluation and Assessment in Software Engineering (EASE 2014) (pp. 371-380). New York: ACM. doi:10.1145/2601248.2601294

Seiffert, C., Khoshgoftaar, T. M., Hulse, J. V., & Folleco, A. (2011). An Empirical Study of the Classification Performance of Learners on Imbalanced and Noisy Software Quality Data. Information Sciences, 1-25.

Seiffert, C., Khoshgoftaar, T. M., Hulse, J. V., & Napolitano, A. (2008). Building Useful Models from Imbalanced Data with Sampling and Boosting. Proceedings of the Twenty-First International Florida Artificial Intelligence Research Society Conference (pp. 306-311). California: AAAI Press.

Shepperd, M., Song, Q., Sun, Z., & Mair, C. (2013). Data Quality: Some Comments on the NASA Software Defect Data Sets. IEEE Transactions on Software Engineering, 1208-1215. doi:10.1109/TSE.2013.11

Shull, F., Basili, V., Boehm, B., Brown, A. W., Costa, P., Lindvall, M., . . . Zelkowitz, M. (2002). What We Have Learned About Fighting Defects. METRICS '02 Proceedings of the 8th International Symposium on Software Metrics (pp. 249-258). Washington: IEEE Computer Society.

Song, Q., Jia, Z., Shepperd, M., Ying, S., & Liu, J. (2011). A General Software Defect-Proneness Prediction Framework. IEEE Transactions on Software Engineering, 356-370.

Strangio, M. A. (2009). Recent Advances in Technologies. Vukovar: In-Teh.

Sun, Y., Mohamed, K. S., Wong, A. K., & Wang, Y. (2007). Cost-sensitive Boosting for Classification of Imbalanced Data. Pattern Recognition Society, 3358-3378.

Tao, W., & Wei-hua, L. (2010). Naïve Bayes Software Defect Prediction Model. Computational Intelligence and Software Engineering (CiSE) (pp. 1-4). Wuhan: IEEE Computer Society. doi:10.1109/CISE.2010.5677057

Turhan, B., & Bener, A. (2007). Software Defect Prediction: Heuristics for Weighted Naive Bayes. Proceedings of the 2nd International Conference on Software and Data Technologies (ICSOFT'07), (pp. 244-249).

Wahono, R. S., Suryana, N., & Ahmad, S. (2014, May). Metaheuristic Optimization based Feature Selection for Software Defect Prediction. Journal of Software, 9(5), 1324-1333. doi:10.4304/jsw.9.5.1324-1333

Wang, S., & Yao, X. (2013). Using Class Imbalance Learning for Software Defect Prediction. IEEE Transactions on Reliability, 434-443.

Weiss, C., Premraj, R., Zimmermann, T., & Zeller, A. (2007). How Long will it Take to Fix This Bug? Fourth International Workshop on Mining Software Repositories (MSR'07) (pp. 1-8). Washington: IEEE Computer Society.

Weiss, G. M. (2013). Foundations of Imbalanced Learning. In H. He, & Y. Ma, Imbalanced Learning: Foundations, Algorithms, and Applications (pp. 13-41). New Jersey: John Wiley & Sons.

Yap, B. W., Rani, K. A., Rahman, H. A., Fong, S., Khairudin, Z., & Abdullah, N. N. (2014). An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets. Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013). 285, pp. 13-22. Singapore: Springer. doi:10.1007/978-981-4585-18-7_2

Zhang, D., Liu, W., Gong, X., & Jin, H. (2011). A Novel Improved SMOTE Resampling Algorithm Based on Fractal. Computational Information Systems, 2204-2211.

Zhang, H., & Wang, Z. (2011). A Normal Distribution-Based Over-Sampling Approach to Imbalanced Data Classification. Advanced Data Mining and Applications - 7th International Conference (pp. 83-96). Beijing: Springer.

Zhang, H., Jiang, L., & Su, J. (2005). Augmenting Na?ve Bayes for Ranking. ICML '05 Proceedings of the 22nd international conference on Machine learning (pp. 1020 - 1027). New York: ACM Press. doi:http://dx.doi.org/10.1145/1102351.1102480