Integrasi SMOTE dan Information Gain pada Naive Bayes untuk Prediksi Cacat Software

Sukmawati Anggraini Putri, Romi Satria Wahono

Abstract


Perangkat lunak banyak memainkan yang peran penting. Oleh karena itu, kewajiban untuk memastikan kualitas, seperti pengujian perangkat lunak dapat dianggap mendasar dan penting. Tapi di sisi lain, pengujian perangkat lunak adalah pekerjaan yang sangat mahal, baik dalam biaya dan waktu penggunaan. Oleh karena itu penting untuk sebuah perusahaan pengembangan perangkat lunak untuk melakukan pengujian kualitas perangkat lunak dengan biaya minimum. Naive Bayes pada prediksi cacat perangkat lunak telah menunjukkan kinerja yang baik dan menghsilkan probabilitas rata-rata 71 persen. Selain itu juga merupakan classifier yang sederhana dan waktu yang dibutuhkan dalam proses belajar mengajar lebih cepat dari algoritma pembelajaran mesin lainnya. NASA adalah dataset yang sangat populer digunakan dalam pengembangan model prediksi cacat software, umum dan dapat digunakan secara bebas oleh para peneliti. Dari penelitian yang dilakukan sebelumnya ada dua isu utama pada prediksi cacat perangkat lunak yaitu noise attribute dan  imbalance class. Penerapan teknik SMOTE (Minority Synthetic Over-Sampling Technique) menghasilkan hasil yang baik dan efektif untuk menangani ketidakseimbangan kelas pada teknik oversampling untuk memproses kelas minoritas (positif). Dan Information Gain digunakan dalam pemilihan atribut untuk menangani kemungkinan noise attribute. Setelah dilakukan percobaan bahwa penerapan model SMOTE dan Information Gain terbukti menangani imbalance class dan noise attribute untuk prediksi cacat software.

Full Text:

PDF

References


Catal, C. (2011). Software fault prediction: A literature review and current trends. Expert Systems with Applications, 38(4), 4626–4636. doi:10.1016/j.eswa.2010.10.024

Chawla, N. V, Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE : Synthetic Minority Over-sampling Technique, 16, 321–357.

De Carvalho, A. B., Pozo, A., & Vergilio, S. R. (2010). A symbolic fault-prediction model based on multiobjective particle swarm optimization. Journal of Systems and Software, 83(5), 868–882.

Demsar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. The Journal of Machine Learning Research, 7, 1–30.

Domingos, P. (1997). On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning, 29(2-3), 103–130.

Gao, K., & Khoshgoftaar, T. M. (2011). Software Defect Prediction for High-Dimensional and Class-Imbalanced Data. Conference: Proceedings of the 23rd International Conference on Software Engineering & Knowledge Engineering, (2).

Guyon, I. (2003). An Introduction to Variable and Feature Selection 1 Introduction. Journal of Machine Learning Research, 3, 1157–1182.

Hall, T., Beecham, S., Bowes, D., Gray, D., & Counsell, S. (2010). A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Transactions on Knowledge and Data Engineering, 38(6), 1276 – 1304.

Jain, M., & Richariya, V. (2012). An Improved Techniques Based on Naive Bayesian for Attack Detection. International Journal of Emerging Technology and Advanced Engineering, 2(1), 324–331.

Kabir, M., & Murase, K. (2012). Expert Systems with Applications A new hybrid ant colony optimization algorithm for feature selection. Expert Systems With Applications, 39(3), 3747–3763.

Khoshgoftaar, T. M., & Gao, K. (2009). Feature Selection with Imbalanced Data for Software Defect Prediction. 2009 International Conference on Machine Learning and Applications, 235–240.

Kohavi, R., & Edu, S. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and M o d e l Selection. Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI), 1137–1143.

Lessmann, S., Member, S., Baesens, B., Mues, C., & Pietsch, S. (2008). Benchmarking Classification Models for Software Defect Prediction : A Proposed Framework and Novel Findings. IEEE Transactions on Software Engineering, 34(4), 485–496.

Ling, C. X. (2003). Using AUC and Accuracy in Evaluating Learning Algorithms, 1–31.

Ling, C. X., & Zhang, H. (2003). AUC: a statistically consistent and more discriminating measure than accuracy. Proceedings of the 18th International Joint Conference on Artificial Intelligence.

Mccabe, T. J. (1976). A Complexity Measure. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING,, SE-2(4), 308–320.

Menzies, T., Greenwald, J., & Frank, A. (2007). Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Transactions on Software Engineering, 33(1), 2–13.

Riquelme, J. C., Ruiz, R., & Moreno, J. (2008). Finding Defective Modules from Highly Unbalanced Datasets. Engineering, 2(1), 67–74.

Rivera, J., & Meulen, R. van der. (2014). Gartner Says Worldwide IT Spending on Pace to Reach $3.8 Trillion in 2014. Retrieved August 01, 2015, from http://www.gartner.com/newsroom/id/2643919

Shepperd, M., Song, Q., Sun, Z., & Mair, C. (2013). Data Quality : Some Comments on the NASA Software Defect Data Sets. Software Engineering, IEEE Transactions, 39(9), 1–13.

Song, Q., Jia, Z., Shepperd, M., Ying, S., & Liu, J. (2011). A General Software Defect-Proneness Prediction Framework. IEEE Transactions on Software Engineering, 37(3), 356–370.

Turhan, B., & Bener, A. (2009). Analysis of Naive Bayes’ assumptions on software fault data: An empirical study. Data & Knowledge Engineering, 68(2), 278–290.

Wahono, R. S., & Suryana, N. (2013). Combining Particle Swarm Optimization based Feature Selection and Bagging Technique for Software Defect Prediction. International Journal of Software Engineering and Its Applications, 7(5), 153–166.

Wang, H., Khoshgoftaar, T. M., Gao, K., & Seliya, N. (2009). High-Dimensional Software Engineering Data and Feature Selection. Proceedings of 21st IEEE International Conference on Tools with Artificial Intelligence, Nov. 2-5, 83–90.

Wilcoxon, F. (1945). Individual Comparisons by Ranking Methods. International Biometric Society Stable, 1(6), 80–83.

Yap, B. W., Rani, K. A., Aryani, H., Rahman, A., Fong, S., Khairudin, Z., & Abdullah, N. N. (2014). An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets. Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013), 285, 13–23.






Journal of Software Engineering (JSE, ISSN 2356-3974)
Copyright © 2015 IlmuKomputer.Com. All rights reserved.