Penanganan Fitur Kontinyu dengan Feature Discretization Berbasis Expectation Maximization Clustering untuk Klasifikasi Spam Email Menggunakan Algoritma ID3

. Safuan, Romi Satria Wahono, Catur Supriyanto

Abstract


Pemanfaatan jaringan internet saat ini berkembang begitu pesatnya, salah satunya adalah pengiriman surat elektronik atau email. Akhir-akhir ini ramai diperbincangkan adanya spam email. Spam email adalah email yang tidak diminta dan tidak diinginkan dari orang asing yang dikirim dalam jumlah besar ke mailing list, biasanya beberapa dengan sifat komersial. Adanya spam ini mengurangi produktivitas karyawan karena harus meluangkan waktu untuk menghapus pesan spam. Untuk mengatasi permasalahan tersebut dibutuhkan sebuah filter email yang akan mendeteksi keberadaan spam sehingga tidak dimunculkan pada inbox mail. Banyak peneliti yang mencoba untuk membuat filter email dengan berbagai macam metode, tetapi belum ada yang menghasilkan akurasi maksimal. Pada penelitian ini akan dilakukan klasifikasi dengan menggunakan algoritma Decision Tree Iterative Dicotomizer 3 (ID3) karena ID3 merupakan algoritma yang paling banyak digunakan di pohon keputusan, terkenal dengan kecepatan tinggi dalam klasifikasi, kemampuan belajar yang kuat dan konstruksi mudah. Tetapi ID3 tidak dapat menangani fitur kontinyu sehingga proses klasifikasi tidak bisa dilakukan. Pada penelitian ini,  feature discretization berbasis Expectation Maximization (EM) Clustering digunakan  untuk merubah fitur kontinyu menjadi fitur diskrit, sehingga proses klasifikasi spam email bisa dilakukan. Hasil eksperimen menunjukkan ID3 dapat melakukan klasifikasi spam email dengan akurasi 91,96% jika menggunakan data training 90%. Terjadi peningkatan sebesar 28,05% dibandingkan dengan klasifikasi ID3 menggunakan binning.

Full Text:

PDF

References


Agre, G., & Peev, S. (2002). On Supervised and Unsupervised Discretization. Methods, 2(2).

Al-Ibrahim, A. (2011). Discretization of Continuous Attributes in Supervised Learning algorithms. The Research Bulletin of Jordan ACM - ISWSA, 7952(Iv).

Allias, N., Megat, M. N., Noor, M., & Ismail, M. N. (2014). A hybrid Gini PSO-SVM feature selection based on Taguchi method. In Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication - ICUIMC ’14 (pp. 1–5). New York, New York, USA: ACM Press. http://doi.org/10.1145/2557977.2557999

Chakraborty, S., & Mondal, B. (2012). Spam Mail Filtering Technique using Different Decision Tree Classifiers through Data Mining Approach - A Comparative Performance Analysis. International Journal of Computer Applications, 47(16), 26–31.

Chharia, A., & Gupta, R. K. (2013). Email classifier: An ensemble using probability and rules. In 2013 Sixth International Conference on Contemporary Computing (IC3) (pp. 130–136). IEEE. http://doi.org/10.1109/IC3.2013.6612176

Çiltik, A., & Güngör, T. (2008). Time-efficient spam e-mail filtering using n-gram models. Pattern Recognition Letters, 29(1), 19–33. http://doi.org/10.1016/j.patrec.2007.07.018

Dash, R., Paramguru, R. L., & Dash, R. (2011). Comparative Analysis of Supervised and Unsupervised Discretization Techniques. International Journal of Advances in Science and Technology, 29–37.

Ferreira, A. J., & Figueiredo, M. a T. (2012). An unsupervised approach to feature discretization and selection. Pattern Recognition, 45(9), 3048–3060. http://doi.org/10.1016/j.patcog.2011.12.008

Ferreira, A. J., & Figueiredo, M. a T. (2014). Incremental filter and wrapper approaches for feature discretization. Neurocomputing, 123, 60–74. http://doi.org/10.1016/j.neucom.2012.10.036

Gupta, A., Mehrotra, K. G., & Mohan, C. (2010). A clustering-based discretization for supervised learning. Statistics & Probability Letters, 80(9-10), 816–824. http://doi.org/10.1016/j.spl.2010.01.015

Hamsapriya, T., D. K. R. and M. R. C. (2012). A Comparative Study of Supervised Machine Learning Techniques for Spam E-mail Filtering. In 2012 Fourth International Conference on Computational Intelligence and Communication Networks (Vol. 6948, pp. 506–512). IEEE. http://doi.org/10.1109/CICN.2012.14

Han, J., & Kamber, M. (2006). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers is an imprint of Elsevier (Vol. 54). http://doi.org/10.1007/978-3-642-19721-5

I. Witten, E. F. (2011). Data Mining : Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers Is an Imprint of Elsevier.

Jearanaitanakij, K. (2005). Classifying Continuous Data Set by ID3 Algorithm. In 2005 5th International Conference on Information Communications & Signal Processing (pp. 1048–1051). IEEE. http://doi.org/10.1109/ICICS.2005.1689212

Jin, C., & De-lin, L. (2009). An Improved ID3 Decision Tree Algorithm. Proceedings of 2009 4th International Conference on Computer Science & Education, 127–130.

Jorgensen, Z., Zhou, Y., & Inge, M. (2008). A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters. Journal of Machine Learning Research, 8, 1115–1146. Retrieved from http://jmlr.csail.mit.edu/papers/volume9/jorgensen08a/jorgensen08a.pdf

Jung, Y. G., Kang, M. S., & Heo, J. (2014). Clustering performance comparison using K -means and expectation maximization algorithms. Biotechnology & Biotechnological Equipment, 28(sup1), S44–S48. http://doi.org/10.1080/13102818.2014.949045

Kumar, R. K., Poonkuzhali, G., & Sudhakar, P. (2012). Comparative Study on Email Spam Classifier using Data Mining Techniques. Proceedings of the International MultiConference of Engineers and Computer Scientists, I.

Ladysz, R. (2004). Clustering of Envolving Time Series Data.

Liu Yuxun, & Xie Niuniu. (2010). Improved ID3 algorithm. In 2010 3rd International Conference on Computer Science and Information Technology. http://doi.org/10.1109/ICCSIT.2010.5564765

Madhu, G., Rajinikanth, T. V, & Govardhan, A. (2014). Feature Selection Algorithm with Discretization and PSO Search Methods for Continuous Attributes. International Journal of Computer Science and Information Technologies, 5(2), 1398–1402.

Marsono, M. N., El-Kharashi, M. W., & Gebali, F. (2008). Binary LNS-based naïve Bayes inference engine for spam control: noise analysis and FPGA implementation. IET Computers & Digital Techniques, 2(1), 56. http://doi.org/10.1049/iet-cdt:20050180

Méndez, J. R., Glez-Peña, D., Fdez-Riverola, F., Díaz, F., & Corchado, J. M. (2009). Managing irrelevant knowledge in CBR models for unsolicited e-mail classification. Expert Systems with Applications, 36(2), 1601–1614. http://doi.org/10.1016/j.eswa.2007.11.037

Nazirova, S. (2011). Survey on Spam Filtering Techniques. Communications and Network, 03(03), 153–160. http://doi.org/10.4236/cn.2011.33019

Saad, O., Darwish, A., & Faraj, R. (2012). A survey of machine learning techniques for Spam filtering. Journal of Computer Science, 12(2), 66–73.

Sculley, D., & Wachman, G. M. (2007). Relaxed online SVMs for spam filtering. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’07 (Vol. 36, p. 415). New York, New York, USA: ACM Press. http://doi.org/10.1145/1277741.1277813

Senthilkumar, J., Karthikeyan, S., Manjula, D., & Krishnamoorthy, R. (2012). Web Service Based Feature Selection and Discretization with Efficiency. 2012 IEEE Sixth International Conference on Semantic Computing, 269–276. http://doi.org/10.1109/ICSC.2012.51

Sharma, A. K., Sahni, S. (2011). A Comparative Study of Classification Algorithms for Spam Email Data Analysis. International Journal on Computer Science and Engineering (IJCSE), (May), 1890–1895.

Sheu, J. J. (2009). An efficient two-phase spam filtering method based on e-mails categorization. International Journal of Network Security, 9(1), 34–43.

Tsai, C.-J., Lee, C.-I., & Yang, W.-P. (2008). A discretization algorithm based on Class-Attribute Contingency Coefficient. Information Sciences, 178(3), 714–731. http://doi.org/10.1016/j.ins.2007.09.004

Wijaya, A., Wahono, R.S. (2008). Two-Step Cluster based Feature Discretization of Naïve Bayes for Outlier Detection in Intrinsic Plagiarism Detection. Journal of Intelligent Systems, (February 2015), 2–9.

Wu, C.-H. (2009). Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks. Expert Systems with Applications, 36(3), 4321–4330. http://doi.org/10.1016/j.eswa.2008.03.002


Refbacks

  • There are currently no refbacks.




Journal of Intelligent Systems (JIS, ISSN 2356-3982)
Copyright © 2015 IlmuKomputer.Com. All rights reserved.