Penerapan Metode Average Gain, Threshold Pruning dan Cost Complexity Pruning Untuk Split Atribut Pada Algoritma C4.5

Erna Sri Rahayu, Romi Satria Wahono, Catur Supriyanto

Abstract


C4.5 is a supervised learning classifier to establish a Decision Tree of data. Split attribute is main process in the formation of a decision tree in C4.5. Split attribute in C4.5 can not be overcome in any misclassification cost split so the effect on the performance of the classifier. After the split attributes, the next process is pruning. Pruning is process to cut or eliminate some of unnecessary branches. Branch or node that is not needed can cause the size of Decision Tree to be very large and it is called over- fitting. Over- fitting is state of the art for this time. Methods for split attributes are Gini Index, Information Gain, Gain Ratio and Average Gain which proposed by Mitchell. Average Gain not only overcome the weakness in the Information Gain but also help to solve the problems of Gain Ratio. Attribute split method which proposed in this research is use average gain value multiplied by the difference of misclassification. While the technique of pruning is done by combining threshold pruning and cost complexity pruning. In this research, testing the proposed method will be applied to datasets and then the results of performance will be compared with results split method performance attributes using the Gini Index, Information Gain and Gain Ratio. The selecting method of split attributes using average gain that multiplied by the difference of misclassification can improve the performance of classifiying C4.5. This is demonstrated through the Friedman test that the proposed split method attributes, combined with threshold pruning and cost complexity pruning have accuracy ratings in rank 1. A Decision Tree formed by the proposed method are smaller.

 

Keyword: Decision Tree, C4.5, split attribute, pruning, over-fitting, gain, average gain.


Full Text:

PDF

References


Abellán, J. (2013). Ensembles of decision trees based on imprecise probabilities and uncertainty measures, 14, 423–430.

C. Sammut, G. W. (2011). Encyclopedia of Machine Learning. (C. Sammut & G. I. Webb, Eds.). Boston, MA: Springer US. doi:10.1007/978-0-387-30164-8

Duchessi, P., & Lauría, E. J. M. (2013). Decision tree models for profiling ski resorts’ promotional and advertising strategies and the impact on sales. Expert Systems with Applications, 40(15), 5822–5829. doi:10.1016/j.eswa.2013.05.017

Gorunescu, F. (2011). Data Mining Concepts, Models and Techniques. (Springer, Ed.) (12th ed., Vol. 12). Berlin, Heidelberg: Springer Berlin Heidelberg. doi:10.1007/978-3-642-19721-5

Han, Jiawei; Kamber, Micheline; Pei, J. (2012). Data Mining Concepts and Techniques. Morgan Kaufmann (Third Edit., Vol. 40, p. 9823). Morgan Kaufmann Publishers. doi:10.1002/1521-3773(20010316)40:6<9823::AID-ANIE9823>3.3.CO;2-C

Larose, D. T. (2005). Discovering Knowledge in Data. United States of America: John Wiley & Sons, Inc.

Larose, D. T. (2006). Data Mining Methods And Models. New Jersey: A John Wiley & Sons, Inc Publication.

Mantas, C. J., & Abellán, J. (2014). Credal-C4.5: Decision tree based on imprecise probabilities to classify noisy data. Expert Systems with Applications, 41(10), 4625–4637. doi:10.1016/j.eswa.2014.01.017

Otero, F. E. B., Freitas, A. A., & Johnson, C. G. (2012). Inducing decision trees with an ant colony optimization algorithm. Applied Soft Computing, 12(11), 3615–3626. doi:10.1016/j.asoc.2012.05.028

Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 81–106.

Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. The Morgan Kaufmann Publishers.

Rokach, L., & Maimon, O. (2005). Decision Tree. Data Mining and Knowledge Discovery Handbook, pp 165–192. doi:10.1007/978-0-387-09823-4_9

Sahin, Y., Bulkan, S., & Duman, E. (2013). A cost-sensitive decision tree approach for fraud detection. Expert Systems with Applications, 40(15), 5916–5923. doi:10.1016/j.eswa.2013.05.021

Setsirichok, D., Piroonratana, T., Wongseree, W., Usavanarong, T., Paulkhaolarn, N., Kanjanakorn, C., … Chaiyaratana, N. (2012). Classification of complete blood count and haemoglobin typing data by a C4.5 decision tree, a naïve Bayes classifier and a multilayer perceptron for thalassaemia screening. Biomedical Signal Processing and Control, 7(2), 202–212. doi:10.1016/j.bspc.2011.03.007

T Warren Liao, E. T. (2007). Recent Advances in Data Mining of Enterprise Data : Algorithms and Applications (Vol.6 ed.). World Scientific Publishing Co.

Ture, M., Tokatli, F., & Kurt, I. (2009). Using Kaplan–Meier analysis together with decision tree methods (C&RT, CHAID, QUEST, C4.5 and ID3) in determining recurrence-free survival of breast cancer patients. Expert Systems with Applications, 36(2), 2017–2026. doi:10.1016/j.eswa.2007.12.002

Wang, T., Qin, Z., Jin, Z., & Zhang, S. (2010). Handling over-fitting in test cost-sensitive decision tree learning by feature selection, smoothing and pruning. Journal of Systems and Software, 83(7), 1137–1147. doi:10.1016/j.jss.2010.01.002

Wang, T., Qin, Z., Zhang, S., & Zhang, C. (2012). Cost-sensitive classification with inadequate labeled data, 37, 508–516. doi:10.1016/j.is.2011.10.009

Zhang, S. (2012). Decision tree classifiers sensitive to heterogeneous costs. Journal of Systems and Software, 85(4), 771–779. doi:10.1016/j.jss.2011.10.007


Refbacks

  • There are currently no refbacks.




Journal of Intelligent Systems (JIS, ISSN 2356-3982)
Copyright © 2015 IlmuKomputer.Com. All rights reserved.