Two-Step Cluster based Feature Discretization of Naive Bayes for Outlier Detection in Intrinsic Plagiarism Detection

Adi Wijaya, Romi Satria Wahono

Abstract


Intrinsic plagiarism detection is the task of analyzing a document with respect to undeclared changes in writing style which treated as outliers. Naive Bayes is often used to outlier detection. However, Naive Bayes has assumption that the values of continuous feature are normally distributed where this condition is strongly violated that caused low classification performance. Discretization of continuous feature can improve the performance of Naïve Bayes. In this study, feature discretization based on Two-Step Cluster for Naïve Bayes has been proposed. The proposed method using tf-idf and query language model as feature creator and False Positive/False Negative (FP/FN) threshold which aims to improve the accuracy and evaluated using PAN PC 2009 dataset. The result indicated that the proposed method with discrete feature outperform the result from continuous feature for all evaluation, such as recall, precision, f-measure and accuracy. The using of FP/FN threshold affects the result as well since it can decrease FP and FN; thus, increase all evaluation.

Full Text:

PDF

References


Alan, O., & Catal, C. (2011). Thresholds based outlier detection approach for mining class outliers: An empirical case study on software measurement datasets. Expert Systems with Applications, 38(4), 3440–3445.

Alzahrani, S. M., Salim, N., & Abraham, A. (2012). Understanding plagiarism linguistic patterns, textual features, and detection methods. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 42(2), 133–149.

Bahrepour, M., Zhang, Y., Meratnia, N., & Havinga, P. J. M. (2009). Use of event detection approaches for outlier detection in wireless sensor networks. 2009 International Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), 439–444.

Baron, G. (2014). Influence of Data Discretization on Efficiency of Bayesian Classifier for Authorship Attribution. Procedia Computer Science, 35, 1112–1121.

Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys, 41(3), 1–72.

Chen, C., Yeh, J., & Ke, H. (2010). Plagiarism Detection using ROUGE and WordNet, 2(3), 34–44.

Chiu, T., Fang, D., Chen, J., Wang, Y., & Jeris, C. (2001). A Robust and Scalable Clustering Algorithm for Mixed Type Attributes in Large Database Environment. In Proceedings of the 7th ACM SIGKDD Internation- al Conference on Knowledge Discovery and Data Mining (pp. 263–268).

Curran, D. (2010). An evolutionary neural network approach to intrinsic plagiarism detection. In AICS 2009, LNAI 6206 (pp. 33–40). Springer-Verlag Berlin Heidelberg.

Dash, R., Paramguru, R. L., & Dash, R. (2011). Comparative Analysis of Supervised and Unsupervised Discretization Techniques, 2(3), 29–37.

Demsar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7, 1–30.

Dougherty, J. (1995). Supervised and Unsupervised Discretization of Continuous Features.

Ferreira, A. J., & Figueiredo, M. a. T. (2012). An unsupervised approach to feature discretization and selection. Pattern Recognition, 45(9), 3048–3060.

Gupta, A., Mehrotra, K. G., & Mohan, C. (2010). A clustering-based discretization for supervised learning. Statistics & Probability Letters, 80(9-10), 816–824.

Hall, M. (2007). A decision tree-based attribute weighting filter for naive Bayes. Knowledge-Based Systems, 20(2), 120–126.

Jamain, A., & Hand, D. J. (2005). The Naive Bayes Mystery: A classification detective story. Pattern Recognition Letters, 26(11), 1752–1760.

Kamra, A., Terzi, E., & Bertino, E. (2007). Information Assurance and Security Detecting anomalous access patterns in relational databases.

Kanaris, I., & Stamatatos, E. (2007). Webpage Genre Identification Using Variable-Length Character n-Grams. In 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007) (pp. 3–10). IEEE.

Koppel, M., Schler, J., & Argamon, S. (2009). Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1), 9–26.

Lepora, N. F., Pearson, M. J., Mitchinson, B., Evans, M., Fox, C., Pipe, A., Prescott, T. J. (2010). Naive Bayes novelty detection for a moving robot with whiskers. 2010 IEEE International Conference on Robotics and Biomimetics, 131–136.

Li, M., Deng, S., Feng, S., & Fan, J. (2011). An effective discretization based on Class-Attribute Coherence Maximization. Pattern Recognition Letters, 32(15), 1962–1973.

Maurer, H., & Kappe, F. (2006). Plagiarism - A Survey, 12(8), 1050–1084.

Meyer zu Eissen, S., Stein, B., & Kulig, M. (2007). Plagiarism Detection Without Reference Collections. In Studies in Classification, Data Analysis, and Knowledge Organization, Advances in Data Analysis (pp. 359–366). Berlin: Springer.

Michailidou, C., Maheras, P., Arseni-Papadimititriou, a., Kolyva-Machera, F., & Anagnostopoulou, C. (2008). A study of weather types at Athens and Thessaloniki and their relationship to circulation types for the cold-wet period, part I: two-step cluster analysis. Theoretical and Applied Climatology, 97(1-2), 163–177.

Oberreuter, G., & Velásquez, J. D. (2013). Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style. Expert Systems with Applications, 40(9), 3756–3763.

Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 275–281).

Richhariya, V., & Sharma, N. (2014). Optimized Intrusion Detection by CACC Discretization Via Naïve Bayes and K-Means Clustering, 14(1), 54–58.

Satish, S. M., & Bharadhwaj, S. (2010a). Information Search Behaviour among New Car Buyers: A Two-Step Cluster Analysis. IIMB Management Review, 22(1-2), 2.

Satish, S. M., & Bharadhwaj, S. (2010b). Information search behaviour among new car buyers: A two-step cluster analysis. IIMB Management Review, 22(1-2), 5–15.

Seaward, L., & Matwin, S. (2009). Intrinsic Plagiarism Detection using Complexity Analysis. In SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009) (pp. 56–61).

Soria, D., Garibaldi, J. M., Ambrogi, F., Biganzoli, E. M., & Ellis, I. O. (2011). A “non-parametric” version of the naive Bayes classifier. Knowledge-Based Systems, 24(6), 775–784.

Stamatatos, E. (2009a). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3), 538–556.

Stamatatos, E. (2009b). Intrinsic Plagiarism Detection Using Character n -gram Profiles. In Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E. (eds.) SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009) (pp. 38–46).

Stamatatos, E. (2011). Plagiarism Detection Using Stopword n -grams, 62(12), 2512–2527.

Stein, B., Lipka, N., & Prettenhofer, P. (2011). Intrinsic plagiarism analysis. Language Resources and Evaluation, 45(1), 63–82.

Taheri, S., & Mammadov, M. (2013). Learning the naive Bayes classifier with optimization models. International Journal of Applied Mathematics and Computer Science, 23(4), 787–795.

Tsai, C.-J., Lee, C.-I., & Yang, W.-P. (2008). A discretization algorithm based on Class-Attribute Contingency Coefficient. Information Sciences, 178(3), 714–731.

Tschuggnall, M., & Specht, G¨. (2012). Plag-Inn: Intrinsic Plagiarism Detection Using Grammar Trees. In G. Bouma, A. Ittoo, E. Métais, & H. Wortmann (Eds.), LNCS-Natural Language Processing and Information Systems (Vol. 7337, pp. 284–289). Berlin, Heidelberg: Springer Berlin Heidelberg.

Webb, G. I. (2001). On Why Discretization Works for Naive-Bayes Classifiers.

Wong, T.-T. (2012). A hybrid discretization method for naïve Bayesian classifiers. Pattern Recognition, 45(6), 2321–2325.

Wu, H., Zhang, X., Li, X., Liao, P., Li, W., Li, Z., … Pei, F. (2006). Studies on Acute Toxicity of Model Toxins by Proton Magnetic Resonance Spectroscopy of Urine Combined with Two-step Cluster Analysis. Chinese Journal of Analytical Chemistry, 34(1), 21–25.

Yang, Y., & Webb, G. I. (2002). A Comparative Study of Discretization Methods for Naive-Bayes Classifiers. In Proceedings of PKAW 2002, The 2002 Pacific Rim Knowledge Acquisition Work- shop (pp. 159–173). Tokyo, Japan.

Yang, Y., & Webb, G. I. (2008). Discretization for naive-Bayes learning: managing discretization bias and variance. Machine Learning, 74(1), 39–74.


Refbacks

  • There are currently no refbacks.




Journal of Intelligent Systems(JIS, ISSN 2356-3982)
Copyright 2020IlmuKomputer.Com. All rights reserved.