Abstract—Web spam detection has become one of the top important tasks for web search engines. Web spam detection is a class imbalance problem because normal pages are far more than spam pages. However, most of traditional learning methods are not effective on imbalance classification problems. In order to tackle this problem and make full use of various features extracted from web pages’ content and links, this paper presents an ensemble classifier based on under-sampling and feature-partition techniques and integrates decision tree algorithm C4.5 into it as a sub classifier to detect web spam. The experimental results show that the ensemble classifier outperforms other approaches on several evaluation metrics such as F1-Measue, AUC etc. in WEBSPAM-UK2006 dataset.
Index Terms—Web spam detection, under-sampling, features partition, ensemble classifier, C4.5.
Xiaoyong Lu and Musheng Chen are with Nanchang University, China (e-mail: lxy@ncu.edu.cn, dreaminit@gmail.com).
Jhenglong Wu and Peichan Chan are with the Information Management Department, Yuan Ze University, Taiwan (e-mail: jlwu.yzu@gmail.com, iepchang@saturn.yzu.edu.tw).
Cite: Xiaoyong Lu, Musheng Chen, Jhenglong Wu, and Peichan Chan, "A Feature-Partition and Under-Sampling Based Ensemble Classifier for Web Spam Detection," International Journal of Machine Learning and Computing vol.5, no. 6, pp. 454-457, 2015.