Abstract—In this paper, we develop an approach for
semi-supervised document clustering based on Latent Dirichlet
Allocation (LDA), namely LLDA. A small amount of labeled
documents are used to indicate user's document grouping
preference. A generative model is investigated to jointly model
documents and the small amount of document labels. A
variational inference algorithm is developed to infer the
document collection structure. We explore the performance of
our proposed approach on both a synthetic dataset and realistic
document datasets. Our experiments indicate that our proposed
approach performs well on grouping documents based on
different user grouping preferences. The comparison between
our proposed approach and state-of-the-art semi-supervised
clustering algorithms using labeled instance shows that our
approach is effective.
Index Terms—Semi-supervised clustering, document
clustering, latent dirichlet allocation, generative model.
The authors are with the College of Computer Science and Technology,
Guizhou University, Guiyang, CO 550025 China (corresponding author: Li
Zhang; e-mail: cse.rzhuang@gzu.edu.cn, gs.pzhou11@mail.gzu.edu.cn,
lizhang_2004@126.com).
Cite: Ruizhang Huang, Ping Zhou, and Li Zhang, "A LDA-Based Approach for Semi-Supervised Document Clustering," International Journal of Machine Learning and Computing vol.4, no. 4, pp. 313-318, 2014.