Open Access Open Access  Restricted Access Subscription Access

Clustering Posts in Online discussion forum Threads


Affiliations
1 Department of Computer Science, University of Calgary, Calgary, Canada
2 Informatics Department, Electronics Research Institute, Giza, Egypt
 

Online discussion forums are considered a challenging repository for data mining tasks. Forums usually contain hundreds of threads which in turn consist of hundreds, or even thousands, of posts. Clustering posts can be used to discover outlier and off-topic posts and would provide better visualization and exploration of online threads.In this paper, we propose the Leader-based Post Clustering (LPC), a modification to the Leader algorithm to be applied to the domain of clustering posts in threads of discussion boards. We also suggest using asymmetric pair-wise distances to measure the dissimilarity between posts. We further investigate the effect of indirect distance between posts, and how to calibrate it with the direct distance. In order to evaluate the proposed methods, we conduct experiments using artificial and real threads extracted from Slashdot and Ciao discussion forums. Experimental results demonstrate the effectiveness of the LPC algorithm when using the linear combination of direct and indirect distances, as well as using an averaging approach to evaluate a representative indirect distance. Furthermore, the results show the potential of the LPC algorithm for detecting off-topic or outlier posts compared with two state-of-the-art methods for off-topic post detection.

Keywords

Distance Metrics, Clustering, Outlier Detection, Off-Topic Detection, Online Forums Mining.
User
Notifications
Font Size

Abstract Views: 212

PDF Views: 133




  • Clustering Posts in Online discussion forum Threads

Abstract Views: 212  |  PDF Views: 133

Authors

Dina Said
Department of Computer Science, University of Calgary, Calgary, Canada
Nayer Wanas
Informatics Department, Electronics Research Institute, Giza, Egypt

Abstract


Online discussion forums are considered a challenging repository for data mining tasks. Forums usually contain hundreds of threads which in turn consist of hundreds, or even thousands, of posts. Clustering posts can be used to discover outlier and off-topic posts and would provide better visualization and exploration of online threads.In this paper, we propose the Leader-based Post Clustering (LPC), a modification to the Leader algorithm to be applied to the domain of clustering posts in threads of discussion boards. We also suggest using asymmetric pair-wise distances to measure the dissimilarity between posts. We further investigate the effect of indirect distance between posts, and how to calibrate it with the direct distance. In order to evaluate the proposed methods, we conduct experiments using artificial and real threads extracted from Slashdot and Ciao discussion forums. Experimental results demonstrate the effectiveness of the LPC algorithm when using the linear combination of direct and indirect distances, as well as using an averaging approach to evaluate a representative indirect distance. Furthermore, the results show the potential of the LPC algorithm for detecting off-topic or outlier posts compared with two state-of-the-art methods for off-topic post detection.

Keywords


Distance Metrics, Clustering, Outlier Detection, Off-Topic Detection, Online Forums Mining.