Open Access Open Access  Restricted Access Subscription Access

A Novel Parallel Algorithm for Clustering Documents Based on the Hierarchical Agglomerative Approach


Affiliations
1 Computer Science Department, Helwan University, Cairo, Egypt
2 Princess Noura University (Former Riyadh University), Riyadh, Saudi Arabia
 

As the amount of internet documents has been growing, document clustering has become practically important. This has led the interest in developing document clustering algorithms. Exploiting parallelism plays an important role in achieving fast and high quality clustering. In this paper, we propose a parallel algorithm that adopts a hierarchical document clustering approach. Our focus is to exploit the sources of parallelism to improve performance and decrease clustering time. The proposed parallel algorithm is tested using a test-bed collection of 749 documents from CACM. A multiprocessor system based on message-passing is used. Various parameters are considered for evaluating performance including average inter-cluster similarity, speedup and processors' utilization. Simulation results show that the proposed algorithm improves performance, decreases the clustering time, and increases the overall speedup while still keeping a high clustering quality. By increasing the number of processors, the clustering time decreases till a certain point where any more processors will no longer be effective. Moreover, the algorithm is applicable for different domains for other document collections.

Keywords

Hierarchical Clustering, Parallel Algorithms, Simulation, Document Collection, Performance Evaluation.
User
Notifications
Font Size

Abstract Views: 209

PDF Views: 115




  • A Novel Parallel Algorithm for Clustering Documents Based on the Hierarchical Agglomerative Approach

Abstract Views: 209  |  PDF Views: 115

Authors

Amal Elsayed Aboutabl
Computer Science Department, Helwan University, Cairo, Egypt
Mohamed Nour Elsayed
Princess Noura University (Former Riyadh University), Riyadh, Saudi Arabia

Abstract


As the amount of internet documents has been growing, document clustering has become practically important. This has led the interest in developing document clustering algorithms. Exploiting parallelism plays an important role in achieving fast and high quality clustering. In this paper, we propose a parallel algorithm that adopts a hierarchical document clustering approach. Our focus is to exploit the sources of parallelism to improve performance and decrease clustering time. The proposed parallel algorithm is tested using a test-bed collection of 749 documents from CACM. A multiprocessor system based on message-passing is used. Various parameters are considered for evaluating performance including average inter-cluster similarity, speedup and processors' utilization. Simulation results show that the proposed algorithm improves performance, decreases the clustering time, and increases the overall speedup while still keeping a high clustering quality. By increasing the number of processors, the clustering time decreases till a certain point where any more processors will no longer be effective. Moreover, the algorithm is applicable for different domains for other document collections.

Keywords


Hierarchical Clustering, Parallel Algorithms, Simulation, Document Collection, Performance Evaluation.