Abstract
Organizing web information is an important aspect of finding information in the easiest and most efficient way. We present a new method for web document clustering called WeDoCWT, which exploits the discrete wavelet transform and term signal, to improve the document representation. We studied different methods for document segmentation to construct the term signals. We used two datasets, UW-CAN and WebKB, to evaluate the proposed method. The experimental results indicated that dividing the documents into fixed segments is preferable to dividing them into logical segments based on HTML features because the web pages do not have the same structure. Mean TF–IDF reduction technique gives the best results in most cases. WeDoCWT gives [Formula: see text]-measure better than most of the previous approaches described in the literature. We used Munkres assignment algorithm to assign each produced cluster to the original class in order to evaluate the clustering results.