Abstract
It is challenging to discover relevant features from long documents that describe user information needs due to the nature of text where synonymy, polysemy, noise, and high dimensionality are inherited problems. Traditional feature selection methods could not effectively deal with these problems, because they assume that documents describe one topic only. Topic-based techniques, such as Latent Dirichlet Allocation (LDA), relax this assumption. They have been developed on the basis that a document can exhibit multiple hidden topics. However, LDA does not show encouraging results in selecting relevant features, because LDA calculates the weight of terms based on their local documents and does not generalise it globally at the collection level. So as to address this problem, we propose an innovative and effective extended random set model to generalise LDA weight for local document terms. The model is used as a weighting scheme for topical terms. It can assign a more discriminately accurate weight to these terms based on their appearance in LDA topics and relevant documents. The experimental results, based on the standard RCV1 dataset, TREC topics, and five standard performance measures, show that the proposed model significantly outperforms eight state-of-the-art baseline models in information filtering.