Abstract
It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing user preferences because of the large number of terms, patterns, and noise. Term-based approaches are used to extract a large number of features, most of which are meaningless noise, and suffer from polysemy and synonymy. Most popular text mining techniques consider the distribution of terms in documents and data sets when calculating the weight of the terms. The innovative technique presented in this paper is called Specific Feature Discovery and is uses positive feedback only to discover positive patterns as high-level features to consider relationship among terms, then high-level patterns deployed to low-level terms to overcome the limitation of patterns. To improve the quality of the extracted features, high-level patterns are extracted from feedback documents. Following this, low-level terms extracted from high-level patterns. Determining the weight of the low-level terms is based on the distribution of the terms in the high-level patterns and the specificity of the terms in a positive group of documents. The specificity of the low-level terms is calculated according to the frequency of the terms in the documents and the frequency of the terms in extracted patterns. There have been extensive experiments in which this technique has been tested using the Reuters Corpus Volume 1 data set and Text REtrieval Conference topics. The results show that the proposed method achieves an encouraging performance in all measures.