Abstract
Researches have shown that Field Association (FA) Terms are effective in document classification, similar file retrieval and passage retrieval, and holds a lot of potential for applications in natural language processing and information retrieval. Many researchers have proposed effective methods to extract automatically relevant FA Terms to build a comprehensive dictionary. However, all previous studies are based on FA terms in English and Japanese, and the extension of FA terms to other language such Arabic could be definitely strengthen further researches. This paper presents a new method to extract, FA Terms from domain-specific corpora using part-of-speech (POS), pattern rules and corpora comparison in Arabic language. Experimental evaluation is carried out for 14 different fields using 251 MB of domain-specific corpora obtained from Arabic Wikipedia dumps and Alhayah news selected average of 2,825 FA Terms (single and compound) per field. From the experimental results, recall and precision are 84% and 79% respectively. Moreover, the quality of the FA Terms dictionary by its ability to identify the fields of 8,054 documents collected from two different sources: Wikipedia and Alhayah corpora are tested.