Abstract
Conference Title: 2014 9th International Conference on Informatics and Systems (INFOS) Conference Start Date: 2014, Dec. 15 Conference End Date: 2014, Dec. 17 Conference Location: Cairo, Egypt Information Extraction (IE) is one of the most important Natural Language Processing (NLP) applications, which extracts information such as Named-Entities (NE) and collocation of terms from the corpus. Collocation is a sequence of terms that co-occur together in the corpus. In Arabic Information Extraction, there are many problems because of the complex of Arabic's grammar and ambiguity. In general, in linguistics research, the more efficient corpus is the one annotated by Part of Speech Tagging (POST). In this paper, we propose a prototype that extracts collocation of N-gram words (from 2-6 gram) based on the sequence of POST from Arabic Quran corpus. This approach extracts the collocation of N-gram words by matching the input structured pattern of Arabic language versus the Part of Speech Tagging of Quran corpus. The system enables users to select a sequence of tags (2-6 gram) and scope of the corpus source (whole Quran Corpus or specific Surah). To show how the system is beneficial for linguistic research, a set of experiments has been conducted in different scenarios.