Extracting N-gram terms collocation from tagged Arabic corpus

Waseem Alromima; Ibrahim F Moawad; Rania Elgohary; Mostafa Aref

Back

Conference proceeding

Extracting N-gram terms collocation from tagged Arabic corpus

Waseem Alromima, Ibrahim F Moawad, Rania Elgohary and Mostafa Aref

The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Conference Proceedings, p.NLP-10

01/12/2014

Abstract

Conference Title: 2014 9th International Conference on Informatics and Systems (INFOS) Conference Start Date: 2014, Dec. 15 Conference End Date: 2014, Dec. 17 Conference Location: Cairo, Egypt Information Extraction (IE) is one of the most important Natural Language Processing (NLP) applications, which extracts information such as Named-Entities (NE) and collocation of terms from the corpus. Collocation is a sequence of terms that co-occur together in the corpus. In Arabic Information Extraction, there are many problems because of the complex of Arabic's grammar and ambiguity. In general, in linguistics research, the more efficient corpus is the one annotated by Part of Speech Tagging (POST). In this paper, we propose a prototype that extracts collocation of N-gram words (from 2-6 gram) based on the sequence of POST from Arabic Quran corpus. This approach extracts the collocation of N-gram words by matching the input structured pattern of Arabic language versus the Part of Speech Tagging of Quran corpus. The system enables users to select a sequence of tags (2-6 gram) and scope of the corpus source (whole Quran Corpus or specific Surah). To show how the system is beneficial for linguistic research, a set of experiments has been conducted in different scenarios.

Metrics

1 Record Views

Details

Title: Extracting N-gram terms collocation from tagged Arabic corpus
Creators - without role: Waseem Alromima
Ibrahim F Moawad
Rania Elgohary
Mostafa Aref
Publication Details: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Conference Proceedings, p.NLP-10
Publisher: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Identifiers: 9929433008331
Academic Unit: Taibah University
Language: English
Resource Type: Conference proceeding