Clitics in Arabic Language: A Statistical Study

Fahad Alotaiby; Salah Foda; Ibrahim Alkharashi

Back

Conference proceeding

Clitics in Arabic Language: A Statistical Study

Fahad Alotaiby, Salah Foda and Ibrahim Alkharashi

PROCEEDINGS OF THE 24TH PACIFIC ASIA CONFERENCE ON LANGUAGE, INFORMATION AND COMPUTATION, pp.595-601

01/01/2010

Abstract

Language & Linguistics

Linguistics

Social Sciences

Clitics in Arabic language can be attached to a stem or to each other without orthographic marks such as an apostrophe. In this paper we present a statistical study of clitics and its effect in Arabic language. We tokenize large Arabic text using white-spaces and an automatic clitics tokenizer (AMIRA 2.0) and compare the unique-word count in both cases with English language. We also show the resulted distribution of clitics in Arabic and examine the performance of the used tokenizer. Using a 600 million words Arabic corpus, we report that the corresponding lexicon size could be reduced by 24.54% when applying clitics tokenization.

Metrics

1 Record Views

Details

Title: Clitics in Arabic Language: A Statistical Study
Creators - without role: Fahad Alotaiby - King Saud University
Salah Foda - King Saud University
Ibrahim Alkharashi - King Abdulaziz City Sci & Technol, Comp & Elect Res Inst, Riyadh 11442, Saudi Arabia
Contributors - without role: R Otoguro
K Ishikawa
H Umemoto
K Yoshimoto
Y Harada
Publication Details: PROCEEDINGS OF THE 24TH PACIFIC ASIA CONFERENCE ON LANGUAGE, INFORMATION AND COMPUTATION, pp.595-601
Publisher: Waseda Univ, Inst Digital Enhancement Cognitive Development
Number of pages: 7
Identifiers: 9918872608331
Academic Unit: King Abdulaziz City for Science & Technology; King Saud University
Language: English
Resource Type: Conference proceeding