Abstract
Clitics in Arabic language can be attached to a stem or to each other without orthographic marks such as an apostrophe. In this paper we present a statistical study of clitics and its effect in Arabic language. We tokenize large Arabic text using white-spaces and an automatic clitics tokenizer (AMIRA 2.0) and compare the unique-word count in both cases with English language. We also show the resulted distribution of clitics in Arabic and examine the performance of the used tokenizer. Using a 600 million words Arabic corpus, we report that the corresponding lexicon size could be reduced by 24.54% when applying clitics tokenization.