Abstract
Due to the rapid growth of the Internet and advanced technologies, data storage and extraction of Arabic diacritical data in real time from an Arabic corpus have become a vital issue in the field of information retrieval. In this paper, we propose a new idea for representing Arabic diacritic text in the corpus such that search engines can enhance the search time of retrieving the desired text with high precision. To achieve our goal, we segment the Arabic diacritical sentences/verses into individual characters along with diacritics which are necessary for interpreting the meanings. Then, we propose a new data structure for representing data using segmented alphabets. To verify the corpus representation, the proposed approach uses the Boyer-Moore algorithm for searching given verses of Arabic diacritical data. The proposed representation of data structure reduces the search time from O(m*n) to O(1+m) in the worst case, where m denotes the diacritical verse to be searched, and n denotes the total number of diacritical verses. Experimental results on popular corpus show that the proposed method outperforms the existing search methods in terms of time complexity. (C) 2018 The Authors. Published by Elsevier B.V.