Abstract
String matching is a classical computer science problem where we search for all the occurrences of a text string of size m, typically called pattern, in a string of size n, where both strings are drawn from the same alphabet. It is an essential task for many applications such as data mining, web search engines, bioinformatics, and natural language processing. Fast hash algorithms were developed to speed up the searching process. Here, we compare the hash value of strings (signature) instead of the letters. The hash function allows exploiting bitwise operations while considering the alphabet’s and pattern’s sizes. However, the efficiency of the hash algorithms calls for further improvements. The problem with q-gram hash algorithms is that the shift skips at most m−q+1 positions, where m is the same as before, and q is the length of hashed q-gram. For a fixed m, the number of skipped positions decreases as q increases. This paper presents a new variation of the q-gram hash algorithm, which elongates the shift by skipping at most m positions over text. Theoretically, the proposed hash algorithm, namely, Longer shift HASHq (LsHASHq), has a longer shift than the state-of-the-art hash algorithms. Experimentally, the new algorithm is the fastest among the following algorithms: BNDMq, BXSq, EPSM, FHASHq, FSBNDMq, HASHq, LWFRq, QLQS, SBNDMq, TWFRq, and WFRq on different natural language texts for m>10. For human genome sequence the new algorithm was second fastest for short patterns of length 10.
•A fast q-gram hash function based exact string matching algorithm for any alphabet.•Our avg time complexity is O(n/(m−d)), where n,m are length of text and pattern.•Our algorithm shift skips £m positions vs other algorithms skip £m−q+1.•Achieved second best avg runtime on Human Chromosome 1 for patterns of size 10.•Best avg runtime (vs 23 algorithms) for pattern size <= 60 on natural language text.