LsHASHq: A string matching algorithm exploiting longer q-gram shifting

Abdulrakeeb M. Al-Ssulami; Aqil M. Azmi; Hassan Mathkour; Hatim Aboalsamh

doi:10.1016/j.ipm.2022.103057

Back

LsHASHq: A string matching algorithm exploiting longer q-gram shifting

Journal article

Peer reviewed

LsHASHq: A string matching algorithm exploiting longer q-gram shifting

Abdulrakeeb M. Al-Ssulami, Aqil M. Azmi, Hassan Mathkour and Hatim Aboalsamh

Information processing & management, Vol.59(5), p.103057

09/2022

DOI: https://doi.org/10.1016/j.ipm.2022.103057

Abstract

Online search

Pattern matching

q-gram hashing

Sequence analysis

String matching algorithms

String matching is a classical computer science problem where we search for all the occurrences of a text string of size m, typically called pattern, in a string of size n, where both strings are drawn from the same alphabet. It is an essential task for many applications such as data mining, web search engines, bioinformatics, and natural language processing. Fast hash algorithms were developed to speed up the searching process. Here, we compare the hash value of strings (signature) instead of the letters. The hash function allows exploiting bitwise operations while considering the alphabet’s and pattern’s sizes. However, the efficiency of the hash algorithms calls for further improvements. The problem with q-gram hash algorithms is that the shift skips at most m−q+1 positions, where m is the same as before, and q is the length of hashed q-gram. For a fixed m, the number of skipped positions decreases as q increases. This paper presents a new variation of the q-gram hash algorithm, which elongates the shift by skipping at most m positions over text. Theoretically, the proposed hash algorithm, namely, Longer shift HASHq (LsHASHq), has a longer shift than the state-of-the-art hash algorithms. Experimentally, the new algorithm is the fastest among the following algorithms: BNDMq, BXSq, EPSM, FHASHq, FSBNDMq, HASHq, LWFRq, QLQS, SBNDMq, TWFRq, and WFRq on different natural language texts for m>10. For human genome sequence the new algorithm was second fastest for short patterns of length 10. •A fast q-gram hash function based exact string matching algorithm for any alphabet.•Our avg time complexity is O(n/(m−d)), where n,m are length of text and pattern.•Our algorithm shift skips £m positions vs other algorithms skip £m−q+1.•Achieved second best avg runtime on Human Chromosome 1 for patterns of size 10.•Best avg runtime (vs 23 algorithms) for pattern size <= 60 on natural language text.

Metrics

1 Record Views

Details

Title: LsHASHq: A string matching algorithm exploiting longer q-gram shifting
Creators - without role: Abdulrakeeb M. Al-Ssulami - Taiz University
Aqil M. Azmi - King Saud University
Hassan Mathkour - King Saud University
Hatim Aboalsamh - King Saud University
Publication Details: Information processing & management, Vol.59(5), p.103057
Publisher: Elsevier Ltd
Identifiers: 9946721708331
Academic Unit: King Saud University
Language: English
Resource Type: Journal article