Abstract
A primitive tandem repeat a is a substring in string S if it can be expressed as two or more contiguous copies of beta, where the base beta cannot be expressed in terms of yet a shorter substring. Substring alpha is maximal if there is no copy of beta to either its left or right. Tandem repeats (or arrays) are known to play an important role in biology, e.g. determining parentage. We present a deterministic algorithm that finds all the exact primitive maximal tandem arrays that occur in S, where at iteration k it discovers all the repeats with base length k. Our algorithm uses a simple list structure for all its operations. In theory, the algorithm scales well with the size of the alphabet. For strings over the alphabet Sigma it has a complexity of O(vertical bar S vertical bar +vertical bar Sigma vertical bar in space, and O(B vertical bar S vertical bar - B-2/2) in time, where B is the length of the longest primitive base of a tandem repeat in S. Experimental results on real biological sequences, randomly generated sequences using large sized alphabets, and Fibonacci strings, show that in practice the algorithm has indeed a linear complexity, both in space and time. (C) 2016 Elsevier Inc. All rights reserved.