Abstract
One of the basic tasks in genomic research is the analysis of a sequence. An absent word in a sequence is a substring that does not occur in the given sequence. Many studies looked into finding the shortest absent words, with some recent studies noting that longer absent words are also of interest. A simple extension of the shortest ones is impractical as the list tends to grow exponentially in the size of the sequence. A better choice is the minimal absent words, since these are known to grow linearly in the size of the sequence. An absent word is minimal if none of its proper factors is missing in the sequence. Similarly, it is (left-fixed) minimal unique if none of its proper prefixes is unique. In this paper we present an efficient algorithm that discovers all words up to a user-specified length that are either minimal absent or are left-fixed minimal unique in the input sequence. We employ a purely deterministic approach which guarantees nothing is overlooked. At each successive iteration, the algorithm works on larger words using a simple list structure for all the operations. Theoretically, the algorithm has a space complexity that is linear with the size of input sequence, while the time bound scales well with alphabet size. Experimental results using real biological sequences and randomly generated ones using different-sized alphabets show that the algorithm has a linearity in time behavior.