Abstract
Sequence mappability is an important task in genome resequencing. In the (k, m)-mappability problem, for a given sequence T of length n, our goal is to compute a table whose ith entry is the number of indices j not equal i such that length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of k = 1. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in O(n min{m(k), log(k+1)n}) time and O(n) space for k = O(1). It requires a careful adaptation of the technique of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. We also show O(n(2))- time algorithms to compute all results for a fixed m and all k = 0, ..., m or a fixed k and all m = k, ..., n - 1. Finally we show that the (k, m)mappability problem cannot be solved in strongly subquadratic time for k,m = circle minus(log n) unless the Strong Exponential Time Hypothesis fails.