Investigators: Kamalika Ray & Sven Rahmann
Funding: internal
We develop methods for optimizing gapped k-mer seeds with regard to beneficial properties, such as tolerance towards mismatches in a sequenced read.
Given a sequence s, a gapped k-mer is a subsequence of s of length k, whose characters have been selected according to a given pattern (mask) of length w that consists of k significant and (w-k) insignificant positions (gaps).
An example is ##_#__#___##___#__#_##
with k=10 and w=22, where #
denotes a significant position.
Given a sequenced DNA read of length n and a total of c substitutions that can be placed arbitrarily (in particular, adversarily) in the read, how many of the gapped k-meres are changed by the c substitutions in the worst case? In other words, for a given mask like the example above, find the worst-case placement of substitutions to change as many of the gapped k-mers as possible.
Among all masks of the same shape (w,k), find the best mask that leaves as many k-mers unchanged (in the worst case) as possible. A related, question is, among all masks, maximize the reads’s worst-case coverage by gapped k-mers.
Both questions can be answered with the help of integer linear programming and by analyzing the structure of the solutions.
We can replace standard k-mers used in many alignment-free methods by optimal gapped k-mers and guarantee better sensitivity/selectivity properties.
(in preparation)
Algorithmic Bioinformatics, SIC, Saarland University | Privacy notice | Legal notice