**Investigators:** Kamalika Ray & Sven Rahmann

**Funding:** internal

We develop methods for optimizing gapped k-mer seeds with regard to beneficial properties, such as tolerance towards mismatches in a sequenced read.

Given a sequence *s*, a gapped *k*-mer is a subsequence of *s* of length *k*, whose characters have been selected according to a given pattern (*mask*) of length *w* that consists of *k significant* and *(w-k) insignificant* positions (gaps).

An example is `##_#__#___##___#__#_##`

with *k*=10 and *w*=22, where `#`

denotes a significant position.

Given a sequenced DNA read of length *n* and a total of *c* substitutions that can be placed arbitrarily (in particular, adversarily) in the read, how many of the gapped *k*-meres are changed by the *c* substitutions in the worst case?
In other words, for a given mask like the example above, find the worst-case placement of substitutions to change as many of the gapped *k*-mers as possible.

Among **all** masks of the same shape (*w,k*), find the **best** mask that leaves as many *k*-mers unchanged (in the worst case) as possible.
A related, question is, among all masks, maximize the reads’s worst-case coverage by gapped *k*-mers.

Both questions can be answered with the help of integer linear programming and by analyzing the structure of the solutions.

We can replace standard *k*-mers used in many alignment-free methods by optimal gapped *k*-mers and guarantee better sensitivity/selectivity properties.

(in preparation)

Algorithmic Bioinformatics, SIC, Saarland University | Privacy notice | Legal notice