Efficient exact Motif Discovery
We show how to solve the motif discovery problem (almost) exactly on a
practically relevant space of IUPAC generalized string patterns, using
the p-value with respect to an i.i.
d. model or a Markov model as the
measure of over-representation. In particular, (i) we use a highly
accurate compound Poisson approximation for the null distribution of
the number of motif occurrences. We show how to compute the exact clump
size distribution using a recently introduced device called
probabilistic arithme
tic automaton (PAA). (ii) We define two p-value
scores for over-representation, the first one based on the total number
of motif occurrences, the second one based on the number of sequences
in a collection with at least one occurrence. (iii) We describe an
algorithm to discover the optimal pattern with respect to either of the
scores. The method exploits monotonicity properties of the compound
Poisson approximation and is by orders of magnitude faster than
exhaustive enumeration of IUPAC strings (11.8 h compared with an
extrapolated running time of 4.8 years). (iv) We justify the use of the
proposed scores for motif discovery by showing our method to outperform
other motif discovery algorithms (e.g. MEME, Weeder) on benchmark
datasets. We also propose new motifs on Mycobacterium tuberculosis. - Article in the ISMB'09 Bioinformatics special issue:[PubMed] [Paper]
- Slides of the talk at ISMB'09: [pdf]
- Software: The method has been implemented in
Java and is part of the MoSDi package.
Modeling Evolutionary Fitness for DNA Motif DiscoveryMany
heuristic methods, including evolutionary algorithms, have been
developed for the motif discovery problem. However, comparatively
little attention has been devoted to the adequate evaluation of motif
quality, especially when comparing motifs of different lengths. We
propose an evolution strategy to solve the motif discovery problem
based on a new fitness function that simultaneously takes into account
(1) the number of motif occurrences, (2) the motif length, and (3) its
information content. Experimental results show that the proposed method
succeeds in uncovering the correct motif positions and length with high
accuracy. - Article in the GECCO'09 proceedings: [Paper] (Best paper award in the Computational Biology and Bioinformatics track)
- Slides of the talk at GECCO'09: [pdf]
- Software: A Java implementation is available here.
|