Research‎ > ‎

Motif Discovery

Efficient exact Motif Discovery

We show how to solve the motif discovery problem (almost) exactly on a practically relevant space of IUPAC generalized string patterns, using the p-value with respect to an i.i. d. model or a Markov model as the measure of over-representation. In particular, (i) we use a highly accurate compound Poisson approximation for the null distribution of the number of motif occurrences. We show how to compute the exact clump size distribution using a recently introduced device called probabilistic arithme tic automaton (PAA). (ii) We define two p-value scores for over-representation, the first one based on the total number of motif occurrences, the second one based on the number of sequences in a collection with at least one occurrence. (iii) We describe an algorithm to discover the optimal pattern with respect to either of the scores. The method exploits monotonicity properties of the compound Poisson approximation and is by orders of magnitude faster than exhaustive enumeration of IUPAC strings (11.8 h compared with an extrapolated running time of 4.8 years). (iv) We justify the use of the proposed scores for motif discovery by showing our method to outperform other motif discovery algorithms (e.g. MEME, Weeder) on benchmark datasets. We also propose new motifs on Mycobacterium tuberculosis.
  • Article in the ISMB'09 Bioinformatics special issue:[PubMed] [Paper]
  • Slides of the talk at ISMB'09: [pdf]
  • Software: The method has been implemented in Java and is part of the MoSDi package.

Modeling Evolutionary Fitness for DNA Motif Discovery

Many heuristic methods, including evolutionary algorithms, have been developed for the motif discovery problem. However, comparatively little attention has been devoted to the adequate evaluation of motif quality, especially when comparing motifs of different lengths. We propose an evolution strategy to solve the motif discovery problem based on a new fitness function that simultaneously takes into account (1) the number of motif occurrences, (2) the motif length, and (3) its information content. Experimental results show that the proposed method succeeds in uncovering the correct motif positions and length with high accuracy.
  • Article in the GECCO'09 proceedings: [Paper] (Best paper award in the Computational Biology and Bioinformatics track)
  • Slides of the talk at GECCO'09: [pdf]
  • Software: A Java implementation is available here.