Protein Hypernetworks
Johannes Köster, Eli Zamir (MPI Dortmund), Sven RahmannProtein-protein interactions give rise to biochemical systems underlying cellular complexity and functions. Despite their complexity, large systems of interacting proteins are often modeled as simple graphs with nodes and edges corresponding to the proteins and their potential interactions, respectively. This oversimplification neglects the dependencies of protein interactions, and therefore is limited in providing biological insights. We present an approach for expanding protein networks with fundamental interaction dependencies using propositional and modal logic, thereby obtaining protein hypernetworks. We illustrate the modeling potential of protein hypernetworks by showing that they improve protein complex prediction in yeast. Furthermore, by modeling the effects of protein perturbations on the possible system states we derive a perturbation impact score, a novel measure of a protein's importance besides its connectivity.
More details on protein hypernetworks.
Analysis of Ion Mobility Spectroscopy (IMS) Data
Dominik Kopczynski, Robert Kirberich, Jörg Ingo Baumbach (B&S Analytik, KIST Europe), Sven RahmannIon mobility spectrometry measures ion intensity I(r,t) of a mixture of molecules after a certain retention time r in a separation phase and a certain drift time t in the spectrometer. Different molecules generally lead to distinct peaks in the chromatogram of intensities; so it is theoretically possible to identify and quantify a number of different metabolites, for example in exhaled human breath. In practice, this is not trivial for a number of reasons:
- Peaks in the chromatogram may overlap and need to be de-mixed.
- Low-intensity peaks are hard to distinguish from the inherent noise.
- Chromatograms of complex mixtures are not necessarily the sums of chromatograms of single metabolites, because of ion interactions.
The IMS technology is moving towards small, autonomous devices that can be carried almost anywhere. Even with modern mobile technology, however, the amount of data they generate, is too large to be sent to a central repository for analysis. This means that we must have data analysis capabilities within the mobile IMS spectrometer. This, on the other hand, consumes much of the limited energy supply of such a device. Especially the high memory requirements (currently about 280 MB for a full chromatogram) contribute to the energy usage.
Our interest therefore focuses on analyzing each spectrum and reducing it to a few parameterized peaks, while the chromatogram is generated (up to 10 minutes for a high-resolution full chromatogram). Then, a few KB of memory would suffice for a mobile IMS device. Of course, not having the full chromatogram available makes it harder to identify and quantify peaks. The question we currently investigate is about the trade-off between used resources (memory, energy, time) and peak recognition quality.
Downstream data analysis (alignment of peaks between different IMS chromatograms, identification of metabolites from peak characteristics) can then be done with the derived parameterized peaks, instead of the full raw data with acceptable loss of accuracy.
Another challenge is to predict peak characteristics from the chemical metabolite structure using statistical machine learning or methods. For other spectrometry technologies, especially mass spectrometry, this is trivial, as the mass of every molecule is simply the sum of its atom masses. Ion mobilities are not so easily predicted, however. Success in this task would free us from the need to manually create a database of measured reference peaks for each substance of interest.
More details on IMS analysis.
Probabilistic Arithmetic Automata (PAAs) and Applications
Tobias Marschall, Inke Herms, Hans-Michael Kaltenbach, Sven RahmannProbabilistic Arithmetic Automata are an extension of Deterministic Finite Automata (DFAs) and Hidden Markov Models (HMMs) to allow (arithmetic or arbitrary binary) operations. Exact state-value probability distributions can be efficiently computed with in this framework, and many applications in string algorithmics and computational biology can be modeled in this framework and treated in a unifying manner. We have done work on
- Development of the PAA framework
- CPM'08 paper
- Application: Exact analysis of pattern matching algorithms, e.g., Horspool, BNDM, BOM
- LATA'10 paper
- preprint of submitted journal version (considerably extended)
- Software
- More details on the analysis of pattern matching algorithms
- Application: significance of motifs (CPM'08)
- see below: DNA Motif Discovery
- Application: seed sensitivity for seeded alignment (WABI'08)
- Application: significance of peaks in mass spectra (RECOMB-Satellite conference on Proteomics and Systems Biology 2007; precedes the formal PAA definition)
DNA Motif Discovery
Tobias Marschall, Sven RahmannThe motif discovery problem consists of finding over-represented patterns in a collection of biosequences. It is one of the classical sequence analysis problems, but still has not been satisfactorily solved in an exact and efficient manner. This is partly due to the large number of possibilities of defining the motif search space and the notion of over-representation. Even for well-defined formalizations, the problem is frequently solved in an ad hoc manner with heuristics that do not guarantee to find the best motif.
We have developed a new class of efficient exact motif discovery methods, and also worked on modeling objective functions for evolutionary algorithms for motif discovery. More details on motif discovery.
Small RNA Sequencing
Tobias Marschall, Marcel Martin, Sven RahmannMore details on small RNA sequencing.
Subsequence Combinatorics
With Cees Elzinge and Hui Wang, we have investigated several combinatorial problems (enumeration problems) related to subsequences within a sequence.A subsequence is obtained from a string by deleting any number of characters; thus in contrast to a substring, a subsequence is not necessarily a contiguous part of the string. Counting subsequences under various constraints has become relevant to biological sequence analysis, to machine learning, to the analysis of categorical time series in the social sciences, and to the theory of word complexity. We have investigated properties of subsequences that lead to efficient dynamic programming algorithms to count
- distinct subsequences in a string,
- distinct common subsequences of two strings,
- matching joint subsequence embeddings in two strings,
- distinct subsequences with a given minimum span,
- sequences generated by a string allowing characters to come in runs of a length that is bounded from above.
- sequences whose longest increasing subsequence has a given length.
Papers
- Cees Elzinga, Sven Rahmann, Hui Wang: Algorithms for Subsequence Combinatorics (Theoretical Computer Science 409(3): 394--404, 28 December 2008).
- Sven Rahmann: Subsequence Combinatorics and Applications to Microarray Production, DNA Sequencing and Chaining Algorithms. CPM 2006, LNCS 4009: 153-164.
Software
Software can be downloaded here as a single zip file. The Java class SubsequenceCombinatorics implements the algorithms described in the
Algorithms for Subsequence Combinatorics paper mentioned above. The Java class SubsequenceCombinatoricsTest is a JUnit based test and at the same time
demonstrates the use of the routines in SubsequenceCombinatorics.java. Documentation in html form is also included in the archive.
Note: In order to make the classes work on your system, you may have to change its package
declaration.
DNA Microarray Design and Layout
Sven Rahmann, Sérgio A. de Carvalho Jr.
Oligonucleotide microarrays consist of small DNA fragments (probes)
chemically synthesized at specific locations (spots) of a solid surface.
Probes are typically 25 to 60 nucleotides long and are synthesized in parallel, in a
series of repetitive steps. Each step appends a particular nucleotide to
selected regions of the chip. Selection occurs by exposure to light with the
help of a photolithographic mask or micromirror arrays.
Regardless of which method is used to direct light (masks or micromirror arrays), it is possible that some probes are accidentally activated for chemical coupling because of light diffraction, scattering or internal reflection on the chip surface. This unwanted illumination of regions introduces unexpected nucleotides that change the probe sequences, significantly reducing their chances of successful hybridization with their targets, and introducing cross-hybridizations. This problem is more likely to occur near the borders between a masked and an unmasked spot (in the case of maskless synthesis, between a spot that is receiving light and a spot that is not); hence the term border conflict.
By carefully designing the arrangement of the probes on the chip and their embeddings (the sequences of masked and unmasked steps used to synthesize each probe), it is possible to reduce the risk of unintended illumination. This issue becomes even more important as more probes need to be accommodated on a single chip, which require the production of spots at higher densities with reduced distances between the probes. Our aim is to design the layout of a microarray in such a way that we minimize the incidence of the unintended illumination problem, what we call the microarray layout problem (MLP), using the term layout to refer to where and how the probes are synthesized on the array, i.e., their arrangement and their embeddings. More details on microarray design and layout.
Collaborative Research Projects with Other Groups
CAMA: Detecting species-site dependencies in large multiple sequence alignments
Schwarz, Roland; Seibel, Philipp; Rahmann, Sven; Schoen, Christoph; Huenerberg, Miria; Müller-Reible, Clemens; Dandekar, Thomas; Karchin, Rachel; Schultz, Joerg; Mueller, TobiasResearch Interests
- Statistical and combinatorial methods in (biological) sequence analysis
- Probabilistically counting finite automata (with Inke Herms, Tobias Marschall)
- Abelian patterns (with Tahir Ejaz)
- Pattern statistics (with Tobias Marschall and Utz Pape, Berlin)
- De novo sequence motif discovery (with Tobias Wittkop)
- Subsequence combinatorics (with Cees Elzinga, Amsterdam, and Hui Wang, UK)
- Algorithms and data structures for ultra-fast sequencing technologies
- Indexing for approximate sequence matching (with Marcel Martin)
- Meta-genomes and Pan-genomes
- Computational epigenomics
- Analysis of and feature recognition light and DIC microscopy images (with several people at Janelia Farm)
- Species identification and biodiversity studies (with the Biocenter at Würzburg University)
- Computational infrastructure for understanding transcriptional gene regulation
- CoryneRegNet (with Jan Baumbach)
- CoryneCenter (with the CeBiTec BRF group, Bielefeld)
- Motif statistics of transcription factor binding sites (with Utz Pape, Martin Vingron; Berlin)
- Discovery of regulatory motifs (with Tobias Wittkop)
- Discovery of regulatory modules (with Katharina Pernhorst, Bonn, Gunnar Klau, Berlin, Axel Mosig, Shanghai)
- DNA microarray design and analysis
- Probe selection
- Array layout and manufacturing (with Sergio A. de Carvalho Jr.)
- Cross-hybridization models (with Nondas Fritzilas)
- Large-scale clustering
- Transitive graph projection, aka cluster editing (with Tobias Wittkop, Jan Baumbach, Marcel Martin, and the Jena bioinformatics group)
- Protein family clustering (with Tobias Wittkop, Jan Baumbach)
- Genome evolution and gene clusters (with Gunnar Klau, Berlin)
- Large-scale phylogenetic tree reconstruction (with the Biocenter at Würzburg University)
- Sequence-based information visualization
- HMM logos (with Benjamin Schuster-Böckler at the Sanger Center)
- Variations of sequence logos (with Madis Rumming)
