Personal tools



Jump to: navigation, search

Title COMODIE - COmputational MOtif DIscovery in vertebrate genomes using Extreme-valued tuples
Reference PTDC/EIA-EIA/102943/2008
PI Sara Pinto Garcia
Participants Armando J. Pinho, Paulo J S G Ferreira, Holger Kantz
Funded by FCT
Global funding (€) 92000€126,960 USD
RU funding (€) 92000€126,960 USD
Starts 2010/05/01
Ends 2013/04/30

Executive summary

Sequence motifs are short patterns in DNA that are presumed to have biological function. As the regulatory code is evermore conjectured to be a key player in eukaryotic genomes, discovering further sequence motifs and their interplay is of paramount importance in understanding gene regulation in vertebrates and its influence on increased organismal complexity. This project aims at addressing these issues by proposing a new methodology for computational motif discovery, based on information theory and extreme value statistics, as studies suggest that standard motif discovery methods generally perform well in bacteria and yeast sequence data but perform relatively poorly on complex sequences from higher eukaryotes. We will use information theory for assessing optimal tuple information measures based on the formalism of Shannon ́s entropy and derived quantities, and extreme value statistics for providing a framework for threshold-based selecting criteria.

Enumerative methods, such as the dictionary-based methodology here proposed, overcome optimization problems inherent to alignment-based motif discovery methods, by parsing the entire search space. Unlike other dictionary-based methods that only output the most probable tuples, the proposed rationale identifies tuples that are much more probable than expected simply by the probability of their components, hence having a larger probability of detecting weak motifs from a noisy background. Moreover, our strategy is more likely to overcome two major pitfalls common to standard motif discovery methods, namely, spurious motif detection as direct consequence of non-functional repetitive elements in the genome, and over-representation bias for long sequences (current motif discovery algorithms for transcription factor binding sites perform poorly for sequences longer than 1,000 base pairs).

We propose to discover three classes of complementary and interrelated motifs. The first class (EVOTs) composes a dictionary of extreme-valued tuples directly observed in the genomic sequence, which can be compared to the dictionary of most frequent tuples and used for studying temporal correlations. The second class of motifs (EVATs) complements the previous information and consists of absent tuples with the property of being composed by two overlapping observed tuples of smaller size. Finally, the third class of motifs (BEVOTs and BEVATs) is based on binary mappings of the genomic sequence according to four rules, to which the methodology used for uncovering the previous two classes of motifs will be then applied to. The case-study genomic sequences for this preliminary study are the high-quality, highly-annotated mouse and human genomes, with the former being a good proxy for the latter. The third class of selected motifs, combined with a major comparative analysis of other available vertebrate genomes, ensure cross-validating the results, either by representing motifs more flexibly hence being more sensitive to the degeneracy of many consensus sequences, or by enabling phylogenetic footprinting of discovered motifs, respectively. Classes of observed tuples (EVOTs and BEVOTs) will also be validated and probed for function against available literature and databases, using empirical significance testing with scores from the area under the curve in receiver operating characteristic curves (ROC-AUC), aiming to output a small as possible set of false positives. Finally, all methods, algorithms and databases of discovered motifs will be published on a web platform, making our findings available for the scientific community.

Computational motif discovery, as other areas of bioinformatics, benefits from multidisciplinary teams of differently-skilled experts. The team at IEETA has a proven record of expertise in applying techniques from time series analysis and signal processing to biological data, as well as developing optimal computational approaches for coding and compressing of genomic sequences, while the team at the MPIPKS has a longstanding expertise and international reputation in advancing the theory of entropies and of pioneering the theoretical approach for selecting and predicting extreme-events from time series data. The two teams will collaborate closely for tasks dependent upon methodological and algorithmic advancements, while the IEETA team will be responsible for tasks of implementing and applying those developments.