Personal tools



Jump to: navigation, search

Title Finite context models for DNA
Reference PTDC/EIA/72569/2006
PI Armando J. Pinho
Participants Paulo J S G Ferreira, António J. R. Neves, Daniel A. Martins, Diogo Pratas
Funded by FCT
Global funding (€) 40,000€55,200 USD
RU funding (€) 40,000€55,200 USD
Starts 2008/01/01
Ends 2010/06/30

Project Summary

In general, the purpose of studying data compression algorithms is twofold. The need for efficient storage and transmission is often the main motivation, but underlying every compression technique there is a model that tries to reproduce as closely as possible the information source to be compressed. This model can have independent interest, as it can shed light on the statistical properties of the source.

DNA data are no exception. We urge to find out efficient methods able to reduce the storage space taken by the impressive amount of genomic data that are continuously being generated. For example, the human genome has about 3000 million pairs of bases, whereas the genome of the wheat has about 16000 million. Notwithstanding, we also desire to know how the code of life works and what structure does it possess. Creating good models for DNA is one of the ways to achieve this knowledge.

DNA is sufficiently determined by a sequence of four molecules called nucleotides (or bases): Adenine, Cytosine, Guanine, and Thymine. These nucleotides can be represented by an alphabet of four letters, A, C, G, T, and can be coded using two bits per base. According to functionality, DNA is subdivided in two parts: Coding and non-coding DNA. The proteins are synthesized based on the coding regions, which are characterized by triplets of bases (codons), each of which codes a protein unit or amino acid according to the genetic code. There are 64 possible codons, although they represent only 20 amino acids. Hence, the genetic code (which maps codons to amino acids) is redundant. The non-coding regions, also called "junk DNA", are DNA segments that do not comprise code for proteins. However, in eukaryotes, they encode functionally important signals for the regulation of chromosomes. These non-coding regions are interspersed throughout the DNA.

Recently, we proposed a three-state finite-context model for representing coding regions of DNA. A finite-context model of an information source assigns probability estimates to the symbols of the alphabet, according to a conditioning context computed over a finite and fixed number, M, of past outcomes (order-M finite-context model). That work provided motivating results that we aim to further develop in this project. One interesting finding was a deviating behaviour of some organisms from what was considered expected regarding the entropy values among the three bases of a codon. This characteristic will be investigated deeply and more systematically, in this project. Also, we intend to use multiple finite-context models not only for modeling coding region of the DNA but also non-coding parts. By the end of this project, we intend to have a better DNA encoder than current state-of-the-art encoders, and also a better knowledge regarding the information conveyed by the DNA sequences.