Ashutosh Gupta
Abstract
The deoxyribonucleic acid (DNA) constitutes the physical medium in which all properties of living organisms are encoded. The understanding of its sequence is primary concern in molecular biology. Some important molecular biology databases (ERIBL, GenBank, DDJB) are developed around the world to accumulate nucleotide sequences (DNA, RNA) and amino-acid sequences of proteins. It is well acknowledged that their size increases nowadays exponentially fast. Not as big yet as some other scientific databases, their size is in hundreds of GB [1]. For complete genomes, these texts can be very elongated. The human genome for example contains three billions characters over twenty-three pairs of chromosomes. It contains all the genetic substance of the human beings. With escalating number of genome sequences being made available, the difficulty of storing and using databases has to be addressed. The compression of genetic information as a result constitutes a very important job. Another factor which is also to be considered is the prediction of certain kind of disease by applying the searching a pattern in the compressed domain.