Project MIUR-SIR (2014) CMACBioSeq grant n. RBSI146R5L

Combinatorial methods for analysis and compression of biological sequences

Principal investigator: Giovanna Rosone

Storing, processing, querying and analyzing a large amount of data is a significant task that holds many obstacles and challenges. Programs for compressing, indexing and analyzing of texts are available. However, the problem of compression and computational analysis for large datasets, such as those generated by current and future sequencing technologies where data in the terabyte range is conceivable, is still a difficult issue: the analysis of the large amount of generated data is a bottleneck for many investigators.

The aim of my proposed project is to develop research in the area of Data Compression and Sequence Analysis (both for the theoretical models and applications in specific fields) by using methods from Combinatorics on Words and by developing algorithms and data structures for the treatment of strings, mainly datasets from Next-generation Sequencing (NGS) technologies or third generation sequencing technologies.

The close relationship between these two areas will add and will provide a fundamental contribution in the advancement of knowledge in both research areas. We plan to introduce new approaches for the compression of short nucleotide sequences from the NGS technologies, together with additional metadata, such as the quality scores required to assess the reliability of the sequencing process, that guarantee the preservation of the salient dataset characteristics when a lossy approach is used. We also plan to introduce new strategies for the analysis of biological sequences, also in succinct form by allowing the random access over compressed data, whose applications can be, but not limited to, de novo sequence assembly, comparison between genome sequences and reference genomic sequences, determining the presence or absence of genomic variant mutations, genome phylogeny problems, etc.

Our goal will be achieved by using tools from combinatorics on words and from information theory. Particular attention will be devoted to an extension of the Burrows-Wheeler Transform to a multiset of texts. Our purposes will take further advantage of deep understanding of the combinatorial properties underlying of these transformations, but also of the development of succinct data structures and space-time-efficient implementations, such as implementations that use the external memory and the parallelization. This goal will be obtained by directing our research towards compression techniques and compressed indexes.

We believe that the methods here introduced here in the biological field can be of considerable interest especially for its possible applications in several fields such as linguistics, network security, etc. For instance, the intrusion detection problem in network security or the text categorization problems in computational linguistics.