Research

Project MIUR-SIR (2014) CMACBioSeq grant n. RBSI146R5L

Storing, processing, querying and analyzing a large amount of data is a significant task that holds many obstacles and challenges. Programs for compressing, indexing and analyzing of texts are available. However, the problem of compression and computational analysis for large datasets, such as those generated by current and future sequencing technologies where data in the terabyte range is conceivable, is still a difficult issue: the analysis of the large amount of generated data is a bottleneck for many investigators.

The aim of my proposed project is to develop research in the area of Data Compression and Sequence Analysis (both for the theoretical models and applications in specific fields) by using methods from Combinatorics on Words and by developing algorithms and data structures for the treatment of strings, mainly datasets from Next-generation Sequencing (NGS) technologies or third generation sequencing technologies.

The close relationship between these two areas will add and will provide a fundamental contribution in the advancement of knowledge in both research areas. We plan to introduce new approaches for the compression of short nucleotide sequences from the NGS technologies, together with additional metadata, such as the quality scores required to assess the reliability of the sequencing process, that guarantee the preservation of the salient dataset characteristics when a lossy approach is used. We also plan to introduce new strategies for the analysis of biological sequences, also in succinct form by allowing the random access over compressed data, whose applications can be, but not limited to, de novo sequence assembly, comparison between genome sequences and reference genomic sequences, determining the presence or absence of genomic variant mutations, genome phylogeny problems, etc.

Our goal will be achieved by using tools from combinatorics on words and from information theory. Particular attention will be devoted to an extension of the Burrows-Wheeler Transform to a multiset of texts. Our purposes will take further advantage of deep understanding of the combinatorial properties underlying of these transformations, but also of the development of succinct data structures and space-time-efficient implementations, such as implementations that use the external memory and the parallelization. This goal will be obtained by directing our research towards compression techniques and compressed indexes.

We believe that the methods here introduced here in the biological field can be of considerable interest especially for its possible applications in several fields such as linguistics, network security, etc. For instance, the intrusion detection problem in network security or the text categorization problems in computational linguistics.

Team:

  • Giovanna Rosone, Principal Investigator (University of Pisa)
  • Veronica Guerrini, Research fellow (2018, April - 2021, November), (University of Pisa)
  • Sabrina Mantaci, Associate Professor (University of Palermo)
  • Nadia Pisanti, Associate Professor (University of Pisa)
  • Nicola Prezza, (23 months) Research fellow (2018, March - January, 2020), (University of Pisa)
  • Marinella Sciortino, Associate Professor (University of Palermo)
  • Davide Verzotto, (12 months) Research fellow (October 2017 - December 2018) (University of Pisa)

Tools:

  • Multi-string BWT/LCP/SA computation:
    [BCR_LCP_GSA_anyLength] [GitHub]
  • Multi-string ACS Computation by cLCP computation:
    [cLCP-mACS_Spire2018] [GitHub]
  • Parallel Multi-string ACS Computation by cLCP computation:
    [GitHub]
  • ebwt2snp (Positional clustering of the extended Burrows-Wheeler transform and discover SNPs):
    [ebwt2snp_Wabi2018] [GitHub]
  • ebwt2InDel (Positional clustering of the extended Burrows-Wheeler transform and discover SNPs/indels):
    [ebwt2snp_v2.zip] [GitHub]
  • LightMetaEbwt (Lightweight alignment-free and assembly-free framework for metagenomic classification):
    [LightMetaEbwt_AlCob2019] [GitHub]
  • LIME - LightMetaEbwt_v2 (Lightweight alignment-free and assembly-free framework for metagenomic classification):
    [LiME] [GitHub]
  • bwt2lcp (Compute the LCP array from the BWT of a set of DNA strings):
    [bwt2lcp_CPM2019] [GitHub]
  • RLBWT2LCP (Compute minimum LCP value for each BWT equal-letter run):
    [GitHub]
  • r-index (the run-length BWT index):
    [GitHub]
  • gsufsort, building suffix arrays, LCP-arrays and BWTs for string collections:
    [GitHub]
  • linear-centroid-decomposition:
    [GitHub]
  • SACA-K+LA computes Lyndon-array (LA) together with the suffix array (SA) for T[1,n] in linear time:
    [GitHub]
  • LiME_binning (Metagenomics binning):
    [GitHub]
  • BFQzip:
    [GitHub]
  • FASTQcompression, FASTQ Compressor:
    [GitHub]

Publications:

Accesso Aperto MIUR. These works have been partly funded by the project CMACBioSeq (“Combinatorial methods for analysis and compression of biological sequences”) grant n. RBSI146R5L of the SIR 2014 program of MIUR.

Other workshops/seminars/meeting:

  • Marinella Sciortino: A combinatorial view on BWT variants. Dipartimento di informatica, University of Verona, February 13, 2020.
  • Nicola Prezza, Nadia Pisanti, Marinella Sciortino, and Giovanna Rosone, Lightweight Reference-Free Variation Detection using the Burrows-Wheeler Transform. BITS2019-Analysis of Big Omics Data. 16th Annual Meeting of the Bioinformatics Italian Society. June 26-28, 2019, Palermo, Italy. Poster. [Poster]
  • Veronica Guerrini, Giovanna Rosone Metagenomic analysis through the eBWT. BITS2019. 16th Annual Meeting of the Bioinformatics Italian Society. June 26-28, 2019, Palermo, Italy. Oral Presentation. Extended abstract. [Slide]
  • Giovanna Rosone: BWT / eBWT similarity. 25 Years of the Burrows-Wheeler Transform. June 10 - 14 , 2019, Dagstuhl Seminar 19241 [Slide]
  • Marinella Sciortino: Combinatorial Properties of BWT. 25 Years of the Burrows-Wheeler Transform. June 10 - 14 , 2019, Dagstuhl Seminar 19241
  • Nicola Prezza: String attractors. 25 Years of the Burrows-Wheeler Transform. June 10 - 14 , 2019, Dagstuhl Seminar 19241
  • Nicola Prezza: Detecting DNA Mutations by Suffix Sorting. Dipartimento di informatica, University of Pisa, June 18, 2018.
  • Giovanna Rosone: Compression and analysis of DNA sequences via EBWT. Invited Talk to Mathematical Foundations in Bioinformatics (MatBio) 2017, London, September 14, 2017.
  • Giovanna Rosone: The Burrows-Wheeler Transform: from Combinatorics on Words to Indexing Problems. Workshop FLA 2016, Naples, January 14-16, 2016.

Articles

Avviso di Copyright / Copyright Notice

I documenti (inclusi i tools) on-line in questo sito sono forniti al fine di assicurare la tempestiva diffusione dei risultati dell'attività di ricerca a fini non commerciali. I diritti di copyright e ogni altro diritto connesso sono riservati agli autori e agli aventi diritto, anche se essi hanno reso consultabili i documenti qui in forma elettronica. Resta inteso che chiunque effettui il download di tali lavori dovrà aderire ai termini e alle limitazioni imposte dal copyright. Questi documenti sono intesi per uso personale e non possono essere ridistribuiti senza il consenso esplicito degli autori e dei possessori dei diritti di copyright.


The on-line documents (included the tools) in this site are provided as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works are intended for personal use only, and may not be reposted without the explicit permission of the copyright holder.

top