Scientific Independence
of young Researchers
funded by the Italian Ministry of Education, Universities and Research (MIUR)

Project MIUR-SIR (2014) CMACBioSeq grant n. RBSI146R5L

Storing, processing, querying and analyzing a large amount of data is a significant task that holds many obstacles and challenges. Programs for compressing, indexing and analyzing of texts are available. However, the problem of compression and computational analysis for large datasets, such as those generated by current and future sequencing technologies where data in the terabyte range is conceivable, is still a difficult issue: the analysis of the large amount of generated data is a bottleneck for many investigators.

The aim of my proposed project is to develop research in the area of Data Compression and Sequence Analysis (both for the theoretical models and applications in specific fields) by using methods from Combinatorics on Words and by developing algorithms and data structures for the treatment of strings, mainly datasets from Next-generation Sequencing (NGS) technologies or third generation sequencing technologies.

The close relationship between these two areas will add and will provide a fundamental contribution in the advancement of knowledge in both research areas. We plan to introduce new approaches for the compression of short nucleotide sequences from the NGS technologies, together with additional metadata, such as the quality scores required to assess the reliability of the sequencing process, that guarantee the preservation of the salient dataset characteristics when a lossy approach is used. We also plan to introduce new strategies for the analysis of biological sequences, also in succinct form by allowing the random access over compressed data, whose applications can be, but not limited to, de novo sequence assembly, comparison between genome sequences and reference genomic sequences, determining the presence or absence of genomic variant mutations, genome phylogeny problems, etc.

Our goal will be achieved by using tools from combinatorics on words and from information theory. Particular attention will be devoted to an extension of the Burrows-Wheeler Transform to a multiset of texts. Our purposes will take further advantage of deep understanding of the combinatorial properties underlying of these transformations, but also of the development of succinct data structures and space-time-efficient implementations, such as implementations that use the external memory and the parallelization. This goal will be obtained by directing our research towards compression techniques and compressed indexes.

We believe that the methods here introduced here in the biological field can be of considerable interest especially for its possible applications in several fields such as linguistics, network security, etc. For instance, the intrusion detection problem in network security or the text categorization problems in computational linguistics.

Team:

Giovanna Rosone, Principal Investigator (University of Pisa)
Veronica Guerrini, Research fellow (2018, April - 2021, November), (University of Pisa)
Sabrina Mantaci, Associate Professor (University of Palermo)
Nadia Pisanti, Associate Professor (University of Pisa)
Nicola Prezza, (23 months) Research fellow (2018, March - January, 2020), (University of Pisa)
Marinella Sciortino, Associate Professor (University of Palermo)
Davide Verzotto, (12 months) Research fellow (October 2017 - December 2018) (University of Pisa)

Tools:

Multi-string BWT/LCP/SA computation:
[BCR_LCP_GSA_anyLength] [GitHub]
Multi-string ACS Computation by cLCP computation:
[cLCP-mACS_Spire2018] [GitHub]
Parallel Multi-string ACS Computation by cLCP computation:
[GitHub]
ebwt2snp (Positional clustering of the extended Burrows-Wheeler transform and discover SNPs):
[ebwt2snp_Wabi2018] [GitHub]
ebwt2InDel (Positional clustering of the extended Burrows-Wheeler transform and discover SNPs/indels):
[ebwt2snp_v2.zip] [GitHub]
LightMetaEbwt (Lightweight alignment-free and assembly-free framework for metagenomic classification):
[LightMetaEbwt_AlCob2019] [GitHub]
LIME - LightMetaEbwt_v2 (Lightweight alignment-free and assembly-free framework for metagenomic classification):
[LiME] [GitHub]
bwt2lcp (Compute the LCP array from the BWT of a set of DNA strings):
[bwt2lcp_CPM2019] [GitHub]
RLBWT2LCP (Compute minimum LCP value for each BWT equal-letter run):
[GitHub]
r-index (the run-length BWT index):
[GitHub]
gsufsort, building suffix arrays, LCP-arrays and BWTs for string collections:
[GitHub]
linear-centroid-decomposition:
[GitHub]
SACA-K+LA computes Lyndon-array (LA) together with the suffix array (SA) for T[1,n] in linear time:
[GitHub]
LiME_binning (Metagenomics binning):
[GitHub]
BFQzip:
[GitHub]
FASTQcompression, FASTQ Compressor:
[GitHub]

Publications:

Accesso Aperto MIUR. These works have been partly funded by the project CMACBioSeq (“Combinatorial methods for analysis and compression of biological sequences”) grant n. RBSI146R5L of the SIR 2014 program of MIUR.

Guerrini, V.; Louza, F. and Rosone, G. (2022). Lossy Compressor Preserving Variant Calling through Extended BWT. In Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies - BIOSTEC/BIOINFORMATICS 2022, ISBN 978-989-758-552-4, pages 38-48. DOI: 10.5220/0010834100003123. BEST PAPER AWARD at BIOSTEC/BIOINFORMATICS 2022
[doi] [bib]
Nicola Prezza and Giovanna Rosone, Space-efficient construction of compressed suffix trees. Theoretical Computer Science. Volume 852, 2021, Pages 138-156, ISSN 0304-3975, https://doi.org/10.1016/j.tcs.2020.11.024.
[doi] [bib] [Preprint arXiv] [Post-print paper] [code/dataset]
Giulia Bernardini, Huiping Chen, Alessio Conte, Roberto Grossi, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone, and Michelle Sweering, Combinatorial Algorithms for String Sanitization Combinatorial Algorithms for String Sanitization.ACM Transactions on Knowledge Discovery from Data,15, 1, Article 8 (2021), (Published: 7 December 2020)
[doi] [bib] [arXiv] [Post-print paper]
Sabrina Mantaci, Antonio Restivo, Giuseppe Romana, Giovanna Rosone, Marinella Sciortino, A combinatorial view on string attractors. Theoretical Computer Science. Volume 850, 2021, Available online 10 November 2020. Pages 236-248, ISSN 0304-3975, https://doi.org/10.1016/j.tcs.2020.11.006.
[doi] [bib] [Post-print paper]
Felipe A. Louza, Guilherme P. Telles, Simon Gog, Nicola Prezza, Giovanna Rosone gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections. Algorithms Mol Biol 15, 18 (2020). 10.1186/s13015-020-00177-y, Open Access. http://hdl.handle.net/11568/1061178.
[doi] [bib] [Open Access] [Pubmed] [ResearchGate] [code/dataset]
Nicola Prezza, Nadia Pisanti, Marinella Sciortino, Giovanna Rosone: Variable-order reference-free variant discovery with the Burrows-Wheeler Transform. BMC Bioinform (2020), 21-S(8): 260, 10.1186/s12859-020-03586-3.
[doi] [bib] [Poster] [Open Access] [Pubmed] [ResearchGate] [code/dataset]
Veronica Guerrini, Felipe A. Louza, Giovanna Rosone: Metagenomic analysis through the extended Burrows-Wheeler transform. BMC Bioinform (2020), 21-S(8): 299, 10.1186/s12859-020-03628-w, Open Access, http://hdl.handle.net/11568/1061184.
[doi] [bib] [Open Access] [Pubmed] [code/dataset]
Mai Alzamel, Lorraine A. K. Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone: Comparing Degenerate Strings. Fundamenta Informaticae (2020), 175(1-4): 41-58, 10.3233/FI-2020-1947. http://hdl.handle.net/11568/1053478
[doi] [bib] [Post-print paper]
Nicola Prezza, Giovanna Rosone: Faster Online Computation of the Succinct Longest Previous Factor Array. Lecture Notes in Computer Science (CIE 2020), Volume 12098, pages=339-352, Springer, Cham, Print ISBN 978-3-030-51465-5, 10.1007/978-3-030-51466-2_31. http://hdl.handle.net/11568/1061192
[doi] [bib] [Post-print paper] [ResearchGate]
Giulia Bernardini, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone: Approximate pattern matching on elastic-degenerate text. Theoretical Computer Science (2020), Volume 812, Pages 109-122, Elsevier B.V., ISSN: 0304-3975, 10.1016/j.tcs.2019.08.012.
[doi] [bib] [Post-print paper] [ResearchGate]
Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, Marinella Sciortino: The Alternating BWT: An algorithmic perspective. Theoretical Computer Science (2020), Volume 812, Pages 230-243, Elsevier B.V., ISSN: 0304-3975, 10.1016/j.tcs.2019.11.002. http://hdl.handle.net/11568/1061188.
[doi] [bib] [Post-print paper] [ResearchGate] [code/dataset]
Lorraine A.K. Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone: Longest property-preserved common factor: A new string-processing framework. Theoretical Computer Science (2020), Volume 812,Pages 244-251, Elsevier B.V., ISSN: 0304-3975, 10.1016/j.tcs.2020.02.012. http://hdl.handle.net/11568/1034802.
[doi] [bib] [Post-print paper] [ResearchGate]
Travis Gagie, Gonzalo Navarro, and Nicola Prezza: Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space. J. ACM , 67, 1, Article 2, 54 pages, 2020.
[doi] [bib] [Preprint arXiv] [ResearchGate]
Jarno Alanko, Giovanna D'Agostino, Alberto Policriti, and Nicola Prezza. Regular languages meet prefix sorting. In Proceedings of the Thirty-First Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '20). Society for Industrial and Applied Mathematics, USA, 911–930, 2020.
[doi] [bib] [Preprint arXiv] [Paper on SODA 2020] [ResearchGate]
Giulia Bernardini, Huiping Chen, Alessio Conte, Roberto Grossi, Grigorios Loukides, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone: String Sanitization: A Combinatorial Approach. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science, vol 11906. Springer, Cham. https://doi.org/10.1007/978-3-030-46150-8_37
[arXiv] [bib] [Post-print paper] [Paper on ECML PKDD] [ResearchGate]
Nicola Prezza, Nadia Pisanti, Marinella Sciortino, Giovanna Rosone: SNPs detection by eBWT positional clustering. Algorithms for Molecular Biology (2019), 14 (1), art. no. 3. BioMed Central Ltd., ISSN: 17487188, 10.1186/s13015-019-0137-8, Open Access.
[doi] [bib] [Open Access] [Pubmed] [code/dataset] [ResearchGate]
Mai Alzamel, Alessio Conte, Daniele Greco, Veronica Guerrini, Costas Iliopoulos, Nadia Pisanti, Nicola Prezza, Giulia Punzi and Giovanna Rosone Online Algorithms on Antipowers and Antiperiods, SPIRE 2019. Lecture Notes in Computer Science (LNCS), vol 11811. Springer, Cham, ISBN 978-3-030-32685-2
[doi] [bib] [Post-print paper] [ResearchGate]
Felipe A. Louza, Sabrina Mantaci, Giovanni Manzini, Marinella Sciortino, Guilherme P. Telles Inducing the Lyndon Array. SPIRE 2019. Lecture Notes in Computer Science , vol 11811. Springer, Cham.
[doi] [bib] [Post Print - arXiv] [ResearchGate]
Davide Della Giustina, Nicola Prezza, Rossano Venturini A New Linear-Time Algorithm for Centroid Decomposition. SPIRE 2019. Lecture Notes in Computer Science , vol 11811. Springer, Cham.
[doi] [bib] [Preprint arXiv] [Paper on Venturini's personal web page] [code/dataset]
Nicola Prezza: Optimal Rank and Select Queries on Dictionary-Compressed Text. Leibniz International Proceedings in Informatics (LIPIcs), 4:1-4:12.
[doi] [bib] [preprint arXiv] [Paper]
Gonzalo Navarro, Nicola Prezza, Universal compressed text indexing, Theoretical Computer Science, Volume 762, Pages 41-50, 2019
[doi] [bib] [Preprint arXiv] [Paper on repositorio.uchile.cl] [ResearchGate]
Sabrina Mantaci, Antonio Restivo, Giuseppe Romana, Giovanna Rosone, Marinella Sciortino: String Attractors and Combinatorics on Words. ICTCS 2019, CEUR Workshop Proceedings, Volume 2504, 2019, Pages 57-71, Code 155082, ISSN: 16130073
[bib] [arXiv] [Post-print paper]
Giulia Bernardini, Pawel Gawrychowski, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone: Even faster elastic-degenerate string matching via fast matrix multiplication. ICALP 2019. Leibniz International Proceedings in Informatics (LIPIcs), 132, art. no. 21. Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing, ISSN: 18688969, ISBN: 9783959771092
[doi] [bib] [arXiv] [Paper] [Presentation] [ResearchGate]
Nicola Prezza, Giovanna Rosone: Space-efficient computation of the LCP array from the Burrows-Wheeler transform. CPM 2019. Leibniz International Proceedings in Informatics (LIPIcs), 128, art. no. 7. Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing. ISSN: 18688969, ISBN: 9783959771030
[doi] [bib] [Preprint arXiv] [Paper] [code/dataset]
Raffaele Giancarlo, Giovanni Manzini, Giovanna Rosone, Marinella Sciortino: A new class of searchable and provably highly compressible string transformations. CPM 2019. Leibniz International Proceedings in Informatics (LIPIcs), 128, art. no. 12. Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing. ISSN: 18688969, ISBN: 9783959771030
[doi] [bib] [arXiv] [Paper]
Veronica Guerrini, Giovanna Rosone: Lightweight Metagenomic Classification via eBWT. Algorithms for Computational Biology. AlCoB 2019. Lecture Notes in Computer Science, vol 11488. Springer, Cham. ISSN: 03029743, ISBN: 9783030181734
[doi] [bib] [Slide] [Post-print paper] [ResearchGate] [code/dataset]
Fabio Garofalo, Giovanna Rosone, Marinella Sciortino, Davide Verzotto: The colored longest common prefix array computed via sequential scans. SPIRE 2018. Lecture Notes in Computer Science, vol 11147. Springer, Cham.
[doi] [bib] [arXiv] [Post-print paper] [ResearchGate] [code/dataset]
Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, Marinella Sciortino: Block Sorting-Based Transformations on Words: Beyond the Magic BWT. DLT 2018. Lecture Notes in Computer Science, vol 11088. Springer, Cham.
[doi] [bib] [Post-print paper] [ResearchGate] [code/dataset]
Lorraine A. K. Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone: Longest Property-Preserved Common Factor. SPIRE 2018. Lecture Notes in Computer Science, vol 11147. Springer, Cham.
[doi] [bib] [Post-print paper] [ResearchGate]
Mai Alzamel, Lorraine A. K. Ayad, Giulia Bernardini, Roberto Grossi, Costas S. Iliopoulos, Nadia Pisanti, Solon P. Pissis and Giovanna Rosone: Degenerate String Comparison and Applications. WABI 2018. Leibniz International Proceedings in Informatics, LIPIcs 2018, 113, art. no. 21, Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik. doi: 10.4230/LIPIcs.WABI.2018.21
[doi] [bib] [Paper] [ResearchGate]
Nicola Prezza, Nadia Pisanti, Marinella Sciortino and Giovanna Rosone: Detecting mutations by eBWT. WABI 2018. Leibniz International Proceedings in Informatics, LIPIcs, 2018, 113, art. no. 3, Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik. doi: 10.4230/LIPIcs.WABI.2018.3
[doi] [bib] [Preprint arXiv] [Paper] [code/dataset]
Dominik Kempa and Nicola Prezza: At the roots of dictionary compression: string attractors. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2018). Association for Computing Machinery, New York, NY, USA, 827–840.
[doi] [bib] [Preprint Arxiv] [ResearchGate]
Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, Marinella Sciortino, Luca Versari: Measuring the clustering effect of BWT via RLE. Theoretical Computer Science (2017) Volume 698, 2017, Pages 79-87, ISSN 0304-3975, 10.1016/j.tcs.2017.07.015.
[doi] [bib] [Post-print paper] [ResearchGate]
Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, Floriana Russo, Marinella Sciortino: On Fixed Points of the Burrows-Wheeler Transform. Fundamenta Informaticae (2017) , vol. 154, no. 1-4, pp. 277-288, 10.3233/FI-2017-1566.
[doi] [bib] [Post-print paper] [ResearchGate]
Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, Marinella Sciortino: Burrows-Wheeler Transform and Run-Length Enconding. WORDS 2017. Lecture Notes in Computer Science, vol 10432, pp 228-239, 2017. doi: 10.1007/978-3-319-66396-8_21 Springer International Publishing.
[doi] [bib] [Post-print paper] [ResearchGate]
Giulia Bernardini, Nadia Pisanti, Solon P. Pissis, Giovanna Rosone: Pattern Matching on Elastic-Degenerate Text with Errors. SPIRE 2017. Lecture Notes in Computer Science, vol 10508, pp 74-90, 2017. doi: 10.1007/978-3-319-67428-5_7. Springer International Publishing.
[doi] [bib] [Post-print paper] [ResearchGate]
Roberto Grossi, Costas Iliopoulos, Chang Liu, Nadia Pisanti, Solon Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani and Luca Versari: On-line pattern matching on similar texts. CPM 2017. Leibniz International Proceedings in Informatics (LIPIcs), vol. 78, pp 9:1-9:14, 2017. ISBN 978-3-95977-039-2, ISSN 1868-8969. Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik.
[doi] [bib] [Paper] [ResearchGate]
Anthony J. Cox, Fabio Garofalo, Giovanna Rosone, Marinella Sciortino: Lightweight LCP construction for very large collections of strings. Journal of Discrete Algorithms (2016) 37: 17-33, March 2016, ISSN 1570-8667, 10.1016/j.jda.2016.03.003.
[doi] [bib] [ArXiv] [open archive] [ResearchGate] [code/dataset]

Other workshops/seminars/meeting:

Marinella Sciortino: A combinatorial view on BWT variants. Dipartimento di informatica, University of Verona, February 13, 2020.
Nicola Prezza, Nadia Pisanti, Marinella Sciortino, and Giovanna Rosone, Lightweight Reference-Free Variation Detection using the Burrows-Wheeler Transform. BITS2019-Analysis of Big Omics Data. 16th Annual Meeting of the Bioinformatics Italian Society. June 26-28, 2019, Palermo, Italy. Poster. [Poster]
Veronica Guerrini, Giovanna Rosone Metagenomic analysis through the eBWT. BITS2019. 16th Annual Meeting of the Bioinformatics Italian Society. June 26-28, 2019, Palermo, Italy. Oral Presentation. Extended abstract. [Slide]
Giovanna Rosone: BWT / eBWT similarity. 25 Years of the Burrows-Wheeler Transform. June 10 - 14 , 2019, Dagstuhl Seminar 19241 [Slide]
Marinella Sciortino: Combinatorial Properties of BWT. 25 Years of the Burrows-Wheeler Transform. June 10 - 14 , 2019, Dagstuhl Seminar 19241
Nicola Prezza: String attractors. 25 Years of the Burrows-Wheeler Transform. June 10 - 14 , 2019, Dagstuhl Seminar 19241
Nicola Prezza: Detecting DNA Mutations by Suffix Sorting. Dipartimento di informatica, University of Pisa, June 18, 2018.
Giovanna Rosone: Compression and analysis of DNA sequences via EBWT. Invited Talk to Mathematical Foundations in Bioinformatics (MatBio) 2017, London, September 14, 2017.
Giovanna Rosone: The Burrows-Wheeler Transform: from Combinatorics on Words to Indexing Problems. Workshop FLA 2016, Naples, January 14-16, 2016.

Avviso di Copyright / Copyright Notice

I documenti (inclusi i tools) on-line in questo sito sono forniti al fine di assicurare la tempestiva diffusione dei risultati dell'attività di ricerca a fini non commerciali. I diritti di copyright e ogni altro diritto connesso sono riservati agli autori e agli aventi diritto, anche se essi hanno reso consultabili i documenti qui in forma elettronica. Resta inteso che chiunque effettui il download di tali lavori dovrà aderire ai termini e alle limitazioni imposte dal copyright. Questi documenti sono intesi per uso personale e non possono essere ridistribuiti senza il consenso esplicito degli autori e dei possessori dei diritti di copyright.

The on-line documents (included the tools) in this site are provided as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works are intended for personal use only, and may not be reposted without the explicit permission of the copyright holder.