#### Succinct Data Structures in Information Retrieval: Theory and Practice

###### Full-day tutorial at Sigir 2016

Simon Gog and Rossano Venturini

The current growth and availability of massive amounts of data gathered and processed by applications — Web search engines, textual and biological databases, just to mention a few — has changed the algorithmic requirements of basic processing and mining tools and provide ample motivation for a great deal of new theoretical and practical research on algorithms and data structures.

Not surprisingly, the last decades have seen a massive research in the field of the so-called succinct and compressed data structures. These new kind of data structures mimic the operations of their classical counterparts within a comparable time complexity but requiring much less space. These solutions usually resort to a careful combination of ideas born both in algorithmic and in data compression fields. A large fraction of classical data structures have their succinct counterpart. Most of these results are of practical interest and several implementations exist, for example we mention SDSL, Succinct, Facebook’s Folly, and Sux.

Several authors are successfully applying the ideas at the basis of the succinct data structures to solve problems in other fields of research. Information retrieval community profits a lot from these data structures as there exist several applications in which them play a central role, e.g., posting lists representation, language model representation, indexing (social) graphs, query auto-completion, document retrieval and indexing dictionary of strings, just to mention the most recent ones.

The tutorial will introduce this field of research by presenting the most important succinct data structures to represent set of integers, set of points, trees, graphs and strings together with their most important applications to Information Retrieval problems. The introduction of the succinct data structures will be sustained with a practical session with

programming handouts to solve. This will allow the attendees to directly experiment with implementations of these solutions on real datasets and understand the potential benefits they can bring on their own projects.

Please, take a look at this pdf for a more detailed description of the tutorial.

##### Objectives

Indexing is the key to efficient query processing in the vast majority of IR applications. While index compression is a relatively old and established method in IR and covered by all relevant text books the new data structures which combine compression and efficient execution of operations — like random access — are a relatively new field of which many IR researchers may not be aware of. Therefore the objectives of this tutorials are

- to introduce the audience to the world of succinct data structures
- to teach the basic techniques of succinct structures
- to present the practical impact of the structures to selected applications in IR
- to show the participants how they can implement their own succinct structures on top of existing basic structures provided by the succinct data structure library.

##### Prerequisites and support material

The theoretical part of the tutorial requires a basic knowledge of algorithms and data structures. The practical part and its handouts require a basic knowledge of C++, even if the knowledge of any modern programming language suffice for most of the tutorial.

Slides and code of the tutorial.

##### Schedule of the tutorial

The tutorial is subdivided in two parts of 3 hours each. The goal of the first part is to describe the most important succinct data structures. We will present the theoretical achievements in this field of research by focusing our attention to the ones that have demonstrated their relevance in practice. The introduction of each of these solutions will be sustained by presenting as selection of Information Retrieval problems in which they have an immediate application.

The goal of the second part is to present ready-to-use implementations of these data structures. This is done by introducing the Succinct Data Structure Library (SDSL). This library covers all the data structures presented in the previous part and has one of the presenter as its creator and one of its main developers. The speakers will also prepare a set of programming handouts which will allow the attendees to experiment with the learned concepts. The ultimate objective of the tutorial is to demonstrate to the attendees the potential benefit that succinct data structures can bring in their applications at a tiny implementation cost.

**Part I**

- Introduction to succinct data structures.
- Motivation and history
- Examples of applications in various domains

- Basic operations on binary vectors
- Rank and Select on uncompressed and compressed vectors

- Elias-Fano representation
- Rank and Select on sparse binary vectors
- Representing increasing sequences

- Succinct representations of trees and range minimum/maximum queries
- LOUDS and balanced parenthesis sequences
- Range-min-max tree

- Rank and Select for sequences with larger alphabets: Wavelet trees
- Rank and Select on general sequences
- Orthogonal range queries on grids

- Range minimum/maximum queries
- Overview on Compressed full-text indexes, Range minimum/maximum queries and the document retrieval solutions
- Data structures for IR
- Space-efficient string dictionaries
- Inverted indexes
- Query auto-completion systems
- Document retrieval systems

**Part II**

- Overview about SDSL components and concepts
- Detailed presentation of components via small code examples
- Exploration of practical performance of structures
- Presentation of benchmarking facilities
- Programming handouts

##### References

- K. Sadakane and G. Navarro

Fully-functional succinct trees

Proceedings of the 21st annual acm-siam symposium on discrete algorithms (SODA), 2010

[Bibtex]`@INPROCEEDINGS{SN10soda, TITLE = "Fully-Functional Succinct Trees", AUTHOR = "K. Sadakane and G. Navarro", BOOKTITLE = {Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms ({SODA})}, YEAR = 2010, PAGES = "134--149" }`

- G. Navarro

Spaces, trees and colors: the algorithmic landscape of document retrieval on sequences

ACM computing surveys, 2014

[Bibtex]`@article {Nav14acmsrvy, TITLE = "Spaces, Trees and Colors: The Algorithmic Landscape of Document Retrieval on Sequences", AUTHOR = "G. Navarro", JOURNAL = {{ACM} Computing Surveys}, YEAR = 2014, VOLUME = 46, NUMBER = 4, PAGES = "article 52", NOTE = "47 pages" }`

- K. Sadakane

Succinct data structures for flexible text retrieval systems

J. discrete algorithms, 2007

[Bibtex]`@article{Sada07jda, author = {Kunihiko Sadakane}, title = {Succinct data structures for flexible text retrieval systems}, journal = {J. Discrete Algorithms}, volume = {5}, number = {1}, pages = {12--22}, year = {2007}, url = {http://dx.doi.org/10.1016/j.jda.2006.03.011}, doi = {10.1016/j.jda.2006.03.011}, timestamp = {Thu, 10 May 2007 08:54:30 +0200}, biburl = {http://dblp.uni-trier.de/rec/bib/journals/jda/Sadakane07}, bibsource = {dblp computer science bibliography, http://dblp.org} }`

- D. Okanohara and K. Sadakane

Practical entropy-compressed rank/select dictionary

Proceedings of the nine workshop on algorithm engineering and experiments (ALENEX), 2007

[Bibtex]`@inproceedings{OS07alenex, author = {Daisuke Okanohara and Kunihiko Sadakane}, title = {Practical Entropy-Compressed Rank/Select Dictionary}, booktitle = {Proceedings of the Nine Workshop on Algorithm Engineering and Experiments ({ALENEX})}, year = {2007}, url = {http://dx.doi.org/10.1137/1.9781611972870.6}, doi = {10.1137/1.9781611972870.6}, timestamp = {Wed, 12 Feb 2014 17:08:14 +0100}, biburl = {http://dblp.uni-trier.de/rec/bib/conf/alenex/OkanoharaS07}, bibsource = {dblp computer science bibliography, http://dblp.org} }`

- S. Gog, A. Moffat, and M. Petri

On identifying phrases using collection statistics

Proceedings of advances in information retrieval – 37th european conference on IR research, ECIR, 2015

[Bibtex]`@inproceedings{gmp15ecir, author = {Simon Gog and Alistair Moffat and Matthias Petri}, title = {On Identifying Phrases Using Collection Statistics}, booktitle = {Proceedings of Advances in Information Retrieval - 37th European Conference on {IR} Research, {ECIR}}, pages = {278--283}, year = {2015}, url = {http://dx.doi.org/10.1007/978-3-319-16354-3_30}, doi = {10.1007/978-3-319-16354-3_30}, timestamp = {Tue, 17 Mar 2015 15:14:29 +0100}, biburl = {http://dblp.uni-trier.de/rec/bib/conf/ecir/GogMP15}, bibsource = {dblp computer science bibliography, http://dblp.org} }`

- S. Gog and M. Petri

Compact indexes for flexible top-k retrieval

Combinatorial pattern matching – 26th annual symposium, CPM, 2015

[Bibtex]`@inproceedings{GP15cpm, author = {Simon Gog and Matthias Petri}, title = {Compact Indexes for Flexible Top-k Retrieval}, booktitle = {Combinatorial Pattern Matching - 26th Annual Symposium, {CPM}}, pages = {207--218}, year = {2015}, url = {http://dx.doi.org/10.1007/978-3-319-19929-0_18}, doi = {10.1007/978-3-319-19929-0_18}, timestamp = {Tue, 16 Jun 2015 12:30:58 +0200}, biburl = {http://dblp.uni-trier.de/rec/bib/conf/cpm/GogP15}, bibsource = {dblp computer science bibliography, http://dblp.org} }`

- M. Patil, S. V. Thankachan, R. Shah, W. Hon, J. S. Vitter, and S. Chandrasekaran

Inverted indexes for phrases and strings

Proceeding of the 34th international ACM SIGIR conference on research and development in information retrieval, SIGIR, 2011

[Bibtex]`@inproceedings{PTSHVC11sigir, author = {Manish Patil and Sharma V. Thankachan and Rahul Shah and Wing-Kai Hon and Jeffrey Scott Vitter and Sabrina Chandrasekaran}, title = {Inverted indexes for phrases and strings}, booktitle = {Proceeding of the 34th International {ACM} {SIGIR} Conference on Research and Development in Information Retrieval, {SIGIR}}, pages = {555--564}, year = {2011}, url = {http://doi.acm.org/10.1145/2009916.2009992}, doi = {10.1145/2009916.2009992}, timestamp = {Fri, 05 Aug 2011 15:26:16 +0200}, biburl = {http://dblp.uni-trier.de/rec/bib/conf/sigir/PatilTSHVC11}, bibsource = {dblp computer science bibliography, http://dblp.org} }`

- G. Navarro and Y. Nekrich

Top-k document retrieval in optimal time and linear space

Proceedings of the twenty-third annual ACM-SIAM symposium on discrete algorithms, SODA, 2012

[Bibtex]`@inproceedings{NN12soda, author = {Gonzalo Navarro and Yakov Nekrich}, title = {Top-k document retrieval in optimal time and linear space}, booktitle = {Proceedings of the Twenty-Third Annual {ACM-SIAM} Symposium on Discrete Algorithms, {SODA}}, pages = {1066--1077}, year = {2012}, timestamp = {Wed, 12 Feb 2014 17:08:16 +0100}, biburl = {http://dblp.uni-trier.de/rec/bib/conf/soda/NavarroN12}, bibsource = {dblp computer science bibliography, http://dblp.org} }`

- S. Gog and G. Navarro

Improved single-term top-k document retrieval

Proceedings of the seventeenth workshop on algorithm engineering and experiments, ALENEX, 2015

[Bibtex]`@inproceedings{GN15alenex, author = {Simon Gog and Gonzalo Navarro}, title = {Improved Single-Term Top-k Document Retrieval}, booktitle = {Proceedings of the Seventeenth Workshop on Algorithm Engineering and Experiments, {ALENEX}}, pages = {24--32}, year = {2015}, url = {http://dx.doi.org/10.1137/1.9781611973754.3}, timestamp = {Wed, 28 Jan 2015 15:17:29 +0100}, biburl = {http://dblp.uni-trier.de/rec/bib/conf/alenex/GogN15}, bibsource = {dblp computer science bibliography, http://dblp.org} }`

- R. Grossi and G. Ottaviano

Fast compressed tries through path decompositions

ACM journal of experimental algorithmics, 2014

[Bibtex]`@article{GO14jea, author = {Roberto Grossi and Giuseppe Ottaviano}, title = {Fast Compressed Tries through Path Decompositions}, journal = {{ACM} Journal of Experimental Algorithmics}, volume = {19}, number = {1}, year = {2014}, url = {http://doi.acm.org/10.1145/2656332}, doi = {10.1145/2656332}, timestamp = {Tue, 07 Apr 2015 13:28:42 +0200}, biburl = {http://dblp.uni-trier.de/rec/bib/journals/jea/GrossiO14}, bibsource = {dblp computer science bibliography, http://dblp.org} }`

- G. Navarro and V. Mäkinen

Compressed full text indexes

Acm computing surveys, 2007

[Bibtex]`@article{NM-survey06, author = "G. Navarro and V. M{\"a}kinen", title = "Compressed full text indexes", journal = "ACM Computing Surveys", volume = "39", number = "1", year = 2007}`

- S. Gog, T. Beller, A. Moffat, and M. Petri

From theory to practice: plug and play with succinct data structures

Proceedings of the 13th international symposium experimental algorithms (SEA), 2014

[Bibtex]`@inproceedings{GogBMP14, author = {Simon Gog and Timo Beller and Alistair Moffat and Matthias Petri}, title = {From Theory to Practice: Plug and Play with Succinct Data Structures}, booktitle = {Proceedings of the 13th International Symposium Experimental Algorithms ({SEA})}, pages = {326--337}, year = {2014} }`

- P. Ferragina and S. S. Rao, “Tree compression and indexing.” , 2008.

[Bibtex]`@incollection{FerraginaR08, author = {Paolo Ferragina and S. Srinivasa Rao}, title = {Tree Compression and Indexing}, booktitle = {Encyclopedia of Algorithms}, year = {2008} }`

- J. Barbay, “Succinct and compressed data structures for permutations and integer functions.” , 2015.

[Bibtex]`@incollection{Barbay15, author = {J{\'{e}}r{\'{e}}my Barbay}, title = {Succinct and Compressed Data Structures for Permutations and Integer Functions}, booktitle = {Encyclopedia of Algorithms}, year = {2015} }`

- J. Barbay and I. J. Munro, “Succinct encoding of permutations: applications to text indexing.” , 2008.

[Bibtex]`@incollection{BarbayM08, author = {J{\'{e}}r{\'{e}}my Barbay and J. Ian Munro}, title = {Succinct Encoding of Permutations: Applications to Text Indexing}, booktitle = {Encyclopedia of Algorithms}, year = {2008} }`

- N. Rahman and R. Raman, “Rank and select operations on binary strings.” , 2008.

[Bibtex]`@incollection{RahmanR08, author = {Naila Rahman and Rajeev Raman}, title = {Rank and Select Operations on Binary Strings}, booktitle = {Encyclopedia of Algorithms}, year = {2008} }`

- P. Ferragina, R. González, G. Navarro, and R. Venturini

Compressed text indexes: from theory to practice

Acm journal of experimental algorithmics, 2008

[Bibtex]`@article{ExperSurvey, author = {P. Ferragina and R. Gonz{\'a}lez and G. Navarro and R. Venturini}, title = {Compressed text indexes: From theory to practice}, journal = {ACM Journal of Experimental Algorithmics}, volume = {13}, year = {2008}, ee = {http://doi.acm.org/10.1145/1412228.1455268}, bibsource = {DBLP, http://dblp.uni-trier.de} }`

- R. Konow, G. Navarro, C. L. A. Clarke, and A. L. –

Faster and smaller inverted indices with treaps

Proceedings of the 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2013

[Bibtex]`@inproceedings{KonowNCL13, author = {Roberto Konow and Gonzalo Navarro and Charles L. A. Clarke and Alejandro L{\'{o}}pez{-}Ortiz}, title = {Faster and smaller inverted indices with treaps}, booktitle = {Proceedings of the 36th {A}nnual {I}nternational {ACM SIGIR} {C}onference on {R}esearch and {D}evelopment in {I}nformation {R}etrieval ({SIGIR})}, pages = {193--202}, year = {2013} }`

- B. P. Hsu and G. Ottaviano

Space-efficient data structures for top-\emphk completion

Proceedings of the 22nd international world wide web conference (WWW), 2013

[Bibtex]`@inproceedings{HsuO13, author = {Bo-June Paul Hsu and Giuseppe Ottaviano}, title = {Space-efficient data structures for Top-\emph{k} completion}, booktitle = {Proceedings of the 22nd International World Wide Web Conference ({WWW})}, pages = {583--594}, year = {2013} }`

- S. Vigna

Quasi-succinct indices

Proceedings of the sixth ACM international conference on web search and data mining (WSDM), 2013

[Bibtex]`@inproceedings{Vigna13, author = {Sebastiano Vigna}, title = {Quasi-succinct indices}, booktitle = {Proceedings of the Sixth {ACM} International Conference on Web Search and Data Mining ({WSDM})}, pages = {83--92}, year = {2013} }`

- G. Ottaviano and R. Venturini

Partitioned elias-fano indexes

Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2014

[Bibtex]`@inproceedings{SIGIR14s, author = {Giuseppe Ottaviano and Rossano Venturini}, title = {Partitioned Elias-Fano Indexes}, booktitle = {{P}roceedings of the 37th {A}nnual {I}nternational {ACM SIGIR} {C}onference on {R}esearch and {D}evelopment in {I}nformation {R}etrieval {(SIGIR)}}, year = {2014}, pages = {273-282} }`

- G. Ottaviano, N. Tonellotto, and R. Venturini

Optimal space-time tradeoffs for inverted indexes

Proceedings of the 8th Annual International ACM Conference on Web Search and Data Mining (WSDM), 2015

[Bibtex]`@inproceedings{WSDM15s, author = {Giuseppe Ottaviano and Nicola Tonellotto and Rossano Venturini}, title = {Optimal Space-time Tradeoffs for Inverted Indexes}, booktitle = {{P}roceedings of the 8th {A}nnual {I}nternational {ACM} {C}onference on {W}eb {S}earch and {D}ata {M}ining ({WSDM})}, year = {2015}, pages = {-} }`

- P. Ferragina, F. Piccinno, and R. Venturini

Compressed indexes for string-searching in labeled graphs

Proceedings of the 24th International Conference on World Wide Web (WWW), 2015

[Bibtex]`@inproceedings{WWW15s, author = {Paolo Ferragina and Francesco Piccinno and Rossano Venturini}, title = {Compressed indexes for string-searching in labeled graphs}, booktitle = {{P}roceedings of the 24th {I}nternational {C}onference on {W}orld {W}ide {W}eb ({WWW})}, year = {2015}, pages = {--} }`

- R. M. Fano

On the number of bits required to implement anassociative memory

Memorandum 61, computer structures group, project mac, 1971

[Bibtex]`@article{Fano, author = {Robert Mario Fano}, title = {On the number of bits required to implement anassociative memory}, journal = {Memorandum 61, Computer Structures Group, Project MAC}, year = {1971}, publisher = {{MIT}}, address = {Cambridge}, }`

- P. Elias

Efficient storage and retrieval by content and address of static files

Journal of the acm, 1974

[Bibtex]`@article{Elias, author = {Peter Elias}, title = {Efficient Storage and Retrieval by Content and Address of Static Files}, journal = {Journal of the ACM}, volume = {21}, issue = {2}, year = {1974}, issn = {0004-5411}, pages = {246--260}, numpages = {15}, acmid = {321820}, publisher = {ACM}, address = {New York, NY, USA}, }`

- J. Fischer and V. Heun

Space-efficient preprocessing schemes for range minimum queries on static arrays

SIAM journal on computing, 2011

[Bibtex]`@article{fischer, author = {Johannes Fischer and Volker Heun}, title = {Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays}, journal = {{SIAM} Journal on Computing}, volume = {40}, number = {2}, year = {2011}, pages = {465-492}, ee = {http://dx.doi.org/10.1137/090779759}, bibsource = {DBLP, http://dblp.uni-trier.de} }`

- P. Ferragina and R. Venturini

The compressed permuterm index

Acm transactions on algorithms, 2010

[Bibtex]`@article{FV10s, author = {Paolo Ferragina and Rossano Venturini}, title = {The compressed permuterm index}, journal = {ACM Transactions on Algorithms}, volume = {7}, number = {1}, year = {2010}, pages = {10}, }`

- P. Ferragina and G. Manzini

Indexing compressed text

Journal of the acm, 2005

[Bibtex]`@article{FM, author = {Paolo Ferragina and Giovanni Manzini}, title = {Indexing compressed text}, journal = {Journal of the ACM}, volume = {52}, number = {4}, year = {2005}, pages = {552-581} }`

- W. Hon, R. Shah, and J. S. Vitter

Space-efficient framework for top-k string retrieval problems

Proceedings of the 50th annual IEEE symposium on foundations of computer science (FOCS), 2009

[Bibtex]`@inproceedings{HonSV09, author = {Wing-Kai Hon and Rahul Shah and Jeffrey Scott Vitter}, title = {Space-Efficient Framework for Top-k String Retrieval Problems}, booktitle = {Proceedings of the 50th Annual {IEEE} Symposium on Foundations of Computer Science ({FOCS})}, pages = {713--722}, year = {2009} }`

- G. Navarro

Wavelet trees for all

Journal discrete algorithms, 2014

[Bibtex]`@article{Navarro14, author = {Gonzalo Navarro}, title = {Wavelet trees for all}, journal = {Journal Discrete Algorithms}, volume = {25}, pages = {2--20}, year = {2014}, url = {http://dx.doi.org/10.1016/j.jda.2013.07.004}, doi = {10.1016/j.jda.2013.07.004}, timestamp = {Wed, 02 Apr 2014 13:53:55 +0200}, biburl = {http://dblp.uni-trier.de/rec/bib/journals/jda/Navarro14}, bibsource = {dblp computer science bibliography, http://dblp.org} }`

- P. Ferragina and R. Venturini, “Indexing compressed text.” , 2009, pp. 1442-1448.

[Bibtex]`@incollection{FerraginaV09s, author = {P. Ferragina and R. Venturini}, title = {Indexing Compressed Text}, booktitle = {Encyclopedia of Database Systems}, year = {2009}, pages = {1442-1448}, ee = {http://dx.doi.org/10.1007/978-0-387-39940-9_1144} }`

- D. Benoit, E. D. Demaine, J. I. Munro, R. Raman, V.Raman, and S. S. Rao

Representing trees of higher degree

Algorithmica, 2005

[Bibtex]`@article{BenoitDMRRR05, author = {D. Benoit and E.D. Demaine and J. I. Munro and R. Raman and V.Raman and S. S. Rao}, title = {Representing Trees of Higher Degree}, journal = {Algorithmica}, volume = {43}, number = {4}, year = {2005}, pages = {275-292} }`