Succinct Data Structures in Information Retrieval: Theory and Practice

Full-day tutorial at Sigir 2016

 

Simon Gog and Rossano Venturini

The current growth and availability of massive amounts of data gathered and processed by applications — Web search engines, textual and biological databases, just to mention a few — has changed the algorithmic requirements of basic processing and mining tools and provide ample motivation for a great deal of new theoretical and practical research on algorithms and data structures.

Not surprisingly, the last decades have seen a massive research in the field of the so-called succinct and compressed data structures. These new kind of data structures mimic the operations of their classical counterparts within a comparable time complexity but requiring much less space.  These solutions usually resort to a careful combination of ideas born both in algorithmic and in data compression fields. A large fraction of classical data structures have their succinct counterpart. Most of these results are of practical interest and several implementations exist, for example we mention SDSL, SuccinctFacebook’s Folly, and Sux.

Several authors are successfully applying the ideas at the basis of the succinct data structures to solve problems in other fields of research. Information retrieval community profits a lot from these data structures as there exist several applications in which them play a central role, e.g., posting lists representation, language model representation, indexing (social) graphs, query auto-completion, document retrieval and indexing dictionary of strings, just to mention the most recent ones.

The tutorial will introduce this field of research by presenting the most important succinct data structures to represent set of integers, set of points, trees, graphs and strings together with their most important applications to Information Retrieval problems.  The introduction of the succinct data structures will be sustained with a practical session with
programming handouts to solve. This will allow the attendees to directly experiment with implementations of these solutions on real datasets and understand the potential benefits they can bring on their own projects.

Please, take a look at this pdf for a more detailed description of the tutorial.

 

Objectives

Indexing is the key to efficient query processing in the vast majority of IR applications. While index compression is a relatively old and established method in IR and covered by all relevant text books the new data structures which combine compression and efficient execution of operations — like random access — are a relatively new field of which many IR researchers may not be aware of. Therefore the objectives of this tutorials are

  1. to introduce the audience to the world of succinct data structures
  2. to teach the basic techniques of succinct structures
  3. to present the practical impact of the structures to selected applications in IR
  4. to show the participants how they can implement their own succinct structures on top of existing basic structures provided by the succinct data structure library.

 

Prerequisites and support material

The theoretical part of the tutorial requires a basic knowledge of algorithms and data structures. The practical part and its handouts require a basic knowledge of C++, even if the knowledge of any modern programming language suffice for most of the tutorial.

Slides and code of the tutorial.

 

Schedule of the tutorial

The tutorial is subdivided in two parts of 3 hours each. The goal of the first part is to describe the most important succinct data structures. We will present the theoretical achievements in this field of research by focusing our attention to the ones that have demonstrated their relevance in practice. The introduction of each of these solutions will be sustained by presenting as selection of Information Retrieval problems in which they have an immediate application.

The goal of the second part is to present ready-to-use implementations of these data structures. This is done by introducing the Succinct Data Structure Library (SDSL). This library covers all the data structures presented in the previous part and has one of the presenter as its creator and one of its main developers. The speakers will also prepare a set of programming handouts which will allow the attendees to experiment with the learned concepts. The ultimate objective of the tutorial is to demonstrate to the attendees the potential benefit that succinct data structures can bring in their applications at a tiny implementation cost.

Part I

  • Introduction to succinct data structures.
    • Motivation and history
    • Examples of applications in various domains
  •  Basic operations on binary vectors
    • Rank and Select on uncompressed and compressed vectors
  • Elias-Fano representation
    • Rank and Select on sparse binary vectors
    • Representing increasing sequences
  • Succinct representations of trees and range minimum/maximum queries
    • LOUDS and balanced parenthesis sequences
    • Range-min-max tree
  • Rank and Select for sequences with larger alphabets: Wavelet trees
    • Rank and Select on general sequences
    • Orthogonal range queries on grids
  • Range minimum/maximum queries
  • Overview on Compressed full-text indexes, Range minimum/maximum queries and the document retrieval solutions
  • Data structures for IR
    • Space-efficient string dictionaries
    • Inverted indexes
    • Query auto-completion systems
    • Document retrieval systems

Part II

  • Overview about SDSL components and concepts
  • Detailed presentation of components via small code examples
  • Exploration of practical performance of structures
  • Presentation of benchmarking facilities
  • Programming handouts

 

References
  • K. Sadakane and G. Navarro
    Fully-functional succinct trees
    Proceedings of the 21st annual acm-siam symposium on discrete algorithms (SODA), 2010
    [Bibtex]
    @INPROCEEDINGS{SN10soda,
    TITLE = "Fully-Functional Succinct Trees",
    AUTHOR = "K. Sadakane and G. Navarro",
    BOOKTITLE = {Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms ({SODA})},
    YEAR = 2010,
    PAGES = "134--149"
    }
  • G. Navarro
    Spaces, trees and colors: the algorithmic landscape of document retrieval on sequences
    ACM computing surveys, 2014
    [Bibtex]
    @article {Nav14acmsrvy,
    TITLE = "Spaces, Trees and Colors: The Algorithmic Landscape of Document Retrieval on Sequences",
    AUTHOR = "G. Navarro",
    JOURNAL = {{ACM} Computing Surveys},
    YEAR = 2014,
    VOLUME = 46,
    NUMBER = 4,
    PAGES = "article 52",
    NOTE = "47 pages"
    }
  • [DOI] K. Sadakane
    Succinct data structures for flexible text retrieval systems
    J. discrete algorithms, 2007
    [Bibtex]
    @article{Sada07jda,
    author = {Kunihiko Sadakane},
    title = {Succinct data structures for flexible text retrieval systems},
    journal = {J. Discrete Algorithms},
    volume = {5},
    number = {1},
    pages = {12--22},
    year = {2007},
    url = {http://dx.doi.org/10.1016/j.jda.2006.03.011},
    doi = {10.1016/j.jda.2006.03.011},
    timestamp = {Thu, 10 May 2007 08:54:30 +0200},
    biburl = {http://dblp.uni-trier.de/rec/bib/journals/jda/Sadakane07},
    bibsource = {dblp computer science bibliography, http://dblp.org}
    }
  • [DOI] D. Okanohara and K. Sadakane
    Practical entropy-compressed rank/select dictionary
    Proceedings of the nine workshop on algorithm engineering and experiments (ALENEX), 2007
    [Bibtex]
    @inproceedings{OS07alenex,
    author = {Daisuke Okanohara and Kunihiko Sadakane},
    title = {Practical Entropy-Compressed Rank/Select Dictionary},
    booktitle = {Proceedings of the Nine Workshop on Algorithm Engineering and Experiments ({ALENEX})},
    year = {2007},
    url = {http://dx.doi.org/10.1137/1.9781611972870.6},
    doi = {10.1137/1.9781611972870.6},
    timestamp = {Wed, 12 Feb 2014 17:08:14 +0100},
    biburl = {http://dblp.uni-trier.de/rec/bib/conf/alenex/OkanoharaS07},
    bibsource = {dblp computer science bibliography, http://dblp.org}
    }
  • [DOI] S. Gog, A. Moffat, and M. Petri
    On identifying phrases using collection statistics
    Proceedings of advances in information retrieval – 37th european conference on IR research, ECIR, 2015
    [Bibtex]
    @inproceedings{gmp15ecir,
    author = {Simon Gog and Alistair Moffat and Matthias Petri},
    title = {On Identifying Phrases Using Collection Statistics},
    booktitle = {Proceedings of Advances in Information Retrieval - 37th European Conference on {IR} Research, {ECIR}},
    pages = {278--283},
    year = {2015},
    url = {http://dx.doi.org/10.1007/978-3-319-16354-3_30},
    doi = {10.1007/978-3-319-16354-3_30},
    timestamp = {Tue, 17 Mar 2015 15:14:29 +0100},
    biburl = {http://dblp.uni-trier.de/rec/bib/conf/ecir/GogMP15},
    bibsource = {dblp computer science bibliography, http://dblp.org}
    }
  • [DOI] S. Gog and M. Petri
    Compact indexes for flexible top-k retrieval
    Combinatorial pattern matching – 26th annual symposium, CPM, 2015
    [Bibtex]
    @inproceedings{GP15cpm,
    author = {Simon Gog and Matthias Petri},
    title = {Compact Indexes for Flexible Top-k Retrieval},
    booktitle = {Combinatorial Pattern Matching - 26th Annual Symposium, {CPM}},
    pages = {207--218},
    year = {2015},
    url = {http://dx.doi.org/10.1007/978-3-319-19929-0_18},
    doi = {10.1007/978-3-319-19929-0_18},
    timestamp = {Tue, 16 Jun 2015 12:30:58 +0200},
    biburl = {http://dblp.uni-trier.de/rec/bib/conf/cpm/GogP15},
    bibsource = {dblp computer science bibliography, http://dblp.org}
    }
  • [DOI] M. Patil, S. V. Thankachan, R. Shah, W. Hon, J. S. Vitter, and S. Chandrasekaran
    Inverted indexes for phrases and strings
    Proceeding of the 34th international ACM SIGIR conference on research and development in information retrieval, SIGIR, 2011
    [Bibtex]
    @inproceedings{PTSHVC11sigir,
    author = {Manish Patil and Sharma V. Thankachan and Rahul Shah and Wing-Kai Hon and Jeffrey Scott Vitter and Sabrina Chandrasekaran},
    title = {Inverted indexes for phrases and strings},
    booktitle = {Proceeding of the 34th International {ACM} {SIGIR} Conference on Research
    and Development in Information Retrieval, {SIGIR}},
    pages = {555--564},
    year = {2011},
    url = {http://doi.acm.org/10.1145/2009916.2009992},
    doi = {10.1145/2009916.2009992},
    timestamp = {Fri, 05 Aug 2011 15:26:16 +0200},
    biburl = {http://dblp.uni-trier.de/rec/bib/conf/sigir/PatilTSHVC11},
    bibsource = {dblp computer science bibliography, http://dblp.org}
    }
  • G. Navarro and Y. Nekrich
    Top-k document retrieval in optimal time and linear space
    Proceedings of the twenty-third annual ACM-SIAM symposium on discrete algorithms, SODA, 2012
    [Bibtex]
    @inproceedings{NN12soda,
    author = {Gonzalo Navarro and Yakov Nekrich},
    title = {Top-k document retrieval in optimal time and linear space},
    booktitle = {Proceedings of the Twenty-Third Annual {ACM-SIAM} Symposium on Discrete
    Algorithms, {SODA}},
    pages = {1066--1077},
    year = {2012},
    timestamp = {Wed, 12 Feb 2014 17:08:16 +0100},
    biburl = {http://dblp.uni-trier.de/rec/bib/conf/soda/NavarroN12},
    bibsource = {dblp computer science bibliography, http://dblp.org}
    }
  • S. Gog and G. Navarro
    Improved single-term top-k document retrieval
    Proceedings of the seventeenth workshop on algorithm engineering and experiments, ALENEX, 2015
    [Bibtex]
    @inproceedings{GN15alenex,
    author = {Simon Gog and Gonzalo Navarro},
    title = {Improved Single-Term Top-k Document Retrieval},
    booktitle = {Proceedings of the Seventeenth Workshop on Algorithm Engineering and
    Experiments, {ALENEX}},
    pages = {24--32},
    year = {2015},
    url = {http://dx.doi.org/10.1137/1.9781611973754.3},
    timestamp = {Wed, 28 Jan 2015 15:17:29 +0100},
    biburl = {http://dblp.uni-trier.de/rec/bib/conf/alenex/GogN15},
    bibsource = {dblp computer science bibliography, http://dblp.org}
    }
  • [DOI] R. Grossi and G. Ottaviano
    Fast compressed tries through path decompositions
    ACM journal of experimental algorithmics, 2014
    [Bibtex]
    @article{GO14jea,
    author = {Roberto Grossi and Giuseppe Ottaviano},
    title = {Fast Compressed Tries through Path Decompositions},
    journal = {{ACM} Journal of Experimental Algorithmics},
    volume = {19},
    number = {1},
    year = {2014},
    url = {http://doi.acm.org/10.1145/2656332},
    doi = {10.1145/2656332},
    timestamp = {Tue, 07 Apr 2015 13:28:42 +0200},
    biburl = {http://dblp.uni-trier.de/rec/bib/journals/jea/GrossiO14},
    bibsource = {dblp computer science bibliography, http://dblp.org}
    }
  • G. Navarro and V. Mäkinen
    Compressed full text indexes
    Acm computing surveys, 2007
    [Bibtex]
    @article{NM-survey06,
    author = "G. Navarro and V. M{\"a}kinen",
    title = "Compressed full text indexes",
    journal = "ACM Computing Surveys",
    volume = "39",
    number = "1",
    year = 2007}
  • S. Gog, T. Beller, A. Moffat, and M. Petri
    From theory to practice: plug and play with succinct data structures
    Proceedings of the 13th international symposium experimental algorithms (SEA), 2014
    [Bibtex]
    @inproceedings{GogBMP14,
    author = {Simon Gog and
    Timo Beller and
    Alistair Moffat and
    Matthias Petri},
    title = {From Theory to Practice: Plug and Play with Succinct Data Structures},
    booktitle = {Proceedings of the 13th International Symposium Experimental Algorithms ({SEA})},
    pages = {326--337},
    year = {2014}
    }
  • P. Ferragina and S. S. Rao, “Tree compression and indexing.” , 2008.
    [Bibtex]
    @incollection{FerraginaR08,
    author = {Paolo Ferragina and
    S. Srinivasa Rao},
    title = {Tree Compression and Indexing},
    booktitle = {Encyclopedia of Algorithms},
    year = {2008}
    }
  • J. Barbay, “Succinct and compressed data structures for permutations and integer functions.” , 2015.
    [Bibtex]
    @incollection{Barbay15,
    author = {J{\'{e}}r{\'{e}}my Barbay},
    title = {Succinct and Compressed Data Structures for Permutations and Integer
    Functions},
    booktitle = {Encyclopedia of Algorithms},
    year = {2015}
    }
  • J. Barbay and I. J. Munro, “Succinct encoding of permutations: applications to text indexing.” , 2008.
    [Bibtex]
    @incollection{BarbayM08,
    author = {J{\'{e}}r{\'{e}}my Barbay and
    J. Ian Munro},
    title = {Succinct Encoding of Permutations: Applications to Text Indexing},
    booktitle = {Encyclopedia of Algorithms},
    year = {2008}
    }
  • N. Rahman and R. Raman, “Rank and select operations on binary strings.” , 2008.
    [Bibtex]
    @incollection{RahmanR08,
    author = {Naila Rahman and
    Rajeev Raman},
    title = {Rank and Select Operations on Binary Strings},
    booktitle = {Encyclopedia of Algorithms},
    year = {2008}
    }
  • P. Ferragina, R. González, G. Navarro, and R. Venturini
    Compressed text indexes: from theory to practice
    Acm journal of experimental algorithmics, 2008
    [Bibtex]
    @article{ExperSurvey,
    author = {P. Ferragina and
    R. Gonz{\'a}lez and
    G. Navarro and
    R. Venturini},
    title = {Compressed text indexes: From theory to practice},
    journal = {ACM Journal of Experimental Algorithmics},
    volume = {13},
    year = {2008},
    ee = {http://doi.acm.org/10.1145/1412228.1455268},
    bibsource = {DBLP, http://dblp.uni-trier.de}
    }
  • R. Konow, G. Navarro, C. L. A. Clarke, and A. L. –
    Faster and smaller inverted indices with treaps
    Proceedings of the 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2013
    [Bibtex]
    @inproceedings{KonowNCL13,
    author = {Roberto Konow and
    Gonzalo Navarro and
    Charles L. A. Clarke and
    Alejandro L{\'{o}}pez{-}Ortiz},
    title = {Faster and smaller inverted indices with treaps},
    booktitle = {Proceedings of the 36th {A}nnual {I}nternational {ACM SIGIR} {C}onference on {R}esearch and {D}evelopment in {I}nformation {R}etrieval ({SIGIR})},
    pages = {193--202},
    year = {2013}
    }
  • B. P. Hsu and G. Ottaviano
    Space-efficient data structures for top-\emphk completion
    Proceedings of the 22nd international world wide web conference (WWW), 2013
    [Bibtex]
    @inproceedings{HsuO13,
    author = {Bo-June Paul Hsu and
    Giuseppe Ottaviano},
    title = {Space-efficient data structures for Top-\emph{k} completion},
    booktitle = {Proceedings of the 22nd International World Wide Web Conference ({WWW})},
    pages = {583--594},
    year = {2013}
    }
  • S. Vigna
    Quasi-succinct indices
    Proceedings of the sixth ACM international conference on web search and data mining (WSDM), 2013
    [Bibtex]
    @inproceedings{Vigna13,
    author = {Sebastiano Vigna},
    title = {Quasi-succinct indices},
    booktitle = {Proceedings of the Sixth {ACM} International Conference on Web Search and Data Mining
    ({WSDM})},
    pages = {83--92},
    year = {2013}
    }
  • G. Ottaviano and R. Venturini
    Partitioned elias-fano indexes
    Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2014
    [Bibtex]
    @inproceedings{SIGIR14s,
    author = {Giuseppe Ottaviano and Rossano Venturini},
    title = {Partitioned Elias-Fano Indexes},
    booktitle = {{P}roceedings of the 37th {A}nnual {I}nternational {ACM SIGIR} {C}onference on {R}esearch and {D}evelopment in {I}nformation {R}etrieval {(SIGIR)}},
    year = {2014},
    pages = {273-282}
    }
  • G. Ottaviano, N. Tonellotto, and R. Venturini
    Optimal space-time tradeoffs for inverted indexes
    Proceedings of the 8th Annual International ACM Conference on Web Search and Data Mining (WSDM), 2015
    [Bibtex]
    @inproceedings{WSDM15s,
    author = {Giuseppe Ottaviano and
    Nicola Tonellotto and
    Rossano Venturini},
    title = {Optimal Space-time Tradeoffs for Inverted Indexes},
    booktitle = {{P}roceedings of the 8th {A}nnual {I}nternational {ACM} {C}onference on {W}eb {S}earch and {D}ata {M}ining ({WSDM})},
    year = {2015},
    pages = {-}
    }
  • P. Ferragina, F. Piccinno, and R. Venturini
    Compressed indexes for string-searching in labeled graphs
    Proceedings of the 24th International Conference on World Wide Web (WWW), 2015
    [Bibtex]
    @inproceedings{WWW15s,
    author = {Paolo Ferragina and
    Francesco Piccinno and
    Rossano Venturini},
    title = {Compressed indexes for string-searching in labeled graphs},
    booktitle = {{P}roceedings of the 24th {I}nternational {C}onference on {W}orld {W}ide {W}eb ({WWW})},
    year = {2015},
    pages = {--}
    }
  • R. M. Fano
    On the number of bits required to implement anassociative memory
    Memorandum 61, computer structures group, project mac, 1971
    [Bibtex]
    @article{Fano,
    author = {Robert Mario Fano},
    title = {On the number of bits required to implement anassociative memory},
    journal = {Memorandum 61, Computer Structures Group, Project MAC},
    year = {1971},
    publisher = {{MIT}},
    address = {Cambridge},
    }
  • P. Elias
    Efficient storage and retrieval by content and address of static files
    Journal of the acm, 1974
    [Bibtex]
    @article{Elias,
    author = {Peter Elias},
    title = {Efficient Storage and Retrieval by Content and Address of Static Files},
    journal = {Journal of the ACM},
    volume = {21},
    issue = {2},
    year = {1974},
    issn = {0004-5411},
    pages = {246--260},
    numpages = {15},
    acmid = {321820},
    publisher = {ACM},
    address = {New York, NY, USA},
    }
  • J. Fischer and V. Heun
    Space-efficient preprocessing schemes for range minimum queries on static arrays
    SIAM journal on computing, 2011
    [Bibtex]
    @article{fischer,
    author = {Johannes Fischer and
    Volker Heun},
    title = {Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays},
    journal = {{SIAM} Journal on Computing},
    volume = {40},
    number = {2},
    year = {2011},
    pages = {465-492},
    ee = {http://dx.doi.org/10.1137/090779759},
    bibsource = {DBLP, http://dblp.uni-trier.de}
    }
  • P. Ferragina and R. Venturini
    The compressed permuterm index
    Acm transactions on algorithms, 2010
    [Bibtex]
    @article{FV10s,
    author = {Paolo Ferragina and
    Rossano Venturini},
    title = {The compressed permuterm index},
    journal = {ACM Transactions on Algorithms},
    volume = {7},
    number = {1},
    year = {2010},
    pages = {10},
    }
  • P. Ferragina and G. Manzini
    Indexing compressed text
    Journal of the acm, 2005
    [Bibtex]
    @article{FM,
    author = {Paolo Ferragina and
    Giovanni Manzini},
    title = {Indexing compressed text},
    journal = {Journal of the ACM},
    volume = {52},
    number = {4},
    year = {2005},
    pages = {552-581}
    }
  • W. Hon, R. Shah, and J. S. Vitter
    Space-efficient framework for top-k string retrieval problems
    Proceedings of the 50th annual IEEE symposium on foundations of computer science (FOCS), 2009
    [Bibtex]
    @inproceedings{HonSV09,
    author = {Wing-Kai Hon and
    Rahul Shah and
    Jeffrey Scott Vitter},
    title = {Space-Efficient Framework for Top-k String Retrieval Problems},
    booktitle = {Proceedings of the 50th Annual {IEEE} Symposium on Foundations of Computer Science ({FOCS})},
    pages = {713--722},
    year = {2009}
    }
  • [DOI] G. Navarro
    Wavelet trees for all
    Journal discrete algorithms, 2014
    [Bibtex]
    @article{Navarro14,
    author = {Gonzalo Navarro},
    title = {Wavelet trees for all},
    journal = {Journal Discrete Algorithms},
    volume = {25},
    pages = {2--20},
    year = {2014},
    url = {http://dx.doi.org/10.1016/j.jda.2013.07.004},
    doi = {10.1016/j.jda.2013.07.004},
    timestamp = {Wed, 02 Apr 2014 13:53:55 +0200},
    biburl = {http://dblp.uni-trier.de/rec/bib/journals/jda/Navarro14},
    bibsource = {dblp computer science bibliography, http://dblp.org}
    }
  • P. Ferragina and R. Venturini, “Indexing compressed text.” , 2009, pp. 1442-1448.
    [Bibtex]
    @incollection{FerraginaV09s,
    author = {P. Ferragina and
    R. Venturini},
    title = {Indexing Compressed Text},
    booktitle = {Encyclopedia of Database Systems},
    year = {2009},
    pages = {1442-1448},
    ee = {http://dx.doi.org/10.1007/978-0-387-39940-9_1144}
    }
  • D. Benoit, E. D. Demaine, J. I. Munro, R. Raman, V.Raman, and S. S. Rao
    Representing trees of higher degree
    Algorithmica, 2005
    [Bibtex]
    @article{BenoitDMRRR05,
    author = {D. Benoit and
    E.D. Demaine and
    J. I. Munro and
    R. Raman and
    V.Raman and
    S. S. Rao},
    title = {Representing Trees of Higher Degree},
    journal = {Algorithmica},
    volume = {43},
    number = {4},
    year = {2005},
    pages = {275-292}
    }