Succinct Data Structures in Information Retrieval: Theory and Practice

Full-day tutorial at Sigir 2016

 

Simon Gog and Rossano Venturini

The current growth and availability of massive amounts of data gathered and processed by applications — Web search engines, textual and biological databases, just to mention a few — has changed the algorithmic requirements of basic processing and mining tools and provide ample motivation for a great deal of new theoretical and practical research on algorithms and data structures.

Not surprisingly, the last decades have seen a massive research in the field of the so-called succinct and compressed data structures. These new kind of data structures mimic the operations of their classical counterparts within a comparable time complexity but requiring much less space.  These solutions usually resort to a careful combination of ideas born both in algorithmic and in data compression fields. A large fraction of classical data structures have their succinct counterpart. Most of these results are of practical interest and several implementations exist, for example we mention SDSL, SuccinctFacebook’s Folly, and Sux.

Several authors are successfully applying the ideas at the basis of the succinct data structures to solve problems in other fields of research. Information retrieval community profits a lot from these data structures as there exist several applications in which them play a central role, e.g., posting lists representation, language model representation, indexing (social) graphs, query auto-completion, document retrieval and indexing dictionary of strings, just to mention the most recent ones.

The tutorial will introduce this field of research by presenting the most important succinct data structures to represent set of integers, set of points, trees, graphs and strings together with their most important applications to Information Retrieval problems.  The introduction of the succinct data structures will be sustained with a practical session with
programming handouts to solve. This will allow the attendees to directly experiment with implementations of these solutions on real datasets and understand the potential benefits they can bring on their own projects.

Please, take a look at this pdf for a more detailed description of the tutorial.

 

Objectives

Indexing is the key to efficient query processing in the vast majority of IR applications. While index compression is a relatively old and established method in IR and covered by all relevant text books the new data structures which combine compression and efficient execution of operations — like random access — are a relatively new field of which many IR researchers may not be aware of. Therefore the objectives of this tutorials are

  1. to introduce the audience to the world of succinct data structures
  2. to teach the basic techniques of succinct structures
  3. to present the practical impact of the structures to selected applications in IR
  4. to show the participants how they can implement their own succinct structures on top of existing basic structures provided by the succinct data structure library.

 

Prerequisites and support material

The theoretical part of the tutorial requires a basic knowledge of algorithms and data structures. The practical part and its handouts require a basic knowledge of C++, even if the knowledge of any modern programming language suffice for most of the tutorial.

Slides and code of the tutorial.

 

Schedule of the tutorial

The tutorial is subdivided in two parts of 3 hours each. The goal of the first part is to describe the most important succinct data structures. We will present the theoretical achievements in this field of research by focusing our attention to the ones that have demonstrated their relevance in practice. The introduction of each of these solutions will be sustained by presenting as selection of Information Retrieval problems in which they have an immediate application.

The goal of the second part is to present ready-to-use implementations of these data structures. This is done by introducing the Succinct Data Structure Library (SDSL). This library covers all the data structures presented in the previous part and has one of the presenter as its creator and one of its main developers. The speakers will also prepare a set of programming handouts which will allow the attendees to experiment with the learned concepts. The ultimate objective of the tutorial is to demonstrate to the attendees the potential benefit that succinct data structures can bring in their applications at a tiny implementation cost.

Part I

  • Introduction to succinct data structures.
    • Motivation and history
    • Examples of applications in various domains
  •  Basic operations on binary vectors
    • Rank and Select on uncompressed and compressed vectors
  • Elias-Fano representation
    • Rank and Select on sparse binary vectors
    • Representing increasing sequences
  • Succinct representations of trees and range minimum/maximum queries
    • LOUDS and balanced parenthesis sequences
    • Range-min-max tree
  • Rank and Select for sequences with larger alphabets: Wavelet trees
    • Rank and Select on general sequences
    • Orthogonal range queries on grids
  • Range minimum/maximum queries
  • Overview on Compressed full-text indexes, Range minimum/maximum queries and the document retrieval solutions
  • Data structures for IR
    • Space-efficient string dictionaries
    • Inverted indexes
    • Query auto-completion systems
    • Document retrieval systems

Part II

  • Overview about SDSL components and concepts
  • Detailed presentation of components via small code examples
  • Exploration of practical performance of structures
  • Presentation of benchmarking facilities
  • Programming handouts

 

References

[bibtex file=http://pages.di.unipi.it/rossano/wp-content/uploads/sites/7/2016/03/biblioTutorial.bib format=rossanoieee template=default-bibtex key_format=number sor=name]