Awards and Grants

Clustered Elias-Fano Indexes

Giulio Ermanno Pibiri and Rossano Venturini
ACM Transactions on Information Systems (TOIS), 2017 Journal Paper
    author      = {Giulio Ermanno Pibiri and Rossano Venturini},
    title       = {{C}lustered {E}lias-{F}ano {I}ndexes},
    journal     = {{ACM} {T}ransactions on {I}nformation {S}ystems ({TOIS})},
    volume      = {36},
    number      = {1},
    year        = {2017},
    issn        = {1046-8188},
    pages       = {2:1--2:33},
    articleno   = {2},
    numpages    = {33},
    url         = {},
    doi         = {10.1145/3052773},
    acmid       = {3052773},
    publisher   = {ACM},
    address     = {New York, NY, USA}


State-of-the-art encoders for inverted indexes compress each posting list individually. Encoding clusters of posting lists offers the possibility of reducing the redundancy of the lists while maintaining a noticeable query processing speed.
In this paper we propose a new index representation based on clustering the collection of posting lists and, for each created cluster, building an ad-hoc reference list with respect to which all lists in the cluster are encoded with Elias-Fano. We describe a posting lists clustering algorithm tailored for our encoder and two methods for building the reference list for a cluster. Both approaches are heuristic and differ in the way postings are added to the reference list: or according to their frequency in the cluster or according to the number of bits necessary for their representation.
The extensive experimental analysis indicates that significant space reductions are indeed possible, beating the best state-of-the-art encoders.

Dynamic Elias-Fano Representation

Giulio Ermanno Pibiri and Rossano Venturini
Annual Symposium on Combinatorial Pattern Matching (CPM), 2017 Conference Paper
  author    = {Giulio Ermanno Pibiri and Rossano Venturini},
  title     = {{D}ynamic {E}lias-{F}ano {R}epresentation},
  booktitle = {28th Annual Symposium on Combinatorial Pattern Matching, {CPM} 2017,
               July 4-6, 2017, Warsaw, Poland},
  pages     = {30:1--30:14},
  year      = {2017},
  crossref  = {DBLP:conf/cpm/2017},
  url       = {},
  doi       = {10.4230/LIPIcs.CPM.2017.30}


We show that it is possible to store a dynamic ordered set $\mathcal{S}(n,u)$ of $n$ integers drawn from a bounded universe of size $u$ in space close to the information-theoretic lower bound and yet preserve the asymptotic time optimality of the operations. Our results leverage on the Elias-Fano representation of $\mathcal{S}(n, u)$ which takes $\textsf{EF}(\mathcal{S}(n, u)) = n\lceil \log\frac{u}{n}\rceil + 2n$ bits of space and can be shown to be less than half a bit per element away from the information-theoretic minimum.
Considering a RAM model with memory words of $\Theta(\log u)$ bits, we focus on the case in which the integers of $\mathcal{S}$ are drawn from a polynomial universe of size $u = n^\gamma$, for any $\gamma = \Theta(1)$. We represent $\mathcal{S}(n,u)$ with $\textsf{EF}(\mathcal{S}(n, u)) + o(n)$ bits of space and:
1. support static predecessor/successor queries in $\mathcal{O}\min\{1+\log\frac{u}{n}, \log\log n\})$;
2. make $\mathcal{S}$ grow in an append-only fashion by spending $\mathcal{O}(1)$ per inserted element;
3. support random access in $\mathcal{O}(\frac{\log n}{\log\log n})$ worst-case, insertions/deletions in $\mathcal{O}(\frac{\log n}{\log\log n})$ amortized and predecessor/successor queries in $\mathcal{O}(\min\{1+\log\frac{u}{n}, \log\log n\})$ worst-case time. These time bounds are optimal.

Efficient Data Structures for Massive N-Gram Datasets

Giulio Ermanno Pibiri and Rossano Venturini
ACM Conference on Research and Development in Information Retrieval (SIGIR), 2017 Conference Paper
  author    = {Giulio Ermanno Pibiri and Rossano Venturini},
  title     = {{E}fficient {D}ata {S}tructures for {M}assive \emph{N}-{G}ram {D}atasets},
  booktitle = {Proceedings of the 40th International {ACM} {SIGIR} Conference on
               Research and Development in Information Retrieval, Shinjuku, Tokyo,
               Japan, August 7-11, 2017},
  pages     = {615--624},
  year      = {2017},
  crossref  = {DBLP:conf/sigir/2017},
  url       = {},
  doi       = {10.1145/3077136.3080798}


The efficient indexing of large and sparse $N$-gram datasets is crucial in several applications in Information Retrieval, Natural Language Processing and Machine Learning. Because of the stringent efficiency requirements, dealing with billions of $N$-grams poses the challenge of introducing a compressed representation that preserves the query processing speed.
In this paper we study the problem of reducing the space required by the representation of such datasets, maintaining the capability of looking up for a given $N$-gram within micro seconds. For this purpose we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art software packages. In particular, we present a trie data structure in which each word following a context of fixed length $k$, i.e., its preceding $k$ words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we are able to lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, we show that our technique introduces a negligible penalty at query time.

Inverted Index Compression

Giulio Ermanno Pibiri and Rossano Venturini
Encyclopedia of Big Data Technologies, 2018 Chapter of Book
    author = "Pibiri, Giulio Ermanno and Venturini, Rossano",
    editor = "Sakr, Sherif and Zomaya, Albert",
    title = "Inverted Index Compression",
    bookTitle = "Encyclopedia of Big Data Technologies",
    year = "2018",
    publisher = "Springer International Publishing",
    pages = "1--8",
    isbn = "978-3-319-63962-8",
    doi = "10.1007/978-3-319-63962-8_52-1",
    url = ""


The data structure at the core of nowadays large-scale search engines, social networks and storage architectures is the inverted index, which can be regarded as being a collection of sorted integer sequences called inverted lists. Because of the many documents indexed by search engines and stringent performance requirements dictated by the heavy load of user queries, the inverted lists often store several million (even billion) of integers and must be searched efficiently.
In this scenario, compressing the inverted lists of the index appears as a mandatory design phase since it can introduce a twofold advantage over a non-compressed representation: feed faster memory levels with more data in order to speed up the query processing algorithms and reduce the number of storage machines needed to host the whole index. The scope of the chapter is the one of surveying the most important encoding algorithms developed for efficient inverted index compression.

Variable-Byte Encoding is Now Space-Efficient Too

Giulio Ermanno Pibiri and Rossano Venturini , 2018 arXiv preprint
   author = {Pibiri, Giulio Ermanno and Venturini, Rossano},
    title = "{{V}ariable-{B}yte {E}ncoding is {N}ow {S}pace-{E}fficient {T}oo}",
  journal = {ArXiv e-prints},
archivePrefix = "arXiv",
   eprint = {1804.10949},
 primaryClass = "cs.IR",
 keywords = {Computer Science - Information Retrieval, Computer Science - Databases},
     year = 2018,
    month = apr,


The ubiquitous Variable-Byte encoding is considered one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by 2x by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression.
Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that, by running in linear time and with low constant factors, does not affect indexing time; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.

The slides of the seminar I gave about Elias-Fano are available here.

The slides of the talk presenting my Ph.D. Thesis proposal are available here.

The slides of the results achieved after two years of Ph.D. are available here.