Positions

  • Present 01/11/2018

    Post-doctoral Researcher

    ISTI-CNR, Pisa

  • 31/10/2018 01/11/2015

    Ph.D. Student

    Department of Computer Science, Pisa

Education

  • 2018 2015

    Ph.D. in Computer Science

    University of Pisa, Italy

  • 2014 2012

    Master Degree in Computer Science & Networking

    University of Pisa, Scuola Superiore Sant’Anna di studi universitari e perfezionamento, Italy

  • 2012 2009

    Bachelor Degree in Computer Engineering

    University of Florence, Italy

  • 2009 2004

    High School

    Liceo Scientifico Statale Guido Castelnuovo, Florence, Italy

Awards and Grants

On Slicing Sorted Integer Sequences

Giulio Ermanno Pibiri
arXiv.org, 2019
arXiv preprint

Abstract

Representing sorted integer sequences in small space is a central problem for large-scale retrieval systems such as Web search engines. Efficient query resolution, e.g., intersection or random access, is achieved by carefully partitioning the sequences.
In this work we describe and compare two different partitioning paradigms: partitioning by cardinality and partitioning by universe. Although the ideas behind such paradigms have been known in the coding and algorithmic community since many years, inverted index compression has extensively adopted the former paradigm, whereas the latter has received only little attention. As a result, an experimental comparison between these two is missing for the setting of inverted index compression.
We also propose and implement a solution that recursively slices the universe of representation of a sequence to achieve compact storage and attain to fast query execution. Albeit larger than some state-of-the-art representations, this slicing approach substantially improves the performance of list intersections and unions while operating in compressed space, thus offering an excellent space/time trade-off for the problem.

On Implementing the Binary Interpolative Coding Algorithm

Giulio Ermanno Pibiri
2019
Technical report

Compressed Indexes for Fast Search of Semantic Data

Raffaele Perego, Giulio Ermanno Pibiri and Rossano Venturini
arXiv.org, 2019
arXiv preprint

Abstract

The sheer increase in volume of RDF data demands efficient solutions for the triple indexing problem, that is devising a compressed data structure to compactly represent RDF triples by guaranteeing, at the same time, fast pattern matching operations. This problem lies at the heart of delivering good practical performance for the resolution of complex SPARQL queries on large RDF datasets.
In this work, we propose a trie-based index layout to solve the problem and introduce two novel techniques to reduce its space of representation for improved effectiveness. The extensive experimental analysis con- ducted over a wide range of publicly available real-world datasets, reveals that our best space/time trade-off configuration substantially outperforms existing solutions at the state-of-the-art, by taking 30 ÷ 60% less space and speeding up query execution by a factor of 2 ÷ 81×.

On Optimally Partitioning Variable-Byte Codes

Giulio Ermanno Pibiri and Rossano Venturini
IEEE Transactions on Knowledge and Data Engineering (TKDE), 2019
Full paper Journal Paper

Abstract

The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes.
This paper shows that the compression ratio of Variable-Byte can be improved by 2X by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that does not affect indexing time because of its linear-time complexity; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.

Space and Time-Efficient Data Structures for Massive Datasets

Giulio Ermanno Pibiri
a.y. 2018/2019, defended on 08/03/2019.
Ph.D. Thesis

Abstract

This thesis concerns the design of compressed data structures for the efficient storage of massive datasets of integer sequences and short strings.

Handling Massive N-Gram Datasets Efficiently

Giulio Ermanno Pibiri and Rossano Venturini
ACM Transactions on Information Systems (TOIS), 2019
Full paper Journal Paper

Abstract

Two fundamental problems concern the handling of large n-gram language models: indexing, that is compressing the n-grams and associated satellite values without compromising their retrieval speed, and estimation, that is computing the probability distribution of the n-grams extracted from a large textual source. Performing these two tasks efficiently is vital for several applications in the fields of Information Retrieval, Natural Language Processing and Machine Learning, such as auto-completion in search engines and machine translation.
Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to the state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word of an n-gram following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before, allowing the indexing of billions of strings. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Specifically, the most space-efficient competitors in the literature, that are both quantized and lossy, do not take less than our trie data structure and are up to 5 times slower. Conversely, our trie is as fast as the fastest competitor, but also retains an advantage of up to 65% in absolute space.
Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construc- tion that requires only one sorting step by exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5 times on the total running time of the previous approach.

Fast Dictionary-based Compression for Inverted Indexes

Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat
ACM International Conference on Web Search and Data Mining (WSDM), 2019
Full paper Conference Paper

Abstract

Dictionary-based compression schemes provide fast decoding operation, typically at the expense of reduced compression effectiveness compared to statistical or probability-based approaches.
In this work, we apply dictionary-based techniques to the compression of inverted lists, showing that the high degree of regularity that these integer sequences exhibit is a good match for certain types of dictionary methods, and that an important new trade-off balance between compression effectiveness and compression efficiency can be achieved.
Our observations are supported by experiments using the document-level inverted index data for two large text collections, and a wide range of other index compression implementations as reference points. Those experiments demonstrate that the gap between efficiency and effectiveness can be substantially narrowed.

Handling Massive N-Gram Datasets Efficiently

Giulio Ermanno Pibiri and Rossano Venturini
arXiv.org, 2018
arXiv preprint

Abstract

This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Performing these two tasks efficiently is fundamental for several applications in the fields of Information Retrieval, Natural Language Processing and Machine Learning, such as auto-completion in search engines and machine translation.
Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Compared to the state-of-the-art proposals, our data structures outperform all of them for space usage, without compromising their time performance. More precisely, the most space-efficient proposals in the literature, that are both quantized and lossy, are not smaller than our trie data structure and up to 5 times slower. Conversely, we are as fast as the fastest competitor, but also retain an advantage of up to 65% in absolute space.
Regarding the problem of estimation, we present a novel algorithm for estimating modi ed Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the- art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.

Variable-Byte Encoding is Now Space-Efficient Too

Giulio Ermanno Pibiri and Rossano Venturini
arXiv.org, 2018
arXiv preprint

Abstract

The ubiquitous Variable-Byte encoding is considered one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by 2x by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression.
Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that, by running in linear time and with low constant factors, does not affect indexing time; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.

Inverted Index Compression

Giulio Ermanno Pibiri and Rossano Venturini
Encyclopedia of Big Data Technologies, 2018
Chapter of Book

Abstract

The data structure at the core of nowadays large-scale search engines, social networks and storage architectures is the inverted index, which can be regarded as being a collection of sorted integer sequences called inverted lists. Because of the many documents indexed by search engines and stringent performance requirements dictated by the heavy load of user queries, the inverted lists often store several million (even billion) of integers and must be searched efficiently.
In this scenario, compressing the inverted lists of the index appears as a mandatory design phase since it can introduce a twofold advantage over a non-compressed representation: feed faster memory levels with more data in order to speed up the query processing algorithms and reduce the number of storage machines needed to host the whole index. The scope of the chapter is the one of surveying the most important encoding algorithms developed for efficient inverted index compression.

Efficient Data Structures for Massive N-Gram Datasets

Giulio Ermanno Pibiri and Rossano Venturini
ACM Conference on Research and Development in Information Retrieval (SIGIR), 2017
Full paper Conference Paper

Abstract

The efficient indexing of large and sparse $N$-gram datasets is crucial in several applications in Information Retrieval, Natural Language Processing and Machine Learning. Because of the stringent efficiency requirements, dealing with billions of $N$-grams poses the challenge of introducing a compressed representation that preserves the query processing speed.
In this paper we study the problem of reducing the space required by the representation of such datasets, maintaining the capability of looking up for a given $N$-gram within micro seconds. For this purpose we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art software packages. In particular, we present a trie data structure in which each word following a context of fixed length $k$, i.e., its preceding $k$ words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we are able to lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, we show that our technique introduces a negligible penalty at query time.

Dynamic Elias-Fano Representation

Giulio Ermanno Pibiri and Rossano Venturini
Annual Symposium on Combinatorial Pattern Matching (CPM), 2017
Full paper Conference Paper

Abstract

We show that it is possible to store a dynamic ordered set $\mathcal{S}(n,u)$ of $n$ integers drawn from a bounded universe of size $u$ in space close to the information-theoretic lower bound and yet preserve the asymptotic time optimality of the operations. Our results leverage on the Elias-Fano representation of $\mathcal{S}(n, u)$ which takes $\textsf{EF}(\mathcal{S}(n, u)) = n\lceil \log\frac{u}{n}\rceil + 2n$ bits of space and can be shown to be less than half a bit per element away from the information-theoretic minimum.
Considering a RAM model with memory words of $\Theta(\log u)$ bits, we focus on the case in which the integers of $\mathcal{S}$ are drawn from a polynomial universe of size $u = n^\gamma$, for any $\gamma = \Theta(1)$. We represent $\mathcal{S}(n,u)$ with $\textsf{EF}(\mathcal{S}(n, u)) + o(n)$ bits of space and:
1. support static predecessor/successor queries in $\mathcal{O}\min\{1+\log\frac{u}{n}, \log\log n\})$;
2. make $\mathcal{S}$ grow in an append-only fashion by spending $\mathcal{O}(1)$ per inserted element;
3. support random access in $\mathcal{O}(\frac{\log n}{\log\log n})$ worst-case, insertions/deletions in $\mathcal{O}(\frac{\log n}{\log\log n})$ amortized and predecessor/successor queries in $\mathcal{O}(\min\{1+\log\frac{u}{n}, \log\log n\})$ worst-case time. These time bounds are optimal.

Clustered Elias-Fano Indexes

Giulio Ermanno Pibiri and Rossano Venturini
ACM Transactions on Information Systems (TOIS), 2017
Full paper Journal Paper

Abstract

State-of-the-art encoders for inverted indexes compress each posting list individually. Encoding clusters of posting lists offers the possibility of reducing the redundancy of the lists while maintaining a noticeable query processing speed.
In this paper we propose a new index representation based on clustering the collection of posting lists and, for each created cluster, building an ad-hoc reference list with respect to which all lists in the cluster are encoded with Elias-Fano. We describe a posting lists clustering algorithm tailored for our encoder and two methods for building the reference list for a cluster. Both approaches are heuristic and differ in the way postings are added to the reference list: or according to their frequency in the cluster or according to the number of bits necessary for their representation.
The extensive experimental analysis indicates that significant space reductions are indeed possible, beating the best state-of-the-art encoders.

Talks and seminars

  • 07/06/2019 - ISTI-CNR, Pisa (Italy).
    A seminar about Ordered Set Problems.

  • 08/03/2019 - Department of Computer Science, Pisa (Italy).
    The (revised) slides for the final Ph.D. thesis defense.

  • 12/02/2019 - Melbourne Exhibition Center, Melbourne (Australia).
    WSDM Conference presentation.

  • 01/02/2019 - Department of Computer Science, Pisa (Italy).
    The slides of the first Ph.D. event showcasing the research of Ph.D. students.
    See also the YouTube video here.

  • 15/11/2018 - Department of Computer Science, Pisa (Italy).
    The slides of the final Ph.D. thesis.

  • 29/10/2018 - Department of Computer Science, Pisa (Italy).
    A seminar talk about web graph compression.

  • 17/05/2018 - RMIT University, Melbourne (Australia).
    A seminar talk about VByte.

  • 10/04/2018 - Riken AIP, Tokyo (Japan).
    A seminar talk about Elias-Fano.

  • 10/10/2017 - Department of Computer Science, Pisa (Italy).
    The slides of the results achieved after two years of Ph.D.

  • 10/08/2017 - Keio Plaza Hotel, Tokyo (Japan).
    SIGIR Conference presentation.

  • 06/07/2017 - University of Warsaw Library, Warsaw (Poland).
    CPM Conference presentation.

  • 06/06/2017 - Università della Szizzera Italiana, Lugano (Switzerland).
    IIR Conference presentation.

  • 17/10/2016 - Department of Computer Science, Pisa (Italy).
    The slides of the talk presenting my Ph.D. thesis proposal.

  • 21/06/2016 - Department of Computer Science, Pisa (Italy).
    A seminar talk about Elias-Fano.