A learned approach to design compressed rank/select data structures

Abstract

We address the problem of designing, implementing and experimenting with compressed data structures that support $\textsf{rank}$ and $\textsf{select}$ queries over a dictionary of integers. We shine a new light on this classical problem by showing a connection between the input integers and the geometry of a set of points in a Cartesian plane suitably derived from them. We then build upon some results in computational geometry to introduce the first compressed rank/select dictionary based on the idea of “learning” the distribution of such points via proper linear approximations (LA). We therefore call this novel data structure the la_vector.
We prove time and space complexities of the la_vector in several scenarios: in the worst case, in the case of input distributions with finite mean and variance, and taking into account the $k$th order entropy of some of its building blocks. We also discuss improved hybrid data structures, namely ones that suitably orchestrate known compressed rank/select dictionaries with the la_vector.
We corroborate our theoretical results with a large set of experiments over datasets originating from a variety of applications (Web search, DNA sequencing, information retrieval and natural language processing) and show that our approach provides new interesting space-time trade-offs with respect to many well-established compressed rank/select dictionary implementations. In particular, we show that our $\textsf{select}$ is the fastest, and our $\textsf{rank}$ is on the space-time Pareto frontier.

Publication
ACM Transactions on Algorithms