Abstract. In this paper we address the issue of compressing and
indexing data. We devise a data structure whose space occupancy is a function
of the entropy of the underlying data set. We call the data structure opportunistic
since its space occupancy is decreased when the input is compressible and
this space reduction is achieved at no significant slowdown in the query
performance. More precisely, its space occupancy is optimal in an information-content
sense because a text T[1,u] is stored using
O(Hk(T))
+ o(1) bits per input symbol in the worst case, where Hk(T)
is the k-th order empirical entropy of T (the bound holds
for any fixed k). Given an arbitrary string P[1,p], the opportunistic
data structure allows to retrieve all the occ occurrences of P
in
T in O(p + occ (logeu))
time, for any fixed e
>0. If data are uncompressible we achieve the best space bound currently
known [Grossi-Vitter, STOC '00]; on compressible data our solution improves
the succinct suffix array of [Grossi-Vitter, STOC '00] and the classical
suffix tree and suffix array data structures either in space or in query
time or both.
We also study our opportunistic data structure in a dynamic setting
and devise a variant achieving effective search and update time bounds.
Finally, we show how to plug our opportunistic data structure into the
Glimpse tool. The result is an indexing tool which achieves sublinear space
and sublinear query time complexity.
The data structure described in this paper has been implemented and extensively tested. The results are reported in [Ferragina Manzini, SODA '01].