FM-Index Version 2Paolo Ferragina and Rossano Venturini Dipartimento di Informatica, University of Pisa, Italy |
The present software offers a new implementation of the compressed full-text index data structure, called FM-index. How does it work?The novelty of the FM-index resides in the careful combination of the Burrows-Wheeler compression algorithm and the suffix-array data structure to obtain a sort of compressed suffix array. The resulting index is opportunistic since it takes advantage of the compressibility of the input data for decreasing the index space. This compression is achieved at no significant asymptotic slowdown in the query time with respect to the fastest known full-text indices (i.e. the suffix tree and the suffix array). The FM-index supports four main operations:
In practice all operations decompress just a tiny portion of the compressed file (typically a few kilobytes), thus taking a few milliseconds on a commodity PC even when the indexed files sum up to several megabytes. The structure of the FM-index consists of two parts: a core containing the basic indexing data structure, and an auxiliary set of information which are necessary to support the locate and extract operations, only. The FM-index allows to trade space occupancy for search time, by increasing/decreasing the size of the core and/or of the auxiliary information. The new implementation of the FM-index introduces some significant changes in the core and in the auxiliary information, as well it adopts new algorithmic solutions for the implementation of some basic steps of the FM-index inner working. This is the reason why we denote it as FM-index vers. 2 when we refer to the this new version. What is changed?The new version of FM-Index modifies the marking strategy in order to improve the performance of the locate operation. The index marks logicaly and uniformly the text by adding a special character every M characters of the original text. This way, all the BWT rows that start with that special character are contiguous, and for them we keep their explicit text position. The count algorithm must take into account the presence of the special character within the indexed text. Therefore, if a pattern P[1,p] has to be searched, we actually search for (p-1) patterns obtained by inserting the special characters in P equally spaced of M positions, plus we search for the pattern P itself. This search is implemented in parallel over all patterns by exploiting the fact that at any step i, we have to search either for P[p-i] or for the special character. This way, the overall search cost becomes quadratic in the pattern length. The output is a set of at most p ranges of rows. The locate algorithm proceeds for at most M phases as follows. At phase 0, S is the single range of rows to be located. At phase k, S contains rows that may be reached in k "backward" steps from the rows to be located. Therefore, at any step, S consists of a set of ranges of rows. To mantain the invariant, the algorithm picks up a range of S, let it be [a,b], and determines in parallel the z distinct characters that occur in the substring BWT[a,b]. Then it executes z backward steps, one per character, thus originating z new ranges of rows. The induction is preserved because the positions of these rows are at distance (k+1) from the rows to be located. The algorithm cycles over all ranges of S, and thus determines new ranges that will form the new set S. Notice that if a range of rows starts with the special character, its position in the indexed text is explicitly stored, so that the position of the corresponding row to be located, can be inferred by summing k. This range can be dropped from S. At most after M phases the set S will be empty, given that special characters are located at distance M in the original text. Where can I find it?The software package is available at the Pizza&Chili site. In order to work properly, the FM-index needs the ds_ssort library, designed by Giovanni Manzini. However, you do not need to download this library because it is included within the present package. To compile the FM-index software you must issue the command make in the FM-index home directory. This creates three useful commands:
The first two commands, fm-build and fm-search, allow to use the FM-index either as a full-text search engine, or as a compressed storage engine that allows to display parts of the indexed-compressed text without its full decompression. How to use command-line programsSome examples of usage follow (try also fm-build -h and fm-search -h): # fm-build MyFile -o IndexedMyFile Other options are available for fm-build in order to modify the size of the information stored in the core and in the auxiliary block. This allows to trade space versus time efficiency of the indexing and searching operations. For example: # fm-build -f 0.1 MyFile -o IndexedMyFile # fm-build -f 0 myfile -o IndexedMyFile # fm-search -c foo IndexedMyFile # fm-search -l foo IndexedMyFile # fm-search -e 100 -n 10 IndexedMyFile # fm-search -s 10 foo IndexedMyFile # fm-search -d MyFileCopy IndexedMyFile How to use the libraryYou can develop your own search engine by exploiting the functionalities offered by the FM-index library. In this case you need to include in your software the two files: fm-index.a and interface.h. Please have a look at the two sources fm_search_main.c and fm_build_main.c for working examples on the usage of the library. We point out that the FM-index library follows the API suggested
by the Pizza&Chili site for implementations of compressed full-text indexes. For further details on this initiative please have a look at one of the mirrors: the Italian mirror or the Chilean mirror. Please note that the function
If you set build_options to NULL then you choose the default values. ReferencesPaolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proc. 41st IEEE Symposium on Foundations of Computer Science (FOCS), Redondo Beach, CA, (USA), 2000. Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index.In Proc. 12th ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington DC, (USA), 2001. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM (JACM), 52, 4 (Jul. 2005), 552-581 Paolo Ferragina and Giovanni Manzini. An experimental study of a compressed index. Information Sciences. Vol 135, (2001), 13-28. |