## FM-Index Version 2Paolo Ferragina and Rossano Venturini Dipartimento di Informatica, University of Pisa, Italy |

The present software offers a new implementation of the compressed full-text index data structure, called FM-index. ## How does it work? The novelty of the FM-index resides in the careful combination of the Burrows-Wheeler compression algorithm and the suffix-array data structure to obtain a sort of The FM-index supports four main operations: -
Operation count computes the number of occurrences of an *arbitrary*pattern P as a substring of the indexed file T, in time proportional to the length of P. Hence, the running time of count*does not asymptotically*depend on the length of T. -
Operation locate retrieves the position in T of one occurrence of the pattern P in O(log n) time, where^{c}*c*is a positive constant chosen at the time the FM-index is built. -
Operation extract receives in input a position *Pos*and a positive integer*L*, and returns the substring T[Pos, Pos+L-1]. This operation can be deployed to decompress the whole file T. - Operation display receives in input a pattern
*P*and a positive integer*L*, and displays L characters surrounding each occurrence of the pattern P in T (overall 2L+|P| characters).
In practice all operations decompress just a tiny portion of the compressed file (typically a few kilobytes), thus taking a few milliseconds on a commodity PC even when the indexed files sum up to several megabytes. The structure of the FM-index consists of two parts: a
The new implementation of the FM-index introduces
some significant changes in the ## What is changed?The new version of FM-Index modifies the marking strategy in order to improve the performance of the locate operation. The index marks logicaly and uniformly the text by adding a special character every M characters of the original text. This way, all the BWT rows that start with that special character are contiguous, and for them we keep their explicit text position. The count algorithm must take into account the presence of the special character within the indexed text. Therefore, if a pattern P[1,p] has to be searched, we actually search for (p-1) patterns obtained by inserting the special characters in P equally spaced of M positions, plus we search for the pattern P itself. This search is implemented in parallel over all patterns by exploiting the fact that at any step i, we have to search either for P[p-i] or for the special character. This way, the overall search cost becomes quadratic in the pattern length. The output is a set of at most p ranges of rows. The locate algorithm proceeds for at most M phases as follows. At phase 0, S is the single range of rows to be located. At phase k, S contains rows that may be reached in k "backward" steps from the rows to be located. Therefore, at any step, S consists of a set of ranges of rows. To mantain the invariant, the algorithm picks up a range of S, let it be [a,b], and determines in parallel the z distinct characters that occur in the substring BWT[a,b]. Then it executes z backward steps, one per character, thus originating z new ranges of rows. The induction is preserved because the positions of these rows are at distance (k+1) from the rows to be located. The algorithm cycles over all ranges of S, and thus determines new ranges that will form the new set S. Notice that if a range of rows starts with the special character, its position in the indexed text is explicitly stored, so that the position of the corresponding row to be located, can be inferred by summing k. This range can be dropped from S. At most after M phases the set S will be empty, given that special characters are located at distance M in the original text. ## Where can I find it?The software package is available at the Pizza&Chili site. In order to work properly, the FM-index needs the ds_ssort library, designed by Giovanni Manzini. However, you do not need to download this library because it is included within the present package. To compile the FM-index software you must issue the command - fm-build: a command to build the FM-index data structure from an input file;
- fm-search: a command to search the FM-index, or visualize parts of the indexed text (snippets);
- fm-index.a: the FM-index library including all compression and indexing functionalities.
The first two commands, ## How to use command-line programs Some examples of usage follow (try also # fm-build MyFile -o IndexedMyFile Other options are available for fm-build in order to modify the size of the information stored in the # fm-build -f 0.1 MyFile -o IndexedMyFile # fm-build -f 0 myfile -o IndexedMyFile # fm-search -c foo IndexedMyFile # fm-search -l foo IndexedMyFile # fm-search -e 100 -n 10 IndexedMyFile # fm-search -s 10 foo IndexedMyFile # fm-search -d MyFileCopy IndexedMyFile ## How to use the libraryYou can develop your own search engine by exploiting the functionalities offered by the FM-index library. In this case you need to include in your software the two files: We point out that the FM-index library follows the API suggested
by the Pizza&Chili site for implementations of compressed full-text indexes. For further details on this initiative please have a look at one of the mirrors: the Italian mirror or the Chilean mirror. Please note that the function - -B Bsize: where
*Bsize*is the size in Kbytes of level 1 buckets (default 16); - -b bsize: where
*bsize*is the size in bytes of level 2 buckets (default 512).*bsize*must divide*Bsize*1024*; - -f frequency: where
*frequency*indicates the percentage of marked characters (default 0.02, namely 2%); - -a x: where
*x*indicates the behavior of FM-index over*text*. [default x=1]- x == 0, the FM-index operates directly on
*text*to build the suffix array. This means that you are responsible to allocate*length+overshoot*bytes for the text instead of*length*bytes, where*overshoot*is a value returned by the function*init_ds_ssort()*available within the*ds_ssort*library. You must include*ds_ssort.h*; - x == 1, the FM-index frees the allocated memory for
*text*after using it.*ds_ssort.h*is not necessary; - x == 2, the FM-index makes its internal copy of
*text*.*ds_ssort.h*is not necessary.
- x == 0, the FM-index operates directly on
If you set ## ReferencesPaolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index.In Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM (JACM), 52, 4 (Jul. 2005), 552-581 Paolo Ferragina and Giovanni Manzini. An experimental study of a compressed index. Information Sciences. Vol 135, (2001), 13-28. |

Send Mail to Us | © Paolo Ferragina and Rossano Venturini, Last update: September, 2005.