INDEX
1. How
does it work
2. The
distribution package
3. Compilation
4. Experimental
results
5. Italian
documentation
The architecture of the library: functional kernel and data-structures kernel. The functional kernel contains the implementation of all basic algorithms that allow the managing and searching of the documents, the API and the console (an interconnection data structure used by the API to operate with the data-structures kernel).
In summary all the operations implemented in the functional kernel have the goal of making available to the user the information on the XML compressed document, as it is stored in the data-structures kernel.
The elements of the data structures kernel are created after that the XML documents undergo a compression and indexing process. During this step the Expat Parsing Library is used to parse the XML documents into tokens, depending on a set of separator characters specified by the user. The tokenised text is then compressed via a two phase-process: first the Huffman algorithm is applied on the tokenised text and then the Zlibcompressing function is issued on blocks of the Huffman's output of proper length. This solution offers high compression ratios (as it is shown in the experimental results) and the possibility of accessing parts of the text without decompressing the whole document.
The Huffman compression and the indexing data structures are based on the creation of a dictionary: the library stores in that dictionary not only the tokens that are needed to compress the text (that we call physical), but also those tokens that are useful to solve the queries (that we call logical; i.e. tokens derived from an internal entity expansion). There are some tokens that are both logical and physical, and they are the majority. The overall dictionary so obtained is then divided into three subdictionaries, one for each XML document component: tags, words and attributes. This feature is helpful to search the correct token during query answering. Searching operations are implemented by means of Agrep command, and these subdictionaries are compressed using Zlib. An inverted list of positions referring to the compressed document is associated with each logical token. The user may request the creation also of a proximity list for logical tokens in the words dictionary with the aim of implementing queries on the word proximity. Obviously elements of inverted lists are compressed to minimize the space occupancy using the byte align algorithm, which offers good compression ratio and fast decompression. To make the access to compressed data faster, various caching mechanisms are applied. Information to apply these mechanisms is stored in the Auxiliary Information data structure. To implement structural queries is built also a computational geometry data structure that we call tagstruct. This data structure allows to relate document structure and its textual part, and the snippets extraction process. The Data-Structures Kernel is stored on the file system and can be accessed by using the API.
The distribution package contains the Makefile, different source files which are necessary to compile and build the XCDE Library, as well other software libraries that XCDE uses. The distribution package also includes some commands and example programs, which should allow the user to practice the Library. It is distributed inside the xcde.tar file, which is a tar-compressed archive. To extract the files from the archive you must use the command:
tar -zxvf xcde.tar
which creates the directory XCDE and its files.
The directory XCDE according to the following structure:XCDE/ # makefile, COPYRIGHT, readme.txt and some scripts
|
/bin # commands and scripts
|
/examples # example programs with their source files
|
/include # C header files needed to compile the library
|
/lib # archive files *.a that contain the used libraries
|
/src # source files of the commands contained in /bin
|
/doc # documentation about algorithms and data structures
|
/libsrc # source files of other libraries and commands used by XCDE
|
/agrep # agrep command
|
/expat # expat library for XML files parsing
|
/hashfunc # library needed to build and manage hash tables
|
/integer # library for the variable length encoding of integers
|
/textbuffer # library needed to manage a text buffer
|
/xcde # XCDE library API source files
|
/zlib # zlib library for compression/decompression
After it has been decompressed, the directory XCDE contains only the source files. It's necessary to compile them in order to make the library usable, via the following command issued from the XCDE directory:
make
This command uses the Makefile presents in the XCDE directory to create commands, example programs and library archives.
WARNING: it's necessary that the two commands agrep and _xcde_trasf (that after the compilation are located in the directory /XCDE/bin) are moved to the /bin directory or in whichever other directory contained in the PATH variable.
In the following we report the results of some experiments designed to test the performance of the XCDE-Library. We ran all the experiments on a machine equipped with a 733Mhz Intel Mobile Celeron, 128 Mb RAM and a 10 Gb EIDE hard disk. The operating system was Gnu\Linux RedHat 7.2. The table below shows the size and content of the files used in our experiments. For it we use a collection of files supplied by the CIBIT (An Italian University institution for digital literature) containing Italian literary texts in XML-TEI format and some files derived from databases.
Name | Size | Content Type | Tags Dictionary Size | Attributes Dictionary size | Words Dictionary Size | Compressed Document Size | Inverted List Size | Tagstruct Size | Auxiliary Information Size |
100.xml | 272,510 | Literary Document | 259 | 300 | 32,864 | 79,056 | 178,038 | 1,427 | 1,162 |
083.xml | 851,171 | Literary Documents | 275 | 2,013 | 89,268 | 242,905 | 899,552 | 12,184 | 2,960 |
001.xml | 866,060 | Literary Document | 271 | 1,144 | 66,181 | 248,595 | 545,505 | 4,454 | 2,464 |
ado.xml | 1,978,0686 | Literary Document | 263 | 581 | 107,609 | 611,212 | 1,331,466 | 37,152 | 4,204 |
records16.xml | 16,000,089 | Database | 149 | 832,686 | 1,214,214 | 2,013,645 | 5,632,359 | 297,100 | 40,565 |
records32.xml | 32,000,080 | Database | 148 | 1,652,486 | 2,231,694 | 4,210,688 | 11,591,680 | 310,102 | 69,632 |
As we can see the space occupancy of all data structures is of the same order of magnitude of the original document size, and for larger documents it is even less. To access a document we spend approximately 10ms to open and close the file, 10ms to access an inverted list, 40 ms to decompress a text block of 64Kb and 30ms to perform a search over a dictionary. So in less that 90ms we can answer a simple query. Obviously this is a very pessimistic time estimate. These results show good time performance with a significant space reduction, especially compared to the space occupancy of other XML-Native indexers, like Natix, whose space occupancy is at least 3 times larger than the original document size, or other indexers based on Relational or Object Oriented Databases, which abundantly surpass the multiplicative factor 10.