XCDE - Query Engine

The XCDE Query Engine

INDEX

1. What is it?
2. How does it work
3. The distribution package
4. Compilation
5. Italian documentation

What is it?

The aim of this module is to support a sophisticated query language which mixes some properties of Xquery (http://www.w3.org/TR/xquery/) together with some features proper of the IR-languages. This research direction has been recently suggested by the W3C with the document http://www.w3.org/TR/xquery-full-text-requirements/ . This is, as far as we know, one of the first attempts to implement that proposal via a compressed index for XML documents. The resulting query language allows the user to perform sophisticated string queries over XML documents: proximity among multiple words, search by regular expressions or error-based matches, snippet extraction to select the context of a query occurrence, substring/prefix/suffix word searches, and many other word-based queries. The directory Examples of the XCDE Library provides some running examples to better explain those functionalities.

How does it work

Users can issue queries over the original XML document by using the XCDE query engine which has a number of specialties. XQuery is a powerful query language but it seems to be designed having in mind XML documents derived from structured data. Conversely XML is text-based and thus its query language should be oriented mainly to the processing and management of textual data. Our goal has been therefore to design a language which is simple to be used and IR oriented. Its specialties are the following.

• The query syntax is similar to SQL: SELECT-FROM-RETURN, but here the SELECT clause is specified by means of an XML-piece of well-formed text. As a result, every user which knows a little bit of XML can formulate the query without being forced to learn the XPath syntax, as Xquery requires. Most of the IR functionalities detailed in the document http://www.w3.org/TR/xquery-full-text-requirements/ have been implemented as well other powerful string-based queries are supported, like regular expressions and error matches.

• The output of the query (the snippet) can be formatted via the RETURN clause which, again, includes an XML-piece of well-formed text . The key tool to build that piece of XML text is a special attribute (called hereafter pivot) whose name is xml_var. This attribute is added to elements within the SELECT clause in order to identify some “interesting points” in each document subtree that matches the query. The pivots are then used in the RETURN clause to indicate the way in which these interesting points of the matching subtree must be visualized. More than one pivot can be specified within the SELECT clause, so that many interesting points of the matching subtree can be simultaneously identified without therefore using complicated combinations of XPath expressions. Another specialty of the snippet extraction process is the possibility for the user to define the size of the snippet to be extracted, the presence or not of the tags, the well-formedness of the snippet, the retrieval of all elements including a given document part.

The query may be issued via the following command:

xcde_search2 [-p] [-f FILENAME] query_expression

By using the options -p and -f is possible to change the behaviour of the query engine:

- f FILENAME: this option allows the user to load the query from a file.

- p: this option indicates to the query engine that the position of the document subtrees satisfying the query must be returned.

The query_expression is written as an XML piece of text drawn on the same set of elements and attributes of the queried document plus some special elements that allow to specify the IR functionalities and the snippet extraction process, as detailed below. The query expression may further use a RANGE clause which allows to specify the range of results to be returned to the user.

Simple example of single word query: Find all the occurrence of the word "pax" with almost 1 error.

SELECT <xml_error xml_maxerr = ‘1’ xml_var = ‘$word’>
          pax
       </xml_error>
FROM   exampleFile.xml
RETURN <example_search> $word </example_search>

The result snippets are numbered and identified by means of the name of the file that contains them.

<xcde:results>
<xcde:result>
<xcde:filename>exampleFile.xml</xcde:filename>

<xcde:return result_number = ‘1’>
<example_search> fax </example_search>
</xcde:return>
.
.
.
<xcde:return result_number = ‘n’>
<example_search> tax </example_search>
</xcde:return>

</xcde:result>

</xcde:results >

The distribution package

The distribution package contains the Makefile, different source files which are necessary to compile and build the Query Engine. The distribution package also includes some commands and example programs, which should allow the user to practice the Library. It is distributed inside the xcde2.tar file, which is a tar-compressed archive. To extract the files from the archive you must use the command:

tar -zxvf xcde2.tar

which creates the directory XCDE2 and its files.

The directory XCDE2 according to the following structure:

XCDE2/ # makefile, COPYRIGHT, readme.txt and some scripts
  |
  /bin # commands and scripts
  |
  /examples # example programs with their source files
  |
  /include # C header files needed to compile the library
  |
  /lib # archive files *.a that contain the used libraries
  |
  /src # source files of the commands contained in /bin
  |
  /doc # documentation about algorithms and data structures
  |
  /libsrc # source files of other libraries and commands used by XCDE
    |
    /agrep # agrep command
    |
    /expat # expat library for XML files parsing
    |
    /hashfunc # library needed to build and manage hash tables
    |
    /integer # library for the variable length encoding of integers
    |
    /textbuffer # library needed to manage a text buffer
    |
    /xcde # XCDE library API source files
    |
    /zlib # zlib library for compression/decompression

Compilation

After it has been decompressed, the directory XCDE2 contains only the source files of the XCDE Kernel and Query Engine. It's necessary to compile them in order to make the library usable, via the following command issued from the XCDE2 directory:

make

This command uses the Makefile presents in the XCDE2 directory to create commands, example programs and library archives.

WARNING: it's necessary that the two commands agrep and _xcde_trasf (that after the compilation are located in the directory /XCDE2/bin) are moved to the /bin directory or in whichever other directory contained in the PATH variable.