Oracle9i Application Developer's Guide - XML Release 1 (9.0.1) Part Number A88894-01 |
|
This chapter describes the following aspects of Oracle Text (interMedia Text/Context):
This chapter contains the following sections:
Oracle Text can be used to search XML documents. It extends Oracle9i by indexing any text or document stored in Oracle. It can also search documents in the operating system (flat files) and URLs.
Oracle Text enables the following:
Oracle Text is packaged with the other interMedia products, namely, image, audio, video, and geographic location services for web content management applications.
Users can query XML data stored in the database directly, without using Oracle Text. However, Oracle Text is useful for boosting query performance.
See Also::
|
interMedia, including Oracle Text, is a standard feature that comes with every Oracle9i Standard, Enterprise, and Personal edition license. It needs to be selected during installation. No special installation instructions are required.
Oracle Text is essentially a set of schema objects owned by CTXSYS. These objects are linked to the Oracle kernel. The schema objects are present when you perform an Oracle9i installation.
You can find more examples for Oracle Text and for creating section group indexes at the following site: http://otn.oracle.com/products/text
XML text is aVARCHAR2 or CLOB type in an Oracle9i database table with character semantics. Oracle Text can also deal with documents in a file system or in URLs, but we are not considering these document types in this chapter.
To simplify the examples included in this chapter we consider a subset of the Oracle Text options. In this chapter's examples, w made the following assumptions:
Here is an example of using SECTION GROUP in CREATE INDEX:
CREATE INDEX my_index ON my_table ( my_column ) INDEXTYPE IS ctxsys.context PARAMETERS ( 'SECTION GROUP my_section_group' ) ;
With Oracle Text you can use the following users/roles:
This user is created at install time. Administer Oracle Text users as this user. It has the following privileges:
Any user can create an Oracle Text index and issue a Text query. For additional tasks, use the CTXAPP role. This is a system-defined role that allows you to perform the following tasks:
Oracle Text's main purpose is to provide an implementation for the CONTAINS operator. The CONTAINS operator is used in the WHERE clause of a SELECT statement to specify the query expression for a Text query.
Here is the CONTAINS syntax:
...WHERE CONTAINS([schema.]column,text_query VARCHAR2,[label NUMBER])
where:
For each row selected, CONTAINS returns a number between 0 and 100 that indicates how relevant the document row is to the query. The number 0 means that Oracle found no matches in the row. You can obtain this score with the SCORE operator.
The following example illustrates how the CONTAINS operator is used in a SELECT statement:
SELECT id FROM my_table WHERE CONTAINS (my_column, 'receipts') > 0
The 'receipts' parameter of the CONTAINS function is called the "Text Query Expression".
The following example searches for all documents in the text column that contain the word Oracle. The score for each row is selected with the SCORE operator using a label of 1:
SELECT SCORE(1), title from newsindex WHERE CONTAINS(text, 'oracle', 1) > 0 ORDER BY SCORE(1) DESC;
The CONTAINS operator must always be followed by the > 0 syntax. This specifies that the score value calculated by the CONTAINS operator must be greater than zero for the row selected.
When the SCORE operator is called, such as in a SELECT clause, the operator must reference the label value as shown in the example.
When documents have internal structure such as in HTML and XML, you can define document sections using embedded tags before you index. This enables you to query within the sections using the WITHIN operator.
You define sections as part of a section group. Use the WITHIN operator to narrow queries down into document sections. Document sections can be any of the following:
Here is the WITHIN syntax for querying sections:
expression WITHIN section
This searches for expression within a section. If you are using XML_SECTION_GROUP the following restrictions apply to the pre-defined zone, field, or attribute section:
You can combine and nest WITHIN clauses. For finer grained searches of XML sections, you can use WITHIN clauses inside CONTAINS select statements.
The WITHIN operator has the following limitations:
You can query within attribute sections when you index with either XML_SECTION_GROUP, AUTO_SECTION_GROUP, or PATH_SECTION_GROUP your section group type.
Consider the following XML document:
<book title="Tale of Two Cities">It was the best of times.</book>
If you use XML_SECTION_GROUP, you can name attribute sections anything with CTX_DDL.ADD_ATTR_SECTION.
To define section, title@book
, as the attribute section booktitle
, you can use either of the following methods:
CTX_DDL.ADD_ATTR_SECTION( group_name in varchar2, section_name in varchar2, tag in varchar2);
To define the title attribute as an attribute section, create an XML_SECTION_GROUP and define the attribute section as follows:
EXEC ctx_ddl_create_section_group('myxmlgroup', 'XML_SECTION_GROUP'); ctx_ddl.add_attr_section('myxmlgroup', 'booktitle', 'book@title');
When you define the TITLE attribute section as such and index the document set, you can query the XML attribute text as follows:
'Cities within booktitle'
ALTER INDEX [schema.]index REBUILD [ONLINE] [PARAMETERS (paramstring)];
where
paramstring = 'replace [datastore datastore_pref] [filter filter_pref] [lexer lexer_pref] [wordlist wordlist_pref] [storage storage_pref] [stoplist stoplist] [section group section_group] [memory memsize] | ... add attr section section_name tag tag@attr | add stop section tag'
Dynamically the clause add attr section section_name tag tag@attr
adds an attribute section section_name to the existing index. You must specify the XML tag and attribute in the form tag@attr. You can only add attribute sections to XML section groups.
The added section, section_name, applies only to documents indexed after this operation. Thus for the change to take effect, you must manually re-index any existing documents that contain the tag. The index is not rebuilt by this statement.
When you use the AUTO_SECTION_GROUP or PATH_SECTION_GROUP to index XML documents, the system automatically creates attribute sections and names them in the form attribute@tag.
To search on Tale within the attribute section booktitle
, include the following WITHIN clause in your SELECT statement:
... WHERE CONTAINS ('Tale WITHIN booktitle')>0;
... WHERE CONTAINS ('Tale WITHIN title@book')>0;
The following constraints apply to querying within attribute sections:
<book title="Tale of Two Cities">It was the best of times.</book>
querying on Tale will not work unless qualified with 'WITHIN title@book'.
....Now is the time for all good <word type="noun"> men </word> to come to the aid......
The search would result in a regular query's, "good men", and ignore the intervening attribute text.
Use the INPATH and HASPATH operators only when your index has been created with PATH_SECTION_GROUP.
Use of PATH_SE CTION_GROUP enables path searching. Path searching extends the syntax of the WITHIN operator so that the section name operand (right-hand-side) is a path instead of a section name.
Table 8-2 lists the different ways you can use the INPATH operator for path searching.
Path Search Feature | Syntax | Description |
---|---|---|
Simple Tag Searching |
virginia INPATH (//STATE) |
Finds all documents where the word "virginia" appears between <STATE> and </STATE>. The STATE element can appear at any level of the document structure. |
Case-sensitivity |
virginia INPATH (State) |
Tags and attribute names in path searching are case-sensitive. virginia INPATH STATE -- finds <STATE>virginia</STATE> but NOT <State>virginia</State>. To find the latter you must do virginia INPATH State. |
Top-Level Tag Searching |
virginia INPATH (/Legal) |
Finds all documents where "virginia" appears in a Legal element which is the top-level tag.'Legal' MUST be the top-level tag of the document.'virginia' may appear anywhere in this tag - regardless of other intervening tags. For example: <?xml version="1.0" standalone="yes"?> <!-- <?xml-stylesheet type="text/xsl" href="./xsl/vacourtfiling(html).xsl"?> --> <Filing ID="f001" FilingType="Civil"> <AddressState>VIRGINIA</AddressState> </Address> ... </Legal> |
Any Level Tag Searching |
|
'Virginia' can appear anywhere within an 'Address' tag, which may appear within any other tags. for example: <?xml version="1.0" standalone="yes"?> <!-- <?xml-stylesheet type="text/xsl" href="./xsl/vacourtfiling(html).xsl"?> --> <Filing ID="f001" FilingType="Civil"> <AddressState> VIRGINIA </AddressState>... </Legal> |
Direct Parentage Path Searching |
virginia INPATH (//CourtInformation/Location) |
Finds all documents where "virginia" appears in a Location element which is a direct child of a CourtInformation element. For example: <?xml version="1.0" standalone="yes"?> <!-- <?xml-stylesheet type="text/xsl" href="./xsl/vacourtfiling(html).xsl"?> --> <Filing ID="f001" FilingType="Civil"> <AddressState> VIRGINIA </AddressState> </Address>... </CourtInformation> |
Single-Level Wildcard Searching |
'virginia INPATH (//CaseCaption/*/Location)' |
Finds all documents where "virginia" appears in a B element which is a grandchild of an A element. For instance, <A><D><B>virginia</B></D></A>. The intermediate element does not need to be an indexed XML tag. For example: <?xml version="1.0" standalone="yes"?> <!-- <?xml-stylesheet type="text/xsl" href="./xsl/vacourtfiling(html).xsl"?> --> <Filing ID="f001" FilingType="Civil"> <AddressState>VIRGINIA</AddressState>... </Legal> |
Multi-level Wildcard Searching |
'virginia INPATH (Legal/*/Filing/*/*/CourtInformation)' |
'Legal' must be a top-level tag, and there must be exactly one tag-level between 'Legal' and 'Filing', and two between 'Filing' and 'CourtInformation'. 'Virginia' may then appear anywhere within 'CourtInformation'. For example: <?xml version="1.0" standalone="yes"?> <!-- <?xml-stylesheet type="text/xsl" href="./xsl/vacourtfiling(html).xsl"?> --> <Filing ID="f001" FilingType="Civil"> <AddressState>VIRGINIA</AddressState> IN THE CIRCUIT COURT OF LOUDOUN COUNTY </CourtInformation>.... |
Descendant Searching |
virginia INPATH(A//B) |
Finds all documents where "virginia" appears in a B element which is some descendant (any level) of an A element. |
Attribute Searching |
virginia INPATH(A/@B) |
Finds all documents where "virginia" appears in the B attribute of an A element. |
Descendant/Attribute Existence Testing |
virginia INPATH (A[B]) |
Finds all documents where "virginia" appears in an A element which has a B element as a direct child. |
|
virginia INPATH (A[@B = "pot of gold"]), would, with the default lexer and stoplist, match any of the following: <A B="POT OF GOLD">virginia</A> By default, lexing is case-independent, so "pot" matches "POT", <A B="POT BLACK GOLD">virginia</A> By default, "of" is a stopword, and, in a query, would match any word in that position, <A B=" Pot OF Gold ">virginia</A> |
Finds all documents where "virginia" appears in an A element which has a B attribute whose value is "foo".
Within equality (See "Using the HASPATH Operator for Path Searching" ) is used to evaluate the test. Whitespace is mainly ignored in text indexing. Again, lexing is case-independent: <A B="pot_of_gold">virginia</A> Underscore is a non-alphabetic character, and is not a join character by default. As a result, it is treated more or less as whitespace and breaks up that string into three words. |
|
|
Numeric literals are allowed. But they are treated as text. The within equality is used to evaluate.This means that the query does NOT match. Thst is, <A B="5.0">virginia</A> does not match A[@B=5] where "5.0", a decimal is not considered the same as 5, an integer. |
|
virginia INPATH (A[B AND @C = "foo"])... |
Predicates can be conjunctively combined. |
|
virginia INPATH (A[@B = "foo"]/C/D) virginia INPATH(A//B[@C]/D[E])... |
Node tests can be applied to any node in the path.
|
Note: The HASPATH operator functions in a similar fashion to the Existsnode() operator in XMLType. See Also Chapter 5, "Database Support for XML". |
Only use the HASPATH operator when your index has been created with the PATH_SECTION_GROUP. The syntax for the HASPATH operator is:
...WHERE CONTAINS (col,'HASPATH(A/B/C)')>0;
finds and returns a score of 100 for the document:
<A><B><C>Virginia</C></B></A>
without having to reference Virginia at all.
...WHERE CONTAINS virginia INPATH A
finds <A>virginia</A>, but it also finds <A>virginia state</A>. To limit the query to the term virginia and nothing else, you can use a section equality test with the HASPATH operator. For example:
... WHERE CONTAINS (col,'HASPATH(A="virginia")'
finds and returns a score of 100 only for the first document, and not the second.
To use Oracle Text to search and retrieve data from XML documents you must do the following overall tasks:
Before you create a section group and Oracle text index you must first determine the role you will need and grant the appropriate privilege. See "Oracle Text Users and Roles" , and grant the appropriate privilege.
After creating and preparing your data, you are ready to perform the next step. See Step 1. Create a Section Preference.
Using the section preference created, you then create an Oracle Text index. See Step 2. Create an Index Using the Section Preference Created in Step 1.
Now you can finish building your query application. See "Using Oracle Text to Search XML Documents".
First determine the role you need. See Oracle Text Reference and "Oracle Text Users and Roles" , and grant the appropriate privilege as follows:
CONNECT system/manager GRANT ctxapp to scott; CONNECT scott/tiger
Here we describe the basics of how to create section preferences using PATH_SECTION_GROUP, XML_SECTION_GROUP, and AUTO_SECTION_GROUP.
Table 8-3 describes the section groups you can use when indexing XML documents.
Section Group | Description |
---|---|
XML_SECTION_GROUP |
Use this group type for indexing XML documents and for defining sections in XML documents. |
AUTO_SECTION_GROUP |
Use this group type to automatically create a zone section for each start-tag/end-tag pair in an XML document. The section names derived from XML tags are case-sensitive as in XML. Attribute sections are created automatically for XML tags that have attributes. Attribute sections are named in the form attribute@tag. Stop sections, empty tags, processing instructions, and comments are not indexed. The following limitations apply to automatic section groups:
|
PATH_SECTION_GROUP |
Use this group type to index XML documents. Behaves like the AUTO_SECTION_GROUP. With this section group you can do path searching with the INPATH and HASPATH operators. Queries are case-sensitive for tag and attribute names. How is PATH_SECTION_GROUP Similar to AUTO_SECTION_GROUP? Documents are assumed to be XML, Every tag and every attribute is indexed by default, Stop sections can be added to prevent certain tags from being indexed, Only stop sections can be added -- ZONE, FIELD, and SPECIAL sections cannot be added, When indexing XML document collections, you do not need to explicitly define sections as Oracle automatically does this for you. How Does PATH_SECTION_GROUP Differ From AUTO_SECTION_GROUP? Path Searching is allowed at query time (see "Case Study: Searching an Online FAQ List Using Oracle Text" and "Using the HASPATH Operator for Path Searching" ) with the new INPATH and HASPATH operators, Tag and attribute names are case-sensitive in queries. |
How do you determine which section groups is best for your application? This depends on your application. Table 8-4 lists some general guidelines to help you decide which of the XML_, AUTO_, or PATH_ section groups to use when indexing your XML documents, and why.
The following command creates a section group called, xmlgroup, with the XML_SECTION_GROUP group type.
EXEC ctx_ddl.create_section_group('myxmlgroup', 'XML_SECTION_GROUP');
You can add sections to this group using CTX_DDL.ADD_SECTION
. Consider an XML file that defines the BOOK tag with a TITLE attribute as follows:
<BOOK TITLE="Tale of Two Cities"> It was the best of times. </BOOK>
To define the title attribute as an attribute section, create an XML_SECTION_GROUP and define the attribute section as follows:
EXEC ctx_ddl.add_attr_section('myxmlgroup', 'booktitle', 'book@title');
When you define the TITLE attribute section as such and index the document set, you can query the XML attribute text as follows:
'Cities within booktitle'
You can set up your indexing operation to automatically create sections from XML documents using the section group AUTO_SECTION_GROUP. Here, Oracle creates zone sections for XML tags. Attribute sections are created for those tags that have attributes, and these attribute sections are named in the form "tag@attribute."
The following command creates a section group called autogroup with the AUTO_SECTION_GROUP group type. This section group automatically creates sections from tags in XML documents.
EXEC ctx_ddl.create_section_group('autogroup', 'AUTO_SECTION_GROUP');
To enable path section searching, index your XML document with PATH_SECTION_GROUP. For example:
EXEC ctx_ddl.create_section_group('xmlpathgroup', 'PATH_SECTION_GROUP');
Create an index depending on which section group you used to create a preference:
To index your XML document when you have used XML_SECTION_GROUP, you can use the following statement:
CREATE INDEX myindex ON docs(htmlfile) INDEXTYPE IS ctxsys.context parameters('section group xmlgroup');
The following statement creates the index, myindex, on a column containing XML files using the AUTO_SECTION_GROUP:
CREATE INDEX myindex ON xmldocs(xmlfile) INDEXTYPE IS ctxsys.context PARAMETERS ('section group autogroup');
To index your XML document when you have used PATH_SECTION_GROUP, you can use the following statement:
CREATE INDEX myindex ON xmldocs(xmlfile) INDEXTYPE IS ctxsys.context PARAMETERS ('section group xmlpathgroup');
EXEC ctx_ddl_create_section_group('myxmlgroup', 'XML_SECTION_GROUP'); /* ADDING A FIELD SECTION */ EXEC ctx_ddl.Add_Field_Section /* THIS IS KEY */ ( group_name =>'my_section_group', section_name =>'author',/* do this for EVERY tag used after "WITHIN" */ tag =>'author' ); EXEC ctx_ddl.Add_Field_Section /* THIS IS KEY */ ( group_name =>'my_section_group', section_name =>'document',/*do this for EVERY tag after "WITHIN" */ tag =>'document' ); ... / /* ADDING AN ATTRIBUTE SECTION */ EXEC ctx_ddl.add_attr_section('myxmlgroup', 'booktitle', 'book@title'); /* The more sections you add to your index, the longer your search will take.*/ /* Useful for defining attributes in XML documents as sections. This allows*/ /* you to search XML attribute text using the WITHIN operator.*/ /* The section name: /* ** Is used for WITHIN queries on the attribute text. ** Cannot contain the colon (:) or dot (.) characters. ** Must be unique within group_name. ** Is case-insensitive. ** Can be no more than 64 bytes. ** The tag specifies the name of the attribute in tag@attr format. This is case-sensitive. */ /* Names used as arguments of the keyword WITHIN can be different from the actual XML tag names. Many tags can be mapped to the same name at query time.*/ /* ADDING A ZONE SECTION */ /* If You have an XML document that contains the <book> tag declared for */ /* different document types. You can create a distinct book section for each */ /* document type. If mydocname is declared in your DTD as an XML document */ /* type (root element) as follows: */ <!DOCTYPE mydocname ... [... /* Within mydocname, element <book> is declared. For this tag, you can create */ /* a section named, mybooksec, that's sensitive to the tag's document type as */ /* follows: */ EXEC ctx_ddl.add_zone_section('myxmlgroup', 'mybooksec', 'mydocname(book)'); /* Call CTX_DDL.Add_Zone_Section for each tag in your XML document that you need to search on. */ CREATE INDEX my_index ON my_table ( my_column ) INDEXTYPE IS ctxsys.context PARAMETERS ( 'SECTION GROUP my_section_group' ); SELECT my_column FROM my_table WHERE CONTAINS(my_column, 'smith WITHIN author') > 0;
ctx_ddl_create_section_group('auto', 'AUTO_SECTION_GROUP'); CREATE INDEX myindex ON docs(xmlfile_column) INDEXTYPE IS ctxsys.context PARAMETERS ('filter ctxsys.null_filter SECTION GROUP auto'); SELECT xmlfile_column FROM docs WHERE CONTAINS (xmlfile_column, 'virginia WITHIN title')>0;
EXEC ctx_ddl.create_section_group('xmlpathgroup', 'PATH_SECTION_GROUP');CREATE INDEX myindex ON xmldocs(xmlfile_column) INDEXTYPE IS ctxsys.context PARAMETERS ('section group xmlpathgroup');SELECT xmlfile_column FROM xmldocs ... WHERE CONTAINS (column, 'Tale WITHIN title@book')>0;
Building XML query applications with Oracle Text includes the following topics:
See Also:
In previous releases, the XML section group was unable to distinguish between tags in different DTD's. For instance, perhaps you have a DTD for storing contact information:
<!DOCTYPE contact> <contact> <address>506 Blue Pool Road</address> <email>dudeman@radical.com</email> </contact>
Appropriate sections might look like:
ctx_ddl.add_field_section('mysg','email', 'email'); ctx_ddl.add_field_section('mysg','address','address');
This is fine until you have a different kind of document in the same table:
<!DOCTYPE mail> <mail> <address>dudeman@radical.com</address> </mail>
Now your address section, originally intended for street addresses, starts picking up email addresses, because of tag collision.
Oracle8i release 8.1.5 and higher allow you to specify doctype limiters to distinguish between these tags across doctypes. Simply specify the doctype in parentheses before the tag as follows:
ctx_ddl.add_field_section('mysg','email','email'); ctx_ddl.add_field_section('mysg','address','(contact)address'); ctx_ddl.add_field_section('mysg','email','(mail)address');
Now when the XML section group sees an address tag, it will index it as the address section when the document type is contact
, or as the email section when the document type is mail
.
If you have both doctype-limited and unlimited tags in a section group:
ctx_ddl.add_field_section('mysg','sec1','(type1)tag1'); ctx_ddl.add_field_section('mysg','sec2','tag1');
Then the limited tag applies when in the doctype, and the unlimited tag applies in all other doctypes.
Querying is unaffected by this -- the query is done on the section name, not the tag, so querying for an email address would be done like:
radical WITHIN email
which, since we have mapped two different kinds of tags to the same section name, finds documents independent of which tags are used to express the email address.
You can query within attribute sections when you index with either of the following as your section group type:
Consider the following XML document:
<book title="Tale of Two Cities">It was the best of times.</book>
You can define the section title@book to be the attribute section title. You can do so with the CTX_DLL.ADD_ATTR_SECTION procedure or dynamically after indexing with ALTER INDEX.
To search on Tale within the attribute section title, you issue the following query:
'Tale WITHIN BOOK@TITLE'
You cannot use attribute sections in a nested WITHIN query.
Phrases ignore attribute text. For example, if the original document looked like:
Now is the time for all good <word type="noun"> men </word> to come to the aid.
The WITHIN operator requires you to know the name of the section you search. A list of defined sections can be obtained using the CTX_SECTIONS or CTX_USER_SECTIONS views.
In Oracle8i Release 1(8.1.5) and higher, XML_SECTION_GROUP offers the ability to index and search within attribute values. Consider a document with the following lines:
<comment author="jeeves"> I really like Oracle Text </comment>
Now XML_SECTION_GROUP offers an attribute section. This allows the inclusion of attribute values to index. For example:
ctx_ddl.add_attr_section('mysg','author','comment@author');
The syntax is similar to other add_section calls. The first argument is the name of the section group, the second is the name of the section, and the third is the tag, in the form <tag_name>@<attribute_name>. This tells Oracle Text to index the contents of the author attribute of the comment tag as the section "author".
Query syntax is just like for any other section:
WHERE CONTAINS ( ... ,'jeeves WITHIN author...',...)...
and finds the document.
Attribute sections allow you to search the contents of attributes. They do not allow you to use attribute values to specify sections to search. For instance, given the document:
<comment author="jeeves"> I really like Oracle Text </comment>
You can find this document by asking:
jeeves within comment@author
which is equivalent to "find me all documents which have a comment element whose author attribute's value includes the word jeeves".
However, there you cannot currently request the following:
interMedia within comment where (@author = "jeeves")
in other words, "find me all documents where interMedia appears in a comment element whose author is jeeves". This feature -- attribute value sensitive section searching -- is planned for future versions of the product.
Because the section group is defined before creating the index, Oracle8i Release 1 (8.1.5) is limited in its ability to cope with changing structured document sets; if your documents start coming with new tags, or you start getting new doctypes, you have to re-create the index to start making use of those tags.
In Oracle8i Release 2 (8.1.6) and higher allows you to add new sections to an existing index without rebuilding the index, using alter index and the new add section parameters string syntax:
add zone section <section_name> tag <tag> add field section <section_name> tag <tag> [ visible | invisible ]
For instance, to add a new zone section named tsec using the tag title:
alter index <indexname> rebuild parameters ('add zone section tsec tag title')
To add a new field section named asec using the tag author:
alter index <indexname> rebuild parameters ('add field section asec tag author')
This field section would be invisible by default, just like when using add_field_section. To add it as visible field section:
alter index <indexname> rebuild parameters ('add field section asec tag author visible')
Dynamic add section only modifies the index meta-data, and does not rebuild the index in any way. This means that these sections take effect for any document indexed after the operation, and do not affect any existing documents -- if the index already has documents with these sections, they must be manually marked for re-indexing (usually with an update of the indexed column to itself).
This operation does not support addition of special sections. Those would require all documents to be re-indexed, anyway. This operation cannot be done using rebuild online, but it should be a fairly quick operation.
The following constraints apply to querying within attribute sections:
<book title="Tale of Two Cities">It was the best of times.</book>
A query on Tale by itself does not produce a hit on the document unless qualified with WITHIN title@book. This behavior is like field sections when you set the visible flag set to false.
Now is the time for all good <word type="noun"> men </word> to come to the aid.
Then this document would hit on the regular query good men, ignoring the intervening attribute text.
WITHIN queries can distinguish repeated attribute sections. This behavior is like zone sections but unlike field sections. For example, for the following document:
<book title="Tale of Two Cities">It was the best of times.</book> <book title="Of Human Bondage">The sky broke dull and gray.</book>
Assume the book is a zone section and book@author is an attribute section. Consider the query:
'(Tale and Bondage) WITHIN book@author'
This query does not hit the document, because tale and bondage are in different occurrences of the attribute section book@author.
To build the query application with Oracle Text carry out the indexing steps first..
The next step is to build your query application. To do so follow these steps:
You can query within attribute sections when you index with either XML_SECTION_GROUP or AUTOMATIC_SECTION_GROUP as your section group type.
Nested tag searching is supported in Oracle Text.
The CTX_OBJECT_ATTRIBUTES view displays attributes that can be assigned to preferences of each object. It can be queried by all users.
Check out the structure of CTX_OBJECTS and CTX_OBJECT_ATTRIBUTE view, with the following DESCRIBE commands. Because we are only interested in querying XML documents in this chapter, we focus on XML_SECTION_GROUP and AUTO_SECTION_GROUP.
Describe ctx_objects SELECT obj_class, obj_name FROM ctx_objects ORDRR BY obj_class, obj_name;
The result is:
... SECTION_GROUP AUTO_SECTION_GROUP <<== SECTION_GROUP BASIC_SECTION_GROUP SECTION_GROUP HTML_SECTION_GROUP SECTION_GROUP NEWS_SECTION_GROUP SECTION_GROUP NULL_SECTION_GROUP SECTION_GROUP XML_SECTION_GROUP <<== ... Describe ctx_object_attributes SELECT oat_attribute FROM ctx_object_attributes WHERE oat_object = 'XML_SECTION_GROUP';
The result is:
OAT_ATTRIBUTE ------------- ATTR FIELD SPECIAL ZONE SELECT oat_attribute FROM ctx_object_attributes WHERE oat_object = 'AUTO_SECTION_GROUP';
The result is:
OAT_ATTRIBUTE ------------- STOP
The first thing you must do is create a preference. To do this, use the CTX_DDL.Create_Preference procedure. For example:
CTX_DDL.Create_Preference ( preference_name => 'books' /* or whatever you want to call it */ object_name => 'XML_SECTION GROUP' /* either XML_SECTION_GROUP or AUTO_ SECTION_GROUP */);
To drop this preference use the following syntax:
CTX_DDL.Drop_Preference ( preference_name => 'books');
To set the preference's attributes for XML_SECTION_GROUP, use the following procedures:
To set the preference's attributes for AUTO_SECTION_GROUP, use the following procedures:
There are corresponding CTX_DDL.drop sections and CTX_DDL.remove section syntax.
The syntax for CTX_DDL.add_zone_section follows:
CTX_DDL.Add_Zone_Section ( group_name => 'my_section_group' /* whatever you called it above */ section_name => 'author' /* what you want to call this section */ tag => 'my_tag' /* what represents it in XML */ );
where 'my_tag' implies opening with <my_tag> and closing with </my_tag>.
add_zone_section guidelines are listed here:
The syntax for CTX_DDL.add_attr_section follows:
CTX_DDL.Add_Attr_Section ( /* call this as many times as you need to describe the attribute sections */ group_name => 'my_section_group' /* whatever you did call it above */ section_name => 'author' /* what you want to call this section */ tag => 'my_tag' /* what represents it in XML */ );
where 'my_tag' implies opening with <my_tag> and closing with </my_tag>.
add_attr_section guidelines are listed here:
The more sections you add to your index, the longer your search will take.
add_attr_section adds an attribute section to an XML section group. This procedure is useful for defining attributes in XML documents as sections. This allows you to search XML attribute text with the WITHIN operator.
The section_name:
The tag specifies the name of the attribute in tag@attr format. This is case-sensitive.
The syntax for CTX_DDL.add_field_section follows:
CTX_DDL.Add_Field_Section ( group_name => 'my_section_group' /* whatever you called it above */ section_name => 'qq' /* what you want to call this section */ tag => 'my_tag' /* what represents it in XML */ ); visible => TRUE or FALSE );
add_field_section guidelines are listed here:
Consider again the query, "... CONTAINS virginia...". You may not get a hit if you use VISIBLE=TRUE. If VISIBLE=FALSE, the index will be smaller. You may lose some functionality but your performance will be improved, compared to if you set VISIBLE =TRUE.
Attribute section differs from field section in the following ways:
WHERE CONTAINS (..., '... jeeves',...)...
does NOT find the document, somewhat like field sections. Unlike field sections, however, attribute section within searches can distinguish between occurrences. Consider the document:
<comment author="jeeves"> I really like Oracle Text </comment> <comment author="bertram"> Me too </comment>
the query:
WHERE CONTAINS (...,'(cryil and bertram) WITHIN author', ...)...
will NOT find the document, because "jeeves" and "bertram" do not occur within the SAME attribute text.
<!DOCTYPE foo [ <!ELEMENT foo (bar)> <!ELEMENT bar (#PCDATA)>
<!ATTLIST bar rev CDATA "8i">]> <foo> <bar>whatever</bar> </foo>
and attribute section:
ctx_ddl.add_attr_section('mysg','barrev','bar@rev');
the query:
8i within barrev does not hit the document, although in XML semantics, the "bar" element has a default value for its "rev" attribute.
CtX_DDL.Add_Stop_Section ( group_name => 'my_section_group' /* whatever you called it above */ section_name => 'qq' /* what you want to call this section */ );
See the section, "Querying with the CONTAINS Operator" for information about how to use the CONTAINS operator in query statements.
You can query within attribute sections when you index with either XML_SECTION_GROUP or AUTO_SECTION_GROUP as your section group type.
Assume you have an XML document as follows:
<book title="Tale of Two Cities">It was the best of times.</book>
You can define the section title@book as the attribute section title. You can do so with the CTX_DLL.Add_Attr_Section procedure or dynamically after indexing with ALTER INDEX.
If you use the XML_SECTION_GROUP, you can name attribute sections anything with CTX_DDL.Add_Attr_Section.
To search on Tale within the attribute section title, issue the following query:
WHERE CONTAINS (...,'Tale WITHIN title', ...)
Consider an XML file that defines the BOOK tag with a TITLE attribute as follows:
<BOOK TITLE="Tale of Two Cities"> It was the best of times. </BOOK> <Author="Charles Dickens"> Born in England in the town, Stratford_Upon_Avon </Author>
Recall the CTX_DDL.Add_Attr_Section syntax is:
CTX_DDL.Add_Attr_Section ( group_name, section_name, tag );
To define the title attribute as an attribute section, create an XML_SECTION_GROUP and define the attribute section as follows:
ctx_ddl_create_section_group('myxmlgroup', 'XML_SECTION_GROUP'); ctx_ddl.add_attr_section('myxmlgroup', 'booktitle', 'book@title'); ctx_ddl.add_attr_section('myxmlgroup', 'authors', 'author'); end;
When you define the TITLE attribute section as such and index the document set, you can query the XML attribute text as follows:
... WHERE CONTAINS (...,'Cities WITHIN booktitle', ....)...
When you define the AUTHOR attribute section as such and index the document set, you can query the XML attribute text as follows:
... WHERE 'England WITHIN authors'
This example does the following:
drop table res_xml; CREATE TABLE res_xml ( pk NUMBER PRIMARY KEY , text CLOB ) ; insert into res_xml values(111, 'ENTITY chap8 "Chapter 8, <q>Keeping it Tidy: the XML Rule Book </q>"> this is the document section'); commit; --- --- script to create index on res_xml --- --- cleanup, in case we have run this before DROP INDEX res_index ; EXEC CTX_DDL.DROP_SECTION_GROUP ( 'res_sections' ) ; --- create a section group BEGIN CTX_DDL.CREATE_SECTION_GROUP ( 'res_sections', 'XML_SECTION_GROUP' ) ; CTX_DDL.ADD_FIELD_SECTION ( 'res_sections', 'chap8', '<q>') ; END ; / begin ctx_ddl.create_preference ( preference_name => 'my_basic_lexer', object_name => 'basic_lexer' ); ctx_ddl.set_attribute ( preference_name => 'my_basic_lexer', attribute_name => 'index_text', attribute_value => 'true' ); ctx_ddl.set_attribute ( preference_name => 'my_basic_lexer', attribute_name => 'index_themes', attribute_value => 'false'); end; / CREATE INDEX res_index ON res_xml(text) INDEXTYPE IS ctxsys.context PARAMETERS ( 'lexer my_basic_lexer SECTION GROUP res_sections' ) ;
Test the above index with a test query, such as:
SELECT pk FROM res_xml WHERE CONTAINS( text, 'keeping WITHIN chap8' )>0 ;
drop table explain_ex; create table explain_ex ( id number primary key, text varchar(2000) ); insert into explain_ex ( id, text ) values ( 1, 'thinks thinking thought go going goes gone went' || chr(10) || 'oracle orackle oricle dog cat bird' || chr(10) || 'President Clinton' ); insert into explain_ex ( id, text ) values ( 2, 'Last summer I went to New England' || chr(10) || 'I hiked a lot.' || chr(10) || 'I camped a bit.' ); commit;
Set Define Off select text from explain_ex WHERE CONTAINS ( text, '( $( think & go ) , ?oracle ) & ( dog , ( cat & bird ) ) & about(mammal during Bill Clinton)' ) > 0; select text from explain_ex WHERE CONTAINS ( text, 'about ( camping and hiking in new england )' ) > 0;
Consider an XML document set that contains the <book> tag declared for different document types. You need to create a distinct book section for each document type. Assume that mydocname is declared as an XML document type (root element) as follows:
<!DOCTYPE mydocname ... [...
Within mydocname, the element <book> is declared. For this tag, you can create a section named mybooksec that is sensitive to the tag's document type as follows:
begin ctx_ddl.create_section_group('myxmlgroup', 'XML_SECTION_GROUP'); ctx_ddl.add_zone_section('myxmlgroup', 'mybooksec', 'mydocname(book)'); end;
Zone sections can repeat. Each occurrence is treated as a separate section. For example, if <H1> denotes a heading section, they can repeat in the same documents as follows:
<H1> The Brown Fox </H1> <H1> The Gray Wolf </H1>
Assuming that these zone sections are named Heading.
The query:
WHERE CONTAINS (..., 'Brown WITHIN Heading', ...)...
returns this document.
But the query:
WHERE CONTAINS (...,' (Brown and Gray) WITHIN Heading',...)...
does not.
Zone sections can overlap each other. For example, if <B> and <I> denote two different zone sections, they can overlap in document as follows:
plain <B> bold <I> bold and italic </B> only italic </I> plain
Zone sections can nest, including themselves as follows:
<TD> <TABLE> <TD>nested cell</TD> </TABLE> </TD>
Using the WITHIN operator, you can write queries to search for text in sections within sections.
For example, assume the BOOK1, BOOK2, and AUTHOR zone sections occur as follows in documents doc1 and doc2:
doc1:
<book1><author>Scott Tiger</author> This is a cool book to read.</book1>
doc2:
<book2> <author>Scott Tiger</author> This is a great book to read.</book2>
Consider the nested query:
'Scott WITHIN author WITHIN book1'
This query returns only doc1.
A Text query application allows you to view the documents returned by a query. You typically select a document from the hitlist and then your application presents the document in some form.
With Oracle Text, you can render a document in different ways. For example, with the query terms highlighted. Highlighted query terms can be either the words of a word query or the themes of an ABOUT query in English. This rendering uses the CTX_DOC.HIGHLIGHT or CTX_DOC.MARKUP procedures.
You can also obtain theme information from documents with the CTX_DOC.THEMES PL/SQL package. Besides these there are several other CTX_DOC procedures for presenting your query results.
Consider the scenario where your company has several FAQs for each product. Each FAQ is an XML document similar to the following:
<?xml version="1.0"?> <FAQ OWNER="Billy Text"> <TITLE>Oracle Text FAQ</TITLE> <DESCRIPTION>Everything you always wanted to know ...</DESCRIPTION> <QUESTION>What is Oracle Text?</QUESTION> <ANSWER>Oracle Text uses standard SQL to index, ...</ANSWER> </FAQ>
In this sample FAQ case study, we are only using 3 FAQs, with titles: Text, Performance, and XML. The sample FAQ program searches for information within the FAQs' XML tags using the WITHIN operator. The pull-down menu for the FAQ user interface is generated at run-time. Figure 8-1 shows the online FAQ Search user interface. The pull down menu shows the XML elements selected for use in searching the FAQ data:
"FAQ Owner" is actually an element attribute. Attributes are also searchable. These tags or elements are used here to assist users in fine tuning their keyword search for desired FAQs.
Figure 8-2 shows to enter a search for keyword "XML", in all TITLE elements of the FAQs. The result of the search here is one FAQ with title, "XML..."
Figure 8-3 shows how you can enter an attribute search for "Billy" within FAQ OWNER by using TITLE, where TITLE is the attribute of the element FAQ OWNER. The syntax is "...WHERE Billy WITHIN faq@owner".
To create your online FAQ search program for your company, follow these steps:
1 Create and Populate Your FAQ Table. Create an Auto Section Group and Oracle Text Index
Run faqsearch_install.sql
. This script does the following:
insert into faq(tk,xml_desc) values(1,'<?xml version="1.0"?> <FAQ OWNER="Billy Text"> <TITLE>Oracle Text FAQ</TITLE><DESCRIPTION>Everything you always wanted to know about Text but were afraid to ask</DESCRIPTION> <QUESTION>What is Oracle Text?</QUESTION> <ANSWER>Oracle Text uses standard SQL to index, search, and analyze text and documents stored in the database, files or websites</ANSWER> <QUESTION>What is ABOUT?</QUESTION> <ANSWER>ABOUT queries increase the number of relevant documents returned by a query.</ANSWER>'); insert into faq(tk,xml_desc) values(2,'<?xml version="1.0"?><FAQ OWNER="Jack Performance"> <TITLE>Text Performance Guide</TITLE> <DESCRIPTION>Oracle Text and interMedia Text performance guide</DESCRIPTION> <QUESTION>What do we mean by query performance anyway?</QUESTION> <ANSWER>There are generally two measures of query performance - response time (the time to get an answer to an individual query), and throughput (the number of queries that can be run in any time period, eg queries per second).</ANSWER></FAQ>'); insert into faq(tk,xml_desc) values(3,'<?xml version="1.0"?><FAQ OWNER="John XML"> <TITLE>XML FAQ</TITLE><DESCRIPTION>Oracle XML FAQ</DESCRIPTION> <QUESTION>What is XML?</QUESTION> <ANSWER>XML stands for eXtensible Markup Language. XML s quickly becoming the standard way to identify and describe data on the web because it has proved broadly implementable and easy to deploy</ANSWER> <QUESTION>What is the Oracle Kit?</QUESTION> <ANSWER>The Oracle XML Developer Kit (XDK) contains the basic building blocks for reading, manipulating, transforming and viewing XML documents. To provide a broad variety of deployment options, the Oracle XDK is available for Java, C, C++ and PL/SQL.</ANSWER></FAQ>'); commit; begin ctx_ddl.create_section_group('faq_auto_section_group','auto_section_group'); end; / create index faq_idx on faq(xml_desc) indextype is ctxsys.context parameters('section group faq_auto_section_group') /
To develop and deploy PL/SQL Server Pages (PSP), you need the Oracle server at version 8.1.6 or later, together with a PL/SQL web gateway. Currently, the web gateways are Oracle Internet Application Server (iAS), the WebDB PL/SQL Gateway, and the OAS PL/SQL Cartridge. Before you start with PSP, you should have access to both the database server and the web server for one of these gateways.
To compile showxml.psp,
use the command:
loadpsp -replace -user username/passwd showxml.psp
<%@ plsql procedure="showxml" %> <%@ plsql parameter="id" default="null" %> <%! v_text varchar2(32767); %> <%! v_text_xml varchar2(32767); %> <% select xml_desc into v_text from faq where tk=id; %> <html> <title>Show xml </title> <body> <h3>XML Content</h3> <hr> <pre> <% v_text_xml := replace(v_text,'<','<'); %> <%= v_text_xml %> </pre> </body> </html>
Open a URL in your browser to access faqsearch.psp
, as follows:
http://myserver_and_directory/faqsearch
<%@ plsql parameter="query" default="null" %> <%@ plsql parameter="tagvalue" default="null" %> <%! v_results numeric := 0; %> <html> <head> <title>FAQ Search </title> </head> <body> <% If query is null Then %> <center> <h3>FAQ Search </h3> <form method=post action="faqsearch"> <p> Search for <input type=Text size=15 maxlength=25 name="query"> within <select name="tagvalue"> <% -- generates the pull-down menu with the following select for c in (select token_text from DR$FAQ_IDX$I where token_type=2) loop %> <option value="<%= c.token_text %>"><%= c.token_text %> <% end loop; %> <option value="faq@Owner">FAQ Owner </select> <input type=submit value="Search"> </form> <% Else %> <%! color varchar2(6) := 'ffffff'; %> <center> <h3>FAQ Search</h3> <form method=post action="faqsearch"> <p> Search for <input type=text size=15 maxlength=25 name="query" value=<%= query %>> within <select name="tagvalue"> <% -- generates the pull-down menu with the following select for c in (select token_text from DR$FAQ_IDX$I where token_type=2) loop %> <option value="<%= c.token_text %>"><%= c.token_text %> <% end loop; %> <option value="faq@Owner">FAQ Owner </select> <input type=submit value="Search"> </form> <p> <%! v_query varchar2(400); %> <%! v_desc_substr varchar2(1000); %> <%! v_desc_item varchar2(1000); %> <%! v_desc_start number; %> <%! v_desc_final number; %> <%! v_title_substr varchar2(1000); %> <%! v_title_start number; %> <%! v_title_final number; %> <% v_query := query || ' WITHIN '|| tagvalue; -- Text query using WITHIN for XML documents for doc in (select tk, xml_desc from faq where contains(xml_desc,v_query) >0 ) loop v_results := v_results + 1; if v_results = 1 then %> <table border="1" cellpadding="4" cellspacing="0"> <tr bgcolor="#6699CC"> <th>Title</th> <th>Description</th> </tr> <% end if; %> <tr bgcolor="#<%= color %>"> <% v_title_start := instr(doc.xml_desc,'<TITLE>'); v_title_final := instr(doc.xml_desc,'</TITLE>'); v_title_substr := substr(doc.xml_desc,v_title_ start+length('<TITLE>'),v_title_final - length('</TITLE>') - v_title_start+1); v_desc_start := instr(doc.xml_desc,'<DESCRIPTION>'); v_desc_final := instr(doc.xml_desc,'</DESCRIPTION>'); v_desc_substr := substr(doc.xml_desc,v_desc_ start+length('<DESCRIPTION>'),v_desc_final - length('</DESCRIPTION>') - v_desc_ start+1); %> <td> <a href="showxml?id=<%= doc.tk %>"><%= v_title_substr %></a> </td> <% v_desc_item := replace(v_desc_substr,'<','<'); %> <td> <%= v_desc_item %> </td> </tr> <% if (color = 'ffffff') then /* alternate row color */ color := 'eeeeee'; else color := 'ffffff'; end if; end loop; %> </table> </center> <% if v_results = 0 then %> <center><h3>Found 0 records for your query</h3></center> <% end if; %> <% End if;%> <p> <hr> <p> </body> </html>
This FAQ section is divided into the following categories:
Currently Oracle Text (intermedia Text) has the option to create indexes based on the content of a section group. But most XML Elements are of the type of Element. So, the only option for searching would be attribute values. Can I build indexes on attribute values?
Releases from 8.1.6 and higher allow attribute indexing.
See the following site:
http://otn.oracle.com/products/intermedia/htdocs/text_training_816/Samples/imt_816_techover.html#SCN
I know that an intact XML document can be stored in a CLOB in ORACLE's XML solution.
[XML document stored in BLOB]...<name id="1111"><first>lee</first> <sencond>jumee</second></name>...
Can value(lee, jumee) be queried by elements, attributes and the structure of XML document?
lee within first or this:jumee within second or this:1111 within name@id
you can combine these like this:
lee within first and jumee within secondor this:(lee within first) within name.
For more information, please read the "interMedia Text Technical Overview" for 8.1.5 and 8.1.6 available on OTN.
Another alternative is to decompose the XML document and store the information in relational fields. Then you could modify individual elements, have element-level simultaneous access, and so on. In this case, using something called the USER_DATASTORE, you can use PL/SQL to reconstitute the document to XML for text indexing. Then, you get text search as if it were XML, but data management as if it were relational data. Again, see interMedia Text Technical Overview for more information.
Is it possible for Oracle Text (intermedia Text) to index XML such as:
2/7/1968
and then process a query such as:
Who has brown hair, that is, select name from person where hair.color = "BROWN"
Searches based on structural conditions are not yet available through Oracle Text (intermedia Text). Attribute searches are supported fromOracle8i Release 2 (8.1.6). For reference you should not put data in attributes as that will not be compliant with XML Schema when it becomes a recommendation.
I need to store a large XML file in Oracle8i, search it, and return a specific tagged area.Using Oracle Text (intermedia Text) some of this is possible:
How do I return a zone or a field based on a text search?
Oracle Text (intermedia Text) will only return the "hits". You will need to subsequently parse the CLOB to extract a section.
How do I insert XML documents into a database? I need to insert the XML document "as is" in column of datatype CLOB into a table.
Oracle's XML SQL Utility for Java offers a command-line utility that can be used for loading XML data. More information can be found on the XML SQL Utility at: http://otn.oracle.com/tech/xml and here in Chapter 7, "XML SQL Utility (XSU)".
You can insert the XML documents as you would any text file. There is nothing special about an XML-formatted file from a CLOB perspective.
Question 2
Oracle Text (intermedia Text) can be used to index and search XML stored in CLOBs? How can we get started?
Versions of Oracle Text (intermedia Text) before Oracle8i Release 2 (8.1.6) only allowed tag-based searching. The current version allows for XML structure and attribute based searching. There is documentation on how to have the index built and the SQL usage in the Oracle Text documentation.
I have this xml:
<person> <name>efrat</name> <childrens> <child> <id>1</id> <name>keren</name> </child> </childrens> </person>
How do I find the person who has a child name keren but not the person's name keren? Assuming I defined every tag with the add_zone_section that can be nested and can include themselves.
Use selectSingleNode or selectNodes with XPATH string as a parameter.eg. selectSingleNode("//child/name[.='keren'])Also, I recommend making id as an attribute instead of a tag.
Where can I get good samples of searching XML with Oracle Text.
See the following manuals:
Can Oracle Text (intermedia Text) recognize the tags in my XML document or do I have to use the add_field_section command for each tag in the XML document? My XML documents have hundreds of tags. Is there an easy way to do this?
Which version of the database are you using? I believe you need to do it for 8.1.5 but not in Oracle8i Release 2(8.1.6). You can use AUTO_SECTION_GROUP in 8.1.6
XSQL Servlet ships with a complete (albeit simple from the interMedia standpoint) example of a SQL script that creates a complex XML Datagram out of Object Types, and then creates an Oracle Text (intermedia Text) index on the XML Document Fragment stored in the "Insurance Claim" type.
If you download the XSQL Servlet, and look at the file ./xsql/demo/insclaim.sql you'll be able to see the interMedia stuff at the bottom of the file. One of the key new features in interMedia in Oracle8i Release 2(8.1.6) is the AUTO Sectioner for XML.
I have an XML document that I have stored in CLOB. I have also created the indexes on the tags using section_group, and so on. One of the tags is <SALARY> </SALARY> I want to write an SQL statement so as to select all the records that have salary lets say > 5000. How do I do this? I cannot use WITHIN operator. I want to interpret the value present in this tag as a number. This could be floating point number also since this is salary.
You cannot do this in Oracle Text. Range search is not really a text operation. The best solution is to use the other Oracle XML parsing utilities to extract the salary into a NUMBER field -- then you can use Oracle Text (intermedia Text) for text searching, and normal SQL operators for the more structured fields, and achieve the same results.
We are storing all our documents in XML format in a CLOB. Are there utilities available in Oracle perhaps interMedia to retrieve the contents a field at a time, that is given a field name, get the text between tags, as opposed to retrieving the whole document and traversing the structure?
interMedia does not do section extraction. See the XML SQL Utility for this in Chapter 7, "XML SQL Utility (XSU)".
I have created a view based on 7-8 tables and it has columns like, custordnumber, product_dscr, qty, prdid,shipdate, ship_status, and so on. I need to create an Oracle Text index on the three columns:
Is there a way to create a Text index on these columns?
The short answer is yes. You have two options:
We are using mySQL to do partial indexing of 9 million Web pages a day. We are running on a 4-processor Sparc 420 and unable to do full text indexing. Can Oracle8i or Oracle9i do this?
We are not interested in transactional integrity, applying any special filters to the text pages, or in doing any other searching other than straight boolean word searches (no scoring, no stemming, no fuzzy searches, no proximity searches, and so on).
I have are the following questions:
Yes. Oracle Text (interMedia Text) can create a full-text index on 9 million web pages - and pretty quickly. In a benchmark on a large Sun box, we indexed 100Gig of web pages (about 15 million) in 7 hours. We can also do partial indexing via regular DML or (in 9i) via partitioning.
You can do "indexing light" to some extent - you can disable theme indexing, you do not need to filter documents if they are already ASCII/HTML/XML, and most common expansions - fuzzy, stemming, proximity - are done at query time.
We know that Oracle 8i Release 2 (8.1.6) allows multiple language records to be stored in the same table (and column) and interMedia handles the index appropriately based on the language setting for each row (using the multi-lexer feature).
Currently we use one CLOB column in the table and it is indexed using Oracle Text. The column content is in XML (tagged) format and we use fields, sections and zone functions for indexing and search. This works perfectly as we only have a single language data in the database (and we have different database for different languages and sites) and we are currently using Oralce8i Release 1 (8.1.5) so we have NLS_LANG appropriately set for indexing and searches work correctly for individual languages.
However, we now have to store multi-lingual data in the same table (and column). Individual data elements may also be in different languages. For example, this is a record for a book that has a French title but Spanish authors. At present we have all this information in the CLOB column separated by fields/sections. My questions are:
Answer
How would I define interMedia parameters so that I would be able to search my CLOB column for records that contained "aorta" and "damage". For example using the following XML (DTD implied):
WellKnownFileName.gif echocardiogram aorta
This is an image of the vessel damage. It would be nice to see a simple (or complicated) example of an XML interMedia implementation.
I assume there is no need to setup the ZONE or FIELDS.....Is this the case?
If you save an XML Document fragment in a CLOB, and enable an Oracle Text (intermedia Text) XML index on it, then you can do an SQL query which uses the CONTAINS() operator as the following query does:
Assume you have a document like an insurance claim:
77804 1999-01-01 00:00:00.0 8895 1044 Paul Astoria 123 Cherry Lane SF CA 94132 1999-01-05 00:00:00.0 7600 JCOX It was becase of Faulty Brakes
If you store the content as a document fragment in a CLOB, then you can do a query like the following (assuming everything else you store in relational tables):
REM Select the SUM of the amounts of REM all settlement payments approved by "JCOX" REM for claims whose relates to Brakes. select sum(n.amount) as TotalApprovedAmount from insurance_claim_view v, TABLE(v.settlements) n where n.approver = 'JCOX' and contains(damageReport,'Brakes within Cause') >
If I store XML in CLOBs and use the DOM or SAX to reparse the XML later as needed.How can I search this document repository? Oracle Text (intermedia Text) seems ideal. Do you have an example of setting this up using intermedia in Oracle8i, demonstrating how to define the XML_SECTION_GROUP and where to use a ZONE as opposed to a FIELD, and so on? For example:
How would I define Intermedia parameters so that I would be able to search my CLOB column for records that had the "aorta" and "damage" in the using the following XML (DTD implied) WellKnownFileName.gif echo cardiogram aorta This is an image of the vessel damage.
Oracle8i Release 2 (8.1.6) and higher allow searching within attribute text. That's something like: state within book@author. Oracle now offers attribute value sensitive search, more like the following:
state within book[@author = "Eric"]:
begin ctx_ddl.create_section_group('mygrp','basic_section_group'); ctx_ddl.add_field_section('mygrp','keyword','keyword'); ctx_ddl.add_field_section('mygrp','caption','caption'); end; create index myidx on mytab(mytxtcolumn)indextype is ctxsys.contextparameters ('section group mygrp'); select * from mytab where contains(mytxtcolumn, 'aorta within keyword')>0; options:
It is not so clear. It looks to me like this example is trying to find instances of elements containing "damage" that have a sibling element containing "aorta" within the same record. It's not clear what exactly he means by "record".
If each record equates to the in his example, and there can be multiple records in a single XML LOB, than I don't see how you could do this search with interMedia.
If there is only one per CLOB/row, than perhaps you could find it by ANDing two context element queries. But that would still be a sloppy sort of xml search relying on some expected limitations of the situation more so than the structural composition actually called for.
I need to store XML files (that are present on the file system as of now) into the database. I want to store the whole document. What I mean is that I do not want to break the document as per the tags and then store the info in separate tables/fields. Rather I want to have a universal table, that I can use to store different XML documents. I think internally it will be stored in a CLOB type of field in my case. My XML files will always contain ASCII data.
Can this be done using interMedia. Should we be using Oracle Text (intermedia Text) or interMedia Annotator for this? I downloaded Annotator from OTN, but I could not store XML document in the database.
I am trying to store XML document into CLOB column. Basically I have one table with the following definition:
CREATE TABLE xml_store_testing ( xml_doc_id NUMBER, xml_doc CLOB )
I want to store my XML document in xml_doc field.
I have written another PL/SQL procedure shown below, to read the contents of the XML Document. The XML document is available on the file system. XML document contains just ASCII data - no binary data.
CREATE OR REPLACE PROCEDURE FileExec ( p_Directory IN VARCHAR2, p_FileName IN VARCHAR2) AS v_CLOBLocator CLOB; v_FileLocator BFILE; BEGIN SELECT xml_doc INTO v_CLOBLocator FROM xml_store_testing WHERE xml_doc_id = 1 FOR UPDATE; v_FileLocator := BFILENAME(p_Directory, p_FileName); DBMS_LOB.FILEOPEN(v_FileLocator, DBMS_LOB.FILE_READONLY); dbms_output.put_line(to_char(DBMS_LOB.GETLENGTH(v_FileLocator))); DBMS_LOB.LOADFROMFILE(v_CLOBLocator, v_FileLocator, DBMS_LOB.GETLENGTH(v_FileLocator)); DBMS_LOB.FILECLOSE(v_FileLocator); END FileExec;
When I execute this procedure, it executes successfully. But when I select from the table I see unknown characters in the table in CLOB field. Could this be because of the reason of the character set difference between operating system (where XML file resides) and database (where CLOB data resides).
Yes. If the character sets are different then you probably have to pass the data through UTL_RAW.CONVERT to do a character set conversion before writing to the CLOB.
We need to insert data in the Database from an XML file. Currently we only can insert structured data with the table already created. Is this true?
We are working in a law project where we need to store laws that have structured data and unstructured data, and then search the data using Oracle Text (interMedia Text). Can we insert unstructured data too? Or do we need to develop a custom application to do it? Then if we have the data stored with some structured parts and some unstructured parts, can we use Oracle Text to search it? If we stored the unstructured part in a CLOB, and the CLOB has tags, how can we search only data in an specific tag?
Consider using iFS which allows you to break up a document storing it across tables and in a LOB. Oracle Text can perform data searches with tags and is knowledgeable about the hierachical XML structure. From Oracle8i Release 2 (8.1.6), Oracle Text (intermedia Text) has this capability along with name/value pair attribute searches.
Hence, this document breaking is not possible in these cases if I don't create a custom development? Although interMedia does not understand hierachical XML structure, can I do something like this?
<report> <day>yesterday</day> there was a disaster <cause>hurricane</cause> </report>
Indexing with Oracle Text I would like to search LOBs where cause was hurricane. Is this possible?
You can perform that level of searching with the current release of Oracle Text (intermedia Text). Currently to break a document up you have to use the XML Parser with XSLT to create a stylesheet that transforms the XML into DDL. iFS gives you a higher level interface.
Another technique is to use a JDBC program to insert the text of the document or document fragment into a CLOB or LONG column, then do the searching using the CONTAINS() operator after setting up the indexes.
|
Copyright © 1996-2001, Oracle Corporation. All Rights Reserved. |
|