8 Globalization and Unicode Support

This chapter describes OCCI support for multibyte and Unicode charactersets.

This chapter contains these topics:

Overview of Globalization and Unicode Support
Specifying Charactersets
Datatypes for Globalization and Unicode Support
Objects and OTT Support

Overview of Globalization and Unicode Support

OCCI now enables application development in all Oracle supported multibyte and Unicode charactersets. The UTF16 encoding of Unicode is fully supported. Application programs can specify their charactersets when the OCCI Environment is created. OCCI interfaces that take character string arguments (such as SQL statements, username/passwords, error messages, object names, and so on) have been extended to handle data in any characterset. Character data from relational tables or objects can be in any characterset. OCCI can be used to develop multi-lingual, global and Unicode applications.

Specifying Charactersets

OCCI applications need to specify the client characterset and client national characterset when initializing the OCCI Environment. The client characterset specifies the characterset for all SQL statements, object/user names, error messages, and data of all CHAR datatype (CHAR, VARCHAR2, LONG) columns/attributes. The client national characterset specifies the characterset for data of all NCHAR datatype (NCHAR, NVARCHAR2) columns/attributes.

A new createEnvironment() interface that takes the client characterset and client national characterset is now provided. This allows OCCI applications to set characterset information dynamically, independent of the NLS_LANG and NLS_CHAR initialization parameter.

Example 8-1 How to Use Globalization and Unicode Support

Environment *env = Environment:createEnvironment("JA16SJIS","UTF8");

This statement creates a OCCI Environment with JA16SJIS as the client characterset and UTF8 as the client national characterset.

Any valid Oracle characterset name (except 'AL16UTF16') can be passed to createEnvironment(). A OCCI specific string "OCCIUTF16" (in uppercase) can be passed to specify UTF16 as the characterset.

Environment *env = Environment::createEnvironment("OCCIUTF16","OCCIUTF16");
Environment *env = Environment::createEnvironment("US7ASCII", "OCCIUTF16");

Note:

If an application specifies "OCCIUTF16" as the client characterset (first argument), then the application should use only the UTF16 interfaces of OCCI. These interfaces take UString argument types

The charactersets in the OCCI Environment are client-side only. They indicate the charactersets the OCCI application uses to interact with Oracle. The database characterset and database national characterset are specified when the database is created. Oracle converts all data from the client characterset/national characterset to the database characterset/national characterset before the server processes the data.

Datatypes for Globalization and Unicode Support

The datatypes used for supporting globalization and use of unicode include UString Datatype, Multibyte and UTF16 data, and CLOB and NCLOB Datatypes.

UString Datatype

UString is a datatype that enables applications and the OCCI library to pass and receive Unicode data in UTF-16 encoding. UString is templated from the C++ STL basic_string with Oracle's utext datatype.

typedef basic_string<utext> UString;

Oracle's utext datatype is a 2 byte short datatype and represents Unicode characters in the UTF-16 encoding. A Unicode character's codepoint can be represented in 1 utext or 2 utexts (2 or 4 bytes). Characters from European and most Asian scripts are represented in a single utext. Supplementary characters defined in the Unicode 3.1 standard are represented with 2 utext elements.

In Microsoft Windows platforms, UString is equivalent to the C++ standard wstring datatype. This is because the wchar_t datatype is type defined to a 2 byte short in these platforms, which is same as Oracle's utext, allowing applications to use a wstring type variable where a UString would be normally required. Consequently, applications can also pass wide-character string literals, created by prefixing the literal with the letter 'L', to OCCI Unicode APIs.

Example 8-2 Using wstring Datatype

//bind Unicode data using wstring datatype
//binding the Euro symbol, UTF16 codepoint 0x20AC
wchar_t eurochars[] = {0x20AC,0x00};
wstring eurostr(eurochars);
stmt->setUString(1,eurostr);
 
//Call the Unicode version of createConnection by
//passing widechar literals
Connection *conn = env->createConnection(L"SCOTT",L"TIGER",L"");

OCCI applications should use the UString datatype for data in UTF16 characterset

Multibyte and UTF16 data

For data in multibyte charactersets like JA16SJIS and UTF8, applications should use the C++ string type. The existing OCCI APIs that take string arguments can handle data in any multibyte characterset. Due to the use of string type, OCCI supports only byte length semantics for multibyte characterset strings.

Example 8-3 Binding UTF8 Data Using the string Datatype

//bind UTF8 data
//binding the Euro symbol, UTF8 codepoint : 0xE282AC
char eurochars[] = {0xE2,0x82,0xAC,0x00};
string eurostr(eurochars)
stmt->setString(1,eurostr);//use the string interface

For Unicode data in the UTF16 characterset, the OCCI specific datatype: UString and the OCCI UTF16 interfaces should be used.

Example 8-4 Binding UTF16 Data Using the UString Datatype

//bind Unicode data using UString datatype
//binding the Euro symbol, UTF16 codepoint 0x20AC
utext eurochars[] = {0x20AC,0x00};
UString eurostr(eurochars);
stmt->setUString(1,eurostr);//use the UString interface

CLOB and NCLOB Datatypes

Oracle provides the CLOB and NCLOB datatypes for storing and processing large amounts of character data. CLOBs represent data in the database characterset and NCLOBs represent data in the database national characterset. CLOBs and NCLOBs can be used as column types in relational tables and as attributes in object types.

The OCCI Clob class is used to work with both CLOB and NCLOB datatypes. If the database type is NCLOB, then the Clob set CharSetForm() method should be called with OCCI_SQLCS_NCHAR before reading/writing from the LOB.

The OCCI Clob class has support for multibyte and UTF16 charactersets. By default, the Clob interfaces assume the data is encoded in the client-side characterset (for both CLOBs and NCLOBs). To specify a different characterset or to specify the client-side national characterset for a NCLOB, call the setCharSetId() or setCharSetIdUString() methods with the appropriate characterset. The OCCI specific string 'OCCIUTF16' can be passed to indicate UTF16 as the characterset.

Example 8-5 Using CLOB and NCLOB Datatypes

//client characterset - ZHT16BIG5, national characterset - UTF16
Environment *env = Environment::createEnvironment("ZHT16BIG5","OCCIUTF16");...
Clob nclobvar;
//for NCLOBs, need to call setCharSetForm method. 
nclobvar.setCharSetForm(OCCI_SQLCS_NCHAR);...
//if reading/writing data in UTF16 for this NCLOB, still need to 
//explicitly call setCharSetId
nclobvar.setCharSetId("OCCIUTF16")

To read or write data in multibyte charactersets, use the existing read and write interfaces that take a char buffer. New overloaded interfaces that take utext buffers for UTF16 data have been added to the Clob Class as read(), write() and writeChunk() methods. The arguments and return values for these methods are either bytes or characters, depending on the characterset of the LOB.

Objects and OTT Support

Multibyte and UTF16 charactersets are supported for handling character data in object attributes. All CHAR datatype (CHAR/VARCHAR2) attributes hold data in the client-side characterset, while all NCHAR datatype (NCHAR/NVARCHAR2) attributes hold data in the client-side national characterset. A member variable of UString datatype represents an attribute in UTF16 characterset.