Oracle® Data Mining Application Developer's Guide, 10g Release 2 (10.2) Part Number B14340-01 |
|
|
View PDF |
This chapter describes data requirements and options for Oracle Data Mining. This information applies to data sets used to build, test, and score models.
You should ensure that a data set is properly defined before applying transformations to optimize it for a particular model. Data transformation techniques are not addressed in this chapter.
See Also:
Oracle Data Mining Concepts for information about data transformations.
DBMS_DATA_MINING_TRANSFORM
in Oracle Database PL/SQL Packages and Types Reference for information about data transformations in PL/SQL.
"Preparing the Data" for information about data transformations in Java.
Oracle Database SQL Reference for information about Oracle schema objects and data types.
This chapter contains the following topics:
The input to ODM is a table or a view. The columns can have numeric or character data types: NUMBER
, FLOAT
, VARCHAR2
, or CHAR
.
Additionally, ODM supports columns of type DM_NESTED_CATEGORICALS
and DM_NESTED_NUMERICALS
. These are collection types that define nested tables.
The ODM collection types define tables of attribute name/value pairs. ODM data sets can include any number of these nested table columns in addition to scalar columns with built-in numeric or character data types. See "Nested Tables" for more information.
ODM uses features of Oracle Text to transform unstructured text columns to structured columns of type DM_NESTED_NUMERICALS
for mining. The ODM Java API provides the OraTextTransform
interface to manage the text transformation process for you. However, if you are using the PL/SQL API, you must use Oracle Text routines directly (See Chapter 5).
Structured text columns are supported by several ODM algorithms (Support Vector Machine for classification and regression, Non-Negative Matrix Factorization, Association, and k-Means clustering).
See Also:
"Using Text Transformation" for information on text transformation in the Java API
Oracle Data Mining Administrator's Guide for information on sample programs that illustrate text transformation and text mining
ODM interprets the columns of the input table as attributes for data mining. Attributes are the predictors or descriptors on which the model is based.
A model may additionally identify a case ID column, a target column, or both.
Case ID
A case ID column holds a unique identifier for each record (row) of data. The case ID must be specified at model build time for all algorithms except Decision Tree. If a case ID is present in a Decision Tree model, it is not considered a possible predictor.
In the PL/SQL API and Java APIs, the case ID must be specified at apply time for all algorithms. The SQL scoring functions do not use a case ID.
The case ID column can be of type VARCHAR2
, CHAR
, or NUMBER
, and its maximum length is 128 bytes.
Target
Predictive algorithms (Classification, Regression, and Attribute Importance) require that one column be designated as a target. The name of the target column is supplied as an argument when the model is created. The target column holds the predictions generated by the model. The target column must be of type VARCHAR2
, CHAR
, NUMBER
, or FLOAT
. SVM Regression supports only numeric targets. One-Class SVM does not use a target.
ODM interprets attributes as either categorical or numerical.
Categorical attributes are values, such as gender or job title, that belong to a category or domain. Values of a categorical attribute do not have a meaningful order. Categorical attributes have character data types.
Numerical attributes are values, such as age or income, that fall within a continuum. Numerical attributes represent interval data that has a measurable order. Numerical attributes have numeric data types.
If the column data type is incompatible with the attribute type, you must convert the data type. For example, an application might use postal codes as a categorical attribute, but the data is actually stored in a numeric column. In this case, you would use the TO_CHAR
function to convert the column to a character data type.
If your mining data includes DATE
and TIMESTAMP
columns, and you are not using Predictive Analytics, you must convert those columns to numeric or character data types. In most cases, these data types should be converted to NUMBER
, but you should evaluate each case individually. If, for example, the date serves as a timestamp indicating when a transaction occurred, converting the date to VARCHAR2
makes it categorical with unique values, one in each record. This kind of column is known as an identifier and is not useful in model building. However, if the date values are coarse and significantly fewer than the number of records (for example, they might indicate the week or month when an item was purchased), it may be useful to use character values.
You can convert dates to numbers by selecting a starting date and subtracting it from each date value. This process results in a NUMBER
column. Another approach would be to parse the date and distribute its components over several columns. This is the conversion method used by Predictive Analytics.
Predictive Analytics interprets DATE
data and all forms of TIMESTAMP
data, including TIMESTAMP WITH TIMEZONE
and TIMESTAMP WITH LOCAL TIMEZONE
, as a set of numerical attributes. For example, a column named TIMECOL
would be transformed into attributes for year, month, week, day of year, day of month, day of week, hour, and minute. Each attribute would be named TIMECOL_
x
, where x
is the suffix used to convert the date into a number. For example, the name of the year attribute would be TIMECOL_YYYY
.
The attributes resulting from DATE
and TIMESTAMP
data are visible in the results of an EXPLAIN
operation. They are not visible in the results of a PREDICT
operation.
The names of ODM attributes must be valid column names. Naming requirements for columns are the same as the naming requirements for Oracle schema objects.
Schema object names can be quoted or nonquoted identifiers from one to thirty bytes long. Nonquoted identifiers are not case sensitive; Oracle converts them to uppercase. Nonquoted identifiers can consist of alphanumeric characters and the underscore (_), dollar sign ($), and pound sign (#). The initial character must be alphabetic. Quoted identifiers are case sensitive and can contain most characters.
See Also:
Oracle Database SQL Reference for information on schema object naming requirements.ODM accepts data in single-record case format, where all the information (attributes) concerning an individual is contained in one row. Single-record case, also known as non-transactional format, is illustrated in the table in Example 2-1. This table contains descriptive information about customers. CUSTOMER_ID
is the case ID column.
Example 2-1 Non-Transactional Format
CUSTOMER_ID GENDER AGE MARITAL_STATUS ------------ ------ --- -------------- 1 Male 30 Married 2 Female 35 Single 3 Male 21 Single 4 Female 20 Single 5 Female 35 Married
Sometimes data is organized in multi-record case, also known as transactional, format. A typical example is market basket data. In transactional format, the data pertaining to an individual is distributed across multiple records. The table in Example 2-2 illustrates transactional format. This table contains information about products purchased by a group of customers on a given day. CUSTOMER_ID
is the case ID column.
Example 2-2 Transactional Format
CUST_ID PROD_ID PROD_NAME -------- ------- --------- 1 524 brand x icecream 1 530 brand y frozen dinners 1 109 brand z dog food 2 578 brand a orange juice 2 191 brand x frozen dinners
ODM does not support multi-record case format. However, there could be circumstances in which you want to construct a model using transactional data. For example, you might want to use transactional data like that in Example 2-2 to predict the products that each customer is likely to buy on his next visit to the store. Discount coupons for these or similar products could then be generated with the checkout receipt.
If you want to construct a model using transactional data, you must first convert the data to single-record case. You must do this by defining columns of nested tables using the ODM fixed collection types, DM_NESTED_NUMERICALS
and DM_NESTED_CATEGORICALS
. These types define collections of numerical attributes and categorical attributes respectively. The data type descriptions are shown as follows.
SQL> describe dm_nested_numerical Name Null? Type ----------------------------------------- -------- ---------------------------- ATTRIBUTE_NAME VARCHAR2(30) VALUE NUMBER SQL> describe dm_nested_numericals DM_NESTED_NUMERICALS TABLE OF DMSYS.DM_NESTED_NUMERICAL Name Null? Type ----------------------------------------- -------- ---------------------------- ATTRIBUTE_NAME VARCHAR2(30) VALUE NUMBER SQL> describe dm_nested_categorical Name Null? Type ----------------------------------------- -------- ---------------------------- ATTRIBUTE_NAME VARCHAR2(30) VALUE VARCHAR2(4000) SQL> describe dm_nested_categoricals DM_NESTED_CATEGORICALS TABLE OF DMSYS.DM_NESTED_CATEGORICAL Name Null? Type ----------------------------------------- -------- ---------------------------- ATTRIBUTE_NAME VARCHAR2(30) VALUE VARCHAR2(4000)
For a given case identifier, attribute names must be unique across all the collections and individual columns. The fixed collection types enforce this requirement. However, the attribute naming requirements, described in "Attribute Names", do not apply to the attribute_name column of a nested table.
The attributes in Example 2-2 could be stored in nested table columns, as illustrated in Example 2-3. The column PRODUCT_IDENTIFIERS
is of type DM_NESTED_NUMERICALS
, and the column PRODUCT_NAMES
is of type DM_NESTED_CATEGORICALS
.
Example 2-3 Nested Tables
CUST_ID | PRODUCT_IDENTIFIERS | PRODUCT_NAMES |
---|---|---|
attribute_name value | attribute_name value | |
1 | PROD_ID 524
|
PROD_NAME brand x ice cream
|
2 | PROD_ID 578
|
PROD_NAME brand a orange juice
|
You can create an object view that presents several sources of transactional data (implemented with nested table columns) as a single data set for data mining. See "Example: Multi-Record Collections With an Object View".
Apart from the benefit of providing all your mining attributes through a single row-source without impacting their physical data storage, the view acts as a join specification on the underlying tables that can be used by the server for efficiently accessing your data.
Note:
Oracle recommends that you perform any necessary data transformations on the base tables before building object views. In this way, all attributes are transformed in a similar way. In most cases, attributes in transactional format are of the same scale, and thus this approach works. Otherwise, you can split the data into sets of similar items and then transform them separately.See DBMS_DATA_MINING_TRANSFORM
in Oracle Database PL/SQL Packages and Types Reference for information about data transformations using PL/SQL. See "Preparing the Data" for information about data transformations using Java.
A real-world example of an analytical pipeline for brain tumor research illustrates multi-case collections with an object view. The underlying tables store gene expression data and clinical data about the patient.
The fact table, GENE_EXPRESSION_DATA
, stores gene expression data. It has the following columns.
case_ID NUMBER gene VARCHAR2(30) expr NUMBER
The dimension table, CLINICAL_DATA_TABLE
, stores clinical patient data. It has the following columns.
case_ID NUMBER name VARCHAR2(30) type VARCHAR2(30) subtype VARCHAR2(30) gender CHAR(1) age NUMBER status VARCHAR2(30)
In this example, we want to create a model that predicts status based on gender, age, and gene expression. The build data for the model is an object view that uses columns of clinical patient data and a nested column of gene expression data. The view will have the following columns.
case_id NUMBER gender CHAR(1) age NUMBER gene_expr DM_NESTED_NUMERICALS status VARCHAR2(30)
The following statement constructs the object view gene_expr_build
, which can be used as build data for the model.
CREATE OR REPLACE VIEW gene_expr_build AS SELECT C.case_id, C.gender, C.age, CAST(MULTISET( SELECT gene, expr FROM gene_expression_data WHERE case_id = C.case_id) AS DM_NESTED_NUMERICALS ) gene_expr, C.status FROM clinical_data_table C
If there are a few hundred mining attributes and your application requires the attributes to be represented as columns in the same row of the table, data storage must be carefully designed.
For a table with several columns, the key question to consider is the (average) row length, not the number of columns. Having more than 255 columns in a table built with a smaller block size typically results in intrablock chaining.
Oracle stores multiple row pieces in the same block, but the overhead to maintain the column information is minimal as long as all row pieces fit in a single data block. If the rows don't fit in a single data block, you may consider using a larger database block size (or use multiple block sizes in the same database).
See Also:
Oracle Database Performance Tuning Guide for more details.