17 High Availability

Computing environments configured to provide nearly full-time availability are known as high availability systems. Oracle has a number of products and features that provide high availability in cases of unplanned downtime or planned downtime.

This chapter includes the following topics:

Introduction to High Availability
Overview of Unplanned Downtime
Overview of Planned Downtime

Introduction to High Availability

Computing environments configured to provide nearly full-time availability are known as high availability systems. Such systems typically have redundant hardware and software that makes the system available despite failures. Well-designed high availability systems avoid having single points-of-failure.

Oracle has a number of products and features that provide high availability in cases of unplanned downtime or planned downtime.

Overview of Unplanned Downtime

Various things can cause unplanned downtime. Oracle offers the following features to maintain high availability during unplanned downtime:

Oracle Solutions to System Failures
Oracle Solutions to Data Failures
Oracle Solutions to Disasters
Overview of Oracle Data Guard

Oracle Solutions to System Failures

This section covers some Oracle solutions to system failures, including the following:

Overview of Fast-Start Fault Recovery
Overview of Real Application Clusters

Overview of Fast-Start Fault Recovery

Oracle Enterprise Edition features include a fast-start fault recovery functionality to control instance recovery. This reduces the time required for cache recovery and makes the recovery bounded and predictable by limiting the number of dirty buffers and the number of redo records generated between the most recent redo record and the last checkpoint.

The foundation of fast-start recovery is the fast-start checkpointing architecture. Instead of the conventional event driven (that is, log switching) checkpointing, which does bulk writes, fast-start checkpointing occurs incrementally. Each DBWn process periodically writes buffers to disk to advance the checkpoint position. The oldest modified blocks are written first to ensure that every write lets the checkpoint advance. Fast-start checkpointing eliminates bulk writes and the resultant I/O spikes that occur with conventional checkpointing.

With fast-start fault recovery, the Oracle database is opened for access by applications without having to wait for the undo, or rollback, phase to be completed. The rollback of data locked by uncommitted transaction is done dynamically on an as needed basis. If the user process encounters a row locked by a crashed transaction, then it just rolls back that row. The impact of rolling back the rows requested by a query is negligible.

Fast-start fault recovery is very fast, because undo data is stored in the database, not in the log files. Undoing a block does not require an expensive sequential scan of a log file. It is simply a matter of locating the right version of the data block within the database.

Fast-start recovery can greatly reduce mean time to recover (MTTR) with minimal effects on online application performance. Oracle continuously estimates the recovery time and automatically adjusts the checkpointing rate to meet the target recovery time.

See Also:

Oracle Database Performance Tuning Guide for information on fast-start fault recovery

Overview of Real Application Clusters

Real Application Clusters (RAC) databases are inherently high availability systems. The clusters that are typical of RAC environments can provide continuous service for both planned and unplanned outages. RAC builds higher levels of availability on top of the standard Oracle features. All single instance high availability features, such as fast-start recovery and online reorganizations, apply to RAC as well.

In addition to all the regular Oracle features, RAC exploits the redundancy provided by clustering to deliver availability with n-1 node failures in an n-node cluster. In other words, all users have access to all data as long as there is one available node in the cluster.

Oracle Solutions to Data Failures

This section covers some Oracle solutions to data failures, including the following:

Overview of Backup and Recovery Features for High Availability
Overview of Partitioning
Overview of Transparent Application Failover

Overview of Backup and Recovery Features for High Availability

In addition to fast-start fault recovery and mean time to recovery, Oracle provides several solutions to protect against and recover from data and media failures. A system or network fault may prevent users from accessing data, but media failures without proper backups can lead to lost data that cannot be recovered. These include the following:

Recovery Manager (RMAN) is Oracle's utility to manage the backup and recovery of the database. It determines the most efficient method of running the requested backup, restore, or recovery operation. RMAN and the server automatically identify modifications to the structure of the database and dynamically adjust the required operation to adapt to the changes. You have the option to specify the maximum disk space when restoring logs during media recovery, thus enabling an efficient space management during the recovery process.
Oracle Flashback Database lets you quickly recover an Oracle database to a previous time to correct problems caused by logical data corruptions or user errors.
Oracle Flashback Query lets you view data at a point-in-time in the past. This can be used to view and reconstruct lost data that was deleted or changed by accident. Developers can use this feature to build self-service error correction into their applications, empowering end-users to undo and correct their errors.
Backup information can be stored in an independent flash recovery area. This increases the resilience of the information, and allows easy querying of backup information. It also acts as a central repository for backup information for all databases across the enterprise, providing a single point of management.
When performing a point in time recovery, you can query the database without terminating recovery. This helps determine whether errors affect critical data or non-critical structures, such as indexes. Oracle also provides trial recovery in which recovery continues but can be backed out if an error occurs. It can also be used to "undo" recovery if point in time recovery has gone on for too long.
With Oracle's block-level media recovery, if only a single block is damaged, then only that block needs to be recovered. The rest of the file, and thus the table containing the block, remains online and accessible.
LogMiner lets a DBA find and correct unwanted changes. Its simple SQL interface allows searching by user, table, time, type of update, value in update, or any combination of these. LogMiner provides SQL statements needed to undo the erroneous operation. The GUI interface shows the change history. Damaged log files can be searched with the LogMiner utility, thus recovering some of the transactions recorded in the log files.

See Also:

Chapter 15, "Backup and Recovery" for information on backup and recovery solutions, including Oracle Flashback Database and Oracle Flashback Table
Oracle Database Backup and Recovery Basics for information on RMAN and backup and recovery solutions
Chapter 13, "Data Concurrency and Consistency" for information on Oracle Flashback Query

Overview of Partitioning

Partitioning addresses key issues in supporting very large tables and indexes by letting you decompose them into smaller and more manageable pieces called partitions. SQL queries and DML statements do not need to be modified in order to access partitioned tables. However, after partitions are defined, DDL statements can access and manipulate individuals partitions rather than entire tables or indexes. This is how partitioning can simplify the manageability of large database objects. Also, partitioning is entirely transparent to applications.

Overview of Transparent Application Failover

Transparent Application Failover enables an application user to automatically reconnect to a database if the connection fails. Active transactions roll back, but the new database connection, made by way of a different node, is identical to the original. This is true regardless of how the connection fails.

With Transparent Application Failover, a client notices no loss of connection as long as there is one instance left serving the application. The database administrator controls which applications run on which instances and also creates a failover order for each application. This works best with Real Application Clusters (RAC): If one node dies, then you can quickly reconnect to another node in the cluster.

Elements Affected by Transparent Application Failover

During normal client/server database operations, the client maintains a connection to the database so the client and server can communicate. If the server fails, so then does the connection. The next time the client tries to use the connection the client issues an error. At this point, the user must log in to the database again.

With Transparent Application Failover, however, Oracle automatically obtains a new connection to the database. This enables users to continue working as if the original connection had never failed.

There are several elements associated with active database connections. These include:

Client/server database connections
Users' database sessions running statements
Open cursors used for fetching
Active transactions
Server-side program variables

Transparent Application Failover can be used to restore client/server database connections, users' database sessions and optionally an active query. To restore other elements of an active database connection, such as active transactions and server-side package state, the application code must be capable of re-running statements that occurred after the last commit.

RAC High Availability Event Notification

OCI and JDBC (Thick) clients can register for RAC high availability event notification and take appropriate action when an event occurs. With this, you can improve connection failover response time and remove stale connections from connection pools and session pools. Reducing failure detection time allows Transparent Application Failover to react more quickly when failures do occur, benefiting client applications running during a node or instance failure.

Clients must connect to a database service that has been enabled for Oracle Streams Advanced Queuing high availability notifications. Database services may be modified with Enterprise Manager to support these notifications. Once enabled, clients can register a callback that is invoked whenever a high availability event occurs.

Note:

With JDBC Thick clients, event notification is limited to connection pools.

See Also:

Oracle Call Interface Programmer's Guide

Oracle Solutions to Disasters

Oracle's primary solution to disasters is the Oracle Data Guard product.

Overview of Oracle Data Guard

Oracle Data Guard lets you maintain uptime automatically and transparently, despite failures and outages. Oracle Data Guard maintains up to nine standby databases, each of which is a real-time copy of the production database, to protect against all threats—corruptions, data failures, human errors, and disasters. If a failure occurs on the production (primary) database, then you can fail over to one of the standby databases to become the new primary database. In addition, planned downtime for maintenance can be reduced, because you can quickly and easily move (switch over) production processing from the current primary database to a standby database, and then back again.

Fast-start failover provides the ability to automatically, quickly, and reliably fail over to a designated, synchronized standby database in the event of loss of the primary database, without requiring that you perform complex manual steps to invoke the failover. This lets you maintain uptime transparently and increase the degree of high availability for system failures, data failures, and site outages, as well the robustness of disaster recovery.

Oracle Data Guard Configurations

An Oracle Data Guard configuration is a collection of loosely connected systems, consisting of a single primary database and up to nine standby databases that can include a mix of both physical and logical standby databases. The databases in a Data Guard configuration can be connected by a LAN in the same data center, or—for maximum disaster protection—geographically dispersed over a WAN and connected by Oracle Net Services.

A Data Guard configuration can be deployed for any database. This is possible because its use is transparent to applications; no application code changes are required to accommodate a standby database. Moreover, Data Guard lets you tune the configuration to balance data protection levels and application performance impact; you can configure the protection mode to maximize data protection, maximize availability, or maximize performance.

As application transactions make changes to the primary database, the changes are logged locally in redo logs. For physical standby databases, the changes are applied to each physical standby database that is running in managed recovery mode. For logical standby databases, the changes are applied using SQL regenerated from the archived redo logs.

Physical Standby Databases

A physical standby database is physically identical to the primary database. While the primary database is open and active, a physical standby database is either performing recovery (by applying logs), or open for reporting access. A physical standby database can be queried read only when not performing recovery while the production database continues to ship redo data to the physical standby site.

Physical standby on disk database structures must be identical to the primary database on a block-for-block basis, because a recovery operation applies changes block-for-block using the physical rowid. The database schema, including indexes, must be the same, and the database cannot be opened (other than for read-only access). If opened, the physical standby database will have different rowids, making continued recovery impossible.

Logical Standby Databases

A logical standby database takes standard Oracle archived redo logs, transforms the redo records they contain into SQL transactions, and then applies them to an open standby database. Although changes can be applied concurrently with end-user access, the tables being maintained through regenerated SQL transactions allow read-only access to users of the logical standby database. Because the database is open, it is physically different from the primary database. The database tables can have different indexes and physical characteristics from their primary database peers, but must maintain logical consistency from an application access perspective, to fulfill their role as a standby data source.

Oracle Data Guard Broker

Oracle Data Guard Broker automates complex creation and maintenance tasks and provides dramatically enhanced monitoring, alert, and control mechanisms. It uses background agent processes that are integrated with the Oracle database server and associated with each Data Guard site to provide a unified monitoring and management infrastructure for an entire Data Guard configuration. Two user interfaces are provided to interact with the Data Guard configuration, a command-line interface (DGMGRL) and a graphical user interface called Data Guard Manager.

Oracle Data Guard Manager, which is integrated with Oracle Enterprise Manager, provides wizards to help you easily create, manage, and monitor the configuration. This integration lets you take advantage of other Enterprise Manager features, such as to provide an event service for alerts, the discovery service for easier setup, and the job service to ease maintenance.

Data Guard with RAC

RAC enables multiple independent servers that are linked by an interconnect to share access to an Oracle database, providing high availability, scalability, and redundancy during failures. RAC and Data Guard together provide the benefits of both system-level, site-level, and data-level protection, resulting in high levels of availability and disaster recovery without loss of data:

RAC addresses system failures by providing rapid and automatic recovery from failures, such as node failures and instance crashes. It also provides increased scalability for applications.
Data Guard addresses site failures and data protection through transactionally consistent primary and standby databases that do not share disks, enabling recovery from site disasters and data corruption.

Many different architectures using RAC and Data Guard are possible depending on the use of local and remote sites and the use of nodes and a combination of logical and physical standby databases.

See Also:

Oracle Solutions to Human Errors

This section covers some Oracle solutions to human errors, including the following:

Overview of Oracle Flashback Features
Overview of LogMiner
Overview of Security Features for High Availability

Overview of Oracle Flashback Features

If a major error occurs, such as a batch job being run twice in succession, the database administrator can request a Flashback operation that quickly recovers the entire database to a previous point in time, eliminating the need to restore backups and do a point-in-time recovery. In addition to Flashback operations at the database level, it is also possible to flash back an entire table. Similarly, the database can recover tables that have been inadvertently dropped by a user.

Oracle Flashback Database lets you quickly bring your database to a prior point in time by undoing all the changes that have taken place since that time. This operation is fast, because you do not need to restore the backups. This in turn results in much less downtime following data corruption or human error.
Oracle Flashback Table lets you quickly recover a table to a point in time in the past without restoring a backup.
Oracle Flashback Drop provides a way to restore accidentally dropped tables.
Oracle Flashback Query lets you view data at a point-in-time in the past. This can be used to view and reconstruct lost data that was deleted or changed by accident. Developers can use this feature to build self-service error correction into their applications, empowering end-users to undo and correct their errors.
Oracle Flashback Version Query uses undo data stored in the database to view the changes to one or more rows along with all the metadata of the changes.
Oracle Flashback Transaction Query lets you examine changes to the database at the transaction level. As a result, you can diagnose problems, perform analysis, and audit transactions.

See Also:

Chapter 15, "Backup and Recovery" for more information on Oracle Flashback Database and Oracle Flashback Table
Oracle Database Backup and Recovery Basics
Chapter 13, "Data Concurrency and Consistency" for information on Oracle Flashback Query

Overview of LogMiner

Oracle LogMiner lets you query redo log files through a SQL interface. Redo log files contain information about the history of activity on a database. Oracle Enterprise Manager includes the Oracle LogMiner Viewer graphical user interface (GUI).

All changes made to user data or to the database dictionary are recorded in the Oracle redo log files. Therefore, redo log files contain all the necessary information to perform recovery operations. Because redo log file data is often kept in archived files, the data is already available. To take full advantage of all the features LogMiner offers, you should enable supplemental logging.

See Also:

Chapter 11, "Oracle Utilities"

Overview of Security Features for High Availability

Oracle Internet Directory lets you manage the security attributes and privileges for users, including users authenticated by X.509 certificates. Oracle Internet Directory also enforces attribute-level access control. This enables read, write, or update privileges on specific attributes to be restricted to specific named users, such as an enterprise security administrator. Directory queries and responses can use SSL encryption for enhanced protection during authentication and other interactions. Other database security features including Virtual Private Database (VPD), Label Security, audit, and proxy authentication can be leveraged for these directory-based users when configured as enterprise users.

The Oracle Advanced Security User Migration Utility assists in migrating existing database users to Oracle Internet Directory. After a user is created in the directory, organizations can continue to build new applications in a Web environment and leverage the same user identity in Oracle Internet Directory for provisioning the user access to these applications.

See Also:

Chapter 20, "Database Security"

Overview of Planned Downtime

Oracle provides a number of capabilities to reduce or eliminate planned downtime. These include the following:

System Maintenance
Data Maintenance
Database Maintenance

System Maintenance

Oracle provides a high degree of self-management - automating routine DBA tasks and reducing complexity of space, memory, and resource administration. These include the following:

Automatic undo management–database administrators do not need to plan or tune the number and sizes of rollback segments or consider how to strategically assign transactions to a particular rollback segment.
Dynamic memory management to resize the Oracle shared memory components dynamically. Oracle also provides advisories to help administrators size the memory allocation for optimal database performance.
Oracle-managed files to automatically create and delete files as needed
Free space management within a table with bitmaps. Additionally, Oracle provides automatic extension of data files, so the files can grow automatically based on the amount of data in the files.
Data Guard for hardware and operating system maintenance

See Also:

Data Maintenance

Database administrators can perform a variety of online operations to table definitions, including online reorganization of heap-organized tables. This makes it possible to reorganize a table while users have full access to it.

This online architecture provides the following capabilities:

Any physical attribute of the table can be changed online. The table can be moved to a new location. The table can be partitioned. The table can be converted from one type of organization (such as a heap-organized) to another (such as index-organized).
Many logical attributes can also be changed. Column names, types, and sizes can be changed. Columns can be added, deleted, or merged. One restriction is that the primary key of the table cannot be modified.
Online creation and rebuilding of secondary indexes on index-organized tables (IOTs). Secondary indexes support efficient use of block hints (physical guesses). Invalid physical guesses can be repaired online.
Indexes can be created online and analyzed at the same time. Online fix-up of physical guess component of logical rowids (used in secondary indexes on index-organized tables) also can be used.
Fix the physical guess component of logical rowids stored in secondary indexes on IOTs. This allows online repair of invalid physical guesses

Database Maintenance

Oracle provides technology to do maintenance of database software with little or no database downtime. Patches can be applied to Real Application Clusters instances one at a time, such that database service is always available.

A Real Application Clusters system can run in this mixed mode for an arbitrary period to test the patch in the production environment. When satisfied that the patch is successful, this procedure is repeated for the remaining nodes in the cluster. When all nodes in the cluster have been patched, the rolling patch upgrade is complete, and all nodes are running the same version of Oracle.