Oracle Intelligent Agent User's Guide Release 9.0.2 Part Number A95412-01 |
|
This chapter covers generic troubleshooting strategies in the event your Intelligent Agent does not function properly. The following topics are discussed:
Under most circumstances, the Intelligent Agent itself requires very little in the way of configuration. In order to function properly, however, the Agent must be able to communicate with the managing host and managed services. If you are familiar with Oracle and your operating system, using the following abbreviated checklists will likely solve problems that can interfere with Agent operation.
The following checklists cover the areas most likely to affect Agent operation. Agent troubleshooting checklists have been divided according to the two most common platforms on which the Agent is run: Windows NT and UNIX. The checklists are abbreviated and assume knowledge of both Oracle, the operating system, and related communication protocols. Specific troubleshooting procedures are covered in detail later in this chapter.
If you are running an Agent on a Windows NT system, use the following checklist.
ORACLE_HOME\network\admin
directory, and services.ora is in the ORACLE_HOME\network\agent
directory.
Compare the services listed with the services which are available on the machine. Please refer to Appendix A, "Agent Configuration Files" for valid sample files.
If services are missing, check the following files for inconsistency or corruption:
The Agent is a service and runs by default as SYSTEM. It also needs DLLs from the ORACLE_HOME/BIN
directory. If you need mapped drives in your path, you MUST NOT set them in the SYSTEM path.
To set your own path:
dbsnmp.trace_level=admin
(or 16 if you want maximum information)
dbsnmp.trace_directory=
<any directory in which the Oracle user has write privileges>
dbsnmp.trace_file=<name of the trace output file>
oracle_home/network/log
directory.
DBSNMP.LOG
should show general Agent problems.
DBSNMP.NOHUP should show any errors related to the Agent's "watchdog" dbsnmpwd process.
DBSNMPCONFIG.LOG
should show problems with auto-discovery.
If you are running an Agent on a UNIX system, use the following checklist.
agentctl status
Alternatively, you can check to see if the Intelligent Agent is running by entering the following command:
ps -eaf | grep dbsnmp
If your Agent is running, you should see something similar to the following:
DBSNMP for Solaris: Version 9.0.0.0.0 - Production on 04-NOV-01 18:44:15 (c) Copyright 2001 Oracle Corporation. All rights reserved. The db subagent is already running.
These checks should show that a "dbsnmp" process is running and/or "dbsnmpwd" watchdog script is running.
ORACLE_HOME/network/log/dbsnmp*.log
file for errors on UNIX. (nmiconf.log for discovery).
ORACLE_HOME/network/log
directory
Compare the services listed with the services which are available on the machine. Please refer to Appendix A, "Agent Configuration Files" for valid sample files.
If services are missing, check the following files for inconsistency or corruption:
If you are trying to do backups, you must run backupts.sql with the dbsnmp/dbsnmp
account.
If after going through the quick checks your Intelligent Agent still is not functioning correctly, use the following section to cover other areas of Agent operation that are less probable causes of Agent operating problems. In addition, many of the steps in the checklists are covered in greater detail for those users who may be less familiar with Oracle and/or the operating system on which the Agent is running. The following questions are covered in this section:
One of the most common problems that prevents the Agent from starting is TCP/IP configuration. To check whether your TCP/IP setup is configured correctly, issue the following commands at the command line:
telnet <hostname>
If these files have never been used, only sample files will exist in the directory. Either rename or copy the .sam files to just the file name with no extension.
(UNIX) Log in as root and edit the /etc/hosts file.
Example: (Windows NT)
(Replace the information in brackets with the actual host information for that system.)
HOSTS file: <122.111.111.111> <hostname> LMHOSTS file: <122.111.111.111> <netbios name or hostname> #PRE
Note: You can also verify this information through the Windows NT Control Panel -> Network property sheet. |
Note: The *.q files contain information about current jobs and events. Do not delete these files without first removing all jobs and events registered against this Agent. |
Before Release 8.0.4 of the Agent, the NT Agent required the DNS Hostname and the Computer Name to be identical. These parameters can be checked/changed from the following Windows NT Control Panel property sheets.
To verify the computer name:
To verify the DNS Name:
In addition to proper network configuration, which allows nodes in your network to communicate, components of your Oracle environment must also be able to communicate with each other. Oracle Net provides the session and data communication medium between client machines and Oracle servers, or between Oracle servers. For this reason, proper Oracle Net configuration is a prerequisite for Agent communication. This section covers the most common problems that can occur when Agent communication fails.
Oracle Net configuration files are found in $ORACLE_HOME/network/admin, or $TNS_ADMIN (Windows NT) or $ORACLE_HOME/network/admin (UNIX).
Primary configuration files are:
See Appendix A, "Agent Configuration Files" for information and examples of the above files.
TNS_ADMIN variable usage during Agent Discovery
All versions of the Unix discovery script allow the use of the TNS_ADMIN variable to locate input files (listener.ora and tnsnames.ora). Only Agent versions 7.3.4 and above correctly write the output files (snmp_ro.ora and snmp_rw.ora) into TNS_ADMIN, if set.
Beginning with version 8.0.5, the discovery script also reads the TNS_ADMIN value from the NT Registry.
The Agent also uses the TNS alias information found in the listener.ora file. The Agent does so even within an Oracle names environment. This behavior is intentional since an Oracle Names server may be temporarily unavailable and the Agent needs to be able to resolve names at all times. Check the following to make sure the local translation of the TNS alias takes place:
Do not activate the listener on port 1748, since Agent is listening on this port. (This is the reason you can use TNSPING against the Agent; TNSPING cannot differentiate between a listener and an Agent)
The Agent requires IPC entries and TNS alias definitions on the server, in addition to alias definitions from the Console, to perform alias translations. This correct IPC entries and TNS alias definitions are essential for correct Agent/Console (V1) or Agent/Management Server (V2) communications.
If your Oracle Net configuration is correct and you are still unable to contact the Agent, the next step is to determine whether services in your Oracle Net network can be reached. You can use the TNSPING utility on each database you want to access by entering the following at the command prompt:
tnsping <network service name>
If you can connect successfully from a client to a server (or from a server to a server) using TNSPING, the command will return an estimate of the round trip time (in milliseconds) it takes to reach the Oracle Net service. This indicates Oracle Net is functioning properly.
Next, add the following alias (Agent debug entry) to the Console's tnsnames.ora file:
agent_<sid>.world= (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (COMMUNITY =TCP.world) (PROTOCOL = TCP) (Host = <your-agent-hostname>) (Port = 1748) ) ) )
Then ping the Agent from the OEM console using:
tnsping agent_<sid>
or
tnsping80 agent_<sid>
If the TNSPING command does not work, add the above alias to the Agent machine's tnsnames.ora file and try using TNSPING from the machine on which the Agent resides. Every Agent must be TNSPING-able using this alias.
To check whether the Agent process is running issue the following command:
agentctl status
If the Agent did not start up, use any of the hints listed in the following table:
If you still do not know why the Agent did not start, turn on tracing. For more information on setting up Agent tracing, see "Tracing the 9i Agent")
For both UNIX and Windows NT systems check:
$ORACLE_HOME/network/log/dbsnmp.nohup
To test whether an Agent can connect to the database(s) it monitors on a given node, try connecting to each database with the following connect string:
dbsnmp/dbsnmp@address_list
You must perform this test on the node where the Agent resides.
Note: Agents prior to 7.3.3 maintain two permanent connections to its local databases. Post 7.3.3 Agents maintain only one permanent connection. |
To verify whether the Agent has the correct user permissions, see Installing the Intelligent Agent on page 2-2 .
An OS user needs to be specified for the node and must have the following permissions:
(Windows NT) Check the NT EVENT VIEWER -> APPLICATIONS -> LOG for any errors starting the DBSNMP process.
(Windows NT and UNIX) Check the $ORACLE_HOME/network/log/nmiconf.log file for discovery errors.
For both UNIX and Windows NT systems check the following file for additional errors:
$ORACLE_HOME/network/log/dbsnmp.nohup
Most likely the job does actually run, but the Agent is unable to contact the Console to send back notifications. Verify that hostname resolution can occur. Verify that the IP and hostname of the Windows NT machine running the console is in the /etc/hosts file on the Unix box or the hostname can be resolved via DNS/NIS. Retry the job.
To test the TCP/IP resolution, perform the following tests from a command prompt:
ping <hostname> ping <IPaddress>
If the server is running telnet or ftp services(UNIX):
telnet <hostname> ftp <hostname>
Since PING uses IP and not TCP, it is a good way of determining if the problem is in the packet routing.
To determine if the problem is actually with TCP, use the telnet or ftp utilities.
Be sure the name and IP address of the Enterprise Manager Console machine is in the /etc/hosts file on the Sun server, otherwise the Agent is not able to return messages to the console because it can not resolve the name of the machine to an IPADDRESS.
The default listening address (TNS format) is:
LISTENING ADDRESS = (ADDRESS=(PROTOCOL= TCP)(Host=machine_name)(Port=7770))
If a job stays in the scheduled status, repeatedly delete it using the DEL key. Restart the job. Sometimes it takes several submits until it starts up. A delay of up to a minute until a job starts is common, especially the first time an Agent tries to sync with the OEM console with old Agents (7.3.2)
The following error messages and resolution are categorized by operating system. Situations that apply to all systems are listed under "Generic Agent."
Copy the snmp.address.<host_name> parameter from your $ORACLE_HOME\network\admin\snmp_ro.ora file. Paste this address and parameter into your $ORACLE_HOME\network\admin\snmp_rw.ora file. In snmp_rw.ora, reduce the size of this connect string by removing the address entries for IPC. (NMP and SPX may also be removed.)
Shutdown/restart the Agent. See examples below.
Note: The parameter snmp.address in no longer found in snmp_ro.ora starting with the 7.3.4/8.0.3 Agents. Therefore, you will have to use this example to add a new variable to your snmp_rw.ora. |
EXAMPLES:
Entry to be copied out of snmp_ro.ora:
snmp.address.ORCL_MACHINE-PC = (DESCRIPTION=(ADDRESS_LIST =(ADDRESS=(PROTOCOL=IPC)(KEY=oracle.world))(ADDRESS=(PROTOCOL=IPC)(KEY=ORCL))(AD DRESS=(COMMUNITY= TCP.world)(Host=machine-pc) (PROTOCOL=TCP)(Port=1521))(ADDRESS=(COMMUNITY=TCP.world)(Host=machine-pc) (PROTOCOL=TCP)(Port= 1526)))(CONNECT_DATA=(SID=ORCL)(SERVER=DEDICATED)))
Modified entry in snmp_rw.ora:
snmp.address.ORCL_machine-PC = (DESCRIPTION=(ADDRESS_LIST =(ADDRESS=(COMMUNITY=TCP.world)(Host = machine-pc)(PROTOCOL= TCP)(Port= 1521))(ADDRESS=(COMMUNITY= TCP.world)(Host = machine-pc)(PROTOCOL= TCP)(Port=1526)))(CONNECT_DATA=(SID=ORCL)(SERVER=DEDICATED)))
This is actually a Oracle Net Listener error.
The following is documented in the 8.0.3.0.0 Intel NT release notes for the Oracle Net Listener. When a client connects to an Oracle8 server in dedicated server mode, WINSOCK2 Shared Sockets feature is used so that the client connection is routed from the listener to the database server. This feature improves the connection time, because the client does not need to close the socket connection with the listener and establish a new connection with the database server.
With the use of Shared Sockets, threads also use the same port as the listener. If you shut down the listener and try to start it up again for the same port, the listener does not start up if the port is in use due to any open connections with the database. Ensure that no client is connected to the database before starting up the listener. Note that if you are using a listener with a different port number you are able to start it up.
See Oracle Networking Products Getting Started for Windows Platforms for more information about the listener.ora file and the LSNRCTL80 utility. Oracle Corporation attempted to overcome the restriction by using the WINSOCK2 option to allow the re-use of a port, but the option does not work reliably. Oracle Corporation is currently working with Microsoft Corporation to resolve this issue.
For additional information about the reload command, see the Oracle Net Administrator's Guide.
While submitting a job, validation fails with "failed to find address for Agent_node". And then the VOC-04816 Invalid Destination. This might also be caused by an invalid address in the tnsnames.ora located on the console.
Upgrade your Agent to at least 7.3.3. or later.
Verify that your SQL*Net configuration files are correct?
In order for the Agent to execute jobs on a managed node, the following conditions must be met:
This usually happens if you have a databases prior to 7.3.3 on the machine. From V7.3.3 onwards, a script called CATSNMP.SQL is included in the CATALOG.SQL dictionary script. This script is responsible for creating the DBSNMP user the Agent needs to connect. Older databases did not have this script yet.
Verify if the user 'DBSNMP' exists. If not, run the catsnmp.sql script.
This message comes from the discovery script, nmiconf.tcl. Make sure you have $ORACLE_HOME environment variable set to the ORACLE_HOME of the Agent and re-start the Agent.
If you have more than one database on a single node, then you need to make sure that each instance has a unique GLOBAL_DBNAME in the listener.ora. You may have to define this manually in the listener.ora.
This error can occur if the Agent cannot write to $ORACLE_HOME\network\admin. Refer to the $ORACLE_HOME\networklog\nmiconf.log for errors. For more information on Agent startup problems, see "Did the Agent startup successfully?".
Check the services.ora file to determine which services have been discovered.
All the services the Agent finds on a machine, must be defined in the relevant SQL*Net/Oracle Net configuration files. If the service(s) are not defined, service discovery will fail and, in the worst case, the Agent will hang or return errors.
Windows NT: Beginning with version 8.0.4, the Agent searches for service names that begin with 'OracleService' or 'OracleService<SID>'. Every entry beginning with 'OracleService' is considered to be a database running on this machine. Every SID encountered by the Agent must be defined in the relevant SQL*Net/Oracle Net files.
UNIX: The oratab file is used to determine which SIDs are present. For 7.3.3 Agents and earlier, discovery fails if it encounters a SID that is not accurate (like in a Developer 2000 environment). To work around this problem, the environment variable $ORATAB can be used to access an alternate oratab file which contains only the databases you wish the Agent to see.
For the remaining databases, check the oratab file, and the SQL*Net/Oracle Net files to see if these files exist and that all definitions are present. Make sure that all of the databases are listed in the listener.ora file. For more information, see "Are the Oracle Net configuration files correct?" and "Is Oracle Net functioning properly?" .
This error is usually seen when the services on the console and the services discovered by the Agent are out of sync. For example, if you have an event registered against TESTDB and someone changes the name of the database to PRODDB, that Agent and Console are out of sync.
To fix this start by removing all job and event registrations from this service and dropping the node where the services exist from the console. Rediscover the node from the console using the auto-discovery wizard.
NOTE: With 7.3.2 the alias are case sensitive.
If you have a NT Agent please refer to 'Invalid service name' while registering a job or event.
This indicates a problem with the TCP/IP layer. Most obvious cause for this is that the IP address and the hostname do not reference the same physical machine.
Verify that TCP/IP is configured and running correctly. (See Is TCP/IP Installed and Running Correctly)
You may receive this error while executing a TCL script using the oratcl verb oralogon through the Software Developer's Kit. "Oralogin failed in orlon" means that the connect string is either wrong or for some reason, the account used cannot logon to the database.
If you see an OS error when starting the Agent, check to see whether it is an actual Agent error as described in snmimsg.mc. Due to one of the Windows APIs not working as documented, the Agent fails to print out the real cause of the error.
Use the Event Viewer in the Administrative tools group of Windows NT. You should find the true cause of the problem documented. The source for the Agent errors are under the service name "dbsnmp". Highlight the most recent dbsnmp entry in the list. Double click on the event to get the actual results.
There are in fact two hostname definitions on NT: One NETBios one, used for the NT's internal Named Pipes protocol, which is always installed. The other is the TCP/IP hostname, which is only configurable when you install TCP/IP on NT.
To find the NT NetBios hostname:
To find the TCP/IP hostname:
On an NT server, you can 'ping' the two names, even if they are configured differently. Other clients, however, only 'ping' real TCP/IP hostnames. If the Agent is using local IPC connections, it uses Named Pipes. Therefore the NetBios name, while all external connections will use the TCP/IP name.
A mismatch in these names leads to 'unable to contact Agent', or forever pending jobs in the console. Therefore, make sure that the NetBios and the TCP/IP hostname are identical.
The Windows NT user that you created for the Agent (see Agent Configuration, Configuration Guide) needs read/write permissions to the $ORACLE_HOME\network\agent directory (and TEMP directory, for some applications) and read permissions to the SYSTEM32 directory
Verify that the NT user has these permissions.
This problem has been fixed for Agent versions 7.3.4 and higher. For Agent versions 7.3.3 and lower, the following workaround can be used.
Check the listener.ora file, and make sure that no $ORACLE_HOME parameter is specified in the SID_LIST section. Specifying an $ORACLE_HOME in the SID_LIST section prevents the Agent from finding the requisite files for service discovery.
If you have a 8.0.4 Agent, you may experience this problem. If you have a default domain other than ".world". The Agent tries to append a ".world" to the database name during discovery. For example, if your default domain is nl.oracle.com and you define your GLOBAL_DBNAME = database.nl.oracle.com, the Agent defines the database name to be database.nl.oracle.com.world. This problem only occurs when the Agent and Console reside on the same machine (they share the some configuration files).
The workaround is to append ".world" to all services that do not currently have a specified domain.
First check that all of the SQL*Net files are present and correctly defined. You can then debug discovery by editing your oratab file contains only a valid SID with a listener running. After you get this working, you can add the remaining entries in the oratab file to see which entry is causing the problem.
Check the $ORACLE_HOME/network/log/nmiconf.log files for errors.
There are two possible causes for this error:
Only have one Agent on a machine.
To confirm port is being used by someone else
netstat -a | grep 1748
^---- this is port #
If any result shown on screen that ends in "LISTENING" then the port is in use.
Then do this.
This will re-start the Agent and remove all of the job and event queues it was using in the past.
If all else fails, re-booting the machine will free up the port.
This message indicates that the SNMP Master Agent (the process on UNIX that controls the SNMP protocol) could not be contacted. By default the Agent listens and works over SQL*Net, but the Agent can also work over SNMP on UNIX systems.
This message can safely be ignored unless you are trying to communicate with a Master Agent.
Events registered with the Agent for monitoring a "seed" database of version 9.0.0.0 will not work since, by default the Agent's database account "dbsnmp" is locked when the seed database is created. A "seed" database is a sample database that gets created when the user does a "typical" Oracle Server installation.
Under these conditions, an Enterprise Manager database up-down event will always indicate that the seed database is down. The Agent's log file dbsnmp.log will contain a NMS-00207 error message indicating the dbsnmp user account for the seed database is locked.
To resolve this problem, the you must log into the seed database and perform the following:
ALTER USER dbsnmp ACCOUNT UNLOCK;
ALTER USER dbsnmp IDENTIFIED BY <password>;
SNMP.CONNECT.<service_name>.PASSWORD=<password>
where service_name is the name of the seed database as discovered by the Agent in snmp_ro.ora/snmp_rw.ora.
Run the catsnmp.sql script for that database with either the SYS or INTERNAL accounts.
The 'dbsnmp' user could not be located.
Run the catsnmp.sql script for that database with either the SYS or INTERNAL accounts.
This happens if there mismatches between the ID's in the '*.q' files in the $ORACLE_HOME/network/agent directory. Delete all the '*.q' in the $ORACLE_HOME/network/agent directory. Rebuild your repository. Restart the Agent.
Beginning with 7.3.3, the Agent reads information from the snmp_ro.ora and snmp_rw.ora files in the $ORACLE_HOME\network\admin directory.
Example of modifications of the snmp_rw.ora file:
DBSNMP.TRACE_LEVEL = (OFF | USER | ADMIN | 16 )
The DBSNMP.TRACLE_LEVEL settings mirror those used for SQL*Net.
Optional:
DBSNMP.TRACE_FILE = agent Default=dbsnmp.trc DBSNMP.TRACE_DIRECTORY = /private/temp Default=$ORACLE_HOME/network/trace
(Any existing directory where the Agent has write permissions)
The log file, $ORACLE_HOME/network/log/dbsnmp.log, is written by the Agent on every startup, even if tracing is not turned on. It contains the name and version of the Agent and the name and location of the Agent's configuration files. If tracing is turned on, it also contains problems encountered with the database and listener connections.
The log file, $ORACLE_HOME/network/log/nmiconf.log, is created on the first start up of the Agent and appended to every time after that. The auto discovery is done by the Tcl script, nmiconf.tcl (hence, the log file name). This file is written to only during startup. $ORACLE_HOME/agentbin/ORATCLSH is a special-purpose TCL shell that supports all standard TCL verbs (supported in TCL75.dll) plus a large subset (not all) of the ORATCL verbs supported by the OEM Agent. ORATCLSH is not a general purpose utility and may only be used in combination with the OEM Agent as it depends on files and data structures maintained by the OEM Agent.
There is no documentation of ORATCLSH and it has never been part of the supported feature set of the OEM Agent. It is provided strictly as a debugging tool to help Oracle customers and developers in developing OEM job and event scripts. The executable ORATCLSH is provided for debugging your TCL scripts. Before executing ORATCLSH, set the environment variable TCL_LIBRARY to point to $ORACLE_HOME/network/agent/tcl, the location of the init.tcl file.
You may also turn Tcl tracing on by setting the environment variable ORATCL_DEBUG and turning tracing on in the snmp_rw.ora file. The ORATCL_DEBUG must be set to the $ORACLE_HOME/network/trace directory. You must shut down and re-start the Agent for these parameters to take effect. TCL tracing creates a file, oratcl.trc in the above location. Every time an event is run an entry is added to the oratcl.trc file.
|
Copyright © 2002 Oracle Corporation. All Rights Reserved. |
|