Monday, February 1, 2016

R12 concurrent manager troubleshooting

Concurrent manages may not work properly due to different reasons.  To check if any concurrent jobs are running, run script afcmrrq.sql at $FND_TOP/sql. See Doc ID 2089560.6 (How To Tell If Concurrent Managers For A Particular SID Are Running) for other scripts.

1. After instance refresh or clone, concurrent managers fail to start.
When old (or source) server info is still saved in the database tables, concurrent managers get confused and will not start properly, even after below two lines to clean source info before running adconfig on database server:
$ sqlplus apps/passwd
SQL> @cmclean.sql         ( <= only before R12.2 )
SQL> EXEC FND_CONC_CLONE.SETUP_CLEAN;
SQL> commit;

SQL> select * from fnd_nodes;   (shall have 0 row. But one row after DBA runs adconfig script)
SQL> select unique(node_name) from fnd_concurrent_queues;
SQL> select * from fnd_concurrent_processes;
SQL> select * from fnd_conflicts_domain;

I saw message lines in log files under $APPLCSF/log:
CONC-SM TNS FAIL
Routine AFPEIM encountered an error while starting concurrent manager FNDSCH with library $APPL_TOP/fnd/12.0.0/bin/FNDSCH.

Check that your system has enough resources to start a concurrent manager process. Contact your system ad : 06-DEC-2015 09:58:42

The best fix is to apply patch 18539575 to fix a bug in R12.1 (See Doc ID 1646026.1). After this patch, concurrent services start normally in my refreshed instances.

One time, when the CM failed to start, I see message in the log file: 
Routine AFPEIM encountered an error while starting concurrent manager STANDARD with library $APPL_TOP/fnd/12.0.0/bin/FNDLIBR.
Check that your system has enough resources to start a concurrent manager process. Contact your sys : 30-DEC-2015 15:08:42
Starting Internal Concurrent Manager Concurrent Manager : 30-DEC-2015 15:08:42
CONC-SM TNS FAIL
 : ICM failed to start for target node1Name.  Review ICM log files for additional information.
                     Process monitor session ended : 30-DEC-2015 15:08:42


I realized that node1Name is the wrong node. So, I ran sql to update the data (with caution!):

SQL> select unique(node_name) from fnd_concurrent_queues;
NODE_NAME
------------------------------
node1Name
node2Name


SQL> select count(*) from fnd_concurrent_queues;
  COUNT(*)
----------------
        45


SQL> create table fnd_concurrent_queues_BK_jan as select * from fnd_concurrent_queues;
SQL> update fnd_concurrent_queues set NODE_NAME='node2Name'
where NODE_NAME='node1Name';

3 rows updated.

SQL> select unique(node_name) from fnd_concurrent_queues;
NODE_NAME
------------------------------
node2Name

SQL> commit;

After the table was updated, all concurrent mangers started on server node2Name.

2. When all concurrent manager processes were not fully shutdown or not properly shutdown in some situation (e.g. database connections were interrupted), CM services may fail on re-start. The number of processes in "Actual" and "Target" may not match with message " System Hold, Fix Manager before resetting counters ". In this case, first try shall be to use OAM to re-start individuals. For example (in Oracle Applications Manager version 2.3.1):

Site Map => Generic Services
On radio button select Output Post Processor (or, Conflict Resolution Manager ) => View Detail => Start (or verify) in the dropdown on the top => Go
(click on Next10) Generic Service Component Container => Workflow Agent Listener Service => Start in the dropdown on the top => Go

Site Map => Request Processing Manager
On radio button select Standard Manager => View Status tab on the upper => Start => Ok. Then click on Service Instances on the top to return to manager list.

3. When try to open a Request log or output file in GUI, it gives error:

An error occurred while attempting to establish an Applications File Server connection with the
node FNDFS_NODE2NAME. There may be a network configuration problem, or the TNS listener
on node FNDFS_NODE2NAME may not be running. Please contact your system administrator.

Here, Node2NAME is the concurrent server. After you make sure the listener (tnslsnr) is started, the first try is to ping the tnsname on Node1 (web/forms server): $ tnsping FNDFS_NODE2NAME
If you can not ping it, the real problem could be in $TNS_ADMIN/tnsnames.ora file on Node1. In one case, "domain.com" was not attached to Node2NAME in entries in tnsnames.ora file on Node1 server. After I re-ran autoconfig on all nodes, the problem got fixed.

Another place to check, make sure Profile options " RRA:% " have nothing strange.

If it works, below line shall list in "Concurrent => Manager => Administrator":
NAME                                   Node      Actual    Target   Status
Service Manager: Node1    Node1     1            0          {blank}

For Service Manager, if the Status shows "Target node/queue unavailable", clicking on "Restart" or OAM tries may not help. 

UPDATE in Sept 2020: Recently, seems the R12.1 clone script on RHEL7 did catch all info during the clone (maybe due to database connection error at the begining). After a clone completion and an adautocfg.sh run on all nodes, the tnsnames.ora file on CM node missed some entries. After 2nd run of adautocfg.sh on all nodes, some entries of tnsnames.ora on CM node missed "domain.com". tnsnames.ora file on CM node matches the original tnsnames.ora prior to clone until 3rd run of adautocfg.sh completed. 

4. Sometimes you are in a hurry to stop CM services but the OS processes keep running. You may see in the GUI form the "Actual" is non-zero while the "Target" is zero, which means there are still some requests are running. If you click "Terminate" the concurrent manager on the GUI, it will NOT kill its OS process. So this will not really speed up. The best way is to Find what are the running requests and then cancel the requests.

5. One time, Output Post Processor did not start. I checked file $APPLCSF/log/FNDOPPxxxxx.txt and saw error
Exception in static block of jtf.cache.CacheManager. Stack trace is: oracle.apps.jtf.base.resources.FrameworkException:
IAS Cache initialization failed. The Distributed Caching System failed to initialize on port: 12351. The list of hosts in the distributed caching system is: 157.121.49.41 157.121.53.42 157.121.53.43 . The port 12351 should be free on each host running the JVMs.


That means port 12351 is in use. After this port became free, Output Post Processor started.

6. One way to check if CM files on file system are good or not, run below report for a test. If all work, it will generate and save Active Users report for you even when all EBS services are shutdown at OS level.

$INST_TOP/ora/10.1.2/bin/appsrwrun.sh userid=apps/appsPWD mode=character report=$FND_TOP/reports/US/FNDSCURS.rdf \
batch=yes destype=file desname=./areport.out desformat=$FND_TOP/reports/HPL pagesize=132x66 traceopts=trace_all tracefile=areport.trc tracemode=trace_replace 

7. Below error may indicate that some node name is not registered in FND_NODES table. autoconfig may not run on all nodes after a cleaning.

List of errors encountered:
.............................................................................

_ 1 _
Concurrent Manager cannot find error description for CONC-System Node
Name not Registered

Contact your support representative.
.............................................................................


List of errors encountered:
.............................................................................

_ 1 _
Routine AFPCAL received failure code while parsing or running your
concurrent program CPMGR

Review your concurrent request log file for more detailed information.
Make sure you are passing arguments in the correct format.


8. If you do not want concurrent managers to run on a node at all, use two parameters in $CONTEXT_FILE to turn off them:

<oa_service_group_status oa_var="s_batch_status">disabled</oa_service_group_status><oa_service_group_status oa_var="s_other_service_group_status">disabled</oa_service_group_status>

I noticed once below error in a manager log file, while Concurrent processing works fine. I believe the error showed up after CM services got started on a wrong node by wrong values in $CONTEXT_FILE when the node got bounced from a crash. Oracle Support said it is a database issue. But I did not do anything on tables and the error went away by itself after above two values were set to "disabled" in $CONTEXT_FILE on that node.

Routine &ROUTINE has attempted to start the internal concurrent manager. The ICM is already running. Contact you system administrator for further assistance.afpdlrq received an unsuccessful result from PL/SQL procedure or function FND_DCP.Request_Session_Lock.
Routine FND_DCP.REQUEST_SESSION_LOCK received a result code of 1 from the call to DBMS_LOCK.Request.
Possible DBMS_LOCK.Request result ORACLE error 1036 in tag_db_session

Cause: tag_db_session failed due to ORA-01036: illegal variable name/number.

The SQL statement being executed at the time of the error was: ... and was executed from the file &ERRFILE.Call to establish_icm failed
The Internal Concurrent Manager has encountered an error.

Review concurrent manager log file for more detailed information. : 26-JUL-2015 10:52:55 -
Shutting down Internal Concurrent Manager : 26-JUL-2015 10:52:55

List of errors encountered:
.............................................................................
_ 1 _
Routine AFPCSQ encountered an ORACLE error. .
Review your error messages for the cause of the error. (=<POINTER>)
.............................................................................

List of errors encountered:
.............................................................................
_ 1 _
Routine AFPCAL received failure code while parsing or running your
concurrent program CPMGR
Review your concurrent request log file for more detailed information.
Make sure you are passing arguments in the correct format.
.............................................................................
The EBSXXXX_0726@EBSXXXX internal concurrent manager has terminated with status 1 - giving up.