Saturday, June 18, 2022

FS_CLONE failed with directory error

I ran "adop phase=fs_clone" and it failed on the 2nd node of a two-node instance. The error is

$ adopscanlog -latest=yes
... ...
$NE_BASE/EBSapps/log/adop/.../fs_clone/node2Name/TXK_SYNC_create/txkADOPPreparePhaseSynchronize.log:
-------------------------------------------------------------------------------------------------------------
Lines #(345-347):
... ...
FUNCTION: main::removeDirectory [ Level 1 ]
ERRORMSG: Failed to delete the directory $PATCH_BASE/EBSapps/comn.

When that happened, some folders were already deleted from PATCH file system by FS_CLONE. It is kind of scared. The only way to get over this is to address the root cause and then re-run FS_CLNE. 

The error matches the description in Doc ID 2690029.1 (ADOP: Fs_clone fails with error Failed to delete the directory). The root cause is that developers copied files to a directory owned by applMgr or concurrent jobs wrote logs to folders under CUSTOM TOPs. In most cases it happens under CUSTOM TOPs. Now applMgr has no permission to remove them. The fix is to ask OS System Admin to find them and change their owner to applMgr or delete all troubling files as the file owner.

$ cd $PATCH_BASE/EBSapps/comn.
$ find . ! -user applMgr                   (Then, login as the file owner to delete them)
$ ls -lR  | grep -v applMgr | more    (Optional. to see the detailed list)
$ find . -user wrong_userID -exec chown applMgr:userGroup {} \;

After the fix on OS level, I did try to run "adop phase=fs_clone allnodes=no force=yes" on the 2nd node directly and got error:
[UNEXPECTED]The admin server for the patch file system is not running.        
Start the patch file system admin server from the admin node and then rerun fs_clone.

There two options to make it work.  Run "adop phase=fs_clone force=yes" on the Primary node. Seems it understands fs_clone worked on 1st node and quickly progressed to run it on the 2nd node. Or, start WLS Admin Server on the Primary node and then run "adop phase=fs_clone allnodes=no force=yes" on the 2nd node.

FS_CLONE normal log:

$ adop phase=fs_clone
... ...
Running fs_clone on admin node: [node1Name].
    Output: $NE_BASE/EBSapps/log/adop/.../fs_clone/remote_execution_result_level1.xml
    Log: $NE_BASE/EBSapps/log/adop/.../fs_clone/node1Name/txkADOPEvalSrvStatus.pl returned SUCCESS

Running fs_clone on node(s): [node2Name].
    Output: $NE_BASE/EBSapps/log/adop/.../fs_clone/remote_execution_result_level2.xml
    Log: $NE_BASE/EBSapps/log/adop/.../fs_clone/node2Name/txkADOPEvalSrvStatus.pl returned SUCCESS

Stopping services on patch file system.

    Stopping admin server.
You are running adadminsrvctl.sh version 120.10.12020000.11
Stopping WLS Admin Server...
Refer $PATCH_BASE/inst/apps/$CONTEXT_NAME/logs/appl/admin/log/adadminsrvctl.txt for details
AdminServer logs are located at $PATCH_BASE/FMW_Home/user_projects/domains/EBS_domain/servers/AdminServer/logs
adadminsrvctl.sh: exiting with status 0
adadminsrvctl.sh: check the logfile $PATCH_BASE/inst/apps/$CONTEXT_NAME/logs/appl/admin/log/adadminsrvctl.txt for more information ...

    Stopping node manager.
You are running adnodemgrctl.sh version 120.11.12020000.12
The Node Manager is already shutdown
NodeManager log is located at $PATCH_BASE/FMW_Home/wlserver_10.3/common/nodemanager/nmHome1
adnodemgrctl.sh: exiting with status 2
adnodemgrctl.sh: check the logfile $PATCH_BASE/inst/apps/$CONTEXT_NAME/logs/appl/admin/log/adnodemgrctl.txt for more information ...

Summary report for current adop session:
    Node node1Name:
       - Fs_clone status:   Completed successfully
    Node node2Name:
       - Fs_clone status:   Completed successfully
    For more details, run the command: adop -status -detail
adop exiting with status = 0 (Success)

NOTES:
It one instance, FS_CLONE took 8 hours on first node, during no log entries and updates. Just stayed frozen for hours! 

No comments: