We use a simple concept to check for any standby gap or logs
apply problems in all the high availability (Dataguard) servers in our
environment. Use the below query in a script and make it run from cron
periodically so we don’t have to go and check manually all the time.
spool /home/oracle/dba_scripts/logs/standby_db_sync_err.txt
select PRIM-SEC from
(select MAX(SEQUENCE#) PRIM FROM V$ARCHIVED_LOG WHERE DEST_ID=1 AND
ARCHIVED='YES'),
(select MAX(SEQUENCE#) SEC FROM V$ARCHIVED_LOG WHERE DEST_ID=2 AND
APPLIED='YES');
spool off
Today I received log applied gaps error alert mail stating
we have a log applied gap of 24 logs. My script runs every hour and sends the
alert mail only when the log applied gap crosses 10 logs since I have delay
parameter set in the database.
Note: I used the term “log applied gaps” as I’m meaning that
the term is used to differentiate the logs applied on primary with logs applied
on standby database.
I first thought that there might be some activity going on
in the database and the gap will catch up soon as this happens most of the
time, but now the gaps increased to 35, the next hour. So something is wrong ,
and I should check it.
I used the following query on the standby database which
gave the below.
SQL> select sequence#, process, status from v$managed_standby;
SEQUENCE# PROCESS STATUS
---------- --------- ------------
0 ARCH CONNECTED
0 ARCH CONNECTED
0 ARCH CONNECTED
0 ARCH CONNECTED
0 ARCH CONNECTED
65999 MRP0 WAIT_FOR_LOG
66007 RFS IDLE
66008 RFS IDLE
66009 RFS IDLE
This is obvious, RFS gets the logs as and when generated at
primary and MRP waits until the “delay” time reaches before it applies. The gap
is not much but only 10 logs. Then I thought why the script gives wrong
information?
I ran the script manually to check and still receive the
same alert mail of log applied gap 35. So I ran the query separately in primary
database to check/troubleshoot what is going on. I got the below response from
the primary database.
SQL> select MAX(SEQUENCE#) PRIM FROM V$ARCHIVED_LOG WHERE
DEST_ID=1 AND ARCHIVED='YES';
PRIM
----------
66010
SQL> select MAX(SEQUENCE#) SEC FROM V$ARCHIVED_LOG WHERE
DEST_ID=2 AND APPLIED='YES';
SEC
----------
65975
Ok, I just wait for some time and check the same again but to
see the same result. (Many a times, in standby environment I get the results
delayed. J
)
Now, it’s time to dig Oracle support documents to find what is really
going on. I thought it might be some sort of bug as I haven’t faced such an
issue before. As my prediction it turns to be a bug.
The MOS
article “APPLIED - Column not updated if Heartbeat-ARCH hangs [ID 1369630.1]”
states the issue we are dealing with is due to the bug “Bug 6113783 - Arch
processes can hang indefinitely on network [ID 6113783.8]”.
The bug is fixed in 11.2.0.1 base release and I’m currently
running my database in 10.2.0.4.
There are 2 options provided.
- Upgrade DB to 11.2.0.1 or higher
- Use the work around of killing the hung ARCH process in the primary database.
Option 2 looked feasible at this time as upgrading of the DB
requires a lot of approvals, application compatibility, etc., etc.,.
So how to find the hung ARCH process?
MOS says Look for Messages like
"ARCn: Becoming the heartbeat ARCH"
in the ALERT.LOG of the Primary Database to determine the
current Heartbeat ARCn-Process.
So I searched my alert log file to find the below.
Fri Aug 17 05:04:08 2012
ARC2: Becoming the heartbeat ARCH
ARC3 started with pid=103, OS id=28106
Fri Aug 17 05:04:08 2012
Check for the OS process id for ARC2
[oracle@slv00112 bdump]$ ps -ef|grep ora_arc
oracle 32224 1
0 2012 ? 02:43:53 ora_arc0_edisipd
oracle 32226 1
0 2012 ? 02:21:49 ora_arc1_edisipd
oracle 32232 1
0 2012 ? 02:41:07 ora_arc2_edisipd
oracle 32234 1
0 2012 ? 02:26:41 ora_arc3_edisipd
oracle 23429 16642 0 04:51 pts/1 00:00:00 grep ora_arc
Kill the Hung ARCH process
[oracle@slv00112 bdump]$ kill -9 32232
Now check the alert log file again to find the below.
ARCH: Detected ARCH process failure
ARCH: STARTING ARCH PROCESSES
ARC2: Archival started
ARCH: STARTING ARCH PROCESSES COMPLETE
ARC2 started with pid=67, OS id=23425
ARC2: Becoming the heartbeat ARCH
Wed Jan 16 04:53:47 2013
Ok, now the new process has been generated. Hence check with
the primary database whether the APPLIED column is updated.
SQL> select MAX(SEQUENCE#) SEC FROM V$ARCHIVED_LOG WHERE
DEST_ID=2 AND APPLIED='YES';
SEC
----------
66003
It was a great article, helped me in resolving the most frequently occured issue.
ReplyDelete~Sandy
That is good to hear. Thank you!!
DeleteThis is excellent post. Appreciated for sharing such real time issue and solution
ReplyDeleteThank you!!
Delete