Wednesday, January 16, 2013

APPLIED column on v$archived_log when Standby database is setup


We use a simple concept to check for any standby gap or logs apply problems in all the high availability (Dataguard) servers in our environment. Use the below query in a script and make it run from cron periodically so we don’t have to go and check manually all the time.

spool /home/oracle/dba_scripts/logs/standby_db_sync_err.txt

select PRIM-SEC from
(select MAX(SEQUENCE#) PRIM FROM V$ARCHIVED_LOG WHERE DEST_ID=1 AND ARCHIVED='YES'),
(select MAX(SEQUENCE#) SEC FROM V$ARCHIVED_LOG WHERE DEST_ID=2 AND APPLIED='YES');

spool off

Today I received log applied gaps error alert mail stating we have a log applied gap of 24 logs. My script runs every hour and sends the alert mail only when the log applied gap crosses 10 logs since I have delay parameter set in the database.
Note: I used the term “log applied gaps” as I’m meaning that the term is used to differentiate the logs applied on primary with logs applied on standby database.

I first thought that there might be some activity going on in the database and the gap will catch up soon as this happens most of the time, but now the gaps increased to 35, the next hour. So something is wrong , and I should check it.

I used the following query on the standby database which gave the below.

SQL> select sequence#, process, status from v$managed_standby;

 SEQUENCE# PROCESS   STATUS
---------- --------- ------------
         0 ARCH      CONNECTED
         0 ARCH      CONNECTED
         0 ARCH      CONNECTED
         0 ARCH      CONNECTED
         0 ARCH      CONNECTED
     65999 MRP0      WAIT_FOR_LOG
     66007 RFS       IDLE
     66008 RFS       IDLE
     66009 RFS       IDLE

This is obvious, RFS gets the logs as and when generated at primary and MRP waits until the “delay” time reaches before it applies. The gap is not much but only 10 logs. Then I thought why the script gives wrong information?
I ran the script manually to check and still receive the same alert mail of log applied gap 35. So I ran the query separately in primary database to check/troubleshoot what is going on. I got the below response from the primary database.

SQL> select MAX(SEQUENCE#) PRIM FROM V$ARCHIVED_LOG WHERE DEST_ID=1 AND ARCHIVED='YES';

      PRIM
----------
     66010

SQL> select MAX(SEQUENCE#) SEC FROM V$ARCHIVED_LOG WHERE DEST_ID=2 AND APPLIED='YES';

       SEC
----------
     65975

Ok, I just wait for some time and check the same again but to see the same result. (Many a times, in standby environment I get the results delayed. J )

Now, it’s time to dig Oracle support documents to find what is really going on. I thought it might be some sort of bug as I haven’t faced such an issue before. As my prediction it turns to be a bug.
The MOS article “APPLIED - Column not updated if Heartbeat-ARCH hangs [ID 1369630.1]” states the issue we are dealing with is due to the bug “Bug 6113783 - Arch processes can hang indefinitely on network [ID 6113783.8]”.

The bug is fixed in 11.2.0.1 base release and I’m currently running my database in 10.2.0.4.
There are 2 options provided.
  1. Upgrade DB to 11.2.0.1 or higher
  2. Use the work around of killing the hung ARCH process in the primary database.


Option 2 looked feasible at this time as upgrading of the DB requires a lot of approvals, application compatibility, etc., etc.,.

So how to find the hung ARCH process?

MOS says Look for Messages like
"ARCn: Becoming the heartbeat ARCH"
in the ALERT.LOG of the Primary Database to determine the current Heartbeat ARCn-Process.

So I searched my alert log file to find the below.

Fri Aug 17 05:04:08 2012
ARC2: Becoming the heartbeat ARCH
ARC3 started with pid=103, OS id=28106
Fri Aug 17 05:04:08 2012

Check for the OS process id for ARC2
[oracle@slv00112 bdump]$ ps -ef|grep ora_arc
oracle   32224     1  0  2012 ?        02:43:53 ora_arc0_edisipd
oracle   32226     1  0  2012 ?        02:21:49 ora_arc1_edisipd
oracle   32232     1  0  2012 ?        02:41:07 ora_arc2_edisipd
oracle   32234     1  0  2012 ?        02:26:41 ora_arc3_edisipd
oracle   23429 16642  0 04:51 pts/1    00:00:00 grep ora_arc

Kill the Hung ARCH process

[oracle@slv00112 bdump]$ kill -9 32232

Now check the alert log file again to find the below.

ARCH: Detected ARCH process failure
ARCH: STARTING ARCH PROCESSES
ARC2: Archival started
ARCH: STARTING ARCH PROCESSES COMPLETE
ARC2 started with pid=67, OS id=23425
ARC2: Becoming the heartbeat ARCH
Wed Jan 16 04:53:47 2013

Ok, now the new process has been generated. Hence check with the primary database whether the APPLIED column is updated.

SQL> select MAX(SEQUENCE#) SEC FROM V$ARCHIVED_LOG WHERE DEST_ID=2 AND APPLIED='YES';

       SEC
----------
     66003

Now it has been updated successfully. J

Monday, October 8, 2012

REDUCE TOP WAIT EVENTS – AN ATTEMPT


This article is prepared to show how I take steps to reduce the top wait events in a busy system running in my environment. As always your comments and suggestions are welcome so that you and I can learn oracle and benefited mutually.

For basic tuning steps and some threshold values please refer to the links below.


These links above give you the basics of performance tuning. Mind this is very old approach and many things have changed from this approach. But when speaking about threshold metrics, you can use those individual queries to check the performance metrics.  

For eg.

select 100*(1 - (v3.value / (v1.value + v2.value))) "Cache Hit Ratio [%]"
from v$sysstat v1, v$sysstat v2, v$sysstat v3
where
v1.name = 'db block gets' and
v2.name = 'consistent gets' and
v3.name = 'physical reads';

The above query is used to calculate the buffer cache hit ratio and typically it should be very high (more than 90%)

Ok, coming to the point where I fine tuned the application is as below.

Scenario: This is an application where batch jobs of huge amount of data (millions of rows) are deleted and updated/inserted on a daily basis. Application team reported that the application is responding very slowly than before as their delete job runs 4~5 folds slower.
While generating the AWR report at the specified time (the time the delete/insert batch job runs), we can find the below report portion.

I have got the instance efficiency percentages as below



You can see from the above that the buffer hit percentage is nearly 100% and comparatively library hit and soft parse % are good as well. We can tune the machine to increase the soft parse % here by making either the SGA to a bigger value or by making use of bind variables (This is an application where many queries are using bind variables and still the value is low, reason being cursor_sharing is set to EXACT which will not take sql statements which use bind variables as used statement and takes every statement as new statement).

While generating the AWR report at the specified time (the time the delete/insert batch job runs), we can find the below report portion.



The Top 5 timed event resulted as above which shows that the reads are taking time in the system. Hence check with the SQL ordered by Reads as below.



Looking at the statistics from the AWR report, the following query is consuming more reads.

  1. call dbms_stats.gather_database_stats_job_proc ( )        
  2. insert /*+ append */ into sys.ora_temp_1_ds_28645 select /*+ no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring */"OBJECT_ID", "ARCHIVE_DATE", "WF_ID", "REC_TIME", rowid SYS_DS_ALIAS_0 from "EDIAPP"."CORRELATION_SET" sample ( .3482546807) t

Looking at the statistics above, it is clear that most of the reads are consumed by statistics gatherer run by scheduler jobs which is scheduled to run automatically in oracle 10g using DBMS_STATS.
The task is scheduled in WEEKNIGHT_WINDOW of the DBA_SCHEDULER_WINDOWS which starts daily at 10 00 pm EST.

So in order to reduce this waits on db file sequential read, its better we disable this scheduled job as we have statistics gathering run explicitly.

Also init parameter db_file_multiblock_read_count is currently set to 32. Oracle recommends not setting this value manually and letting oracle choose the optimal value. So we can make this alteration as a step to improve the performance.

The other wait events log file sync and log file parallel write are related to each other. By decreasing the log file sync wait even, we can reduce log file parallel write wait event.
From dataguard perspective (as we have dataguard set up for this environment),
For Data Guard with synchronous (SYNC) transport and commit WAIT defaults, log file parallel write also includes the time for the network write and the RFS/redo write to the standby redo logs. This makes us clear that an improved network can improve the waits encountered.

Steps to improve the performance by eliminating the above wait events is as follows.

  • Tune LGWR to get good throughput to disk . eg: Do not put redo logs on RAID 5. à This is checked by Unix admin and made clear files are not in RAID 5  
  • If there are lots of short duration transactions see if it is possible to BATCH transactions together so there are fewer distinct COMMIT operations. Each commit has to have it confirmed that the relevant REDO is on disk. Although commits can be "piggybacked" by Oracle reducing the overall number of commits by batching transactions can have a very beneficial effect. à Apps team should take this into account for improving performance and hence advised them the same.
  • See if any of the processing can use the COMMIT NOWAIT option (be sure to understand the semantics of this before using it). à This is not recommended though
  • See if any activity can safely be done with NOLOGGING / UNRECOVERABLE options. à Compromises high availability and hence is not recommended as we have standby setup
  • Check to see if redologs are large enough. Enlarge the redologs so the logs switch between 15 to 20 minutes. à Current redo log size is 500 mb. We can increase this to 1gb so the log switch takes increased time.

Action plan would be as follows.

  1. Disable scheduled jobs to collect statistics
  2. Change db_file_multiblock_read_count value
  3. Increase redo logs to 1gb
  4. Reduce log buffers from the current size
The above example may provide a good feed to your white paper as this is the live tuning that has been recommended to application team and this works well after reducing log buffers size. This resulted in decreased log file sync.
The statistics were gathered for the database and indexes (for the tables that were affected by the batch job) were rebuilt and the performance is monitored again. This shows a very good increase in performance by reduction of sequential read and scattered read. Application team is satisfied with the performance which is the primary goal.

Thank you!! J Happy tuning!! JJ