联系:手机/微信(+86 17813235971) QQ(107644445)
作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]
有客户反馈系统hang住,不能归档,需要我们紧急介入分析
节点1日志
出现redo不能归档,redo日志都已经被写满,人工执行了ALTER SYSTEM ARCHIVE LOG CURRENT,数据库就开始把redo全部归档,但是后面产生的redo又不能归档,当redo全部写满之后,数据库有出现大量log file switch (archiving needed)等待
Tue Sep 24 22:05:37 2013 Thread 1 advanced to log sequence 47282 (LGWR switch) Current log# 6 seq# 47282 mem# 0: +DATA/q9db/onlinelog/group_6.1244.818697409 Tue Sep 24 22:07:31 2013 ORACLE Instance q9db1 - Can not allocate log, archival required Thread 1 cannot allocate new log, sequence 47283 All online logs needed archiving Current log# 6 seq# 47282 mem# 0: +DATA/q9db/onlinelog/group_6.1244.818697409 Tue Sep 24 22:28:17 2013 ALTER SYSTEM ARCHIVE LOG Archived Log entry 259646 added for thread 1 sequence 47266 ID 0x354620c2 dest 1: Tue Sep 24 22:28:18 2013 Thread 1 advanced to log sequence 47283 (LGWR switch) Current log# 7 seq# 47283 mem# 0: +DATA/q9db/onlinelog/group_7.1243.818697415 Archived Log entry 259647 added for thread 1 sequence 47267 ID 0x354620c2 dest 1: Archived Log entry 259648 added for thread 1 sequence 47268 ID 0x354620c2 dest 1: Archived Log entry 259649 added for thread 1 sequence 47269 ID 0x354620c2 dest 1: Archived Log entry 259650 added for thread 1 sequence 47270 ID 0x354620c2 dest 1: Archived Log entry 259651 added for thread 1 sequence 47271 ID 0x354620c2 dest 1: Archived Log entry 259652 added for thread 1 sequence 47272 ID 0x354620c2 dest 1: Tue Sep 24 22:28:28 2013 Archived Log entry 259653 added for thread 1 sequence 47273 ID 0x354620c2 dest 1: Archived Log entry 259654 added for thread 1 sequence 47274 ID 0x354620c2 dest 1: Archived Log entry 259655 added for thread 1 sequence 47275 ID 0x354620c2 dest 1: Archived Log entry 259656 added for thread 1 sequence 47276 ID 0x354620c2 dest 1: Archived Log entry 259657 added for thread 1 sequence 47277 ID 0x354620c2 dest 1: Archived Log entry 259658 added for thread 1 sequence 47278 ID 0x354620c2 dest 1: Archived Log entry 259659 added for thread 1 sequence 47279 ID 0x354620c2 dest 1: Tue Sep 24 22:28:39 2013 Archived Log entry 259660 added for thread 1 sequence 47280 ID 0x354620c2 dest 1: Archived Log entry 259661 added for thread 1 sequence 47281 ID 0x354620c2 dest 1: Archived Log entry 259662 added for thread 1 sequence 47282 ID 0x354620c2 dest 1: Tue Sep 24 22:29:39 2013 Thread 1 advanced to log sequence 47284 (LGWR switch) Current log# 8 seq# 47284 mem# 0: +DATA/q9db/onlinelog/group_8.1242.818697417 Tue Sep 24 22:31:18 2013 Thread 1 advanced to log sequence 47285 (LGWR switch) Current log# 16 seq# 47285 mem# 0: +DATA/q9db/onlinelog/group_16.1884.827003545 Thread 1 advanced to log sequence 47286 (LGWR switch) Current log# 17 seq# 47286 mem# 0: +DATA/q9db/onlinelog/group_17.1885.827003587
节点2日志
节点2中出现大量的IPC Send timeout
Tue Sep 24 15:22:19 2013 IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)] ………… Tue Sep 24 18:51:55 2013 IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)] Tue Sep 24 18:57:54 2013 IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)] Receiver: inst 1 binc 464003926 ospid 1566 Tue Sep 24 19:03:57 2013 IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)] Receiver: inst 1 binc 464003926 ospid 1566 Tue Sep 24 19:09:53 2013 IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)] ………… Tue Sep 24 20:22:00 2013 IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]
节点1因为不能归档hang住,节点2紧接着也就hang住。对节点1hang住之时对两个节点分别做systemstate dump,使用ass进行分析得到节点1和节点2的记录大体如下:
节点1
393:waiting for 'log file switch (archiving needed)' 394:waiting for 'log file switch (archiving needed)' Cmd: Insert 395:waiting for 'log file switch (archiving needed)' Cmd: Insert 397:waiting for 'log file switch (archiving needed)' Cmd: Insert 398:waiting for 'log file switch (archiving needed)' Cmd: Insert 451:waiting for 'SQL*Net message from client' 469:waiting for 'log file switch (archiving needed)' Cmd: Insert 470:waiting for 'log file switch (archiving needed)' Cmd: Insert 471:waiting for 'log file switch (archiving needed)' Cmd: Insert 618:waiting for 'log file switch (archiving needed)' Cmd: Insert 626:waiting for 'log file switch (archiving needed)' Cmd: Insert NO BLOCKING PROCESSES FOUND
节点2
515:waiting for 'gc buffer busy acquire' Cmd: Insert 516:waiting for 'gc buffer busy acquire' Cmd: Insert 517:waiting for 'gc buffer busy acquire' Cmd: Insert 518:waiting for 'gc buffer busy acquire' Cmd: Insert 519:waiting for 'gc buffer busy acquire' Cmd: Insert 520:waiting for 'gc buffer busy acquire' Cmd: Select 521:waiting for 'gc current request' Cmd: Insert 522:waiting for 'enq: TX - row lock contention'[Enq TX-00BA0020-001E3E3C] Cmd: Select 523:waiting for 'gc buffer busy acquire' Cmd: Insert 524:waiting for 'SQL*Net message from client' 525:waiting for 'gc buffer busy acquire' Cmd: Insert 526:waiting for 'gc buffer busy acquire' Cmd: Insert 527:waiting for 'enq: TX - row lock contention'[Enq TX-00BA0020-001E3E3C] Cmd: Select 528:waiting for 'SQL*Net message from client' 529:waiting for 'gc buffer busy acquire' Cmd: Select Resource Holder State Enq TX-0005001E-0022374F 223: waiting for 'gc current request' Enq TX-0047001B-002BCEB2 247: waiting for 'gc current request' Enq TX-015B001E-000041FF 330: waiting for 'gc current request' Enq TX-00010010-002EA7CD 179: waiting for 'gc current request' Enq TX-00BA0020-001E3E3C ??? Blocker Object Names ~~~~~~~~~~~~ Enq TX-0005001E-0022374F Enq TX-0047001B-002BCEB2 Enq TX-015B001E-000041FF Enq TX-00010010-002EA7CD Enq TX-00BA0020-001E3E3C
通过这里,我们可以明白,节点2的很多事务hang住是因为请求gc current request,而该等待是因为节点1无法归档,有些block无法正常传输到节点2,导致节点2一直hang在这里,然后就出现IPC Send timeout;节点1上的事务阻塞甚至hang住是因为无法归档导致.到此需要定位的问题是为什么节点1不能归档
继续分析节点1 alert日志
Tue Sep 24 15:18:20 2013 opidrv aborting process O000 ospid (7332) as a result of ORA-28 Immediate Kill Session#: 1904, Serial#: 1065 Immediate Kill Session: sess: 0x24a2522a38 OS pid: 7338 Immediate Kill Session#: 3597, Serial#: 11107 Immediate Kill Session: sess: 0x24c27cf498 OS pid: 7320 Tue Sep 24 15:18:23 2013 opidrv aborting process W000 ospid (7980) as a result of ORA-28 Tue Sep 24 15:18:23 2013 opidrv aborting process W001 ospid (8560) as a result of ORA-28 Tue Sep 24 15:18:35 2013 LGWR: Detected ARCH process failure LGWR: Detected ARCH process failure LGWR: Detected ARCH process failure LGWR: Detected ARCH process failure LGWR: STARTING ARCH PROCESSES Tue Sep 24 15:18:35 2013 ARC0 started with pid=66, OS id=10793 Tue Sep 24 15:18:35 2013 Errors in file /u01/app/oracle/diag/rdbms/q9db/q9db1/trace/q9db1_nsa2_12635.trc: ORA-00028: your session has been killed LNS: Failed to archive log 8 thread 1 sequence 47156 (28) ARC0: Archival started LGWR: STARTING ARCH PROCESSES COMPLETE Thread 1 advanced to log sequence 47157 (LGWR switch) ARC0: STARTING ARCH PROCESSES Current log# 9 seq# 47157 mem# 0: +DATA/q9db/onlinelog/group_9.1241.818697421 Tue Sep 24 15:18:36 2013 ARC1 started with pid=81, OS id=10805 Tue Sep 24 15:18:36 2013 ARC2 started with pid=84, OS id=10807 Tue Sep 24 15:18:36 2013 ARC3 started with pid=87, OS id=10809 ARC1: Archival started ARC2: Archival started ARC2: Becoming the 'no FAL' ARCH ARC2: Becoming the 'no SRL' ARCH ARC1: Becoming the heartbeat ARCH Error 1031 received logging on to the standby PING[ARC1]: Heartbeat failed to connect to standby 'q9adgdg'. Error is 1031. ARC3: Archival started ARC0: STARTING ARCH PROCESSES COMPLETE Archived Log entry 259135 added for thread 1 sequence 47156 ID 0x354620c2 dest 1: Error 1031 received logging on to the standby FAL[server, ARC3]: Error 1031 creating remote archivelog file 'q9adgdg' FAL[server, ARC3]: FAL archive failed, see trace file. ARCH: FAL archive failed. Archiver continuing ORACLE Instance q9db1 - Archival Error. Archiver continuing. Tue Sep 24 15:18:46 2013 opidrv aborting process O001 ospid (9605) as a result of ORA-28 Tue Sep 24 15:18:46 2013 opidrv aborting process O000 ospid (10813) as a result of ORA-28 Tue Sep 24 15:18:46 2013 Immediate Kill Session#: 2909, Serial#: 369 Immediate Kill Session: sess: 0x24226c7200 OS pid: 9091 Immediate Kill Session#: 3380, Serial#: 30271 Immediate Kill Session: sess: 0x2422782c58 OS pid: 10265 Immediate Kill Session#: 3597, Serial#: 11109 Immediate Kill Session: sess: 0x24c27cf498 OS pid: 10267 Tue Sep 24 15:20:14 2013 Restarting dead background process DIAG Tue Sep 24 15:20:14 2013 DIAG started with pid=64, OS id=20568 Restarting dead background process PING Tue Sep 24 15:20:14 2013 PING started with pid=68, OS id=20570 Restarting dead background process LMHB Tue Sep 24 15:20:14 2013 LMHB started with pid=70, OS id=20572 Restarting dead background process SMCO ………… Tue Sep 24 15:23:13 2013 ARC0: Detected ARCH process failure Tue Sep 24 15:23:13 2013 Thread 1 advanced to log sequence 47158 (LGWR switch) Current log# 10 seq# 47158 mem# 0: +DATA/q9db/onlinelog/group_10.1240.818697423 ARC0: STARTING ARCH PROCESSES ARC0: STARTING ARCH PROCESSES COMPLETE ARC0: Becoming the heartbeat ARCH ARCH: Archival stopped, error occurred. Will continue retrying ORACLE Instance q9db1 - Archival Error ORA-00028: your session has been killed
查看ARCn进程
[oracle@q9db01 ~]$ ps -ef|grep ora_ar oracle 20718 12870 0 22:07 pts/14 00:00:00 grep ora_ar [oracle@q9db01 ~]$ ps -ef|grep ora_ar oracle 25998 12870 0 22:07 pts/14 00:00:00 grep ora_ar
这里基本上明白了,因为客户的系统从15:15开始由于中间件程序异常,导致大量会话连接数据库,然后dba为了防止其他业务不受影响,然后开始大量通过alter system kill session,误杀了不少系统进程,包括ARCn(0,1,2,3)进程,在后面ARCn进程因为某种原因无法正常启动,导致redo无法归档,所有的redo组写满系统即hang住,该系统由于大量kill session已经导致了实例本身异常(正常情况ARCn进程kill之后会自动重启),处理方案:先增加redo组配合定时人工归档,等待业务低峰重启节点1,解决问题。温馨提醒:kill进程请小心
为什么不首先查看alert日志呢?
这种情况一般还是少见的吧