分类目录归档:Oracle RAC

has a disk HB, but no network HB—-traceroute不通导致

有客户反馈集群有一个节点异常,检查发现crs进程异常
3


重启crs发现cssd进程无法正常启动
no-network-hb

明显私网异常,进一步分析发现私网相互可以ping,但是无法traceroute其他节点
traceroute-not-work
traceroute-not-work2

客户反馈近期安装了安全软件,客户停掉安全软件之后,traceroute恢复正常
1
2

集群也正常启动

[root@his01 cssd]# crsctl status res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINE  ONLINE       his01                                        
               ONLINE  ONLINE       his02                                        
ora.FRA.dg
               ONLINE  ONLINE       his01                                        
               ONLINE  ONLINE       his02                                        
ora.LISTENER.lsnr
               ONLINE  ONLINE       his01                                        
               ONLINE  ONLINE       his02                                        
ora.OCRVOTE.dg
               ONLINE  ONLINE       his01                                        
               ONLINE  ONLINE       his02                                        
ora.asm
               ONLINE  ONLINE       his01                    Started             
               ONLINE  ONLINE       his02                    Started             
ora.gsd
               OFFLINE OFFLINE      his01                                        
               OFFLINE OFFLINE      his02                                        
ora.net1.network
               ONLINE  ONLINE       his01                                        
               ONLINE  ONLINE       his02                                        
ora.ons
               ONLINE  ONLINE       his01                                        
               ONLINE  ONLINE       his02                                        
ora.registry.acfs
               ONLINE  ONLINE       his01                                        
               ONLINE  ONLINE       his02                                        
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       his02                                        
ora.cvu
      1        ONLINE  ONLINE       his02                                        
ora.his01.vip
      1        ONLINE  ONLINE       his01                                        
ora.his02.vip
      1        ONLINE  ONLINE       his02                                        
ora.oc4j
      1        ONLINE  ONLINE       his02                                        
ora.orcl.db
      1        ONLINE  ONLINE       his01                    Open                
      2        ONLINE  ONLINE       his02                    Open                
ora.scan1.vip
      1        ONLINE  ONLINE       his02                                        
发表在 Oracle RAC | 标签为 , , | 评论关闭

12.1人工修改操作系统时间导致数据库异常

有客户数据库版本为12.1.0.1 版本RAC,突发发生重启,让协助分析原因
数据库alert日志报ORA-15064错误

Mon Apr 15 15:06:26 2019
WARNING: inbound connection timed out (ORA-3136)
Mon Apr 15 15:41:26 2019
NOTE: ASMB terminating
Mon Apr 15 15:41:26 2019
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_asmb_61426.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 1892 Serial number: 29
Mon Apr 15 15:41:26 2019
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_asmb_61426.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 1892 Serial number: 29
Mon Apr 15 15:41:26 2019
System state dump requested by (instance=1, osid=61426 (ASMB)), summary=[abnormal instance termination].
Mon Apr 15 15:41:26 2019
USER (ospid: 61426): terminating the instance due to error 15064
Mon Apr 15 15:41:26 2019
System State dumped to trace file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_diag_61287.trc
Mon Apr 15 15:41:27 2019
opiodr aborting process unknown ospid (1171) as a result of ORA-1092
Mon Apr 15 15:41:27 2019
ORA-1092 : opitsk aborting process

这里看,明显asmb异常导致数据库无法正常访问asm从而出现数据库crash的问题.

分析asm日志

Mon Apr 15 15:41:26 2019
WARNING: client [+ASM1:+ASM] not responsive for 2069s; state=0x1. pid 23155
NOTE: umbilicus traces dumped to /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_23050.trc
WARNING: client [orcl1:orcl] not responsive for 2069s; state=0x1. killing pid 61436
NOTE: umbilicus traces dumped to /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_23050.trc
WARNING: fencing client [orcl1:orcl] after 2069 seconds (mbr 2)
WARNING: client [-MGMTDB:_mgmtdb] not responsive for 2070s; state=0x1. killing pid 24026
NOTE: umbilicus traces dumped to /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_gen0_23050.trc
WARNING: fencing client [-MGMTDB:_mgmtdb] after 2070 seconds (mbr 1)
Mon Apr 15 15:41:26 2019
NOTE: cleaned up ASM client -MGMTDB:_mgmtdb
NOTE: cleaned up ASM client orcl1:orcl
Mon Apr 15 15:41:43 2019
NOTE: Standard client -MGMTDB:_mgmtdb registered, osid 183707, mbr 0x1 (reg:1371965153)
Mon Apr 15 15:42:16 2019
NOTE: Standard client orcl1:orcl registered, osid 184063, mbr 0x2 (reg:2088418628)
Mon Apr 15 15:44:30 2019
Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details.

asm日志中和mos中的GEN0 terminating the ASM instance due to error 15082 (文档 ID 2096988.1)描述比较匹配.根据客户反馈,他们使用ntp进行修改了时间,基本上可以确定是由于oracle的Bug 19032250(在12.1.0.2中修复)在ntp修改时间跨度过大触发的相关问题(人工直接修改时间也可能出现类似问题)

对于rac修改时间建议
1. 如果时间慢了,关闭数据库和集群直接把时间向前调整,启动集群和数据库
2. 如果时间快了,关闭数据库和集群等实际时间过关闭集群和库的时间之后,再往回调整时间,启动集群和数据库

发表在 Oracle RAC | 评论关闭

私网直连后遗症:一节点无法启动导致另外节点haip无法启动

该案例为两节点rac(11.2.0.4),private 网络使用直连方式,其中一个节点主机异常无法启动,另外一个节点集群启动发现haip无法正常启动

# crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
     1        ONLINE  ONLINE       xifenfei1                  Started                     
ora.cluster_interconnect.haip                                                      >>>>  OFFLINE
     1        ONLINE  OFFLINE
ora.crf
     1        ONLINE  ONLINE       xifenfei1
ora.crsd
     1        ONLINE  OFFLINE                                                      >>>>  OFFLINE
ora.cssd
     1        ONLINE  ONLINE       xifenfei1
ora.cssdmonitor
     1        ONLINE  ONLINE       xifenfei1
ora.ctssd
     1        ONLINE  ONLINE       xifenfei1                  OBSERVER
ora.diskmon
     1        OFFLINE OFFLINE
ora.drivers.acfs
     1        ONLINE  ONLINE       xifenfei1
ora.evmd
     1        ONLINE  INTERMEDIATE xifenfei1
ora.gipcd
     1        ONLINE  ONLINE       xifenfei1
ora.gpnpd
     1        ONLINE  ONLINE       xifenfei1
ora.mdnsd
     1        ONLINE  ONLINE       xifenfei1

alerthostname日志

2018-09-02 10:38:56.767: 
[/u01/app/11.2.0/grid/bin/orarootagent.bin(7866)]CRS-5818:Aborted command 'start' for resource 'ora.cluster_interconnect.haip'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/11.2.0/grid/log/xifenfei1/agent/ohasd/orarootagent_root/orarootagent_root.log.
2018-09-02 10:39:00.771: 
[ohasd(7495)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.cluster_interconnect.haip'. Details at (:CRSPE00111:) {0:0:2} in /u01/app/11.2.0/grid/log/xifenfei1/ohasd/ohasd.log.
2018-09-02 10:40:00.802: 
[/u01/app/11.2.0/grid/bin/orarootagent.bin(7866)]CRS-5818:Aborted command 'start' for resource 'ora.cluster_interconnect.haip'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/11.2.0/grid/log/xifenfei1/agent/ohasd/orarootagent_root/orarootagent_root.log.
2018-09-02 10:40:04.806: 
[ohasd(7495)]CRS-2757:Command 'Start' timed out waiting for response from the resource 'ora.cluster_interconnect.haip'. Details at (:CRSPE00111:) {0:0:2} in /u01/app/11.2.0/grid/log/xifenfei1/ohasd/ohasd.log.

orarootagent_root日志

2018-09-02 10:37:56.805: [ USRTHRD][3650455296]{0:0:2} No HAIP info configured in GPNP, using defaults
2018-09-02 10:37:56.805: [ USRTHRD][3650455296]{0:0:2} The final CIDR subnet 169.254/16
2018-09-02 10:37:56.805: [ default][3650455296]clsvactversion:4: Retrieving Active Version from local storage.
2018-09-02 10:37:56.809: [ USRTHRD][3650455296]{0:0:2} HAIP: mbr num is 0.
[   CLWAL][3650455296]clsw_Initialize: OLR initlevel [70000]
2018-09-02 10:37:56.843: [ USRTHRD][3650455296]{0:0:2} HAIP: initializing to 1 interfaces
2018-09-02 10:37:56.844: [ USRTHRD][3650455296]{0:0:2} HAIP: configured to use 1 interfaces

gipcd.log日志

2018-09-02 10:38:56.787: [ CLSINET][2477147904] Returning NETDATA: 0 interfaces
2018-09-02 10:38:56.988: [GIPCDCLT][2477147904] gipcdClientInterfaceRequest: sent local interface list back to client
2018-09-02 10:38:56.822: [GIPCHDEM][2468742912] gipchaDaemonInfRequest: sent local interfaceRequest,  hctx 0x1369730 [0000000000000010] { gipchaContext : host 'xifenfei1', name 'gipcd_ha_name', luid '184dd356-00000000', numNode 0, numInf 0, usrFlags 0x0, flags 0x63 } to gipcd
2018-09-02 10:38:56.822: [GIPCDCLT][2477147904] gipcdClientThread: req from local client of type gipcdmsgtypeInterfaceRequest, endp 00000000000002cb
2018-09-02 10:38:56.822: [GIPCDCLT][2477147904] gipcdClientInterfaceRequest: Received type(gipcdmsgtypeInterfaceRequest), endp(00000000000002cb), len(1032), buf(0x7fab858b7a78):[hostname(xifenfei1), retStatus(gipcretSuccess)]
2018-09-02 10:38:56.822: [GIPCDCLT][2477147904] gipcdClientInterfaceQueryToMonitor: enqueue local interface query (2) to worklist
2018-09-02 10:38:56.823: [GIPCDCLT][2477147904] gipcdClientInterfaceRequest: sent local interface query
2018-09-02 10:38:56.823: [GIPCDMON][2472945408] gipcdMonitorCheckXfer: set new infQuery
2018-09-02 10:38:56.831: [ GIPCLIB][2477147904] gipclibSetTraceLevel: to set level to 0

ohasd.log日志

2018-09-02 10:38:52.494: [GIPCHDEM][1878710016]gipchaDaemonInfRequest: sent local interfaceRequest,  hctx 0x2749eb0 [0000000000000010] { gipchaContext : host 'xifenfei1', name 'CLSFRAME_oracler-cluster', luid '47624c02-00000000', numNode 0, numInf 0, usrFlags 0x0, flags 0x63 } to gipcd
2018-09-02 10:38:57.255: [    AGFW][3305629440]{0:0:2} Received the reply to the message: RESOURCE_START[ora.cluster_interconnect.haip 1 1] ID 4098:502 from the agent /u01/app/11.2.0/grid/bin/orarootagent_root
2018-09-02 10:38:57.255: [    AGFW][3305629440]{0:0:2} Agfw Proxy Server sending the reply to PE for message:RESOURCE_START[ora.cluster_interconnect.haip 1 1] ID 4098:500
2018-09-02 10:38:57.255: [   CRSPE][3295123200]{0:0:2} Received reply to action [Start] message ID: 500
2018-09-02 10:38:57.256: [   CRSPE][3295123200]{0:0:2} Got agent-specific msg: CRS-5017: The resource action "ora.cluster_interconnect.haip start" encountered the following error: 
Start action for HAIP aborted. For details refer to "(:CLSN00107:)" in "/u01/app/11.2.0/grid/log/xifenfei1/agent/ohasd/orarootagent_root/orarootagent_root.log".
2018-09-02 10:38:57.500: [GIPCHDEM][1878710016]gipchaDaemonInfRequest: sent local interfaceRequest,  hctx 0x2749eb0 [0000000000000010] { gipchaContext : host 'xifenfei1', name 'CLSFRAME_oracler-cluster', luid '47624c02-00000000', numNode 0, numInf 0, usrFlags 0x0, flags 0x63 } to gipcd

检查私网状态,发现eth2网络链路状态为down,由于网络直连,而另外一台机器无法启动

[root@xifenfei1 rules.d]# ethtool eth1
Settings for eth1:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Supported pause frame use: Symmetric
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Speed: Unknown!
        Duplex: Unknown! (255)
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: Unknown
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: no   ====>网卡链路状态异常


[root@xifenfei1 rules.d]# ifconfig
eth0      Link encap:Ethernet  HWaddr 6C:92:BF:2B:7B:36  
          inet addr:10.10.17.42  Bcast:172.17.17.255  Mask:255.255.255.0
          inet6 addr: fe80::6e92:bfff:fe2b:7b36/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1     --------->注意
          RX packets:234424 errors:0 dropped:0 overruns:0 frame:0
          TX packets:160916 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:16926236 (16.1 MiB)  TX bytes:24269882 (23.1 MiB)
          Memory:91160000-91180000 

eth1      Link encap:Ethernet  HWaddr 6C:92:BF:2B:7B:37  
          inet addr:11.1.1.2  Bcast:11.1.1.255  Mask:255.255.255.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1      --------->注意少了RUNNING
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Memory:91140000-91160000 

关于网卡链路异常导致haip无法启动的mos描述请参考:CRSD & HAIP Resources Remain In OFFLINE as Private Network Interface is Partially Up (Doc ID 1529721.1).该案例是11.2集群私网使用直连引起的直接后遗症(非常不建议集群私网使用直连方式)

发表在 Oracle RAC | 标签为 | 评论关闭