分类目录归档:AIX

pvid=yes导致asm无法mount

今天凌晨接到客户恢复请求,对于aix rac,两个ibm存储做mirror的环境中,客户做存储容灾演练,发现磁盘的名称发生改变,然后对其中一个磁盘设置pvid,结果悲剧了导致asm一个磁盘组无法正常起来。然后又aix端删除这些设备,然后重新扫描设备。结果不是一个磁盘组不能mount,而是整个gi就无法正常启动。希望我们给予技术支持。
查看asm 日志,确定asm disk信息
asm-disk1
asm-disk2


从这里可以确定,一共有两个asm diskgroup,每个group有两个磁盘,hdisk2和hdisk3 为hisdata,hdisk4,和hdisk5为emrdata.

使用kfed分析磁盘头

dd if=/dev/rhdisk2 of=/tmp/xifenfei/rhdisk2.dd bs=1024k count=10
dd if=/dev/rhdisk3 of=/tmp/xifenfei/rhdisk3.dd bs=1024k count=10
dd if=/dev/rhdisk4 of=/tmp/xifenfei/rhdisk4.dd bs=1024k count=10
dd if=/dev/rhdisk5 of=/tmp/xifenfei/rhdisk5.dd bs=1024k count=10
--传输到我电脑上分析

C:\Users\FAL>kfed read H:\temp\xifenfei\tmp\xifenfei\rhdisk2.dd|grep name
kfdhdb.dskname:            HISDATA_0000 ; 0x028: length=12
kfdhdb.grpname:                 HISDATA ; 0x048: length=7
kfdhdb.fgname:             HISDATA_0000 ; 0x068: length=12
kfdhdb.capname:                         ; 0x088: length=0

C:\Users\FAL>kfed read H:\temp\xifenfei\tmp\xifenfei\rhdisk3.dd|grep name
kfdhdb.dskname:            HISDATA_0001 ; 0x028: length=12
kfdhdb.grpname:                 HISDATA ; 0x048: length=7
kfdhdb.fgname:             HISDATA_0001 ; 0x068: length=12
kfdhdb.capname:                         ; 0x088: length=0

C:\Users\FAL>kfed read H:\temp\xifenfei\tmp\xifenfei\rhdisk4.dd|grep name
kfdhdb.dskname:            EMRDATA_0000 ; 0x028: length=12
kfdhdb.grpname:                 EMRDATA ; 0x048: length=7
kfdhdb.fgname:             EMRDATA_0000 ; 0x068: length=12
kfdhdb.capname:                         ; 0x088: length=0

C:\Users\FAL>kfed read H:\temp\xifenfei\tmp\xifenfei\rhdisk5.dd|grep name

C:\Users\FAL>kfed read H:\temp\xifenfei\tmp\xifenfei\rhdisk5.dd
kfbh.endian:                        201 ; 0x000: 0xc9
kfbh.hard:                          194 ; 0x001: 0xc2
kfbh.type:                          212 ; 0x002: *** Unknown Enum ***
kfbh.datfmt:                        193 ; 0x003: 0xc1
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                           0 ; 0x00c: 0x00000000
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
000000000 C1D4C2C9 00000000 00000000 00000000  [................]
000000010 00000000 00000000 00000000 00000000  [................]
  Repeat 254 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][212]

C:\Users\FAL>kfed read H:\temp\xifenfei\tmp\xifenfei\rhdisk5.dd blkn=2|grep kfbh
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL
kfbh.datfmt:                          2 ; 0x003: 0x02
kfbh.block.blk:                33554432 ; 0x004: blk=33554432
kfbh.block.obj:                16777344 ; 0x008: file=128
kfbh.check:                  2654889601 ; 0x00c: 0x9e3e6681
kfbh.fcn.base:               1696071680 ; 0x010: 0x65180000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000

C:\Users\FAL>kfed read H:\temp\xifenfei\tmp\xifenfei\rhdisk5.dd blkn=510|grep name
kfdhdb.dskname:            EMRDATA_0001 ; 0x028: length=12
kfdhdb.grpname:                 EMRDATA ; 0x048: length=7
kfdhdb.fgname:             EMRDATA_0001 ; 0x068: length=12
kfdhdb.capname:                         ; 0x088: length=0

通过上述分析,基本上确定由于对hdisk5设置了pvid导致该asm disk的磁盘头损坏.这个可以直接使用asm repair功能修复(注意要clear pvid)

C:\Users\FAL>kfed read H:\temp\xifenfei\tmp\xifenfei\rhdisk5.dd |grep name
kfdhdb.dskname:            EMRDATA_0001 ; 0x028: length=12
kfdhdb.grpname:                 EMRDATA ; 0x048: length=7
kfdhdb.fgname:             EMRDATA_0001 ; 0x068: length=12
kfdhdb.capname:                         ; 0x088: length=0

启动crs到cssd进程报错分析
1. 由于删除磁盘,扫描设备导致hdisk[2-5] 权限和用户组不对
2. 由于删除,扫描磁盘导致磁盘共享模式不对
修复磁盘头和解决这两个问题之后,gi启动正常,磁盘组也正常mount,数据库也正常启动,数据0丢失,至此完美恢复
oracle-open


类似客户恢复案例:asm disk误设置pvid导致asm diskgroup无法mount恢复
如果您遇到此类情况,无法解决请联系我们,提供专业ORACLE数据库恢复技术支持
Phone:17813235971    Q Q:107644445QQ咨询惜分飞    E-Mail:dba@xifenfei.com

发表在 AIX, Oracle备份恢复 | 标签为 , , , , , | 评论关闭

aix中procmap 查看oracle进程占用系统内存

procmap是用来显示进程地址空间,通过这个命令找出来的“read/write”表示为进程的私有内存,如果对应到oracle 进程的LOCAL中来,也就是对应了是oracle 会话进程占用的操作系统内存,和sga与pga无关,即ORACLE数据库进程占用的额外的系统内存,在计算oracle数据库消耗内存的时候,要考虑sga+pga+process占用的内存
procmap命令使用

$procmap 7931354
7931354 : oracleccicdx (LOCAL=NO) 
100000000            95504K  read/exec         oracle
110000035             2399K  read/write        oracle
9fffffff0000000         51K  read/exec         /usr/ccs/bin/usla64
9fffffff000cfe2          0K  read/write        /usr/ccs/bin/usla64
900000000b05930          2K  read/exec         /usr/lib/libC.a[shr3_64.o]
9001000a0122930          0K  read/write        /usr/lib/libC.a[shr3_64.o]
900000000ae6b00        118K  read/exec         /usr/lib/libC.a[shrcore_64.o]
9001000a030a100         12K  read/write        /usr/lib/libC.a[shrcore_64.o]
900000000ac8000        118K  read/exec         /usr/lib/libC.a[ansicore_64.o]
9001000a0300e00         36K  read/write        /usr/lib/libC.a[ansicore_64.o]
900000000411468          0K  read/exec         /usr/lib/libicudata.a[shr_64.o]
9001000a0121468          0K  read/write        /usr/lib/libicudata.a[shr_64.o]
90000000040f738          2K  read/exec         /usr/lib/libC.a[shr2_64.o]
9001000a0314738          0K  read/write        /usr/lib/libC.a[shr2_64.o]
9000000008dd800       1699K  read/exec         /usr/lib/libC.a[ansi_64.o]
9001000a0315a00        277K  read/write        /usr/lib/libC.a[ansi_64.o]
9000000008bab00        135K  read/exec         /usr/lib/libC.a[shr_64.o]
9001000a030eb00         19K  read/write        /usr/lib/libC.a[shr_64.o]
900000000708180       1732K  read/exec         /usr/lib/libicuuc.a[shr_64.o]
9001000a035cdac        180K  read/write        /usr/lib/libicuuc.a[shr_64.o]
900000000493d80       2510K  read/exec         /usr/lib/libicui18n.a[shr_64.o]
9001000a038a148        270K  read/write        /usr/lib/libicui18n.a[shr_64.o]
900000000473200         91K  read/exec         /usr/lib/libsrc.a[shr_64.o]
9001000a01127a8         55K  read/write        /usr/lib/libsrc.a[shr_64.o]
90000000045a300         98K  read/exec         /usr/lib/libcorcfg.a[shr_64.o]
9001000a04147c8         18K  read/write        /usr/lib/libcorcfg.a[shr_64.o]
900000000b16200        750K  read/exec         /usr/lib/liblvm.a[shr_64.o]
9001000a03dd028        219K  read/write        /usr/lib/liblvm.a[shr_64.o]
900000000444f00         82K  read/exec         /usr/lib/libcfg.a[shr_64.o]
9001000a03d58f0         26K  read/write        /usr/lib/libcfg.a[shr_64.o]
90000000040e3a0          2K  read/exec         /usr/lib/libcrypt.a[shr_64.o]
9001000a0106948          0K  read/write        /usr/lib/libcrypt.a[shr_64.o]
90000001615d860          5K  read/exec         /usr/lib/libc.a[aio_64.o]
9001000a3aed568          0K  read/write        /usr/lib/libc.a[aio_64.o]
9000000003efc00        120K  read/exec         /usr/lib/libodm.a[shr_64.o]
9001000a0107cc8         40K  read/write        /usr/lib/libodm.a[shr_64.o]
900000000bd2c80        147K  read/exec         /usr/lib/libperfstat.a[shr_64.o]
9001000a041a960         14K  read/write        /usr/lib/libperfstat.a[shr_64.o]
9000000017d7000          0K  read/exec         /usr/lib/libdl.a[shr_64.o]
9001000a0517000          0K  read/write        /usr/lib/libdl.a[shr_64.o]
9000000158ed100       8636K  read/exec         /oracle/product/db10gr2/lib/libjox10.a[shr.o]
8001000a0000b78        587K  read/write        /oracle/product/db10gr2/lib/libjox10.a[shr.o]
900000000a87000        257K  read/exec         /usr/lib/libpthreads.a[shr_xpg5_64.o]
9001000a0274000        559K  read/write        /usr/lib/libpthreads.a[shr_xpg5_64.o]
900000000000800       4025K  read/exec         /usr/lib/libc.a[shr_64.o]
9001000a0000020       1047K  read/write        /usr/lib/libc.a[shr_64.o]
         Total      121863K

简化命令,统计私有内存,procmap 7931354|grep “read/write” |awk -F ” ” ‘{print $2}’,通过相关计算的出来,在当前的操作系统和数据库版本中,一个LOCAL=NO进程占用系统内存为:5758KB

补充说明
1.操作系统版本

$oslevel -r
6100-06

2.数据库版本

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
PL/SQL Release 10.2.0.4.0 - Production
CORE    10.2.0.4.0      Production
TNS for IBM/AIX RISC System/6000: Version 10.2.0.4.0 - Productio
NLSRTL Version 10.2.0.4.0 - Production

3.通过跟踪多个LOCAL=NO进程,发现类似进程占用的系统内存相同,估算给系统oracle进程占用的内存,可以通过该值进行大概估算
4.确认ORACLE使用的内存量不是以往认识的sga+pga,实际上应该是sga+pga+所有oracle进程占用
5.在linux中使用pmap来查看

发表在 AIX, Oracle | 2 条评论

ORACLE在AIX中产生SOFTWARE PROGRAM ABNORMALLY TERMINATED警告原因

数据库中发现如下错误
该错误的解决方案:ORA-07445[dbgrmqmqpk_query_pick_key()+0f88]

Dump file /oracle/diag/rdbms/sgerp5/sgerp5/incident/incdir_579300/sgerp5_m000_7602504_i579300.trc
Oracle Database 11g Enterprise Edition Release 11.1.0.6.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/product/11.1.0/db_1
System name:    AIX
Node name:  sgerp5
Release:    1
Version:    6
Machine:    00C8F0564C00
Instance name: sgerp5
Redo thread mounted by this instance: 1
Oracle process number: 138
Unix process pid: 7602504, image: oracle@sgerp5 (m000)
 
*** 2012-05-11 03:52:35.200
*** SESSION ID:(752.5029) 2012-05-11 03:52:35.200
*** CLIENT ID:() 2012-05-11 03:52:35.200
*** SERVICE NAME:(SYS$BACKGROUND) 2012-05-11 03:52:35.200
*** MODULE NAME:(MMON_SLAVE) 2012-05-11 03:52:35.200
*** ACTION NAME:(Auto-Purge Slave Action) 2012-05-11 03:52:35.200
 
Dump continued from file: /oracle/diag/rdbms/sgerp5/sgerp5/trace/sgerp5_m000_7602504.trc
ORA-07445: exception encountered: core dump [dbgrmqmqpk_query_pick_key()+0f88] 
[SIGSEGV] [ADDR:0xB38F0000000049][PC:0x100213C08] [Address not mapped to object] []

errpt错误说明
在产生7445错误的同时观察aix系统错误日志发现SOFTWARE PROGRAM ABNORMALLY TERMINATED错误

sgerp5_[oracle]-->errpt -aj A924A5FC
---------------------------------------------------------------------------
LABEL:          CORE_DUMP
IDENTIFIER:     A924A5FC

Date/Time:       Fri May 11 03:52:55 BEIST 2012
Sequence Number: 471
Machine Id:      00C8F0564C00
Node Id:         sgerp5
Class:           S
Type:            PERM
WPAR:            Global
Resource Name:   SYSPROC         

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

        Recommended Actions
        CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

        Recommended Actions
        RERUN THE APPLICATION PROGRAM
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
           6
USER'S PROCESS ID:
               7602504
FILE SYSTEM SERIAL NUMBER
          14
INODE NUMBER
           0      367648
CORE FILE NAME
/oracle/diag/rdbms/sgerp5/sgerp5/cdump/core_7602504/core
PROGRAM NAME
oracle
STACK EXECUTION DISABLED
           0
COME FROM ADDRESS REGISTER
sskgmcrea 0

PROCESSOR ID
  hw_fru_id: 1
  hw_cpu_id: 2

ADDITIONAL INFORMATION
skgdbgcra 224
??
ksdbgcra 3D0
ssexhd 978
??

Symptom Data
REPORTABLE
1
INTERNAL ERROR
0
SYMPTOM CODE
PCSS/SPI2 FLDS/oracle SIG/6 FLDS/skgdbgcra VALU/224

错误原因

This error is logged when a software program abnormally ends and causes a core dump. Users might
not be exiting applications correctly, the system might have been shut down while users were
working in application, or the user's terminal might have locked up and the application stopped
1)这里也就是说如果oracle进程在aix机器上异常终止,并且产生了一个core dump文件,
  就会出现SOFTWARE PROGRAM ABNORMALLY TERMINATED警告信息
2)用户登录系统没有正常退出,而系统被关闭
3)用户强制终止一个一个lock,而导致进程停止

本次AIX日志警告原因:由于进程7602504异常终止(ORA-07445错误)并且产生了 /oracle/diag/rdbms/sgerp5/sgerp5/cdump/core_7602504/core dump 文件,从而有了AIX中的SOFTWARE PROGRAM ABNORMALLY TERMINATED警告信息

发表在 AIX, Oracle | 2 条评论