联系:手机/微信(+86 17813235971) QQ(107644445)
作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]
接到客户恢复请求:把oracle asm datagroup中的一个磁盘增加到vg中,现在磁盘组无法mount,数据库无法正常启动.远程登录现场进行分析发现情况如下:
操作系统层面分析
history操作记录
这里比较明显把一个磁盘做成pv,并且加入到vg中,然后再分配199G给lv_home,系统层面分析lvm信息
--查看pv信息 [root@xff1 ~]# pvdisplay --- Physical volume --- PV Name /dev/sda2 VG Name VolGroup PV Size 277.98 GiB / not usable 3.00 MiB Allocatable yes (but full) PE Size 4.00 MiB Total PE 71161 Free PE 0 Allocated PE 71161 PV UUID F6QO3f-065n-mwTW-Xbq2-Xx2y-c8HD-Tkr7V7 --- Physical volume --- PV Name /dev/sdg <----新加入的磁盘 VG Name VolGroup PV Size 200.00 GiB / not usable 4.00 MiB Allocatable yes PE Size 4.00 MiB Total PE 51199 Free PE 255 Allocated PE 50944 PV UUID i69vUG-nCIK-dtxL-FvpD-2WZd-bvLv-n7lwrb [root@xff1 ~]# lvdisplay --- Logical volume --- LV Path /dev/VolGroup/lv_root LV Name lv_root VG Name VolGroup LV UUID JUNnkN-m4zq-D0gh-h42b-cUM1-Wh1q-ZMtQE4 LV Write Access read/write LV Creation host, time localhost.localdomain, 2017-07-19 20:08:47 +0800 LV Status available # open 1 LV Size 50.00 GiB Current LE 12800 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:0 --- Logical volume --- LV Path /dev/VolGroup/lv_home LV Name lv_home VG Name VolGroup LV UUID eZTkLt-cNGX-371i-m8Bd-VdD9-q6Hz-wYDRIJ LV Write Access read/write LV Creation host, time localhost.localdomain, 2017-07-19 20:08:54 +0800 LV Status available # open 1 LV Size 422.97 GiB <-----lv大小变成422G,应该是被扩了199G后结果 Current LE 108281 Segments 2 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:2 --- Logical volume --- LV Path /dev/VolGroup/lv_swap LV Name lv_swap VG Name VolGroup LV UUID 54P9ok-VpwO-zM68-hvwY-9GBf-89yb-8xQAMn LV Write Access read/write LV Creation host, time localhost.localdomain, 2017-07-19 20:09:23 +0800 LV Status available # open 1 LV Size 4.00 GiB Current LE 1024 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:1 [root@xff1 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup-lv_root 50G 3.9G 43G 9% / tmpfs 63G 509M 63G 1% /dev/shm /dev/sda1 477M 44M 408M 10% /boot /dev/mapper/VolGroup-lv_home 417G 226G 170G 58% /home <----增加了199g空间,剩余只剩170G,证明增加空间之后最少使用了30G以上
基于这样的情况,基本上可以确定sdg盘加入VolGroup中并且被分配给 lv_home中,而且还写入了数据(/home空闲空间只剩余170G,lv_home当时扩了199G).
asm层面分析
asm磁盘组无法mount,提示缺少一块磁盘
SQL> ALTER DISKGROUP DATA MOUNT /* asm agent *//* {1:12056:279} */ NOTE: cache registered group DATA number=1 incarn=0xa1dbff16 NOTE: cache began mount (first) of group DATA number=1 incarn=0xa1dbff16 NOTE: Assigning number (1,2) to disk (/dev/asmdisk3) NOTE: Assigning number (1,1) to disk (/dev/asmdisk2) Sat Apr 25 13:04:58 2020 ERROR: no read quorum in group: required 1, found 0 disks NOTE: cache dismounting (clean) group 1/0xA1DBFF16 (DATA) NOTE: messaging CKPT to quiesce pins Unix process pid: 81552, image: oracle@rac2db1 (TNS V1-V3) NOTE: dbwr not being msg'd to dismount NOTE: lgwr not being msg'd to dismount NOTE: cache dismounted group 1/0xA1DBFF16 (DATA) NOTE: cache ending mount (fail) of group DATA number=1 incarn=0xa1dbff16 NOTE: cache deleting context for group DATA 1/0xa1dbff16 GMON dismounting group 1 at 19 for pid 30, osid 81552 NOTE: Disk DATA_0001 in mode 0x9 marked for de-assignment NOTE: Disk DATA_0002 in mode 0x9 marked for de-assignment ERROR: diskgroup DATA was not mounted ORA-15032: not all alterations performed ORA-15017: diskgroup "DATA" cannot be mounted ORA-15040: diskgroup is incomplete ERROR: ALTER DISKGROUP DATA MOUNT /* asm agent *//* {1:12056:279} */
报错比较明显asm disk磁盘头被lvm的信息取代(因为asm disk 被加入到vg中),根据前面的分析,该磁盘被写入数据很可能超过30G,使用kfed分析一个随意au,确认被破坏,证明开始判断基本正确
root@xff1:/home/oracle11g$kfed read /dev/asmdisk1 aun=10000 kfbh.endian: 51 ; 0x000: 0x33 kfbh.hard: 55 ; 0x001: 0x37 kfbh.type: 32 ; 0x002: *** Unknown Enum *** kfbh.datfmt: 42 ; 0x003: 0x2a kfbh.block.blk: 1329801248 ; 0x004: blk=1329801248 kfbh.block.obj: 1128615502 ; 0x008: file=347726 kfbh.check: 1094999892 ; 0x00c: 0x41445f54 kfbh.fcn.base: 675103060 ; 0x010: 0x283d4154 kfbh.fcn.wrap: 1448232275 ; 0x014: 0x56524553 kfbh.spare1: 1598374729 ; 0x018: 0x5f454349 kfbh.spare2: 1162690894 ; 0x01c: 0x454d414e 7F7843EAD400 2A203733 4F432820 43454E4E 41445F54 [37 * (CONNECT_DA] 7F7843EAD410 283D4154 56524553 5F454349 454D414E [TA=(SERVICE_NAME] 7F7843EAD420 6361723D 29626432 44494328 5250283D [=rac2db)(CID=(PR] 7F7843EAD430 4152474F 3A443D4D 4341505C DFCF3153 [OGRAM=D:\PACS1..] 7F7843EAD440 B3BEB7BB 6369445C 65536D6F 72657672 [....\DicomServer] 7F7843EAD450 445C524D 6D6F6369 76726553 524D7265 [MR\DicomServerMR] 7F7843EAD460 6578652E 4F482829 573D5453 362D4E49 [.exe)(HOST=WIN-6] 7F7843EAD470 51414C38 54553645 28294A30 52455355 [8LAQE6UT0J)(USER] 7F7843EAD480 6D64413D 73696E69 74617274 2929726F [=Administrator))] 7F7843EAD490 202A2029 44444128 53534552 5250283D [) * (ADDRESS=(PR] 7F7843EAD4A0 434F544F 743D4C4F 28297063 54534F48 [OTOCOL=tcp)(HOST] 7F7843EAD4B0 2E30313D 2E303831 30332E31 4F502829 [=10.180.1.30)(PO] 7F7843EAD4C0 343D5452 37333539 2A202929 74736520 [RT=49537)) * est] 7F7843EAD4D0 696C6261 2A206873 63617220 20626432 [ablish * rac2db ] 7F7843EAD4E0 3231202A 0A343135 2D534E54 31353231 [* 12514.TNS-1251] 7F7843EAD4F0 54203A34 6C3A534E 65747369 2072656E [4: TNS:listener ] 7F7843EAD500 73656F64 746F6E20 72756320 746E6572 [does not current] 7F7843EAD510 6B20796C 20776F6E 7320666F 69767265 [ly know of servi] 7F7843EAD520 72206563 65757165 64657473 206E6920 [ce requested in ] 7F7843EAD530 6E6E6F63 20746365 63736564 74706972 [connect descript] ……………… 7F7843EAE300 6F636944 7265536D 4D726576 69445C52 [DicomServerMR\Di] 7F7843EAE310 536D6F63 65767265 2E524D72 29657865 [comServerMR.exe)] 7F7843EAE320 534F4828 49573D54 4F302D4E 314B304A [(HOST=WIN-0OJ0K1] 7F7843EAE330 4955304E 55282954 3D524553 696D6441 [N0UIT)(USER=Admi] 7F7843EAE340 7473696E 6F746172 29292972 28202A20 [nistrator))) * (] 7F7843EAE350 52444441 3D535345 4F525028 4F434F54 [ADDRESS=(PROTOCO] 7F7843EAE360 63743D4C 48282970 3D54534F 312E3031 [L=tcp)(HOST=10.1] 7F7843EAE370 312E3038 2930332E 524F5028 35353D54 [80.1.30)(PORT=55] 7F7843EAE380 29383632 202A2029 61747365 73696C62 [268)) * establis] 7F7843EAE390 202A2068 32636172 2A206264 35323120 [h * rac2db * 125] 7F7843EAE3A0 540A3431 312D534E 34313532 4E54203A [14.TNS-12514: TN] 7F7843EAE3B0 696C3A53 6E657473 64207265 2073656F [S:listener does ] 7F7843EAE3C0 20746F6E 72727563 6C746E65 6E6B2079 [not currently kn] 7F7843EAE3D0 6F20776F 65732066 63697672 65722065 [ow of service re] 7F7843EAE3E0 73657571 20646574 63206E69 656E6E6F [quested in conne] 7F7843EAE3F0 64207463 72637365 6F747069 34320A72 [ct descriptor.24] KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][32]
通过上述kfed可以看到第10000 au的位置被写入的是数据库异常之后listener.log的信息(该数据库安装在/home目录中),进一步证明覆盖,通过以下信息证明sdg就是asmdisk1
[root@xff1 dev]# ls -l sdg brw-rw---- 1 root disk 8, 96 Apr 25 00:05 sdg [root@xff1 dev]# ls -l asmdisk1 brw-rw---- 1 grid asmadmin 8, 96 Apr 25 00:05 asmdisk1
基于现在的情况,data磁盘组是由三块 200G的磁盘组成,第一块磁盘被意外加入vg,并且写入数据大于30G,无法从asm层面直接通过kfed修复磁盘组,然后直接mount,只能通过oracle asm磁盘数据块重组技术(asm disk header 彻底损坏恢复)实现没有覆盖数据的恢复.
该客户运气还不错,通过仅剩的2019年12月份几天的不成功备份找出来所有的数据文件(无归档),然后强制拉库成功.通过碎片恢复的最新的数据文件数据结合2019年12月份备份,实现绝大部分业务数据恢复,最大限度减少客户损失.对于oracle rac数据库服务器磁盘操作需要谨慎.
如果不幸有类似oracle asm disk被破坏(格式化,dd部分,做成lv等),需要进行恢复支持,可以联系我们,做专业的恢复评估,最大限度,最快速度抢救数据,减少损失
Phone:17813235971 Q Q:107644445 E-Mail:dba@xifenfei.com
恢复过部分asm被格式化案例:
又一例asm格式化文件系统恢复
一次完美的asm disk被格式化ntfs恢复
oracle asm disk格式化恢复—格式化为ext4文件系统
oracle asm disk格式化恢复—格式化为ntfs文件系统