标签云
asm恢复 bbed bootstrap$ dul In Memory kcbzib_kcrsds_1 kccpb_sanity_check_2 MySQL恢复 ORA-00312 ORA-00607 ORA-00704 ORA-00742 ORA-01110 ORA-01555 ORA-01578 ORA-08103 ORA-600 2131 ORA-600 2662 ORA-600 2663 ORA-600 3020 ORA-600 4000 ORA-600 4137 ORA-600 4193 ORA-600 4194 ORA-600 16703 ORA-600 kcbzib_kcrsds_1 ORA-600 KCLCHKBLK_4 ORA-15042 ORA-15196 ORACLE 12C oracle dul ORACLE PATCH Oracle Recovery Tools oracle加密恢复 oracle勒索 oracle勒索恢复 oracle异常恢复 ORACLE恢复 Oracle 恢复 ORACLE数据库恢复 oracle 比特币 OSD-04016 YOUR FILES ARE ENCRYPTED 勒索恢复 比特币加密文章分类
- Others (2)
- 中间件 (2)
- WebLogic (2)
- 操作系统 (102)
- 数据库 (1,697)
- DB2 (22)
- MySQL (74)
- Oracle (1,558)
- Data Guard (52)
- EXADATA (8)
- GoldenGate (24)
- ORA-xxxxx (159)
- ORACLE 12C (72)
- ORACLE 18C (6)
- ORACLE 19C (15)
- ORACLE 21C (3)
- Oracle 23ai (8)
- Oracle ASM (68)
- Oracle Bug (8)
- Oracle RAC (53)
- Oracle 安全 (6)
- Oracle 开发 (28)
- Oracle 监听 (28)
- Oracle备份恢复 (571)
- Oracle安装升级 (93)
- Oracle性能优化 (62)
- 专题索引 (5)
- 勒索恢复 (81)
- PostgreSQL (18)
- PostgreSQL恢复 (6)
- SQL Server (27)
- SQL Server恢复 (8)
- TimesTen (7)
- 达梦数据库 (2)
- 生活娱乐 (2)
- 至理名言 (11)
- 虚拟化 (2)
- VMware (2)
- 软件开发 (37)
- Asp.Net (9)
- JavaScript (12)
- PHP (2)
- 小工具 (20)
-
最近发表
- ORA-600 ktuPopDictI_1恢复
- impdp导入数据丢失sys授权问题分析
- impdp 创建index提示ORA-00942: table or view does not exist
- 数据泵导出 (expdp) 和导入 (impdp)工具性能降低分析参考
- 19c非归档数据库断电导致ORA-00742故障恢复
- Oracle 19c – 手动升级到 Non-CDB Oracle Database 19c 的完整核对清单
- sqlite数据库简单操作
- Oracle 暂定和恢复功能
- .pzpq扩展名勒索恢复
- Oracle read only用户—23ai新特性:只读用户
- 迁移awr快照数据到自定义表空间
- .hmallox加密mariadb/mysql数据库恢复
- 2025年首个故障恢复—ORA-600 kcbzib_kcrsds_1
- 第一例Oracle 21c恢复咨询
- ORA-15411: Failure groups in disk group DATA have different number of disks.
- 断电引起的ORA-08102: 未找到索引关键字, 对象号 39故障处理
- ORA-00227: corrupt block detected in control file
- 手工删除19c rac
- 解决oracle数据文件路径有回车故障
- .wstop扩展名勒索数据库恢复
标签归档:Input/output error
记录一次oracle现场故障处理经过
近期到现场进行了一个数据库恢复,我在恢复之前该库先由于硬件进行恢复,然后由其他人对其进行了一系列数据库恢复,但是未恢复成功,客户希望我们到现场进行处理(因为网络原因无法远程).接手库之后,处理第一个问题,是客户在进行现场备份的时候(把linux数据拷贝到win的过程中)发现有几个文件拷贝异常,这个错误很可能是由于当初的硬件故障修复之后留下的后遗症(由于io设备错误,无法运行此项请求),通过工具进行拷贝,恢复出来
DUL> copy file from /oradata2/xifenfeidata.dbf to /oradata2/xifenfeidata.dbf starting copy datafile '/oradata1/xifenfeidata.dbf' to '/oradata2/xifenfeidata.dbf' read data error from file '/oradata1/xifenfeidata.dbf'.error message:Input/output error read block# error: 560171 read data error from file '/oradata1/xifenfeidata.dbf'.error message:Input/output error read block# error: 560179 datafile copy completed with 2 block error.
[oracle@localhost ~]$ dbv file=/oradata2/xifenfeidata.dbf blocksize=16384 DBVERIFY: Release 11.2.0.3.0 - Production on Mon Mar 29 17:28:17 2021 Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved. DBVERIFY - Verification starting : FILE = /oradata2/xifenfeidata.dbf Page 560171 is marked corrupt Corrupt block relative dba: 0x3bc88c2b (file 239, block 560171) Completely zero block found during dbv: Page 560179 is marked corrupt Corrupt block relative dba: 0x3bc88c33 (file 239, block 560179) Completely zero block found during dbv: DBVERIFY - Verification complete Total Pages Examined : 4194302 Total Pages Processed (Data) : 2230726 Total Pages Failing (Data) : 0 Total Pages Processed (Index): 1936953 Total Pages Failing (Index): 0 Total Pages Processed (Other): 26618 Total Pages Processed (Seg) : 0 Total Pages Failing (Seg) : 0 Total Pages Empty : 3 Total Pages Marked Corrupt : 2 Total Pages Influx : 0 Total Pages Encrypted : 0 Highest block SCN : 304929867 (106.304929867)
修复完相关无法拷贝文件之后,启动数据库报控制文件异常
Mon Mar 29 15:03:38 2021 alter database mount USER (ospid: 29044): terminating the instance Mon Mar 29 15:03:42 2021 System state dump requested by (instance=1, osid=29044), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/xff/xff/trace/xff_diag_28961.trc Instance terminated by USER, pid = 29044
尝试重建ctl
[oracle@localhost ~]$ sqlplus / as sysdba SQL*Plus: Release 11.2.0.3.0 Production on Mon Mar 29 17:40:17 2021 Copyright (c) 1982, 2011, Oracle. All rights reserved. Connected to an idle instance. SQL> startup nomount pfile='/tmp/pfile' ORACLE instance started. Total System Global Area 1.7704E+10 bytes Fixed Size 2235568 bytes Variable Size 2348811088 bytes Database Buffers 1.5301E+10 bytes Redo Buffers 52580352 bytes SQL> @/tmp/ctl.sql CREATE CONTROLFILE REUSE DATABASE xff NORESETLOGS NOARCHIVELOG * ERROR at line 1: ORA-01503: CREATE CONTROLFILE failed ORA-01189: file is from a different RESETLOGS than previous files ORA-01110: data file 249: '/oradata/xff/system03.dbf'
初步判断是由于对方之前恢复导致部分文件resetlogs scn异常,通过bbed进行判断确认
BBED> set file 1 FILE# 1 BBED> p kcvfhrls struct kcvfhrls, 8 bytes @116 ub4 kscnbas @116 0x00000001 ub2 kscnwrp @120 0x0000 BBED> set file 249 FILE# 249 BBED> p kcvfhrls struct kcvfhrls, 8 bytes @116 ub4 kscnbas @116 0x00000001 ub2 kscnwrp @120 0x0000
通过bbed修改相关值,然后重建控制文件成功,尝试resetlogs库,报ORA-01248错误
SQL> alter database open resetlogs; alter database open resetlogs * ERROR at line 1: ORA-01248: file 234 was created in the future of incomplete recovery ORA-01110: data file 234: '/oradata1/xifenfeidata5.DBF'
关于ORA-01248的错误解释
01248, 00000, "file %s was created in the future of incomplete recovery" // *Cause: Attempting to do a RESETLOGS open with a file entry in the // control file that was originally created after the UNTIL time // of the incomplete recovery. // Allowing such an entry may hide the version of the file that // is needed at this time. The file number may be in use for // a different file which would be lost if the RESETLOGS was allowed. // *Action: If more recovery is desired then apply redo until the creation // time of the file is reached. If the file is not wanted and the // same file number is not in use at the stop time of the recovery, // then the file can be taken offline with the FOR DROP option. // Otherwise a different control file is needed to allow the RESETLOGS. // Another backup can be restored and recovered, or a control file can // be created via CREATE CONTROLFILE.
大概的意思是文件的创建时间大于文件当前的scn,通过查询确实如此
SQL> select file#,CREATION_CHANGE#,CREATION_TIME from v$datafile_header where file#=234; FILE# CREATION_CHANGE# CREATION_ ---------------- ---------------- --------- 234 419298664864 02-AUG-19 SQL> SELECT status, 2 to_char(checkpoint_change#,'9999999999999999') "SCN", 3 to_char(checkpoint_time,'yyyy-mm-dd hh24:mi:ss') checkpoint_time,FUZZY, 4 count(*) ROW_NUM 5 FROM v$datafile_header 6 GROUP BY status, checkpoint_change#, to_char(checkpoint_time,'yyyy-mm-dd hh24:mi:ss'),fuzzy 7 ORDER BY status, checkpoint_change#, checkpoint_time; STATUS SCN CHECKPOINT_TIME FUZ ROW_NUM ------- ----------------- ------------------- --- ---------------- ONLINE 417750848223 2021-02-23 23:50:46 YES 7 ONLINE 417750848223 2021-03-21 11:44:25 NO 396
通过对部分scn进行修改(比如减小创建时间的scn),然后尝试resetlogs库
SQL> alter database open resetlogs; alter database open resetlogs * ERROR at line 1: ORA-01092: ORACLE instance terminated. Disconnection forced ORA-00704: bootstrap process failure ORA-00704: bootstrap process failure ORA-00604: error occurred at recursive SQL level 1 ORA-01555: snapshot too old: rollback segment number 5 with name "_SYSSMU5_2708889888$" too small Process ID: 3182 Session ID: 1 Serial number: 3
这个错误比较简单,参考以前的部分文章:在数据库open过程中常遇到ORA-01555汇总数据库open过程遭遇ORA-1555对应sql语句补充,处理之后,数据库open成功
SQL> startup mount; ORACLE instance started. Total System Global Area 1.7704E+10 bytes Fixed Size 2235568 bytes Variable Size 2348811088 bytes Database Buffers 1.5301E+10 bytes Redo Buffers 52580352 bytes Database mounted. SQL> alter database open; Database altered.
本次数据库恢复基本上完成,已经最大限度恢复数据,导出数据到新库,完成恢复任务
Input/output error故障恢复
客户由于硬件故障,导致数据文件出现io错误
oracle@linux1:~> dd if=/oradata/orcl/system01.dbf of=/oradata/orcl/system01.dbf_bak bs=8192 dd: reading `/oradata/orcl/system01.dbf': Input/output error 83871+0 records in 83871+0 records out 687071232 bytes (687 MB) copied, 1.07177 s, 641 MB/s dd: reading `/oradata/orcl/system01.dbf': Input/output error 83871+0 records in 83871+0 records out 687071232 bytes (687 MB) copied, 1.0731 s, 640 MB/s dd: reading `/oradata/orcl/system01.dbf': Input/output error 83871+0 records in 83871+0 records out 687071232 bytes (687 MB) copied, 1.07431 s, 640 MB/s dd: reading `/oradata/orcl/system01.dbf': Input/output error 85158+1 records in 85158+1 records out 697618432 bytes (698 MB) copied, 4.11649 s, 169 MB/s dd: reading `/oradata/orcl/system01.dbf': Input/output error 85158+1 records in 85158+1 records out 697618432 bytes (698 MB) copied, 5.64775 s, 124 MB/s dd: reading `/oradata/orcl/system01.dbf': Input/output error 85158+1 records in 85158+1 records out 697618432 bytes (698 MB) copied, 7.1791 s, 97.2 MB/s dd: reading `/oradata/orcl/system01.dbf': Input/output error 85158+1 records in 85158+1 records out 697618432 bytes (698 MB) copied, 8.70247 s, 80.2 MB/s dd: reading `/oradata/orcl/system01.dbf': Input/output error 85158+1 records in 85158+1 records out 697618432 bytes (698 MB) copied, 10.2258 s, 68.2 MB/s dd: reading `/oradata/orcl/system01.dbf': Input/output error 85158+1 records in 85158+1 records out 697618432 bytes (698 MB) copied, 10.2272 s, 68.2 MB/s dd: reading `/oradata/orcl/system01.dbf': Input/output error 85158+1 records in 85158+1 records out 697618432 bytes (698 MB) copied, 10.2284 s, 68.2 MB/s dd: reading `/oradata/orcl/system01.dbf': Input/output error 85158+1 records in 85158+1 records out 697618432 bytes (698 MB) copied, 10.2296 s, 68.2 MB/s dd: reading `/oradata/orcl/system01.dbf': Input/output error 85158+1 records in 85158+1 records out 697618432 bytes (698 MB) copied, 10.2309 s, 68.2 MB/s dd: reading `/oradata/orcl/system01.dbf': Input/output error 85170+1 records in 85170+1 records out 697716736 bytes (698 MB) copied, 11.7563 s, 59.3 MB/s dd: reading `/oradata/orcl/system01.dbf': Input/output error 85170+1 records in 85170+1 records out 697716736 bytes (698 MB) copied, 13.3038 s, 52.4 MB/s 93431+1 records in 93431+1 records out 765390848 bytes (765 MB) copied, 18.2578 s, 41.9 MB/s
这个明显io错误比较多,无法直接使用以前的dd方法较好的恢复数据,只能通过linux平台的一些io工具修复文件(或者直接把磁盘挂载到win上通过工具处理),然后下载到win机器之后效果不错,只有17个坏块
C:\Users\XIFENFEI>dbv file=f:/11.2.0.1/system01.dbf DBVERIFY: Release 10.2.0.3.0 - Production on 星期日 2月 24 22:46:59 2019 Copyright (c) 1982, 2005, Oracle. All rights reserved. DBVERIFY - 开始验证: FILE = f:/11.2.0.1/system01.dbf 页 83871 标记为损坏 Corrupt block relative dba: 0x0041479f (file 1, block 83871) Bad header found during dbv: Data in bad block: type: 229 format: 5 rdba: 0xe5e5e5e5 last change scn: 0xe5e5.e5e5e5e5 seq: 0xe5 flg: 0xe5 spare1: 0xe5 spare2: 0xe5 spare3: 0xe5e5 consistency value in tail: 0xe5e5e5e5 check value in block header: 0xe5e5 computed block checksum: 0x0 页 83872 标记为损坏 Corrupt block relative dba: 0x004147a0 (file 1, block 83872) Bad header found during dbv: Data in bad block: type: 229 format: 5 rdba: 0xe5e5e5e5 last change scn: 0xe5e5.e5e5e5e5 seq: 0xe5 flg: 0xe5 spare1: 0xe5 spare2: 0xe5 spare3: 0xe5e5 consistency value in tail: 0xe5e5e5e5 check value in block header: 0xe5e5 computed block checksum: 0x0 页 83873 标记为损坏 Corrupt block relative dba: 0x004147a1 (file 1, block 83873) Bad header found during dbv: Data in bad block: type: 229 format: 5 rdba: 0xe5e5e5e5 last change scn: 0xe5e5.e5e5e5e5 seq: 0xe5 flg: 0xe5 spare1: 0xe5 spare2: 0xe5 spare3: 0xe5e5 consistency value in tail: 0x759c0601 check value in block header: 0xe5e5 computed block checksum: 0xddc6 页 85161 流入 - 很可能是介质损坏 Corrupt block relative dba: 0x00414ca9 (file 1, block 85161) Fractured block found during dbv: Data in bad block: type: 6 format: 2 rdba: 0x00414ca9 last change scn: 0x0000.0ce20ac2 seq: 0x2 flg: 0x04 spare1: 0x0 spare2: 0x0 spare3: 0x0 consistency value in tail: 0xe5e5e5e5 check value in block header: 0x47f0 computed block checksum: 0xc3ab 页 85162 标记为损坏 Corrupt block relative dba: 0x00414caa (file 1, block 85162) Bad header found during dbv: Data in bad block: type: 229 format: 5 rdba: 0xe5e5e5e5 last change scn: 0xe5e5.e5e5e5e5 seq: 0xe5 flg: 0xe5 spare1: 0xe5 spare2: 0xe5 spare3: 0xe5e5 consistency value in tail: 0xe5e5e5e5 check value in block header: 0xe5e5 computed block checksum: 0x0 页 85163 标记为损坏 Corrupt block relative dba: 0x00414cab (file 1, block 85163) Bad header found during dbv: Data in bad block: type: 229 format: 5 rdba: 0xe5e5e5e5 last change scn: 0xe5e5.e5e5e5e5 seq: 0xe5 flg: 0xe5 spare1: 0xe5 spare2: 0xe5 spare3: 0xe5e5 consistency value in tail: 0xe5e5e5e5 check value in block header: 0xe5e5 computed block checksum: 0x0 页 85164 标记为损坏 Corrupt block relative dba: 0x00414cac (file 1, block 85164) Bad header found during dbv: Data in bad block: type: 229 format: 5 rdba: 0xe5e5e5e5 last change scn: 0xe5e5.e5e5e5e5 seq: 0xe5 flg: 0xe5 spare1: 0xe5 spare2: 0xe5 spare3: 0xe5e5 consistency value in tail: 0xe5e5e5e5 check value in block header: 0xe5e5 computed block checksum: 0x0 页 85165 标记为损坏 Corrupt block relative dba: 0x00414cad (file 1, block 85165) Bad header found during dbv: Data in bad block: type: 229 format: 5 rdba: 0xe5e5e5e5 last change scn: 0xe5e5.e5e5e5e5 seq: 0xe5 flg: 0xe5 spare1: 0xe5 spare2: 0xe5 spare3: 0xe5e5 consistency value in tail: 0xe5e5e5e5 check value in block header: 0xe5e5 computed block checksum: 0x0 页 85166 标记为损坏 Corrupt block relative dba: 0x00414cae (file 1, block 85166) Bad header found during dbv: Data in bad block: type: 229 format: 5 rdba: 0xe5e5e5e5 last change scn: 0xe5e5.e5e5e5e5 seq: 0xe5 flg: 0xe5 spare1: 0xe5 spare2: 0xe5 spare3: 0xe5e5 consistency value in tail: 0xe5e5e5e5 check value in block header: 0xe5e5 computed block checksum: 0x0 页 85167 标记为损坏 Corrupt block relative dba: 0x00414caf (file 1, block 85167) Bad header found during dbv: Data in bad block: type: 229 format: 5 rdba: 0xe5e5e5e5 last change scn: 0xe5e5.e5e5e5e5 seq: 0xe5 flg: 0xe5 spare1: 0xe5 spare2: 0xe5 spare3: 0xe5e5 consistency value in tail: 0xe5e5e5e5 check value in block header: 0xe5e5 computed block checksum: 0x0 页 85177 流入 - 很可能是介质损坏 Corrupt block relative dba: 0x00414cb9 (file 1, block 85177) Fractured block found during dbv: Data in bad block: type: 6 format: 2 rdba: 0x00414cb9 last change scn: 0x0000.0ce55ebf seq: 0x1 flg: 0x06 spare1: 0x0 spare2: 0x0 spare3: 0x0 consistency value in tail: 0xe5e5e5e5 check value in block header: 0x54c9 computed block checksum: 0x5ce5 页 85178 标记为损坏 Corrupt block relative dba: 0x00414cba (file 1, block 85178) Bad header found during dbv: Data in bad block: type: 229 format: 5 rdba: 0xe5e5e5e5 last change scn: 0xe5e5.e5e5e5e5 seq: 0xe5 flg: 0xe5 spare1: 0xe5 spare2: 0xe5 spare3: 0xe5e5 consistency value in tail: 0xe5e5e5e5 check value in block header: 0xe5e5 computed block checksum: 0x0 页 85179 标记为损坏 Corrupt block relative dba: 0x00414cbb (file 1, block 85179) Bad header found during dbv: Data in bad block: type: 229 format: 5 rdba: 0xe5e5e5e5 last change scn: 0xe5e5.e5e5e5e5 seq: 0xe5 flg: 0xe5 spare1: 0xe5 spare2: 0xe5 spare3: 0xe5e5 consistency value in tail: 0xe5e5e5e5 check value in block header: 0xe5e5 computed block checksum: 0x0 页 85180 标记为损坏 Corrupt block relative dba: 0x00414cbc (file 1, block 85180) Bad header found during dbv: Data in bad block: type: 229 format: 5 rdba: 0xe5e5e5e5 last change scn: 0xe5e5.e5e5e5e5 seq: 0xe5 flg: 0xe5 spare1: 0xe5 spare2: 0xe5 spare3: 0xe5e5 consistency value in tail: 0xe5e5e5e5 check value in block header: 0xe5e5 computed block checksum: 0x0 页 85181 标记为损坏 Corrupt block relative dba: 0x00414cbd (file 1, block 85181) Bad header found during dbv: Data in bad block: type: 229 format: 5 rdba: 0xe5e5e5e5 last change scn: 0xe5e5.e5e5e5e5 seq: 0xe5 flg: 0xe5 spare1: 0xe5 spare2: 0xe5 spare3: 0xe5e5 consistency value in tail: 0xe5e5e5e5 check value in block header: 0xe5e5 computed block checksum: 0x0 页 85182 标记为损坏 Corrupt block relative dba: 0x00414cbe (file 1, block 85182) Bad header found during dbv: Data in bad block: type: 229 format: 5 rdba: 0xe5e5e5e5 last change scn: 0xe5e5.e5e5e5e5 seq: 0xe5 flg: 0xe5 spare1: 0xe5 spare2: 0xe5 spare3: 0xe5e5 consistency value in tail: 0xe5e5e5e5 check value in block header: 0xe5e5 computed block checksum: 0x0 页 85183 标记为损坏 Corrupt block relative dba: 0x00414cbf (file 1, block 85183) Bad header found during dbv: Data in bad block: type: 229 format: 5 rdba: 0xe5e5e5e5 last change scn: 0xe5e5.e5e5e5e5 seq: 0xe5 flg: 0xe5 spare1: 0xe5 spare2: 0xe5 spare3: 0xe5e5 consistency value in tail: 0xe5e5e5e5 check value in block header: 0xe5e5 computed block checksum: 0x0 DBVERIFY - 验证完成 检查的页总数: 93440 处理的页总数 (数据): 64294 失败的页总数 (数据): 0 处理的页总数 (索引): 12616 失败的页总数 (索引): 0 处理的页总数 (其它): 3111 处理的总页数 (段) : 0 失败的总页数 (段) : 0 空的页总数: 13402 标记为损坏的总页数: 17 流入的页总数: 2 最高块 SCN : 1073748415 (0.1073748415) C:\Users\XIFENFEI>
经过一系列恢复,数据库强制打开,数据库后台报ORA-7445 kkogbro
Completed: alter database open resetlogs upgrade Sun Feb 24 18:06:36 2019 MMON started with pid=15, OS id=9032 Sun Feb 24 18:07:50 2019 Errors in file d:\app\diag\rdbms\orcl\orcl\trace\orcl_ora_8336.trc: Sun Feb 24 18:07:52 2019 Trace dumping is performing id=[cdmp_20190224180752] Sun Feb 24 18:09:42 2019 alter tablespace temp add tempfile 'f:/11.2.0.1/temp01.dbf' size 128m autoextend on Completed: alter tablespace temp add tempfile 'f:/11.2.0.1/temp01.dbf' size 128m autoextend on Exception [type: ACCESS_VIOLATION, UNABLE_TO_READ] [ADDR:0x166] [PC:0x38E41AD, kkogbro()+497] ERROR: Unable to normalize symbol name for the following short stack (at offset 199): dbgexProcessError()+193<-dbgeExecuteForError()+65<-dbgePostErrorKGE()+1726<-dbkePostKGE_kgsf()+ 75<-kgeade()+560<-kgerev()+125<-kgerec5()+60<-sss_xcpt_EvalFilterEx()+1869<- sss_xcpt_EvalFilter()+174<-.1.6_8+59<-0000000077207388<-000000007721BF7D<- 00000000771F043A<-000000007721B61E<-kkogbro()+497<-kkogjro()+99<-kkojnp() +10299<-kkocnp()+78<-kkooqb()+1549<-kkoqbc()+2474<-apakkoqb()+200<- apaqbdDescendents()+496<-apaqbdList()+79<-apaqbdDescendents()+795<- apaqbdList()+79<-apaqbd()+17<-apadrv()+818<-opitca()+2518<-kksLoadChild()+9008 <-kxsGetRuntimeLock()+2320<-kksfbc()+15225<-kkspbd0()+669<-kksParseCursor()+741 <-opiosq0()+2538<-opipls()+12841<-opiodr()+1662<-rpidrus()+862<-rpidru()+154 <-rpiswu2()+2757<-rpidrv()+6105<-psddr0()+614<-psdnal()+510<-pevm_EXECC()+365 <-pfrinstr_EXECC()+90<-pfrrun_no_tool()+65<-pfrrun()+1241<-plsql_run()+875 <-peicnt()+329<-kkxexe()+616<-opiexe()+20006 Errors in file d:\app\diag\rdbms\orcl\orcl\trace\orcl_ora_8336.trc (incident=2540): ORA-07445: 出现异常错误: 核心转储 [kkogbro()+497] [ACCESS_VIOLATION] [ADDR:0x166] [PC:0x38E41AD] [UNABLE_TO_READ] [] Incident details in: d:\app\diag\rdbms\orcl\orcl\incident\incdir_2540\orcl_ora_8336_i2540.trc
通过分析trace文件,确认是和坏块有关系,对于上述坏块进行处理之后,数据正常导出.
分布式存储crash导致ORACLE坏块原因分析
oracle运行在分布式存储中,我们在实验室模拟所有存储节点故障(整个存储断电),然后加电,数据库启动没有坏块,但是有客户在自己的环境做测试,发现数据库启动有大量坏块(几次测试问题依旧)。
客户异常环境报错汇总
模拟存储故障后,数据库报错信息,主要体现犹豫存储模拟故障,数据库读写io失败ORA-15081,ORA-27072,ORA-27061
Thread 2 cannot allocate new log, sequence 488 Checkpoint not complete Current log# 3 seq# 487 mem# 0: +DATADG/testdb/onlinelog/group_3.318.970157559 Current log# 3 seq# 487 mem# 1: +ARCHDG/testdb/onlinelog/group_3.616.970157559 Thu Mar 08 16:17:08 2018 WARNING: Read Failed. group:2 disk:0 AU:6925 offset:49152 size:16384 WARNING: failed to read mirror side 1 of virtual extent 4 logical extent 0 of file 313 in group [2.1286527068] from disk DATADG_0000 allocation unit 6925 reason error; if possible, will try another mirror side Thu Mar 08 16:17:08 2018 WARNING: Read Failed. group:2 disk:0 AU:6921 offset:16384 size:16384 WARNING: failed to read mirror side 1 of virtual extent 0 logical extent 0 of file 313 in group [2.1286527068] from disk DATADG_0000 allocation unit 6921 reason error; if possible, will try another mirror side Errors in file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_lmon_21091.trc: ORA-00202: control file: '+DATADG/testdb/controlfile/current.313.970157461' ORA-15081: failed to submit an I/O operation to a disk ORA-27072: File I/O error Linux-x86_64 Error: 5: Input/output error Additional information: 4 Additional information: 56729696 Additional information: -1 Errors in file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_lgwr_21111.trc: ORA-00202: control file: '+DATADG/testdb/controlfile/current.313.970157461' ORA-15081: failed to submit an I/O operation to a disk ORA-27072: File I/O error Linux-x86_64 Error: 5: Input/output error Additional information: 4 Additional information: 56696864 Additional information: -1 Thu Mar 08 16:17:08 2018 WARNING: Write Failed. group:1 disk:0 AU:56970 offset:65536 size:16384 WARNING: Write Failed. group:2 disk:0 AU:6921 offset:65536 size:16384 ………… WARNING: failed to write mirror side 1 of virtual extent 147 logical extent 0 of file 335 in group 2 on disk 0 allocation unit 11789 Errors in file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_dbw0_21109.trc: ORA-15080: synchronous I/O operation to a disk failed ORA-27061: waiting for async I/Os failed Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 8192 WARNING: failed to write mirror side 1 of virtual extent 147 logical extent 0 of file 335 in group 2 on disk 0 allocation unit 11789 KCF: read, write or open error, block=0xddec online=1 file=18 '+DATADG/testdb/datafile/soe.333.970157833' error=15081 txt: '' Errors in file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_dbw0_21109.trc: Thu Mar 08 16:17:09 2018 System state dump requested by (instance=2, osid=21113 (CKPT)), summary=[abnormal instance termination]. System State dumped to trace file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_diag_21081_20180308161709.trc Thu Mar 08 16:17:09 2018 ORA-1092 : opitsk aborting process Thu Mar 08 16:17:12 2018 ORA-1092 : opitsk aborting process Instance terminated by CKPT, pid = 21113
asm日志报错信息,主要也是体现在io的读写错误上Linux-x86_64 Error: 5: Input/output error
Thu Mar 08 16:17:20 2018 Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_9636.trc: ORA-17503: ksfdopn:2 Failed to open file +DATADG/testdb/spfiletestdb.ora ORA-15001: diskgroup "DATADG" does not exist or is not mounted Thu Mar 08 16:17:37 2018 WARNING: Write Failed. group:3 disk:0 AU:1 offset:4190208 size:4096 Thu Mar 08 16:17:37 2018 WARNING: Write Failed. group:3 disk:0 AU:22 offset:143360 size:4096 Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_20679.trc: ORA-15080: synchronous I/O operation to a disk failed ORA-27072: File I/O error Linux-x86_64 Error: 5: Input/output error Additional information: 4 Additional information: 180504 Additional information: -1 WARNING: Hbeat write to PST disk 0.3916310704 in group 3 failed. [4] WARNING: failed to write mirror side 1 of virtual extent 0 logical extent 0 of file 255 in group 3 on disk 0 allocation unit 22 Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_20679.trc: ORA-15081: failed to submit an I/O operation to a disk ORA-27072: File I/O error Linux-x86_64 Error: 5: Input/output error Additional information: 4 Additional information: 180504 Additional information: -1 Thu Mar 08 16:17:37 2018 NOTE: process _b001_+asm2 (9546) initiating offline of disk 0.3916310704 (OCRDG_0000) with mask 0x7e in group 3 NOTE: initiating PST update: grp = 3, dsk = 0/0xe96e28b0, mask = 0x6a, op = clear GMON updating disk modes for group 3 at 15 for pid 27, osid 9546 ERROR: no read quorum in group: required 1, found 0 disks Thu Mar 08 16:17:37 2018 NOTE: cache dismounting (not clean) group 3/0x4CAED85D (OCRDG) WARNING: Offline for disk OCRDG_0000 in mode 0x7f failed. NOTE: messaging CKPT to quiesce pins Unix process pid: 9505, image: oracle@dbtest02 (B000) Thu Mar 08 16:17:37 2018 NOTE: halting all I/Os to diskgroup 3 (OCRDG) Thu Mar 08 16:17:37 2018 NOTE: LGWR doing non-clean dismount of group 3 (OCRDG) NOTE: LGWR sync ABA=11.6 last written ABA 11.6 WARNING: Write Failed. group:3 disk:0 AU:22 offset:102400 size:4096 Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_20679.trc: ORA-15080: synchronous I/O operation to a disk failed ORA-27072: File I/O error Linux-x86_64 Error: 5: Input/output error Additional information: 4 Additional information: 180424 Additional information: -1 WARNING: failed to write mirror side 1 of virtual extent 0 logical extent 0 of file 255 in group 3 on disk 0 allocation unit 22 Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_20679.trc: ORA-15081: failed to submit an I/O operation to a disk ORA-27072: File I/O error Linux-x86_64 Error: 5: Input/output error Additional information: 4 Additional information: 180424 Additional information: -1 WARNING: Write Failed. group:3 disk:0 AU:22 offset:147456 size:4096 Errors in file /oracle/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_20679.trc: ORA-15080: synchronous I/O operation to a disk failed ORA-27072: File I/O error ………… Thu Mar 08 16:17:37 2018 ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) ) ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) ) ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
存储恢复之后,asm mount磁盘组没有报任何错误
SQL> ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */ NOTE: Diskgroups listed in ASM_DISKGROUPS are ARCHDG DATADG NOTE: Diskgroup used for Voting files is: OCRDG Diskgroup with spfile:OCRDG Diskgroup used for OCR is:OCRDG NOTE: cache registered group ARCHDG number=1 incarn=0xb81889ff NOTE: cache began mount (first) of group ARCHDG number=1 incarn=0xb81889ff NOTE: cache registered group DATADG number=2 incarn=0xb8288a00 NOTE: cache began mount (first) of group DATADG number=2 incarn=0xb8288a00 NOTE: cache registered group OCRDG number=3 incarn=0xb8288a01 NOTE: cache began mount (first) of group OCRDG number=3 incarn=0xb8288a01 NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so NOTE: Assigning number (1,0) to disk (/dev/oracleasm/disks/ARCHDISK01) NOTE: Assigning number (2,0) to disk (/dev/oracleasm/disks/DATADISK01) NOTE: Assigning number (3,0) to disk (/dev/oracleasm/disks/OCRDISK01) NOTE: GMON heartbeating for grp 1 GMON querying group 1 at 5 for pid 23, osid 3303 NOTE: cache opening disk 0 of grp 1: ARCHDG_0000 path:/dev/oracleasm/disks/ARCHDISK01 NOTE: F1X0 found on disk 0 au 2 fcn 0.0 NOTE: cache mounting (first) external redundancy group 1/0xB81889FF (ARCHDG) * allocate domain 1, invalid = TRUE NOTE: attached to recovery domain 1 NOTE: starting recovery of thread=1 ckpt=13.6287 group=1 (ARCHDG) NOTE: starting recovery of thread=2 ckpt=12.7985 group=1 (ARCHDG) NOTE: advancing ckpt for group 1 (ARCHDG) thread=1 ckpt=13.6287 NOTE: advancing ckpt for group 1 (ARCHDG) thread=2 ckpt=12.7985 NOTE: cache recovered group 1 to fcn 0.122934 NOTE: redo buffer size is 256 blocks (1053184 bytes) NOTE: LGWR attempting to mount thread 1 for diskgroup 1 (ARCHDG) Process LGWR (pid 3287) is running at high priority QoS for Exadata I/O NOTE: LGWR found thread 1 closed at ABA 13.6286 NOTE: LGWR mounted thread 1 for diskgroup 1 (ARCHDG) NOTE: LGWR opening thread 1 at fcn 0.122934 ABA 14.6287 NOTE: cache mounting group 1/0xB81889FF (ARCHDG) succeeded NOTE: cache ending mount (success) of group ARCHDG number=1 incarn=0xb81889ff NOTE: GMON heartbeating for grp 2 GMON querying group 2 at 7 for pid 23, osid 3303 NOTE: cache opening disk 0 of grp 2: DATADG_0000 path:/dev/oracleasm/disks/DATADISK01 NOTE: F1X0 found on disk 0 au 2 fcn 0.0 NOTE: cache mounting (first) external redundancy group 2/0xB8288A00 (DATADG) * allocate domain 2, invalid = TRUE NOTE: attached to recovery domain 2 NOTE: starting recovery of thread=1 ckpt=12.2323 group=2 (DATADG) NOTE: starting recovery of thread=2 ckpt=11.2681 group=2 (DATADG) NOTE: advancing ckpt for group 2 (DATADG) thread=1 ckpt=12.2326 NOTE: advancing ckpt for group 2 (DATADG) thread=2 ckpt=11.2687 NOTE: cache recovered group 2 to fcn 0.21395 NOTE: redo buffer size is 512 blocks (2101760 bytes) NOTE: LGWR attempting to mount thread 1 for diskgroup 2 (DATADG) NOTE: LGWR found thread 1 closed at ABA 12.2325 NOTE: LGWR mounted thread 1 for diskgroup 2 (DATADG) NOTE: LGWR opening thread 1 at fcn 0.21395 ABA 13.2326 NOTE: cache mounting group 2/0xB8288A00 (DATADG) succeeded NOTE: cache ending mount (success) of group DATADG number=2 incarn=0xb8288a00 NOTE: GMON heartbeating for grp 3 GMON querying group 3 at 9 for pid 23, osid 3303 NOTE: cache opening disk 0 of grp 3: OCRDG_0000 path:/dev/oracleasm/disks/OCRDISK01 NOTE: F1X0 found on disk 0 au 2 fcn 0.0 NOTE: cache mounting (first) external redundancy group 3/0xB8288A01 (OCRDG) * allocate domain 3, invalid = TRUE Thu Mar 08 17:00:24 2018 NOTE: attached to recovery domain 3 NOTE: starting recovery of thread=1 ckpt=13.55 group=3 (OCRDG) NOTE: starting recovery of thread=2 ckpt=11.7 group=3 (OCRDG) NOTE: advancing ckpt for group 3 (OCRDG) thread=1 ckpt=13.55 NOTE: advancing ckpt for group 3 (OCRDG) thread=2 ckpt=11.7 NOTE: cache recovered group 3 to fcn 0.851 NOTE: redo buffer size is 512 blocks (2101760 bytes) Thu Mar 08 17:00:24 2018 NOTE: LGWR attempting to mount thread 1 for diskgroup 3 (OCRDG) NOTE: LGWR found thread 1 closed at ABA 13.54 NOTE: LGWR mounted thread 1 for diskgroup 3 (OCRDG) NOTE: LGWR opening thread 1 at fcn 0.851 ABA 14.55 NOTE: cache mounting group 3/0xB8288A01 (OCRDG) succeeded NOTE: cache ending mount (success) of group OCRDG number=3 incarn=0xb8288a01 Thu Mar 08 17:00:24 2018 NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 1 SUCCESS: diskgroup ARCHDG was mounted NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 2 SUCCESS: diskgroup DATADG was mounted NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 3 SUCCESS: diskgroup OCRDG was mounted SUCCESS: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */
数据库虽然启动成功了,但是大量的坏块生产ORA-01578,ORA-01110
ALTER DATABASE OPEN /* db agent *//* {2:40834:2} */ This instance was first to open Beginning crash recovery of 2 threads parallel recovery started with 7 processes Started redo scan Completed redo scan read 107566 KB redo, 10569 data blocks need recovery Started redo application at Thread 1: logseq 767, block 101851 Thread 2: logseq 486, block 91861 Recovery of Online Redo Log: Thread 1 Group 1 Seq 767 Reading mem 0 Mem# 0: +DATADG/testdb/onlinelog/group_1.314.970157463 Mem# 1: +ARCHDG/testdb/onlinelog/group_1.614.970157463 Recovery of Online Redo Log: Thread 2 Group 4 Seq 486 Reading mem 0 Mem# 0: +DATADG/testdb/onlinelog/group_4.319.970157561 Mem# 1: +ARCHDG/testdb/onlinelog/group_4.617.970157561 Recovery of Online Redo Log: Thread 1 Group 2 Seq 768 Reading mem 0 Mem# 0: +DATADG/testdb/onlinelog/group_2.315.970157463 Mem# 1: +ARCHDG/testdb/onlinelog/group_2.615.970157463 Recovery of Online Redo Log: Thread 2 Group 3 Seq 487 Reading mem 0 Mem# 0: +DATADG/testdb/onlinelog/group_3.318.970157559 Mem# 1: +ARCHDG/testdb/onlinelog/group_3.616.970157559 Thu Mar 08 17:01:11 2018 Hex dump of (file 12, block 126469) in trace file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_p001_4090.trc Corrupt block relative dba: 0x0301ee05 (file 12, block 126469) Fractured block found during crash/instance recovery Data in bad block: type: 32 format: 2 rdba: 0x0301ee05 last change scn: 0x0000.00446cbb seq: 0x1 flg: 0x04 spare1: 0x0 spare2: 0x0 spare3: 0x0 consistency value in tail: 0xeac22001 check value in block header: 0xbd0c computed block checksum: 0x8679 Reading datafile '+DATADG/testdb/datafile/soe.327.970157825' for corruption at rdba: 0x0301ee05 (file 12, block 126469) Reread (file 12, block 126469) found same corrupt data (no logical check) Completed redo application of 84.56MB Completed crash recovery at Thread 1: logseq 768, block 102395, scn 4586960 Thread 2: logseq 487, block 101664, scn 4587050 10569 data blocks read, 10569 data blocks written, 107566 redo k-bytes read ARCH: STARTING ARCH PROCESSES Thu Mar 08 17:01:12 2018 ARC0 started with pid=43, OS id=4151 Thu Mar 08 17:01:13 2018 ARC0: Archival started ARCH: STARTING ARCH PROCESSES COMPLETE ARC0: STARTING ARCH PROCESSES Thu Mar 08 17:01:13 2018 ARC1 started with pid=44, OS id=4153 Thu Mar 08 17:01:13 2018 ARC2 started with pid=45, OS id=4155 ARC1: Archival started ARC2: Archival started Thu Mar 08 17:01:13 2018 ARC3 started with pid=46, OS id=4157 ARC1: Becoming the 'no FAL' ARCH ARC1: Becoming the 'no SRL' ARCH ARC2: Becoming the heartbeat ARCH Thread 1 advanced to log sequence 769 (thread recovery) Picked broadcast on commit scheme to generate SCNs Thu Mar 08 17:01:13 2018 Thread 2 advanced to log sequence 488 (thread open) Thread 2 opened at log sequence 488 Current log# 4 seq# 488 mem# 0: +DATADG/testdb/onlinelog/group_4.319.970157561 Current log# 4 seq# 488 mem# 1: +ARCHDG/testdb/onlinelog/group_4.617.970157561 Successful open of redo thread 2 MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set Thu Mar 08 17:01:13 2018 SMON: enabling cache recovery Archived Log entry 1249 added for thread 2 sequence 487 ID 0xa41f5c11 dest 1: [4046] Successfully onlined Undo Tablespace 2. Undo initialization finished serial:0 start:4294332240 end:4294332460 diff:220 (2 seconds) Verifying file header compatibility for 11g tablespace encryption.. Verifying 11g file header compatibility for tablespace encryption completed SMON: enabling tx recovery Database Characterset is AL32UTF8 No Resource Manager plan active Starting background process GTX0 Thu Mar 08 17:01:14 2018 GTX0 started with pid=55, OS id=4181 Starting background process RCBG Thu Mar 08 17:01:14 2018 RCBG started with pid=57, OS id=4185 replication_dependency_tracking turned off (no async multimaster replication found) Starting background process QMNC Thu Mar 08 17:01:14 2018 QMNC started with pid=61, OS id=4195 Thu Mar 08 17:01:15 2018 Hex dump of (file 9, block 43523) in trace file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_p009_4175.trc Corrupt block relative dba: 0x0240aa03 (file 9, block 43523) Fractured block found during buffer read Data in bad block: type: 32 format: 2 rdba: 0x0240aa03 last change scn: 0x0000.0044ccc4 seq: 0x1 flg: 0x04 spare1: 0x0 spare2: 0x0 spare3: 0x0 consistency value in tail: 0xd94f2001 check value in block header: 0xb09c computed block checksum: 0x158b Reading datafile '+DATADG/testdb/datafile/soe.324.970157821' for corruption at rdba: 0x0240aa03 (file 9, block 43523) Reread (file 9, block 43523) found same corrupt data (no logical check) Errors in file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_p009_4175.trc (incident=216418): ORA-01578: ORACLE data block corrupted (file # 9, block # 43523) ORA-01110: data file 9: '+DATADG/testdb/datafile/soe.324.970157821' Incident details in: /oracle/app/db/diag/rdbms/testdb/testdb2/incident/incdir_216418/testdb2_p009_4175_i216418.trc ARC3: Archival started ARC0: STARTING ARCH PROCESSES COMPLETE Thu Mar 08 17:01:15 2018 Hex dump of (file 14, block 71173) in trace file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_p003_4094.trc Corrupt block relative dba: 0x03811605 (file 14, block 71173) Fractured block found during buffer read Data in bad block: type: 32 format: 2 rdba: 0x03811605 last change scn: 0x0000.0043916d seq: 0x1 flg: 0x04 spare1: 0x0 spare2: 0x0 spare3: 0x0 consistency value in tail: 0x93662001 check value in block header: 0xad98 computed block checksum: 0x20b Reading datafile '+DATADG/testdb/datafile/soe.329.970157831' for corruption at rdba: 0x03811605 (file 14, block 71173) Reread (file 14, block 71173) found same corrupt data (no logical check) Errors in file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_p003_4094.trc (incident=216306): ORA-01578: ORACLE data block corrupted (file # 14, block # 71173) ORA-01110: data file 14: '+DATADG/testdb/datafile/soe.329.970157831' Incident details in: /oracle/app/db/diag/rdbms/testdb/testdb2/incident/incdir_216306/testdb2_p003_4094_i216306.trc Hex dump of (file 12, block 112385) in trace file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_smon_4005.trc Corrupt block relative dba: 0x0301b701 (file 12, block 112385) Fractured block found during buffer read Data in bad block: type: 6 format: 2 rdba: 0x0301b701 last change scn: 0x0000.003cb146 seq: 0x1 flg: 0x06 spare1: 0x0 spare2: 0x0 spare3: 0x0 consistency value in tail: 0x1d030601 check value in block header: 0x1e62 computed block checksum: 0x7a47 Reading datafile '+DATADG/testdb/datafile/soe.327.970157825' for corruption at rdba: 0x0301b701 (file 12, block 112385) Reread (file 12, block 112385) found same corrupt data (no logical check) Errors in file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_smon_4005.trc (incident=216170): ORA-01578: ORACLE data block corrupted (file # 12, block # 112385) ORA-01110: data file 12: '+DATADG/testdb/datafile/soe.327.970157825' Incident details in: /oracle/app/db/diag/rdbms/testdb/testdb2/incident/incdir_216170/testdb2_smon_4005_i216170.trc Errors in file /oracle/app/db/diag/rdbms/testdb/testdb2/trace/testdb2_p003_4094.trc (incident=216307): ORA-01578: ORACLE data block corrupted (file # , block # ) Incident details in: /oracle/app/db/diag/rdbms/testdb/testdb2/incident/incdir_216307/testdb2_p003_4094_i216307.trc Completed: ALTER DATABASE OPEN /* db agent *//* {2:40834:2} */
rman 检查坏块信息,发现有大量坏块,而且主要坏块类型是FRACTURED(折断)
SQL> select * from v$database_block_corruption; FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTIO ---------- ---------- ---------- ------------------ --------- 12 126469 1 0 FRACTURED 12 112385 1 0 FRACTURED 14 71173 1 0 FRACTURED 9 43523 1 0 FRACTURED 15 44801 1 0 FRACTURED 15 44855 1 0 FRACTURED 15 44862 1 0 FRACTURED 11 56252 1 0 FRACTURED 12 116481 1 0 CHECKSUM 12 116535 1 0 FRACTURED 15 46903 1 0 FRACTURED 12 108289 1 0 FRACTURED 12 108343 1 0 FRACTURED 8 95872 1 0 FRACTURED 8 99712 1 0 FRACTURED 8 102783 1 0 FRACTURED 8 104974 1 0 FRACTURED 8 105856 1 0 FRACTURED 8 105998 1 0 FRACTURED 8 109120 1 0 FRACTURED 8 110016 1 0 FRACTURED 8 110092 1 0 FRACTURED 8 116226 1 0 FRACTURED 8 116231 1 0 FRACTURED 8 119295 1 0 FRACTURED 8 130559 1 0 FRACTURED 12 113163 1 0 FRACTURED 12 118283 1 0 FRACTURED 12 119553 1 0 FRACTURED 12 119608 1 0 FRACTURED 12 120576 1 0 FRACTURED 12 120632 1 0 FRACTURED 12 120639 1 0 FRACTURED 12 123407 1 0 FRACTURED 9 38411 1 0 FRACTURED 9 67647 1 0 FRACTURED 9 109068 1 0 FRACTURED 9 109070 1 0 FRACTURED 11 99839 1 0 FRACTURED 11 104960 1 0 FRACTURED 13 63214 1 0 FRACTURED 13 65097 1 0 FRACTURED 13 71183 1 0 FRACTURED 13 77319 1 0 FRACTURED 13 103104 1 0 FRACTURED 13 107072 1 0 FRACTURED 13 110082 1 0 FRACTURED 13 111106 1 0 FRACTURED 13 114752 1 0 FRACTURED 14 72196 1 0 FRACTURED 14 75275 1 0 FRACTURED 7 76293 1 0 FRACTURED 7 76296 1 0 FRACTURED 7 76415 1 0 FRACTURED 7 76864 1 0 FRACTURED 15 49976 1 0 FRACTURED 15 81413 1 0 FRACTURED 2 61512 1 0 FRACTURED 19 35338 1 0 FRACTURED 19 42687 1 0 FRACTURED 60 rows selected.
对比实验室和客户环境配置
实验室环境
[oracle@i-q2ghx82t ~]$ more /etc/issue CentOS release 6.8 (Final) Kernel \r on an \m SQL> select path from v$asm_disk; PATH ---------------------------------------- /dev/asm-disk011 /dev/asm-disk001 /dev/asm-disk014 /dev/asm-disk002 /dev/asm-disk003 /dev/asm-disk015 /dev/asm-disk012 /dev/asm-disk013 8 rows selected. [root@i-q2ghx82t ~]# fdisk -l|grep vd Disk /dev/vda: 107.4 GB, 107374182400 bytes /dev/vda1 * 1 13055 104856576 83 Linux Disk /dev/vdb: 34.4 GB, 34359738368 bytes Disk /dev/vdc: 107.4 GB, 107374182400 bytes Disk /dev/vdd: 268.4 GB, 268435456000 bytes /dev/vdd1 1 32635 262140606 83 Linux Disk /dev/vde: 1073.7 GB, 1073741824000 bytes Disk /dev/vdf: 53.7 GB, 53687091200 bytes /dev/vdf1 1 104025 52428568+ 83 Linux Disk /dev/vdg: 107.4 GB, 107374182400 bytes Disk /dev/vdh: 1073.7 GB, 1073741824000 bytes Disk /dev/vdi: 107.4 GB, 107374182400 bytes Disk /dev/vdj: 1073.7 GB, 1073741824000 bytes Disk /dev/vdk: 1073.7 GB, 1073741824000 bytes Disk /dev/vdl: 1073.7 GB, 1073741824000 bytes [oracle@i-q2ghx82t ~]$ ls -l /dev/asm-* lrwxrwxrwx 1 root root 3 Mar 10 17:05 /dev/asm-disk001 -> vdg lrwxrwxrwx 1 root root 3 Mar 10 17:05 /dev/asm-disk002 -> vdc lrwxrwxrwx 1 root root 3 Mar 10 17:05 /dev/asm-disk003 -> vdi lrwxrwxrwx 1 root root 3 Mar 10 17:05 /dev/asm-disk011 -> vde lrwxrwxrwx 1 root root 3 Mar 10 17:05 /dev/asm-disk012 -> vdh lrwxrwxrwx 1 root root 3 Mar 10 17:05 /dev/asm-disk013 -> vdl lrwxrwxrwx 1 root root 3 Mar 10 17:05 /dev/asm-disk014 -> vdj lrwxrwxrwx 1 root root 3 Mar 10 17:05 /dev/asm-disk015 -> vdk [root@i-q2ghx82t ~]# fdisk -l /dev/vdj Disk /dev/vdj: 1073.7 GB, 1073741824000 bytes 16 heads, 63 sectors/track, 2080507 cylinders Units = cylinders of 1008 * 512 = 516096 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk identifier: 0x00000000
客户环境
[oracle@dbtest02 ~]$ more /etc/issue Red Hat Enterprise Linux Server release 6.6 (Santiago) Kernel \r on an \m SQL> select path from v$asm_disk; PATH ---------------------------------------- /dev/oracleasm/disks/ARCHDISK01 /dev/oracleasm/disks/OCRDISK01 /dev/oracleasm/disks/DATADISK01 [root@dbtest02 ~]# oracleasm listdisks ARCHDISK01 DATADISK01 OCRDISK01 [root@dbtest02 ~]# oracleasm querydisk -p DATADISK01 Disk "DATADISK01" is a valid ASM disk /dev/vdc1: LABEL="DATADISK01" TYPE="oracleasm" [root@dbtest02 ~]# oracleasm querydisk -p ARCHDISK01 Disk "ARCHDISK01" is a valid ASM disk /dev/vde1: LABEL="ARCHDISK01" TYPE="oracleasm" [root@dbtest02 ~]# oracleasm querydisk -p OCRDISK01 Disk "OCRDISK01" is a valid ASM disk /dev/vdd1: LABEL="OCRDISK01" TYPE="oracleasm" [root@dbtest02 ~]# fdisk -l|grep vd Disk /dev/vda: 53.7 GB, 53687091200 bytes /dev/vda1 1 6528 52427776 83 Linux Disk /dev/vdb: 17.2 GB, 17179869184 bytes Disk /dev/vdc: 1073.7 GB, 1073741824000 bytes /dev/vdc1 1 2080507 1048575496+ 83 Linux Disk /dev/vdd: 107.4 GB, 107374182400 bytes /dev/vdd1 1 208050 104857168+ 83 Linux Disk /dev/vde: 536.9 GB, 536870912000 bytes /dev/vde1 1 1040253 524287480+ 83 Linux Disk /dev/vdf: 859.0 GB, 858993459200 bytes /dev/vdf1 1 1664406 838860592+ 83 Linux [root@dbtest02 ~]# fdisk -l /dev/vdf Disk /dev/vdf: 859.0 GB, 858993459200 bytes 16 heads, 63 sectors/track, 1664406 cylinders Units = cylinders of 1008 * 512 = 516096 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk identifier: 0x1bdbf439 Device Boot Start End Blocks Id System /dev/vdf1 1 1664406 838860592+ 83 Linux Partition 1 does not start on physical sector boundary.
这里比较明显:1)实验室环境使用的是udev,而客户环境使用的是asmlib.2)客户环境使用了分区,而实验室环境直接使用裸盘.
尝试在实验室环境使用asmlib
[root@i-q2ghx82t tmp]# ls -l *oracleasm* -rw-r--r-- 1 root root 36288 Mar 10 13:09 kmod-oracleasm-2.0.8-13.el6_8.x86_64.rpm -rw-r--r-- 1 root root 17168 Mar 10 13:09 oracleasmlib-2.0.12-1.el6.x86_64.rpm -rw-r--r-- 1 root root 74984 Mar 10 13:09 oracleasm-support-2.1.8-1.el6.x86_64.rpm [root@i-q2ghx82t tmp]# rpm -ivh kmod-oracleasm-2.0.8-13.el6_8.x86_64.rpm warning: kmod-oracleasm-2.0.8-13.el6_8.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID ec551f03: NOKEY Preparing... ########################################### [100%] 1:kmod-oracleasm ########################################### [100%] [root@i-q2ghx82t tmp]# rpm -ivh oracleasm*.rpm warning: oracleasmlib-2.0.12-1.el6.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID ec551f03: NOKEY Preparing... ########################################### [100%] 1:oracleasm-support ########################################### [ 50%] 2:oracleasmlib ########################################### [100%] [root@i-q2ghx82t tmp]# [root@i-q2ghx82t tmp]# rpm -aq|grep oracleasm kmod-oracleasm-2.0.8-13.el6_8.x86_64 oracleasmlib-2.0.12-1.el6.x86_64 oracleasm-support-2.1.8-1.el6.x86_64 [root@i-q2ghx82t dev]# oracleasm createdisk ASMVDL /dev/vdl Device "/dev/vdl" is not a partition
测试比较明显,对于没有分区的分布式存储中的盘,无法直接被asmlib创建asm disk.如果要使用asmlib,需要对磁盘进行分区.
存储特征
通过咨询分布式存储开发,确认几点:
0. 整个存储是基于网络技术实现分布式存储内部数据同步
1. 整个分布式存储使用ssd卡,默认三副本
2. 三副本中有两个副本写入成功才会反馈给应用(数据库),io成功
3. 整个分布式存储,没有引入任何的cache.
4. 存储的最小io单元是4k,由于ssd卡决定的
5. 分布式存储划分出来的lun是自动实现快对齐的
故障原因
由于存储最小单元是4k,但是在os层面物理扇区为4k,逻辑扇区为512,客户为了使用asmlib,对磁盘进行了默认分区,而没有考虑块对齐.会使得在存储整体crash的时候,很多block由于没有块对齐,很可能被进行了拆分(本来一个4k的io,存储上对对应的一个原子io完成,但是现在这个os层面的4k被拆分成了多个io的可能性,使得os层面的4k的io无法有了原则性),进而使得在数据库主机,存储突发异常的时候,有可能发生坏块.
处理建议
1. 在linux 6开始,对于asmlib,建议使用udev
2. 如果坚持asmlib,分区的时候考虑块对齐,不然会出现较大的性能下降,而且还会引起坏块的风险