联系:手机/微信(+86 17813235971) QQ(107644445)
作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]
一、现状描述
11月25日晚上8点40分接到现场电话,说我们公司所有员工的im不能正常登陆,im数据库服务器(mysql数据库)的公司表中对应的我们公司名称被删除,需要处理。接到这个异常后,第一想到的是下午下班前,收到一封ogg的警告邮件说ogg进程终止,然后我登陆数据库查看的时候,发现ogg已经工作正常,所以也就没有太多关注。既然已经出现了这个问题,那么先解决问题,再分析原因。因为这些都是ogg从oracle端同步过来的一些数据,所以直接从oracle那边初始化一份过来,然后重设同步程序就可以了。
二、错误分析
1、出现这个问题,第一想到的就是binlog,因为公司表的数据是从我们oracle那边同步过去的,而且oracle那边没有任何关于这个表的删除操作,所以我定位这个表的delete操作
[mysql@ezgclient mysqllog]$ mysqlbinlog mysqlbin.000150>/tmp/11_25.txt [mysql@ezgclient mysqllog]$ vi /tmp/11_25.txt …… # at 1396789 # at 1396875 #111125 16:15:31 server id 2 end_log_pos 1396875 Table_map: `a`.`abc` mapped to number 5304 #111125 16:15:31 server id 2 end_log_pos 1397000 Delete_rows: table id 5304 flags: STMT_END_F BINLOG ' o07PThMCAAAAVgAAAItQFQAAALgUAAAAAAAABnNydGFpcgAPdGFiX3V1bV9jb21wYW55AA4DD/b2 CAwMCA8PDw8PCBAAARQAFgCAAIAAQABAAEAA+D8= o07PThkCAAAAfQAAAAhRFQAQALgUAAAAAAEADv//AP7iAAAAFAC6vNbdysC1vL/GvLzT0M/euavL voAAAAAAOLkeQYAAAAAAAAAAAAEpAAAAAAAAANkqumA8EgAAO8kaakoSAAABAAAAAAAAAA9CVTAw MDAwMDAzNTgwNzg= ……
2、因为binlog_format采用的是row模式,所以需要进一步解析binglog
[mysql@ezgclient mysqllog]$ mysqlbinlog -v -v mysqlbin.000150>/tmp/11_25.txt [mysql@ezgclient mysqllog]$ vi /tmp/11_25.txt #111125 16:15:31 server id 2 end_log_pos 1396875 Table_map: `a`.`abc` mapped to number 5304 #111125 16:15:31 server id 2 end_log_pos 1397000 Delete_rows: table id 5304 flags: STMT_END_F BINLOG ' o07PThMCAAAAVgAAAItQFQAAALgUAAAAAAAABnNydGFpcgAPdGFiX3V1bV9jb21wYW55AA4DD/b2 CAwMCA8PDw8PCBAAARQAFgCAAIAAQABAAEAA+D8= o07PThkCAAAAfQAAAAhRFQAQALgUAAAAAAEADv//AP7iAAAAFAC6vNbdysC1vL/GvLzT0M/euavL voAAAAAAOLkeQYAAAAAAAAAAAAEpAAAAAAAAANkqumA8EgAAO8kaakoSAAABAAAAAAAAAA9CVTAw MDAwMDAzNTgwNzg= '/*!*/; ### DELETE FROM a.abc ### WHERE ### @1=226 /* INT meta=0 nullable=0 is_null=0 */ ### @2='*****有限公司' /* VARSTRING(256) meta=256 nullable=0 is_null=0 */ ### @3=951656001 /* DECIMAL(20,0) meta=5120 nullable=0 is_null=0 */ ### @4=000000001 /* DECIMAL(22,0) meta=5632 nullable=1 is_null=0 */ ### @5=41 /* LONGINT meta=0 nullable=1 is_null=0 */ ### @6=2005-05-30 15:11:29 /* DATETIME meta=0 nullable=1 is_null=0 */ ### @7=2011-08-17 02:02:19 /* DATETIME meta=0 nullable=1 is_null=0 */ ### @8=1 /* LONGINT meta=0 nullable=1 is_null=0 */ ### @9='BU0000000358078' /* VARSTRING(128) meta=128 nullable=1 is_null=0 */ ### @10=NULL /* VARSTRING(128) meta=128 nullable=1 is_null=1 */ ### @11=NULL /* VARSTRING(128) meta=64 nullable=1 is_null=1 */ ### @12=NULL /* VARSTRING(128) meta=64 nullable=1 is_null=1 */ ### @13=NULL /* VARSTRING(128) meta=64 nullable=1 is_null=1 */ ### @14=NULL /* VARSTRING(128) meta=0 nullable=1 is_null=1 */ # at 1397000 #111125 16:15:31 server id 2 end_log_pos 1397027 Xid = 79238866 COMMIT/*!*/; DELIMITER ; # End of log file ROLLBACK /* added by mysqlbinlog */;
通过上面的分析和这里的日志情况显示,很明显有人误删除了这条记录,导致我们公司所有员工不能登录im(登录在线的,不会使用到这条记录,这个也就是导致了我们到晚上八点多才发现这个异常)
3、检查error日志
检查这个日志,发现一个很明显的bug,这个是导致数据库重启,以及那个时间因为数据库重启导致ogg进程异常收到邮件
111125 16:15:35 InnoDB: Assertion failure in thread 1095162176 in file row/row0mysql.c line 1534 InnoDB: Failing assertion: index->type & DICT_CLUSTERED InnoDB: We intentionally generate a memory trap. InnoDB: Submit a detailed bug report to http://bugs.mysql.com. InnoDB: If you get repeated assertion failures or crashes, even InnoDB: immediately after the mysqld startup, there may be InnoDB: corruption in the InnoDB tablespace. Please refer to InnoDB: http://dev.mysql.com/doc/refman/5.1/en/forcing-recovery.html InnoDB: about forcing recovery. 111125 16:15:35 - mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. key_buffer_size=8388608 read_buffer_size=1048576 max_used_connections=30 max_threads=1000 threads_connected=14 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 9234379 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. thd: 0xb8bf170 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x4146d100 thread_stack 0x30000 /opt/mysql/product/5.1/bin/mysqld(my_print_stacktrace+0x2e)[0x8a74ce] /opt/mysql/product/5.1/bin/mysqld(handle_segfault+0x322)[0x5dc992] /lib64/libpthread.so.0[0x357980eb10] /lib64/libc.so.6(gsignal+0x35)[0x3578c30265] /lib64/libc.so.6(abort+0x110)[0x3578c31d10] /opt/mysql/product/5.1/bin/mysqld(row_unlock_for_mysql+0x2f2)[0x7f4a52] /opt/mysql/product/5.1/bin/mysqld(row_search_for_mysql+0x22e1)[0x802591] /opt/mysql/product/5.1/bin/mysqld(_ZN11ha_innobase10index_readEPhPKhj16ha_rkey_function+0x192)[0x7724d2] /opt/mysql/product/5.1/bin/mysqld(_ZN7handler16read_range_firstEPK12st_key_rangeS2_bb+0xbe)[0x6caa9e] /opt/mysql/product/5.1/bin/mysqld(_ZN7handler22read_multi_range_firstEPP18st_key_multi_rangeS1_jbP17st_handler_buffer+0xce)[0x6c85be] /opt/mysql/product/5.1/bin/mysqld(_ZN18QUICK_RANGE_SELECT8get_nextEv+0x127)[0x6aa557] /opt/mysql/product/5.1/bin/mysqld[0x6c415d] /opt/mysql/product/5.1/bin/mysqld(_Z12mysql_deleteP3THDP10TABLE_LISTP4ItemP11st_sql_listyyb+0x86c)[0x66fc5c] /opt/mysql/product/5.1/bin/mysqld(_Z21mysql_execute_commandP3THD+0x38bf)[0x5f03af] /opt/mysql/product/5.1/bin/mysqld(_Z11mysql_parseP3THDPKcjPS2_+0x357)[0x5f25e7] /opt/mysql/product/5.1/bin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0xe63)[0x5f3453] /opt/mysql/product/5.1/bin/mysqld(_Z10do_commandP3THD+0xe6)[0x5f3d16] /opt/mysql/product/5.1/bin/mysqld(handle_one_connection+0x236)[0x5e66d6] /lib64/libpthread.so.0[0x357980673d] /lib64/libc.so.6(clone+0x6d)[0x3578cd3d1d] Trying to get some variables. Some pointers may be invalid and cause the dump to abort... thd->query at 0x2aab4890fcd0 is an invalid pointer thd->thread_id=62259 thd->killed=NOT_KILLED The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains information that should help you find out what is causing the crash. 111125 16:15:35 mysqld_safe Number of processes running now: 0 111125 16:15:35 mysqld_safe mysqld restarted InnoDB: Log scan progressed past the checkpoint lsn 0 694228728 111125 16:15:36 InnoDB: Database was not shut down normally! InnoDB: Starting crash recovery. InnoDB: Reading tablespace information from the .ibd files... InnoDB: Restoring possible half-written data pages from the doublewrite InnoDB: buffer... InnoDB: Doing recovery: scanned up to log sequence number 0 694229872 111125 16:15:36 InnoDB: Starting an apply batch of log records to the database... InnoDB: Progress in percents: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 5 6 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 InnoDB: Apply batch completed InnoDB: Last MySQL binlog file position 0 1397027, file name /opt/mysql/mysqldata/mysqllog/mysqlbin.000149 111125 16:15:37 InnoDB: Started; log sequence number 0 694229872 111125 16:15:37 [Note] Recovering after a crash using /opt/mysql/mysqldata/mysqllog/mysqlbin 111125 16:15:37 [Note] Starting crash recovery... 111125 16:15:37 [Note] Crash recovery finished.
而且还有个巧合就是查看binlog发现,DELETE FROM a.abc where ……之后,数据库就因为这个bug自动重启了。
4、网友解释
# /opt/mysql/product/5.1/bin/mysqld(_Z12mysql_deleteP3THDP10TABLE_LISTP4ItemP11st_sql_listyyb+0x86c)[0x66fc5c] # /opt/mysql/product/5.1/bin/mysqld(_Z21mysql_execute_commandP3THD+0x38bf)[0x5f03af] ---从这信息看是整理簇索引,导致表空间出现损坏 ---分析的情况,你的系统应该正在做一个DELETE操作,而且应该无索引可走,删除的数据量也比较大 ---可能是大量数据被缓存在innodb_buffer_pool_size中,并且其内部有创建自适应的hash索引,因删除数据而不得不重新创建, ---以及你的服务器当时IO出现瓶颈,导致一时无法响应Innodb master thread,而出现问题,并且InnoDB引擎在此方面出现过BUG ---解决版本是5.1.37之后,所以建议使用:5.1.40版本,较稳定
感谢jinguanding前辈热情帮助
http://www.itpub.net/forum.php?mod=viewthread&tid=1515971&page=1#pid18593129
请问除了提升mysql的版本之外, 还有其他方法吗?
因为我测试过提升版本, 可是还是有此问题呢.