联系:手机/微信(+86 17813235971) QQ(107644445)
标题:PostgreSQL恢复系列:pg_control异常恢复
作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]
在PG中pg_control文件类似oracle数据库的control文件(控制文件),在Oracle中如果该文件丢失/损坏,可以通过alter database create controlfile命令进行创建,对于PG数据库来说也可以通过pg_resetwal命令来实现创建,由于pg_control文件损坏,需要人工指定一些参数完成pg_resetwal相关操作
pg_resetwal 使用说明
-bash-4.2$ pg_resetwal --help pg_resetwal resets the PostgreSQL write-ahead log. Usage: pg_resetwal [OPTION]... DATADIR Options: -c, --commit-timestamp-ids=XID,XID set oldest and newest transactions bearing commit timestamp (zero means no change) [-D, --pgdata=]DATADIR data directory -e, --epoch=XIDEPOCH set next transaction ID epoch -f, --force force update to be done -l, --next-wal-file=WALFILE set minimum starting location for new WAL -m, --multixact-ids=MXID,MXID set next and oldest multitransaction ID -n, --dry-run no update, just show what would be done -o, --next-oid=OID set next OID -O, --multixact-offset=OFFSET set next multitransaction offset -u, --oldest-transaction-id=XID set oldest transaction ID -V, --version output version information, then exit -x, --next-transaction-id=XID set next transaction ID --wal-segsize=SIZE size of WAL segments, in megabytes -?, --help show this help, then exit Report bugs to <pgsql-bugs@lists.postgresql.org>. PostgreSQL home page: <https://www.postgresql.org/>
确认现在业务表记录情况
-bash-4.2$ psql psql (14.3) Type "help" for help. postgres=# select count(1) from ac_event; count -------- 246266 (1 row)
模拟pg_control文件异常
-bash-4.2$ ps -ef|grep postgres postgres 37178 1 0 09:58 ? 00:00:00 /usr/pgsql-14/bin/postgres -D /var/lib/pgsql/14/data postgres 37179 37178 0 09:58 ? 00:00:00 postgres: logger postgres 37181 37178 0 09:58 ? 00:00:00 postgres: checkpointer postgres 37182 37178 0 09:58 ? 00:00:00 postgres: background writer postgres 37183 37178 0 09:58 ? 00:00:00 postgres: walwriter postgres 37184 37178 0 09:58 ? 00:00:00 postgres: autovacuum launcher postgres 37185 37178 0 09:58 ? 00:00:00 postgres: stats collector postgres 37186 37178 0 09:58 ? 00:00:00 postgres: logical replication launcher root 41368 41314 0 11:06 pts/1 00:00:00 su - postgres postgres 41369 41368 0 11:06 pts/1 00:00:00 -bash postgres 45071 41369 0 12:07 pts/1 00:00:00 ps -ef postgres 45072 41369 0 12:07 pts/1 00:00:00 grep --color=auto postgres -bash-4.2$ kill -9 37178 -bash-4.2$ ps -ef|grep postgres root 41368 41314 0 11:06 pts/1 00:00:00 su - postgres postgres 41369 41368 0 11:06 pts/1 00:00:00 -bash postgres 45095 41369 0 12:08 pts/1 00:00:00 ps -ef postgres 45096 41369 0 12:08 pts/1 00:00:00 grep --color=auto postgres -bash-4.2$ pwd /var/lib/pgsql/14/data/global -bash-4.2$ ls -l pg_control -rw-------. 1 postgres postgres 8192 May 30 12:04 pg_control -bash-4.2$ rm -rf pg_control -bash-4.2$ ls -l pg_control ls: cannot access pg_control: No such file or directory
PG启动失败
-bash-4.2$ pg_ctl start pg_ctl: another server might be running; trying to start server anyway waiting for server to start....postgres: could not find the database system Expected to find it in the directory "/var/lib/pgsql/14/data", but could not open file "/var/lib/pgsql/14/data/global/pg_control": No such file or directory stopped waiting pg_ctl: could not start server Examine the log output.
创建空pg_control文件启动依旧失败
-bash-4.2$ touch /var/lib/pgsql/14/data/global/pg_control -bash-4.2$ pg_ctl start pg_ctl: another server might be running; trying to start server anyway waiting for server to start....2022-05-30 12:09:43.953 CST [45215] PANIC: could not read file "global/pg_control": read 0 of 296 stopped waiting pg_ctl: could not start server Examine the log output.
设置next-wal-file
-l, –next-wal-file=WALFILE,这个参数设置下一个新的WAL文件的最小值,这个值可以从$PGDATA/pg_wal目录下去看最后一个WAL 文件,这个文件的id+1即可
-bash-4.2$ pwd /var/lib/pgsql/14/data/pg_wal -bash-4.2$ ls -l total 16384 -rw-------. 1 postgres postgres 16777216 May 30 12:04 000000010000000000000014 drwx------. 2 postgres postgres 6 May 24 02:20 archive_status -bash-4.2$
这个文件+1,-l 000000010000000000000015
设置next-transaction
-x, –next-transaction-id=XID,这个参数设置pg_control中的下一个XID的值,这个值可以从pg_xact目录下的文件中查询
-bash-4.2$ pwd /var/lib/pgsql/14/data/pg_xact -bash-4.2$ ls -ltr total 8 -rw-------. 1 postgres postgres 8192 May 30 12:03 0000
最后一个是0000,那么下一个XID就是0001,然后乘以 1048576 (0×100000),实际上后面直接加5个0就行了。注意,这个值是16进制的。-x 0×000100000
multixact-ids设置
-m, –multixact-ids=MXID1,MXID2,这个参数包含两个部分,MXID1和MXID2,都可以从$PGDATA/pg_multixact/offsets目录下获得。MXID1的值,首先找到最大值,+1,再乘以 65536 (0×10000,相当于后面加4个0)作为这个参数的前半部分。找到最小的值,后面加4个0,作为MXID2的值
-bash-4.2$ pwd /var/lib/pgsql/14/data/pg_multixact/offsets -bash-4.2$ ls -ltr total 8 -rw-------. 1 postgres postgres 8192 May 29 22:06 0000 -bash-4.2$
-m 0×00010000, 0×00000000(由于oldest multitransaction ID不能为0,因此后续这个值需要适当调整)
multixact-offset设置
-O, –multixact-offset=OFFSET,这个参数可以从$PGDATA/pg_multixact/members目录下获得。找到最大值,+1,乘以 52352 (0xCC80)
-bash-4.2$ pwd /var/lib/pgsql/14/data/pg_multixact/members -bash-4.2$ ls -ltr total 8 -rw-------. 1 postgres postgres 8192 May 24 02:20 0000
-O 0xCC80
尝试执行pg_resetwal
-bash-4.2$ pg_resetwal -l 000000010000000000000015 -x 0x000100000 -m 0x00010000,0x00000000 -O 0xCC80 $PGDATA pg_resetwal: error: oldest multitransaction ID (-m) must not be 0
multixact-ids值不对,进行调整后处理
postmaster.pid文件需要清理
由于PG库异常关闭,需要人工清理掉该文件
-bash-4.2$ pg_resetwal -l 000000010000000000000015 -x 0x000100000 -m 0x00020000,0x00010000 -O 0xCC80 $PGDATA pg_resetwal: error: lock file "postmaster.pid" exists -bash-4.2$ rm -rf postmaster.pid
pg_resetwal结果预览
-bash-4.2$ pg_resetwal -l 000000010000000000000015 -x 0x000100000 -m 0x00020000,0x00010000 -O 0xCC80 $PGDATA pg_resetwal: warning: pg_control exists but is broken or wrong version; ignoring it Guessed pg_control values: pg_control version number: 1300 Catalog version number: 202107181 Database system identifier: 7103392535324046312 Latest checkpoint's TimeLineID: 1 Latest checkpoint's full_page_writes: off Latest checkpoint's NextXID: 0:3 Latest checkpoint's NextOID: 12000 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 3 Latest checkpoint's oldestXID's DB: 0 Latest checkpoint's oldestActiveXID: 0 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 0 Latest checkpoint's oldestCommitTsXid:0 Latest checkpoint's newestCommitTsXid:0 Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float8 argument passing: by value Data page checksum version: 0 Values to be changed: First log segment after reset: 000000010000000000000015 NextMultiXactId: 131072 OldestMultiXid: 65536 OldestMulti's DB: 0 NextMultiOffset: 52352 NextXID: 1048576 OldestXID: 3 OldestXID's DB: 0 If these values seem acceptable, use -f to force reset.
pg_resetwal进行创建pg_control并启动PG
-bash-4.2$ pg_resetwal -l 000000010000000000000015 -x 0x000100000 -m 0x00020000,0x00010000 -O 0xCC80 -f $PGDATA pg_resetwal: warning: pg_control exists but is broken or wrong version; ignoring it Write-ahead log reset -bash-4.2$ pg_ctl start waiting for server to start....2022-05-30 13:33:28.266 CST [51437] LOG: redirecting log output to logging collector process 2022-05-30 13:33:28.266 CST [51437] HINT: Future log output will appear in directory "log". done server started
验证数据
-bash-4.2$ psql psql (14.3) Type "help" for help. postgres=# select count(1) from ac_event; count -------- 245275 (1 row)
这种方法恢复之后,建议理解dump数据,然后导入到新库中