主机[postgres@localhost ~]$ pg_controldatapg_control version number: 903Catalog version number: 201105231Database system identifier: 5809766734683241747Database cluster state: in production备机[postgres@localhost pg_log]$ pg_controldatapg_control version number: 903Catalog version number: 201105231Database system identifier: 5809766734683241747Database cluster state: in archive recovery2.通过数据字典表pg_stat_replication 主机表中能查到记录,备机表中是没有的 主机
postgres=# select procpid,usename,application_name,client_addr,client_port,state,sync_state from pg_stat_replication;procpid | usename | application_name | client_addr | client_port | state | sync_state---------+---------+------------------+---------------+-------------+-----------+------------ 7101 | repuser | walreceiver | 192.168.2.137 | 46655 | streaming | async(1 row)--9.2版本,procpid已经改为pid3.通过进程查看,显示wal sender的是主机,显示wal receiver的是备机
主机[postgres@localhost ]$ ps -ef|grep postgrespostgres 5228 5224 0 05:26 ? 00:00:06 postgres: wal sender process repuser 192.168.2.137(43246) 0/ED610000 .....备机[postgres@localhost ]$ ps -ef|grep postgrespostgres 5228 5224 0 05:22 ? 00:00:06 postgres: wal receiver process streaming 0/ED610000postgres 5231 5224 0 05:22 ? 00:00:01 postgres: writer process .....4.通过自带的函数,是备机则是true
postgres=# select pg_is_in_recovery(); pg_is_in_recovery ------------------- f(1 row)二、模拟主机宕机,备机切换到主机 1.主机宕机: [postgres@localhost ]$pg_stop 主机停机前备机的进程
[postgres@localhost pg_log]$ ps -ef|grep postgresroot 2215 2197 0 00:35 pts/0 00:00:00 su - postgrespostgres 2217 2215 0 00:35 pts/0 00:00:00 -bashpostgres 5224 1 0 05:22 pts/0 00:00:02 /home/postgres/bin/postgres -D /database/pgdatapostgres 5226 5224 0 05:22 ? 00:00:00 postgres: logger process postgres 5227 5224 0 05:22 ? 00:00:00 postgres: startup process recovering 00000001000000000000003Bpostgres 5228 5224 0 05:22 ? 00:00:06 postgres: wal receiver process streaming 0/ED610000postgres 5231 5224 0 05:22 ? 00:00:01 postgres: writer process postgres 5232 5224 0 05:22 ? 00:00:00 postgres: stats collector process postgres 5902 2217 0 19:05 pts/0 00:00:00 ps -efpostgres 5903 2217 0 19:05 pts/0 00:00:00 grep postgres主机停机后备机的进程
[postgres@localhost pg_log]$ ps -ef|grep postgresroot 2215 2197 0 00:35 pts/0 00:00:00 su - postgrespostgres 2217 2215 0 00:35 pts/0 00:00:00 -bashpostgres 5224 1 0 05:22 pts/0 00:00:02 /home/postgres/bin/postgres -D /database/pgdatapostgres 5226 5224 0 05:22 ? 00:00:00 postgres: logger process postgres 5227 5224 0 05:22 ? 00:00:00 postgres: startup process waiting for 00000001000000000000003Cpostgres 5231 5224 0 05:22 ? 00:00:01 postgres: writer process postgres 5232 5224 0 05:22 ? 00:00:00 postgres: stats collector process postgres 5904 2217 0 19:05 pts/0 00:00:00 ps -efpostgres 5905 2217 0 19:05 pts/0 00:00:00 grep postgres可以发现原先的streaming进程(pid=5228)没了。 同时,备机中的日志,出现大量的错误信息
2012-12-24 19:09:07.064 PST,,,5948,,50d918d3.173c,1,,2012-12-24 19:09:07 PST,,0,FATAL,XX000,"could not connect to the primary server: could not connect to server: Connection refused Is the server running on host ""192.168.2.134"" and accepting TCP/IP connections on port 5432?",,,,,,,,,""2012-12-24 19:09:12.069 PST,,,5949,,50d918d8.173d,1,,2012-12-24 19:09:12 PST,,0,FATAL,XX000,"could not connect to the primary server: could not connect to server: Connection refused Is the server running on host ""192.168.2.134"" and accepting TCP/IP connections on port 5432?",,,,,,,,,""2012-12-24 19:09:17.077 PST,,,5950,,50d918dd.173e,1,,2012-12-24 19:09:17 PST,,0,FATAL,XX000,"could not connect to the primary server: could not connect to server: Connection refused Is the server running on host ""192.168.2.134"" and accepting TCP/IP connections on port 5432?",,,,,,,,,""2012-12-24 19:09:22.081 PST,,,5951,,50d918e2.173f,1,,2012-12-24 19:09:22 PST,,0,FATAL,XX000,"could not connect to the primary server: could not connect to server: Connection refused Is the server running on host ""192.168.2.134"" and accepting TCP/IP connections on port 5432?显示的错误信息很明显,primary 服务器连不上了。 此时查看备机的pg_controldata状态信息,仍是备机状态
[postgres@localhost ]$ pg_controldatapg_control version number: 903Catalog version number: 201105231Database system identifier: 5809766734683241747Database cluster state: in archive recoverypg_control last modified: Mon 24 Dec 2012 07:07:54 PM PSTLatest checkpoint location: 0/F0000020Prior checkpoint location: 0/EC01E768Latest checkpoint's REDO location: 0/F0000020Latest checkpoint's TimeLineID: 1Latest checkpoint's NextXID: 0/1968Latest checkpoint's NextOID: 24607Latest checkpoint's NextMultiXactId: 1Latest checkpoint's NextMultiOffset: 0Latest checkpoint's oldestXID: 1792Latest checkpoint's oldestXID's DB: 1Latest checkpoint's oldestActiveXID: 0Time of latest checkpoint: Mon 24 Dec 2012 07:05:05 PM PSTMinimum recovery ending location: 0/EC018CA0Backup start location: 0/0Current wal_level setting: hot_standbyCurrent max_connections setting: 100Current max_prepared_xacts setting: 0......[postgres@localhost ]$2.备机切换成主机 之前备机上的recovery.conf中配置了 trigger_file = '/database/pgdata/trigger.kenyon' 要切换备机成主机,只要创建一个触发文件trigger.kenyon即可,这个名字可以随便写。 [postgres@localhost ]$ touch /database/pgdata/trigger.kenyon 此时查看备机上的日志,可以看到成功切换到主机了。
[postgres@localhost ]$tail -f postgresql-2012-12-24_190930.csv 2012-12-24 19:09:37.100 PST,,,5954,,50d918f1.1742,1,,2012-12-24 19:09:37 PST,,0,FATAL,XX000,"could not connect to the primary server: could not connect to server: Connection refused Is the server running on host ""192.168.2.134"" and accepting TCP/IP connections on port 5432?",,,,,,,,,""2012-12-24 19:09:42.093 PST,,,5227,,50d85726.146b,6,,2012-12-24 05:22:46 PST,1/0,0,LOG,00000,"trigger file found: /database/pgdata/trigger.kenyon",,,,,,,,,""2012-12-24 19:09:42.097 PST,,,5227,,50d85726.146b,7,,2012-12-24 05:22:46 PST,1/0,0,LOG,00000,"redo done at 0/F0000020",,,,,,,,,""2012-12-24 19:09:42.104 PST,,,5227,,50d85726.146b,8,,2012-12-24 05:22:46 PST,1/0,0,LOG,00000,"last completed transaction was at log time 2012-12-24 05:29:38.526602-08",,,,,,,,,""2012-12-24 19:09:42.112 PST,,,5227,,50d85726.146b,9,,2012-12-24 05:22:46 PST,1/0,0,LOG,00000,"selected new timeline ID: 2",,,,,,,,,""2012-12-24 19:10:04.403 PST,,,5227,,50d85726.146b,10,,2012-12-24 05:22:46 PST,1/0,0,LOG,00000,"archive recovery complete",,,,,,,,,""2012-12-24 19:10:04.705 PST,,,5224,,50d8571c.1468,2,,2012-12-24 05:22:36 PST,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,,""2012-12-24 19:10:04.710 PST,,,5964,,50d9190c.174c,1,,2012-12-24 19:10:04 PST,,0,LOG,00000,"autovacuum launcher started",,,,,,,,,""日志里可以体现出来原来的备机已经切换为主机了。 再去看现在这台机子的pg_controldata的信息,再次确认一下:
[postgres@localhost pg_log]$ pg_controldatapg_control version number: 903Catalog version number: 201105231Database system identifier: 5809766734683241747Database cluster state: in productionpg_control last modified: Mon 24 Dec 2012 07:10:04 PM PST已经变成production了,对,备机切主机就这么简单。 还有一处明显的变化是现在的主机(137)上的recovery.conf文件名字变成了recovery.done。 备机切换为主机后,就可以正常连接使用了。此时就有时间去处理原master端问题了。 3.宕机的主机切换成备机 先在现在的主机(137)上做一些数据的增删改
[postgres@localhost pgdata]$ psqlPassword:psql (9.1.2)Type "help" for help.postgres=# \d List of relationsSchema | Name | Type | Owner --------+--------------+-------+----------public | empsalary | table | postgrespublic | t2 | table | postgrespublic | t_index_test | table | postgrespublic | t_kenyon | table | postgrespublic | tab_kenyon | table | postgrespublic | xxxx | table | postgres(6 rows)postgres=# drop table xxxx;DROP TABLEpostgres=# drop table t2;DROP TABLEpostgres=# create table kenyon_rep(ir int,name varchar(10));CREATE TABLEpostgres=# insert into kenyon_rep select generate_series(1,100),repeat('kenyon_rep',2);ERROR: value too long for type character varying(10)postgres=# insert into kenyon_rep select generate_series(1,100),repeat('kenyon_rep',1);INSERT 0 100postgres=# \d List of relationsSchema | Name | Type | Owner --------+--------------+-------+----------public | empsalary | table | postgrespublic | kenyon_rep | table | postgrespublic | t_index_test | table | postgrespublic | t_kenyon | table | postgrespublic | tab_kenyon | table | postgres(5 rows)首先在现在的备机(134)上准备恢复文件,拷贝recovery.conf文件,并修改
[postgres@localhost ~]$ cp $PGHOME/share/recovery.conf.sample $PGDATA/recovery.conf[postgres@localhost ~]$ vi $PGDATA/recovery.confrecovery_target_timeline = 'latest'primary_conninfo = 'host=192.168.2.137 port=5432 user=repuser password=repuser'--指定137为新的主机trigger_file = '/database/pgdata/trigger.kenyon' --新的触发文件standby_mode = on --标记为备机同时修改postgresql.conf文件[postgres@localhost ~]$ vi $PGDATA/postgresql.confhot_standby = on配置好了后,我们启动134这台模拟宕掉的原主机,并使之与137连接,并做他的备机。 [postgres@localhost ~]$ pg_start 此时我们查看备机(134)的日志,有很多时间线不一致的问题(timeline)
2012-12-24 21:57:13.135 PST,,,11936,,50d94039.2ea0,1,,2012-12-24 21:57:13 PST,,0,FATAL,XX000,"timeline 2 of the primary does not match recovery target timeline 1",,,,,,,,,""2012-12-24 21:57:18.055 PST,,,11937,,50d9403e.2ea1,1,,2012-12-24 21:57:18 PST,,0,FATAL,XX000,"timeline 2 of the primary does not match recovery target timeline 1",,,,,,,,,""2012-12-24 21:57:23.070 PST,,,11938,,50d94043.2ea2,1,,2012-12-24 21:57:23 PST,,0,FATAL,XX000,"timeline 2 of the primary does not match recovery target timeline 1",,,,,,,,,""2012-12-24 21:57:28.072 PST,,,11939,,50d94048.2ea3,1,,2012-12-24 21:57:28 PST,,0,FATAL,XX000,"timeline 2 of the primary does not match recovery target timeline 1",,,,,,,,,""遇到这种问题,需要把现在主机上归档线文件拷贝到备机上,并查看日志
[postgres@localhost]$ scp 00000002.history postgres@192.168.2.134:/database/pgdata/pg_xlog查看日志,看其变化
2012-12-24 22:00:23.276 PST,,,12004,,50d940f7.2ee4,1,,2012-12-24 22:00:23 PST,,0,FATAL,XX000,"timeline 2 of the primary does not match recovery target timeline 1",,,,,,,,,""2012-12-24 22:00:28.294 PST,,,12005,,50d940fc.2ee5,1,,2012-12-24 22:00:28 PST,,0,FATAL,XX000,"timeline 2 of the primary does not match recovery target timeline 1",,,,,,,,,""2012-12-24 22:00:33.261 PST,,,11678,,50d93c09.2d9e,5,,2012-12-24 21:39:21 PST,,0,LOG,00000,"new target timeline is 2",,,,,,,,,""2012-12-24 22:00:33.289 PST,,,12007,,50d94101.2ee7,1,,2012-12-24 22:00:33 PST,,0,LOG,00000,"streaming replication successfully connected to primary",,,,,,,,,""2012-12-24 22:00:49.254 PST,,,11678,,50d93c09.2d9e,6,,2012-12-24 21:39:21 PST,,0,LOG,00000,"redo starts at 0/F0000078",,,,,,,,,""2012-12-24 22:11:08.361 PST,"postgres","postgres",12047,"[local]",50d9437c.2f0f,1,"",2012-12-24 22:11:08 PST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""2012-12-24 22:11:28.493 PST,"postgres","postgres",12051,"[local]",50d94390.2f13,1,"",2012-12-24 22:11:28 PST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""2012-12-24 22:12:30.547 PST,,,11672,,50d93c02.2d98,1,,2012-12-24 21:39:14 PST,,0,LOG,00000,"received fast shutdown request",,,,,,,,,""2012-12-24 22:12:30.549 PST,,,12007,,50d94101.2ee7,2,,2012-12-24 22:00:33 PST,,0,FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,,""2012-12-24 22:12:30.563 PST,,,11679,,50d93c09.2d9f,1,,2012-12-24 21:39:21 PST,,0,LOG,00000,"shutting down",,,,,,,,,""2012-12-24 22:12:30.567 PST,,,11679,,50d93c09.2d9f,2,,2012-12-24 21:39:21 PST,,0,LOG,00000,"database system is shut down",,,,,,,,,""仔细看日志的话可以发现,里面有两条数据很奇怪
2012-12-24 22:11:08.361 PST,"postgres","postgres",12047,"[local]",50d9437c.2f0f,1,"",2012-12-24 22:11:08 PST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""2012-12-24 22:11:28.493 PST,"postgres","postgres",12051,"[local]",50d94390.2f13,1,"",2012-12-24 22:11:28 PST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""这是因为在切换原主机为备机的时候,postgresql.conf中的一个参数hot_standby = off所导致,结果导致流复制是能连上的,但是数据库起不来,连接了两次,报上述两条错误信息。将这个参数改成on,然后重启一下数据库就正常了。 查看现在备机上的数据,看看更新有没有同步
查看备机上的表数据,与主机上一致了postgres=# \d List of relationsSchema | Name | Type | Owner --------+--------------+-------+----------public | empsalary | table | postgrespublic | kenyon_rep | table | postgrespublic | t_index_test | table | postgrespublic | t_kenyon | table | postgrespublic | tab_kenyon | table | postgres(5 rows)postgres=# select count(1) from kenyon_rep ;count------- 100(1 row)至此,主机和备机互为切换成功,在新主机上更新的数据也已经同步到备机上去了。 补充:9.3的版本可以不用手工复制.history了,有一个补丁将实现这个功能。 其他:采用异步方式流复制,当原主机有大量的事务操作压力比较大时,比如update,delete等操作,在原备机提升为主机后,原主机很多时候并不能正常切为备机,这是因为对于原主机,原备机会有一定的延时,也就是说原主机是超前,切换后有一部分内容主备间是不一致的,这个时候原主机降为备机就会报错。这种情况很容易模拟,在不关闭原主机的时候,把备机提升为主机,然后原主机插入新数据,再切为备机即可。 参考: http://francs3.blog.163.com/blog/static/405767272011724103133766/ http://www.depesz.com/2012/12/22/waiting-for-9-3-allow-a-streaming-replication-standby-to-follow-a-timeline-switch/