PostgreSQL repmgr 高可用之故障转移

吉娅寿 发表于 2025-9-18 20:48:53

PostgreSQL高可用之repmgr自动切换
之前写过一个repmgr的高可用搭建的，https://www.cnblogs.com/wy123/p/18531710，repmgr的搭建过程还是比较简单的，具体过程不再赘述。这里为了简化，做了1主2从的结构，之前一直没空测试repmgr的手动和自动故障转移，抽空找了个环境，做了个repmgr的故障转移测试。

环境：

ubuntu05:192.168.152.111（postgre服务为postgresql9000，repmgr服务为repmgr9000）
ubuntu06:192.168.152.111（postgre服务为postgresql9000，repmgr服务为repmgr9000）
ubuntu07:192.168.152.111（postgre服务为postgresql9000，repmgr服务为repmgr9000）
1，ubuntu05，ubuntu06，ubuntu07是一个repmgr集群，ubuntu05为主节点，其他两个为从节点
2，强制关闭ubuntu05上的PostgreSQL服务
3，repmgr完整自动故障转移，自动提升ubuntu06为这点

repmgr配置

repmgr的配置文件repmgr.conf
node_id=2
node_name='ubuntu06'
conninfo='host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100'
data_directory='/usr/local/pgsql16/pg9000/data'
pg_bindir='/usr/local/pgsql16/server/bin'
priority=80

#自动故障转移配置
failover=automatic
promote_command='/usr/local/pgsql16/server/bin/repmgr standby promote -f /usr/local/pgsql16/repmgr/repmgr.conf --log-to-file'
follow_command='/usr/local/pgsql16/server/bin/repmgr standby follow -f /usr/local/pgsql16/repmgr/repmgr.conf --log-to-file --upstream-node-id=%n'
log_file='/usr/local/pgsql16/repmgr/repmgr.log'

#要启用 repmgrd 守护进程和监控，需在 repmgr.conf中启用 moitoring_history=yes
monitoring_history=true
#默认监控时间间隔为2秒
monitor_interval_secs=5
#故障转移之前，尝试重新连接主库次数（默认为6）参数
reconnect_attempts=12
#每间隔5s尝试重新连接一次参数
reconnect_interval=5repmgrd的systemd服务启动脚本，设置repmgrd自动启动

Description=PostgreSQL Replication Manager Daemon
After=network.target postgresql9000.service
Requires=postgresql9000.service

Type=forking
User=postgres
Group=postgres
ExecStart=/usr/local/pgsql16/server/bin/repmgrd -f /usr/local/pgsql16/repmgr/repmgr.conf --pid-file /usr/local/pgsql16/repmgr/repmgrd.pid
ExecStop=/bin/kill -QUIT $MAINPID
PIDFile=/usr/local/pgsql16/repmgr/repmgrd.pid
Restart=always
RestartSec=5

# 环境变量（如果需要）
Environment=PATH=/usr/local/pgsql16/server/bin:/usr/local/bin:/usr/bin:/bin

WantedBy=multi-user.target

手动切换主从

repmgr的前置条件是需要节点之间ssh互信，

1，手动故障转移，哪个从节点需要提升为主节点，就在哪个节点上执行：
/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby switchover --siblings-follow
--siblings-follow表示所有从库的同步源自动改成最新的主库节点

switchover的内部流程如下：
1.关闭当前的主库 ubuntu06
2.等待老主库彻底关闭后，在 ubuntu05 上进行 pg_promote()
3.重启启动老主库 ubuntu06，降级成 standby 数据库，指向复制源 ubuntu05
4.sibling nodes兄弟节点同样进行了复制源重定向，指向 ubuntu05
5.整个switchover 过程结束

在当前节点Ubuntu04查看集群状态
repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
postgres@ubuntu05:~$ repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+----------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
1| ubuntu05 | standby | running | ubuntu06 | default| 80    | 2    | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100
2| ubuntu06 | primary | * running |       | default| 80    | 2    | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100
3| ubuntu07 | standby | running | ubuntu06 | default| 60    | 2    | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100
postgres@ubuntu05:~$
postgres@ubuntu05:~$
postgres@ubuntu05:~$

执行switchover
postgres@ubuntu05:~$ /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby switchover --siblings-follow
NOTICE: executing switchover on node "ubuntu05" (ID: 1)
NOTICE: attempting to pause repmgrd on 3 nodes
NOTICE: local node "ubuntu05" (ID: 1) will be promoted to primary; current primary "ubuntu06" (ID: 2) will be demoted to standby
NOTICE: stopping current primary node "ubuntu06" (ID: 2)
NOTICE: issuing CHECKPOINT on node "ubuntu06" (ID: 2)
DETAIL: executing server command "/usr/local/pgsql16/server/bin/pg_ctl-D '/usr/local/pgsql16/pg9000/data' -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 3 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 4 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 5 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 6 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/18000028
NOTICE: promoting standby to primary
DETAIL: promoting server "ubuntu05" (ID: 1) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "ubuntu05" (ID: 1) was successfully promoted to primary
NOTICE: node "ubuntu05" (ID: 1) promoted to primary, node "ubuntu06" (ID: 2) demoted to standby
NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings
INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
NOTICE: switchover was successful
DETAIL: node "ubuntu05" is now primary and node "ubuntu06" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully
postgres@ubuntu05:~$
postgres@ubuntu05:~$

postgres@ubuntu05:~$
postgres@ubuntu05:~$ repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+----------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
1| ubuntu05 | primary | * running |       | default| 80    | 3    | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100
2| ubuntu06 | standby | running | ubuntu05 | default| 80    | 2    | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100
3| ubuntu07 | standby | running | ubuntu05 | default| 60    | 2    | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100
postgres@ubuntu05:~$
手动故障转移

1，kill或者停止主节点服务来模拟主节点故障
systemctl stop postgresql9000

2，从节点上查看集群状态，此时原始主节点已不可达
postgres@ubuntu06:~$ repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
ID | Name | Role | Status    | Upstream | Location | Priority | Timeline | Connection string
----+----------+---------+---------------+------------+----------+----------+----------+------------------------------------------------------------------------------
1| ubuntu05 | primary | ? unreachable | ?       | default| 80    |       | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100
2| ubuntu06 | standby | running | ? ubuntu05 | default| 80    | 3    | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100
3| ubuntu07 | standby | running | ? ubuntu05 | default| 60    | 3    | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100

WARNING: following issues were detected
- unable to connect to node "ubuntu05" (ID: 1)
- node "ubuntu05" (ID: 1) is registered as an active primary but is unreachable
- unable to connect to node "ubuntu06" (ID: 2)'s upstream node "ubuntu05" (ID: 1)
- unable to determine if node "ubuntu06" (ID: 2) is attached to its upstream node "ubuntu05" (ID: 1)
- unable to connect to node "ubuntu07" (ID: 3)'s upstream node "ubuntu05" (ID: 1)
- unable to determine if node "ubuntu07" (ID: 3) is attached to its upstream node "ubuntu05" (ID: 1)

HINT: execute with --verbose option to see connection error messages
postgres@ubuntu06:~$

3，手动 promote 把 ubuntu06 提升为主库
/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby promote --siblings-follow
检查集群状态，此时Ubuntu06已经成为主节点，原主库 pg02 被标记为 failed 的状态

postgres@ubuntu06:~$
postgres@ubuntu06:~$ /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby promote --siblings-follow
NOTICE: promoting standby to primary
DETAIL: promoting server "ubuntu06" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "ubuntu06" (ID: 2) was successfully promoted to primary
NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings
INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
postgres@ubuntu06:~$
postgres@ubuntu06:~$###检查集群状态，此时Ubuntu06已经成为主节点
postgres@ubuntu06:~$ repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+----------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
1| ubuntu05 | primary | - failed| ?    | default| 80    |       | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100
2| ubuntu06 | primary | * running |       | default| 80    | 4    | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100
3| ubuntu07 | standby | running | ubuntu06 | default| 60    | 3    | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100

WARNING: following issues were detected
- unable to connect to node "ubuntu05" (ID: 1)

HINT: execute with --verbose option to see connection error messages
postgres@ubuntu06:~$

4，老主库重新加入集群
4.1 启动老主库
root@ubuntu05:~# systemctl start postgresql9000
root@ubuntu05:~#
root@ubuntu05:~# su - postgres
postgres@ubuntu05:~$
postgres@ubuntu05:~$
postgres@ubuntu05:~$ /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
ID | Name | Role | Status             | Upstream | Location | Priority | Timeline | Connection string
----+----------+---------+----------------------+------------+----------+----------+----------+------------------------------------------------------------------------------
1| ubuntu05 | primary | * running          |          | default| 80    | 3    | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100
2| ubuntu06 | standby | ! running as primary |          | default| 80    | 4    | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100
3| ubuntu07 | standby | running          | ! ubuntu06 | default| 60    | 3    | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100

WARNING: following issues were detected
   - node "ubuntu06" (ID: 2) is registered as standby but running as primary
   - node "ubuntu07" (ID: 3) reports a different upstream (reported: "ubuntu06", expected "ubuntu05")

postgres@ubuntu05:~$

4.2 执行pg_rewind
/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf node rejoin -d 'host=ubuntu06 dbname=repmgr user=repmgr password=****** port=9000' --force-rewind --dry-run

postgres@ubuntu05:~$ /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf node rejoin -d 'host=ubuntu06 dbname=repmgr user=repmgr password=****** port=9000' --force-rewind --dry-run
NOTICE: rejoin target is node "ubuntu06" (ID: 2)
INFO: replication connection to the rejoin target node was successful
INFO: local and rejoin target system identifiers match
DETAIL: system identifier is 7550951818891860956
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server s timeline 4 forked off current database system timeline 3 before current recovery point 0/1B000028
INFO: prerequisites for using pg_rewind are met
INFO: pg_rewind would now be executed
DETAIL: pg_rewind command is:
   /usr/local/pgsql16/server/bin/pg_rewind -D '/usr/local/pgsql16/pg9000/data' --source-server='host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100'
INFO: prerequisites for executing NODE REJOIN are met
postgres@ubuntu05:~$
postgres@ubuntu05:~$

或者简单粗暴，直接删除本地的数据，重新克隆

克隆数据库
/usr/local/pgsql16/server/bin/repmgr -h 192.168.152.112 -p 9000 -U repmgr -d repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby clone --dry-run
直接启动数据库服务即可
--取消注册，实际上是从nodes表中删除数据
/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby unregister
--重新注册，重新将repmgr.conf中的配置加载到nodes表中
/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby register

--强制注册force，实际上就是覆盖现有的配置
/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby register --force
--指定主节点，一般不用指定，直接会根据postgresql.auto.conf找到主节点
/usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby register--upstream-node-id=2

对正常节点重新注册，目的是修改配置之后，重新注册会，达到重新加载的功能，从节点(pg02，pg03)进行重新注册操作
$ repmgr -f /home/postgres/repmgr/repmgr.conf standby unregister
$ repmgr -f /home/postgres/repmgr/repmgr.conf standby register --upstream-node-id=1

自动故障转移

强制关闭主节点Ubuntu05上的PostgreSQL服务模拟故障

自动故障转移过程如下：

repmgr的转移过程日志，可以看到repmgr会根据上面配置文件的重试间隔reconnect_interval和重试参数reconnect_attempts，一直重试，如果最终主节点不可达，开始故障转移，整个过程为1分钟
monitoring connection to upstream node "ubuntu05" (ID: 1)
node "ubuntu06" (ID: 2) monitoring upstream node "ubuntu05" (ID: 1) in normal state
last monitoring statistics update was 5 seconds ago
node "ubuntu06" (ID: 2) monitoring upstream node "ubuntu05" (ID: 1) in normal state
last monitoring statistics update was 5 seconds ago
***************************************************这里开始模拟主节点故障，从节点开始重试*************************************************************************
unable to ping "host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100"
PQping() returned "PQPING_NO_RESPONSE"
unable to connect to upstream node "ubuntu05" (ID: 1)
checking state of node "ubuntu05" (ID: 1), 1 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
unable to ping "host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100"
PQping() returned "PQPING_NO_RESPONSE"
unable to connect to upstream node "ubuntu05" (ID: 1)
checking state of node "ubuntu05" (ID: 1), 1 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 2 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 2 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 3 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 3 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 4 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 4 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 5 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 5 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 6 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 6 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 7 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 7 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 8 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 8 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 9 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 9 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 10 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 10 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 11 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 11 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
sleeping up to 5 seconds until next reconnection attempt
checking state of node "ubuntu05" (ID: 1), 12 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
unable to reconnect to node "ubuntu05" (ID: 1) after 12 attempts
1 active sibling nodes registered
3 total nodes registered
primary node"ubuntu05" (ID: 1) and this node have the same location ("default")
local node's last receive lsn: 0/220000A0
checking state of sibling node "ubuntu07" (ID: 3)
node "ubuntu07" (ID: 3) reports its upstream is node 1, last seen 56 second(s) ago
standby node "ubuntu07" (ID: 3) last saw primary node 56 second(s) ago
last receive LSN for sibling node "ubuntu07" (ID: 3) is: 0/220000A0
node "ubuntu07" (ID: 3) has same LSN as current candidate "ubuntu06" (ID: 2)
node "ubuntu07" (ID: 3) has lower priority (60) than current candidate "ubuntu06" (ID: 2) (80)
visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 10 seconds
promotion candidate is "ubuntu06" (ID: 2)
this node is the winner, will now promote itself and inform other nodes
promote_command is:
"/usr/local/pgsql16/server/bin/repmgr standby promote -f /usr/local/pgsql16/repmgr/repmgr.conf --log-to-file"
redirecting logging output to "/usr/local/pgsql16/repmgr/repmgr.log"

1 sibling nodes found, but option "--siblings-follow" not specified
these nodes will remain attached to the current primary:
ubuntu07 (node ID: 3)
promoting standby to primary
promoting server "ubuntu06" (ID: 2) using pg_promote()
waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
checking state of node "ubuntu05" (ID: 1), 12 of 12 attempts
unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
PQping() returned "PQPING_NO_RESPONSE"
unable to reconnect to node "ubuntu05" (ID: 1) after 12 attempts
1 active sibling nodes registered
3 total nodes registered
primary node"ubuntu05" (ID: 1) and this node have the same location ("default")
local node's last receive lsn: 0/220000A0
checking state of sibling node "ubuntu07" (ID: 3)
node "ubuntu07" (ID: 3) reports its upstream is node 1, last seen 56 second(s) ago
standby node "ubuntu07" (ID: 3) last saw primary node 56 second(s) ago
last receive LSN for sibling node "ubuntu07" (ID: 3) is: 0/220000A0
node "ubuntu07" (ID: 3) has same LSN as current candidate "ubuntu06" (ID: 2)
node "ubuntu07" (ID: 3) has lower priority (60) than current candidate "ubuntu06" (ID: 2) (80)
visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 10 seconds
promotion candidate is "ubuntu06" (ID: 2)
this node is the winner, will now promote itself and inform other nodes
promote_command is:
"/usr/local/pgsql16/server/bin/repmgr standby promote -f /usr/local/pgsql16/repmgr/repmgr.conf --log-to-file"
redirecting logging output to "/usr/local/pgsql16/repmgr/repmgr.log"

STANDBY PROMOTE can only be executed on a standby node
promote command failed
promote command exited with error code 8
checking if original primary node has reappeared
connection to database failed

connection to server at "192.168.152.111", port 9000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?

attempted to connect using:
user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr options=-csearch_path=
unable to ping "host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100"
PQping() returned "PQPING_NO_RESPONSE"
local node is primary, checking local node state
resuming monitoring as primary node after 0 seconds
1 followers to notify
reconnecting to node "ubuntu07" (ID: 3)...
notifying node "ubuntu07" (ID: 3) to follow node 2
INFO:node 3 received notification to follow node 2
monitoring cluster primary "ubuntu06" (ID: 2)
STANDBY PROMOTE successful
server "ubuntu06" (ID: 2) was successfully promoted to primary
checking state of node 2, 1 of 12 attempts
node 2 has recovered, reconnecting
connection to node 2 succeeded
original connection is still available
1 followers to notify
notifying node "ubuntu07" (ID: 3) to follow node 2
INFO:node 3 received notification to follow node 2
switching to primary monitoring mode
monitoring cluster primary "ubuntu06" (ID: 2)
child node "ubuntu07" (ID: 3) is attached
new standby "ubuntu07" (ID: 3) has connected
monitoring primary node "ubuntu06" (ID: 2) in normal state
monitoring primary node "ubuntu06" (ID: 2) in normal state
monitoring primary node "ubuntu06" (ID: 2) in normal state
monitoring primary node "ubuntu06" (ID: 2) in normal state
monitoring primary node "ubuntu06" (ID: 2) in normal state
monitoring primary node "ubuntu06" (ID: 2) in normal state
monitoring primary node "ubuntu06" (ID: 2) in normal state
monitoring primary node "ubuntu06" (ID: 2) in normal state
repmgr的优缺点总结

repmgr在高可用方案上，勉强能用吧。
优点是安装配置都比较简单，
缺点是没办法做到连续自动故障转移，第一次转移完成后，故障节点想拉起来，还是要先做手动pg_rewind。
repmgr把元数据保存在本地的PostgreSQL数据库中，数据库启动之前repmgr进程不知道集群状态，所以不可能自动rewind，这也就是用PostgreSQL自身保存集群元数据的缺陷，也算是跟partoni的差距吧。

来源：程序园用户自行投稿发布，如果侵权，请联系站长删除
免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！

唐嘉懿 发表于 2025-11-6 05:44:16

喜欢鼓捣这些软件，现在用得少，谢谢分享！

乳杂丫 发表于 2025-12-10 20:23:03

感谢发布原创作品，程序园因你更精彩

红弘丽 发表于 2025-12-22 13:19:27

懂技术并乐意极积无私分享的人越来越少。珍惜

旌磅箱 发表于 2026-1-5 20:24:34

前排留名，哈哈哈

讣丢发表于 2026-1-14 16:27:32

不错，里面软件多更新就更好了

迫蔺发表于 2026-1-20 07:56:31

用心讨论，共获提升！

坪钗发表于 2026-1-26 10:34:43

喜欢鼓捣这些软件，现在用得少，谢谢分享！

恙髡发表于 2026-1-29 06:50:17

这个有用。

全阳霁 发表于 2026-2-7 09:29:04

感谢分享，学习下。

粒浊发表于 2026-2-8 00:54:09

鼓励转贴优秀软件安全工具和文档！

骆贵发表于 2026-2-8 04:06:53

谢谢分享，辛苦了

扈怀易 发表于 2026-2-9 06:52:40

新版吗？好像是停更了吧。

辜酗徇 发表于 2026-2-9 17:53:43

感谢发布原创作品，程序园因你更精彩

喳谍发表于 2026-2-10 04:25:41

分享、互助让互联网精神温暖你我

糙昧邵 发表于 2026-2-10 18:25:38

热心回复！

上官泰 发表于 2026-2-10 22:08:34

喜欢鼓捣这些软件，现在用得少，谢谢分享！

鞍汉发表于 2026-2-10 23:21:29

感谢分享，学习下。

郗新语 发表于 2026-2-12 15:23:57

这个好，看起来很实用

碛物发表于 2026-2-12 20:42:28

懂技术并乐意极积无私分享的人越来越少。珍惜

页: [1] 2

程序园's Archiver

PostgreSQL repmgr 高可用之故障转移