找回密码
 立即注册
首页 业界区 安全 PostgreSQL repmgr 高可用之故障转移

PostgreSQL repmgr 高可用之故障转移

吉娅寿 2025-9-18 20:48:53
PostgreSQL高可用之repmgr自动切换
之前写过一个repmgr的高可用搭建的,https://www.cnblogs.com/wy123/p/18531710,repmgr的搭建过程还是比较简单的,具体过程不再赘述。这里为了简化,做了1主2从的结构,之前一直没空测试repmgr的手动和自动故障转移,抽空找了个环境,做了个repmgr的故障转移测试。


环境:

ubuntu05:192.168.152.111(postgre服务为postgresql9000,repmgr服务为repmgr9000)
ubuntu06:192.168.152.111(postgre服务为postgresql9000,repmgr服务为repmgr9000)
ubuntu07:192.168.152.111(postgre服务为postgresql9000,repmgr服务为repmgr9000)
1,ubuntu05,ubuntu06,ubuntu07是一个repmgr集群,ubuntu05为主节点,其他两个为从节点
2,强制关闭ubuntu05上的PostgreSQL服务
3,repmgr完整自动故障转移,自动提升ubuntu06为这点
 
repmgr配置

repmgr的配置文件repmgr.conf
  1. node_id=2
  2. node_name='ubuntu06'
  3. conninfo='host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100'
  4. data_directory='/usr/local/pgsql16/pg9000/data'
  5. pg_bindir='/usr/local/pgsql16/server/bin'
  6. priority=80
  7. #自动故障转移配置
  8. failover=automatic
  9. promote_command='/usr/local/pgsql16/server/bin/repmgr standby promote -f /usr/local/pgsql16/repmgr/repmgr.conf --log-to-file'
  10. follow_command='/usr/local/pgsql16/server/bin/repmgr standby follow -f /usr/local/pgsql16/repmgr/repmgr.conf --log-to-file --upstream-node-id=%n'
  11. log_file='/usr/local/pgsql16/repmgr/repmgr.log'
  12. #要启用 repmgrd 守护进程和监控,需在 repmgr.conf中启用 moitoring_history=yes
  13. monitoring_history=true
  14. #默认监控时间间隔为2秒
  15. monitor_interval_secs=5
  16. #故障转移之前,尝试重新连接主库次数(默认为6)参数
  17. reconnect_attempts=12
  18. #每间隔5s尝试重新连接一次参数
  19. reconnect_interval=5
复制代码
repmgrd的systemd服务启动脚本,设置repmgrd自动启动
  1. [Unit]
  2. Description=PostgreSQL Replication Manager Daemon
  3. After=network.target postgresql9000.service
  4. Requires=postgresql9000.service
  5. [Service]
  6. Type=forking
  7. User=postgres
  8. Group=postgres
  9. ExecStart=/usr/local/pgsql16/server/bin/repmgrd -f /usr/local/pgsql16/repmgr/repmgr.conf --pid-file /usr/local/pgsql16/repmgr/repmgrd.pid
  10. ExecStop=/bin/kill -QUIT $MAINPID
  11. PIDFile=/usr/local/pgsql16/repmgr/repmgrd.pid
  12. Restart=always
  13. RestartSec=5
  14. # 环境变量(如果需要)
  15. Environment=PATH=/usr/local/pgsql16/server/bin:/usr/local/bin:/usr/bin:/bin
  16. [Install]
  17. WantedBy=multi-user.target
复制代码
 

手动切换主从
  1. repmgr的前置条件是需要节点之间ssh互信,
  2. 1,手动故障转移,哪个从节点需要提升为主节点,就在哪个节点上执行:
  3.     /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby switchover --siblings-follow
  4.     --siblings-follow  表示所有从库的同步源自动改成最新的主库节点
  5.     switchover的内部流程如下:
  6.     1.关闭当前的主库 ubuntu06
  7.     2.等待老主库彻底关闭后,在 ubuntu05 上进行 pg_promote()
  8.     3.重启启动老主库 ubuntu06, 降级成 standby 数据库, 指向复制源 ubuntu05
  9.     4.sibling nodes兄弟节点同样进行了复制源重定向,指向 ubuntu05
  10.     5.整个switchover 过程结束
  11.    
  12.     在当前节点Ubuntu04查看集群状态
  13.         repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
  14.         postgres@ubuntu05:~$ repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
  15.          ID | Name     | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
  16.         ----+----------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
  17.          1  | ubuntu05 | standby |   running | ubuntu06 | default  | 80       | 2        | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100
  18.          2  | ubuntu06 | primary | * running |          | default  | 80       | 2        | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100
  19.          3  | ubuntu07 | standby |   running | ubuntu06 | default  | 60       | 2        | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100
  20.         postgres@ubuntu05:~$
  21.         postgres@ubuntu05:~$
  22.         postgres@ubuntu05:~$
  23.    
  24.     执行switchover
  25.     postgres@ubuntu05:~$ /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby switchover --siblings-follow
  26.         NOTICE: executing switchover on node "ubuntu05" (ID: 1)
  27.         NOTICE: attempting to pause repmgrd on 3 nodes
  28.         NOTICE: local node "ubuntu05" (ID: 1) will be promoted to primary; current primary "ubuntu06" (ID: 2) will be demoted to standby
  29.         NOTICE: stopping current primary node "ubuntu06" (ID: 2)
  30.         NOTICE: issuing CHECKPOINT on node "ubuntu06" (ID: 2)
  31.         DETAIL: executing server command "/usr/local/pgsql16/server/bin/pg_ctl  -D '/usr/local/pgsql16/pg9000/data' -W -m fast stop"
  32.         INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
  33.         INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
  34.         INFO: checking for primary shutdown; 3 of 60 attempts ("shutdown_check_timeout")
  35.         INFO: checking for primary shutdown; 4 of 60 attempts ("shutdown_check_timeout")
  36.         INFO: checking for primary shutdown; 5 of 60 attempts ("shutdown_check_timeout")
  37.         INFO: checking for primary shutdown; 6 of 60 attempts ("shutdown_check_timeout")
  38.         NOTICE: current primary has been cleanly shut down at location 0/18000028
  39.         NOTICE: promoting standby to primary
  40.         DETAIL: promoting server "ubuntu05" (ID: 1) using pg_promote()
  41.         NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
  42.         NOTICE: STANDBY PROMOTE successful
  43.         DETAIL: server "ubuntu05" (ID: 1) was successfully promoted to primary
  44.         NOTICE: node "ubuntu05" (ID: 1) promoted to primary, node "ubuntu06" (ID: 2) demoted to standby
  45.         NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings
  46.         INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
  47.         NOTICE: switchover was successful
  48.         DETAIL: node "ubuntu05" is now primary and node "ubuntu06" is attached as standby
  49.         NOTICE: STANDBY SWITCHOVER has completed successfully
  50.         postgres@ubuntu05:~$
  51.         postgres@ubuntu05:~$
  52.    
  53.         postgres@ubuntu05:~$
  54.         postgres@ubuntu05:~$ repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
  55.          ID | Name     | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
  56.         ----+----------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
  57.          1  | ubuntu05 | primary | * running |          | default  | 80       | 3        | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100
  58.          2  | ubuntu06 | standby |   running | ubuntu05 | default  | 80       | 2        | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100
  59.          3  | ubuntu07 | standby |   running | ubuntu05 | default  | 60       | 2        | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100
  60.         postgres@ubuntu05:~$
复制代码
 
手动故障转移
  1. 1,kill或者停止主节点服务来模拟主节点故障
  2.     systemctl stop postgresql9000
  3.                
  4. 2,从节点上查看集群状态,此时原始主节点已不可达
  5.         postgres@ubuntu06:~$ repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
  6.                 ID | Name     | Role    | Status        | Upstream   | Location | Priority | Timeline | Connection string
  7.         ----+----------+---------+---------------+------------+----------+----------+----------+------------------------------------------------------------------------------
  8.                 1  | ubuntu05 | primary | ? unreachable | ?          | default  | 80       |          | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100
  9.                 2  | ubuntu06 | standby |   running     | ? ubuntu05 | default  | 80       | 3        | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100
  10.                 3  | ubuntu07 | standby |   running     | ? ubuntu05 | default  | 60       | 3        | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100
  11.         WARNING: following issues were detected
  12.                 - unable to connect to node "ubuntu05" (ID: 1)
  13.                 - node "ubuntu05" (ID: 1) is registered as an active primary but is unreachable
  14.                 - unable to connect to node "ubuntu06" (ID: 2)'s upstream node "ubuntu05" (ID: 1)
  15.                 - unable to determine if node "ubuntu06" (ID: 2) is attached to its upstream node "ubuntu05" (ID: 1)
  16.                 - unable to connect to node "ubuntu07" (ID: 3)'s upstream node "ubuntu05" (ID: 1)
  17.                 - unable to determine if node "ubuntu07" (ID: 3) is attached to its upstream node "ubuntu05" (ID: 1)
  18.         HINT: execute with --verbose option to see connection error messages
  19.         postgres@ubuntu06:~$
  20.                
  21. 3,手动 promote 把 ubuntu06 提升为主库
  22.         /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby promote --siblings-follow
  23.         检查集群状态,此时Ubuntu06已经成为主节点,原主库 pg02 被标记为 failed 的状态
  24.                
  25.         postgres@ubuntu06:~$
  26.         postgres@ubuntu06:~$ /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby promote --siblings-follow
  27.         NOTICE: promoting standby to primary
  28.         DETAIL: promoting server "ubuntu06" (ID: 2) using pg_promote()
  29.         NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
  30.         NOTICE: STANDBY PROMOTE successful
  31.         DETAIL: server "ubuntu06" (ID: 2) was successfully promoted to primary
  32.         NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings
  33.         INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
  34.         postgres@ubuntu06:~$
  35.         postgres@ubuntu06:~$###检查集群状态,此时Ubuntu06已经成为主节点
  36.         postgres@ubuntu06:~$ repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
  37.                 ID | Name     | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
  38.         ----+----------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------
  39.                 1  | ubuntu05 | primary | - failed  | ?        | default  | 80       |          | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100
  40.                 2  | ubuntu06 | primary | * running |          | default  | 80       | 4        | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100
  41.                 3  | ubuntu07 | standby |   running | ubuntu06 | default  | 60       | 3        | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100
  42.                
  43.         WARNING: following issues were detected
  44.                 - unable to connect to node "ubuntu05" (ID: 1)
  45.                
  46.         HINT: execute with --verbose option to see connection error messages
  47.         postgres@ubuntu06:~$
  48.                
  49.        
  50. 4,老主库重新加入集群
  51.     4.1 启动老主库
  52.                 root@ubuntu05:~# systemctl start postgresql9000
  53.                 root@ubuntu05:~#
  54.                 root@ubuntu05:~# su - postgres
  55.                 postgres@ubuntu05:~$
  56.                 postgres@ubuntu05:~$
  57.                 postgres@ubuntu05:~$ /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf cluster show
  58.                  ID | Name     | Role    | Status               | Upstream   | Location | Priority | Timeline | Connection string
  59.                 ----+----------+---------+----------------------+------------+----------+----------+----------+------------------------------------------------------------------------------
  60.                  1  | ubuntu05 | primary | * running            |            | default  | 80       | 3        | host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100
  61.                  2  | ubuntu06 | standby | ! running as primary |            | default  | 80       | 4        | host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100
  62.                  3  | ubuntu07 | standby |   running            | ! ubuntu06 | default  | 60       | 3        | host=192.168.152.113 user=repmgr dbname=repmgr port=9000 connect_timeout=100
  63.                
  64.                 WARNING: following issues were detected
  65.                   - node "ubuntu06" (ID: 2) is registered as standby but running as primary
  66.                   - node "ubuntu07" (ID: 3) reports a different upstream (reported: "ubuntu06", expected "ubuntu05")
  67.                
  68.                 postgres@ubuntu05:~$
  69.                
  70.         4.2 执行pg_rewind
  71.                 /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf node rejoin -d 'host=ubuntu06 dbname=repmgr user=repmgr password=****** port=9000' --force-rewind --dry-run
  72.                  
  73.                 postgres@ubuntu05:~$ /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf node rejoin -d 'host=ubuntu06 dbname=repmgr user=repmgr password=****** port=9000' --force-rewind --dry-run
  74.                 NOTICE: rejoin target is node "ubuntu06" (ID: 2)
  75.                 INFO: replication connection to the rejoin target node was successful
  76.                 INFO: local and rejoin target system identifiers match
  77.                 DETAIL: system identifier is 7550951818891860956
  78.                 NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
  79.                 DETAIL: rejoin target server s timeline 4 forked off current database system timeline 3 before current recovery point 0/1B000028
  80.                 INFO: prerequisites for using pg_rewind are met
  81.                 INFO: pg_rewind would now be executed
  82.                 DETAIL: pg_rewind command is:
  83.                   /usr/local/pgsql16/server/bin/pg_rewind -D '/usr/local/pgsql16/pg9000/data' --source-server='host=192.168.152.112 user=repmgr dbname=repmgr port=9000 connect_timeout=100'
  84.                 INFO: prerequisites for executing NODE REJOIN are met
  85.                 postgres@ubuntu05:~$
  86.                 postgres@ubuntu05:~$
  87.                
  88.                 或者简单粗暴,直接删除本地的数据,重新克隆
  89.                
  90.                 克隆数据库
  91.                 /usr/local/pgsql16/server/bin/repmgr -h 192.168.152.112 -p 9000 -U repmgr -d repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby clone --dry-run
  92.                 直接启动数据库服务即可
  93.                 --取消注册,实际上是从nodes表中删除数据
  94.                 /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby unregister
  95.                 --重新注册,重新将repmgr.conf中的配置加载到nodes表中
  96.                 /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby register
  97.                
  98.                
  99.                 --强制注册force,实际上就是覆盖现有的配置
  100.                 /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby register --force
  101.                 --指定主节点,一般不用指定,直接会根据postgresql.auto.conf找到主节点
  102.                 /usr/local/pgsql16/server/bin/repmgr -f /usr/local/pgsql16/repmgr/repmgr.conf standby register  --upstream-node-id=2
  103.                
  104.                
  105.                
  106.                 对正常节点重新注册,目的是修改配置之后,重新注册会,达到重新加载的功能,从节点(pg02,pg03)进行重新注册操作
  107.                 $ repmgr -f /home/postgres/repmgr/repmgr.conf standby unregister
  108.                 $ repmgr -f /home/postgres/repmgr/repmgr.conf standby register --upstream-node-id=1
  109.                
复制代码
 
自动故障转移

强制关闭主节点Ubuntu05上的PostgreSQL服务模拟故障

自动故障转移过程如下:
1.png

2.png

repmgr的转移过程日志,可以看到repmgr会根据上面配置文件的重试间隔reconnect_interval和重试参数reconnect_attempts,一直重试,如果最终主节点不可达,开始故障转移,整个过程为1分钟
  1. [2025-09-18 13:24:00] [INFO] monitoring connection to upstream node "ubuntu05" (ID: 1)
  2. [2025-09-18 13:26:26] [INFO] node "ubuntu06" (ID: 2) monitoring upstream node "ubuntu05" (ID: 1) in normal state
  3. [2025-09-18 13:26:26] [DETAIL] last monitoring statistics update was 5 seconds ago
  4. [2025-09-18 13:29:01] [INFO] node "ubuntu06" (ID: 2) monitoring upstream node "ubuntu05" (ID: 1) in normal state
  5. [2025-09-18 13:29:01] [DETAIL] last monitoring statistics update was 5 seconds ago
  6. ***************************************************这里开始模拟主节点故障,从节点开始重试*************************************************************************
  7. [2025-09-18 13:30:01] [WARNING] unable to ping "host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100"
  8. [2025-09-18 13:30:01] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  9. [2025-09-18 13:30:01] [WARNING] unable to connect to upstream node "ubuntu05" (ID: 1)
  10. [2025-09-18 13:30:01] [INFO] checking state of node "ubuntu05" (ID: 1), 1 of 12 attempts
  11. [2025-09-18 13:30:01] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  12. [2025-09-18 13:30:01] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  13. [2025-09-18 13:30:01] [INFO] sleeping up to 5 seconds until next reconnection attempt
  14. [2025-09-18 13:30:02] [WARNING] unable to ping "host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100"
  15. [2025-09-18 13:30:02] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  16. [2025-09-18 13:30:02] [WARNING] unable to connect to upstream node "ubuntu05" (ID: 1)
  17. [2025-09-18 13:30:02] [INFO] checking state of node "ubuntu05" (ID: 1), 1 of 12 attempts
  18. [2025-09-18 13:30:02] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  19. [2025-09-18 13:30:02] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  20. [2025-09-18 13:30:02] [INFO] sleeping up to 5 seconds until next reconnection attempt
  21. [2025-09-18 13:30:06] [INFO] checking state of node "ubuntu05" (ID: 1), 2 of 12 attempts
  22. [2025-09-18 13:30:06] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  23. [2025-09-18 13:30:06] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  24. [2025-09-18 13:30:06] [INFO] sleeping up to 5 seconds until next reconnection attempt
  25. [2025-09-18 13:30:07] [INFO] checking state of node "ubuntu05" (ID: 1), 2 of 12 attempts
  26. [2025-09-18 13:30:07] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  27. [2025-09-18 13:30:07] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  28. [2025-09-18 13:30:07] [INFO] sleeping up to 5 seconds until next reconnection attempt
  29. [2025-09-18 13:30:11] [INFO] checking state of node "ubuntu05" (ID: 1), 3 of 12 attempts
  30. [2025-09-18 13:30:11] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  31. [2025-09-18 13:30:11] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  32. [2025-09-18 13:30:11] [INFO] sleeping up to 5 seconds until next reconnection attempt
  33. [2025-09-18 13:30:12] [INFO] checking state of node "ubuntu05" (ID: 1), 3 of 12 attempts
  34. [2025-09-18 13:30:12] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  35. [2025-09-18 13:30:12] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  36. [2025-09-18 13:30:12] [INFO] sleeping up to 5 seconds until next reconnection attempt
  37. [2025-09-18 13:30:16] [INFO] checking state of node "ubuntu05" (ID: 1), 4 of 12 attempts
  38. [2025-09-18 13:30:16] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  39. [2025-09-18 13:30:16] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  40. [2025-09-18 13:30:16] [INFO] sleeping up to 5 seconds until next reconnection attempt
  41. [2025-09-18 13:30:17] [INFO] checking state of node "ubuntu05" (ID: 1), 4 of 12 attempts
  42. [2025-09-18 13:30:17] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  43. [2025-09-18 13:30:17] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  44. [2025-09-18 13:30:17] [INFO] sleeping up to 5 seconds until next reconnection attempt
  45. [2025-09-18 13:30:22] [INFO] checking state of node "ubuntu05" (ID: 1), 5 of 12 attempts
  46. [2025-09-18 13:30:22] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  47. [2025-09-18 13:30:22] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  48. [2025-09-18 13:30:22] [INFO] sleeping up to 5 seconds until next reconnection attempt
  49. [2025-09-18 13:30:22] [INFO] checking state of node "ubuntu05" (ID: 1), 5 of 12 attempts
  50. [2025-09-18 13:30:22] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  51. [2025-09-18 13:30:22] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  52. [2025-09-18 13:30:22] [INFO] sleeping up to 5 seconds until next reconnection attempt
  53. [2025-09-18 13:30:27] [INFO] checking state of node "ubuntu05" (ID: 1), 6 of 12 attempts
  54. [2025-09-18 13:30:27] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  55. [2025-09-18 13:30:27] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  56. [2025-09-18 13:30:27] [INFO] sleeping up to 5 seconds until next reconnection attempt
  57. [2025-09-18 13:30:27] [INFO] checking state of node "ubuntu05" (ID: 1), 6 of 12 attempts
  58. [2025-09-18 13:30:27] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  59. [2025-09-18 13:30:27] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  60. [2025-09-18 13:30:27] [INFO] sleeping up to 5 seconds until next reconnection attempt
  61. [2025-09-18 13:30:32] [INFO] checking state of node "ubuntu05" (ID: 1), 7 of 12 attempts
  62. [2025-09-18 13:30:32] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  63. [2025-09-18 13:30:32] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  64. [2025-09-18 13:30:32] [INFO] sleeping up to 5 seconds until next reconnection attempt
  65. [2025-09-18 13:30:32] [INFO] checking state of node "ubuntu05" (ID: 1), 7 of 12 attempts
  66. [2025-09-18 13:30:32] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  67. [2025-09-18 13:30:32] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  68. [2025-09-18 13:30:32] [INFO] sleeping up to 5 seconds until next reconnection attempt
  69. [2025-09-18 13:30:37] [INFO] checking state of node "ubuntu05" (ID: 1), 8 of 12 attempts
  70. [2025-09-18 13:30:37] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  71. [2025-09-18 13:30:37] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  72. [2025-09-18 13:30:37] [INFO] sleeping up to 5 seconds until next reconnection attempt
  73. [2025-09-18 13:30:37] [INFO] checking state of node "ubuntu05" (ID: 1), 8 of 12 attempts
  74. [2025-09-18 13:30:37] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  75. [2025-09-18 13:30:37] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  76. [2025-09-18 13:30:37] [INFO] sleeping up to 5 seconds until next reconnection attempt
  77. [2025-09-18 13:30:42] [INFO] checking state of node "ubuntu05" (ID: 1), 9 of 12 attempts
  78. [2025-09-18 13:30:42] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  79. [2025-09-18 13:30:42] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  80. [2025-09-18 13:30:42] [INFO] sleeping up to 5 seconds until next reconnection attempt
  81. [2025-09-18 13:30:42] [INFO] checking state of node "ubuntu05" (ID: 1), 9 of 12 attempts
  82. [2025-09-18 13:30:42] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  83. [2025-09-18 13:30:42] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  84. [2025-09-18 13:30:42] [INFO] sleeping up to 5 seconds until next reconnection attempt
  85. [2025-09-18 13:30:47] [INFO] checking state of node "ubuntu05" (ID: 1), 10 of 12 attempts
  86. [2025-09-18 13:30:47] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  87. [2025-09-18 13:30:47] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  88. [2025-09-18 13:30:47] [INFO] sleeping up to 5 seconds until next reconnection attempt
  89. [2025-09-18 13:30:47] [INFO] checking state of node "ubuntu05" (ID: 1), 10 of 12 attempts
  90. [2025-09-18 13:30:47] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  91. [2025-09-18 13:30:47] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  92. [2025-09-18 13:30:47] [INFO] sleeping up to 5 seconds until next reconnection attempt
  93. [2025-09-18 13:30:52] [INFO] checking state of node "ubuntu05" (ID: 1), 11 of 12 attempts
  94. [2025-09-18 13:30:52] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  95. [2025-09-18 13:30:52] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  96. [2025-09-18 13:30:52] [INFO] sleeping up to 5 seconds until next reconnection attempt
  97. [2025-09-18 13:30:52] [INFO] checking state of node "ubuntu05" (ID: 1), 11 of 12 attempts
  98. [2025-09-18 13:30:52] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  99. [2025-09-18 13:30:52] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  100. [2025-09-18 13:30:52] [INFO] sleeping up to 5 seconds until next reconnection attempt
  101. [2025-09-18 13:30:57] [INFO] checking state of node "ubuntu05" (ID: 1), 12 of 12 attempts
  102. [2025-09-18 13:30:57] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  103. [2025-09-18 13:30:57] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  104. [2025-09-18 13:30:57] [WARNING] unable to reconnect to node "ubuntu05" (ID: 1) after 12 attempts
  105. [2025-09-18 13:30:57] [INFO] 1 active sibling nodes registered
  106. [2025-09-18 13:30:57] [INFO] 3 total nodes registered
  107. [2025-09-18 13:30:57] [INFO] primary node  "ubuntu05" (ID: 1) and this node have the same location ("default")
  108. [2025-09-18 13:30:57] [INFO] local node's last receive lsn: 0/220000A0
  109. [2025-09-18 13:30:57] [INFO] checking state of sibling node "ubuntu07" (ID: 3)
  110. [2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) reports its upstream is node 1, last seen 56 second(s) ago
  111. [2025-09-18 13:30:57] [INFO] standby node "ubuntu07" (ID: 3) last saw primary node 56 second(s) ago
  112. [2025-09-18 13:30:57] [INFO] last receive LSN for sibling node "ubuntu07" (ID: 3) is: 0/220000A0
  113. [2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) has same LSN as current candidate "ubuntu06" (ID: 2)
  114. [2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) has lower priority (60) than current candidate "ubuntu06" (ID: 2) (80)
  115. [2025-09-18 13:30:57] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 10 seconds
  116. [2025-09-18 13:30:57] [NOTICE] promotion candidate is "ubuntu06" (ID: 2)
  117. [2025-09-18 13:30:57] [NOTICE] this node is the winner, will now promote itself and inform other nodes
  118. [2025-09-18 13:30:57] [INFO] promote_command is:
  119.   "/usr/local/pgsql16/server/bin/repmgr standby promote -f /usr/local/pgsql16/repmgr/repmgr.conf --log-to-file"
  120. [2025-09-18 13:30:57] [NOTICE] redirecting logging output to "/usr/local/pgsql16/repmgr/repmgr.log"
  121. [2025-09-18 13:30:57] [WARNING] 1 sibling nodes found, but option "--siblings-follow" not specified
  122. [2025-09-18 13:30:57] [DETAIL] these nodes will remain attached to the current primary:
  123.   ubuntu07 (node ID: 3)
  124. [2025-09-18 13:30:57] [NOTICE] promoting standby to primary
  125. [2025-09-18 13:30:57] [DETAIL] promoting server "ubuntu06" (ID: 2) using pg_promote()
  126. [2025-09-18 13:30:57] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
  127. [2025-09-18 13:30:57] [INFO] checking state of node "ubuntu05" (ID: 1), 12 of 12 attempts
  128. [2025-09-18 13:30:57] [WARNING] unable to ping "user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr"
  129. [2025-09-18 13:30:57] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  130. [2025-09-18 13:30:57] [WARNING] unable to reconnect to node "ubuntu05" (ID: 1) after 12 attempts
  131. [2025-09-18 13:30:57] [INFO] 1 active sibling nodes registered
  132. [2025-09-18 13:30:57] [INFO] 3 total nodes registered
  133. [2025-09-18 13:30:57] [INFO] primary node  "ubuntu05" (ID: 1) and this node have the same location ("default")
  134. [2025-09-18 13:30:57] [INFO] local node's last receive lsn: 0/220000A0
  135. [2025-09-18 13:30:57] [INFO] checking state of sibling node "ubuntu07" (ID: 3)
  136. [2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) reports its upstream is node 1, last seen 56 second(s) ago
  137. [2025-09-18 13:30:57] [INFO] standby node "ubuntu07" (ID: 3) last saw primary node 56 second(s) ago
  138. [2025-09-18 13:30:57] [INFO] last receive LSN for sibling node "ubuntu07" (ID: 3) is: 0/220000A0
  139. [2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) has same LSN as current candidate "ubuntu06" (ID: 2)
  140. [2025-09-18 13:30:57] [INFO] node "ubuntu07" (ID: 3) has lower priority (60) than current candidate "ubuntu06" (ID: 2) (80)
  141. [2025-09-18 13:30:57] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 10 seconds
  142. [2025-09-18 13:30:57] [NOTICE] promotion candidate is "ubuntu06" (ID: 2)
  143. [2025-09-18 13:30:57] [NOTICE] this node is the winner, will now promote itself and inform other nodes
  144. [2025-09-18 13:30:57] [INFO] promote_command is:
  145.   "/usr/local/pgsql16/server/bin/repmgr standby promote -f /usr/local/pgsql16/repmgr/repmgr.conf --log-to-file"
  146. [2025-09-18 13:30:57] [NOTICE] redirecting logging output to "/usr/local/pgsql16/repmgr/repmgr.log"
  147. [2025-09-18 13:30:57] [ERROR] STANDBY PROMOTE can only be executed on a standby node
  148. [2025-09-18 13:30:57] [ERROR] promote command failed
  149. [2025-09-18 13:30:57] [DETAIL] promote command exited with error code 8
  150. [2025-09-18 13:30:57] [INFO] checking if original primary node has reappeared
  151. [2025-09-18 13:30:57] [ERROR] connection to database failed
  152. [2025-09-18 13:30:57] [DETAIL]
  153. connection to server at "192.168.152.111", port 9000 failed: Connection refused
  154.         Is the server running on that host and accepting TCP/IP connections?
  155. [2025-09-18 13:30:57] [DETAIL] attempted to connect using:
  156.   user=repmgr connect_timeout=100 dbname=repmgr host=192.168.152.111 port=9000 fallback_application_name=repmgr options=-csearch_path=
  157. [2025-09-18 13:30:57] [WARNING] unable to ping "host=192.168.152.111 user=repmgr dbname=repmgr port=9000 connect_timeout=100"
  158. [2025-09-18 13:30:57] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
  159. [2025-09-18 13:30:57] [NOTICE] local node is primary, checking local node state
  160. [2025-09-18 13:30:57] [NOTICE] resuming monitoring as primary node after 0 seconds
  161. [2025-09-18 13:30:57] [INFO] 1 followers to notify
  162. [2025-09-18 13:30:57] [INFO] reconnecting to node "ubuntu07" (ID: 3)...
  163. [2025-09-18 13:30:57] [NOTICE] notifying node "ubuntu07" (ID: 3) to follow node 2
  164. INFO:  node 3 received notification to follow node 2
  165. [2025-09-18 13:30:57] [NOTICE] monitoring cluster primary "ubuntu06" (ID: 2)
  166. [2025-09-18 13:30:58] [NOTICE] STANDBY PROMOTE successful
  167. [2025-09-18 13:30:58] [DETAIL] server "ubuntu06" (ID: 2) was successfully promoted to primary
  168. [2025-09-18 13:30:58] [INFO] checking state of node 2, 1 of 12 attempts
  169. [2025-09-18 13:30:58] [NOTICE] node 2 has recovered, reconnecting
  170. [2025-09-18 13:30:58] [INFO] connection to node 2 succeeded
  171. [2025-09-18 13:30:58] [INFO] original connection is still available
  172. [2025-09-18 13:30:58] [INFO] 1 followers to notify
  173. [2025-09-18 13:30:58] [NOTICE] notifying node "ubuntu07" (ID: 3) to follow node 2
  174. INFO:  node 3 received notification to follow node 2
  175. [2025-09-18 13:30:58] [INFO] switching to primary monitoring mode
  176. [2025-09-18 13:30:58] [NOTICE] monitoring cluster primary "ubuntu06" (ID: 2)
  177. [2025-09-18 13:30:58] [INFO] child node "ubuntu07" (ID: 3) is attached
  178. [2025-09-18 13:31:02] [NOTICE] new standby "ubuntu07" (ID: 3) has connected
  179. [2025-09-18 13:35:57] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state
  180. [2025-09-18 13:35:58] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state
  181. [2025-09-18 13:40:58] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state
  182. [2025-09-18 13:40:58] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state
  183. [2025-09-18 13:45:58] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state
  184. [2025-09-18 13:45:59] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state
  185. [2025-09-18 13:50:58] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state
  186. [2025-09-18 13:50:59] [INFO] monitoring primary node "ubuntu06" (ID: 2) in normal state
复制代码
 
repmgr的优缺点总结

repmgr在高可用方案上,勉强能用吧。
优点是安装配置都比较简单,
缺点是没办法做到连续自动故障转移,第一次转移完成后,故障节点想拉起来,还是要先做手动pg_rewind。
repmgr把元数据保存在本地的PostgreSQL数据库中,数据库启动之前repmgr进程不知道集群状态,所以不可能自动rewind,这也就是用PostgreSQL自身保存集群元数据的缺陷,也算是跟partoni的差距吧。




来源:程序园用户自行投稿发布,如果侵权,请联系站长删除
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!

相关推荐

您需要登录后才可以回帖 登录 | 立即注册