ceph mgr balancer模块执行流程与配置方案

梦霉发表于 2025-6-29 17:22:04

随着OSD的更替和集群的扩缩容，PG在OSD的分布会逐渐变的不均衡，导致各OSD的实际容量使用率出现差异，集群整体使用率降低。ceph balancer模块就是通过调整权重或者upmap指定pg映射来让pg分布均匀的模块，分为upmap模式和crush-compat模式，本文基于Pacific版本，主要分析和使用upmap模式的运行流程。
balancer模块运行流程

blancer模块执行流程概览：

ceph 的 balancer 分为 upmap 和 crush-compat两种模式，只能采用一种设置，对比两种模式：
实现方式

crush-compat 会在 crush map 中生成单独的choose_args 列表，包含调整过后的权重集，依靠该列表调整数据的分布，执行 ceph crush dump可以查看，例如：
"choose_args": {
   "-1": [
         {
            "bucket_id": -1,
            "weight_set": [
               [
                     0.1983489990234375,
                     0.1943206787109375,
                     0.1934051513671875
               ]
            ]
         },
         {
            "bucket_id": -2,
            "weight_set": [
               [
                     0.1983489990234375,
                     0.1943206787109375,
                     0.1934051513671875
               ]
            ]
         },
         {
            "bucket_id": -3,
            "weight_set": [
               [
                     0.1060638427734375,
                     0.09228515625
               ]
            ]
         },
         {
            "bucket_id": -4,
            "weight_set": [
               [
                     0.1060638427734375,
                     0.09228515625
               ]
            ]
         },
         {
            "bucket_id": -5,
            "weight_set": [
               [
                     0.096160888671875,
                     0.0981597900390625
               ]
            ]
         },
         {
            "bucket_id": -6,
            "weight_set": [
               [
                     0.096160888671875,
                     0.0981597900390625
               ]
            ]
         },
         {
            "bucket_id": -7,
            "weight_set": [
               [
                     0.0999908447265625,
                     0.093414306640625
               ]
            ]
         },
         {
            "bucket_id": -8,
            "weight_set": [
               [
                     0.0999908447265625,
                     0.093414306640625
               ]
            ]
         }
   ]
}upmap 模式会在 osd map 中生成 pg 的映射关系，例如第一条表示，将 PG 3.1e 从 OSD5 迁移到 OSD2 上：
pg_upmap_items 3.1e
pg_upmap_items 3.27
pg_upmap_items 3.28
pg_upmap_items 3.29
pg_upmap_items 3.33
pg_upmap_items 3.3e 版本

[*]crush-compat 模式兼容所有版本的客户端，客户端来请求OSDMap和CRUSH map，会使用choose_args结构（balancer调整后生成）中的权重。
[*]upmap 模式不支持 L 版本以下的客户端。
影响范围

crush-compat 模式根据权重控制 OSD 分布，集群会根据该规则对 PG 做重映射，无法控制影响的 PG 范围。
upmap 根据用户设置的可容忍最大 PG 偏离数和每周期最多可以调整多少个 PG 来控制 PG 重映射影响的范围。
upmap 运行流程

打开 OSD 日志 ceph tell mgr.* config set debug_osd 30/30，对比 upmap 执行代码流程进行分析，绿色为mgr python模块，蓝色为日志打印。
图中的流程表示：在双副本存储池下，把 pg 3.28 的 up_set 从改为，即把 pg 3.28 从 osd.5 移到 osd.2。

参数

时间参数

mgr/balancer/begin_time: 开始的时间，格式为HM，例如 0000                                     mgr/balancer/end_time：结束时间，格式为HM，例如 0100                                        mgr/balancer/begin_weekday：拜几开始，可取值1、2、3、4、5、6、7，例如 2
mgr/balancer/end_weekday：礼拜几结束，可取值1、2、3、4、5、6、7，例如 7
mgr/balancer/sleep_interval：balancer休眠多少秒后执行调整操作，例如 180upmap 关联的控制参数

mgr/balancer/active：是否启用balancer模块，true/false，例如 true
mgr/balancer/mode：balancer模式，分upmap和crush-compat，例如 upmap
mgr/balancer/upmap_max_deviation：允许偏离几个OSD，例如 5                                  mgr/balancer/upmap_max_optimizations：每次开始balancer最多调优多少轮退出，例如 10crush-compat 关联的控制参数

mgr/balancer/crush_compat_max_iterations：按照指定步长最多可调整多少次，例如 25             mgr/balancer/crush_compat_metrics：参与score计算的指标，例如pgs,objects,bytes             mgr/balancer/crush_compat_step：权重调整的步长，控制调整的权重精确度，例如0.500000             mgr/balancer/min_score：要调整到小于等于该score才表示调整完成，例如 0.020000                   mgr/balancer/mode：调整模式，例如crush-compat配置方案

无人值守

方案说明：开启 balancer 并且打开为 upmap 模式，分别针对 id 为 2 和 3 的存储池进行优化， balancer 执行周期为每天凌晨 2:00~5:00，在 balancer 执行周期内，检测是否存储池的 OSD 之间 PG 数量差异超过 5，若超过则进行优化，如果优化计算 10 次仍然无法将差值调整到低于 5 个 PG，则 balancer 睡眠 180 秒后再尝试优化。
ceph config set mgr mgr/balancer/active true
ceph config set mgr mgr/balancer/mode upmap
ceph config set mgr mgr/balancer/begin_weekday 1
ceph config set mgr mgr/balancer/end_weekday 7
ceph config set mgr mgr/balancer/begin_time 0200
ceph config set mgr mgr/balancer/end_time 0500
ceph config set mgr mgr/balancer/sleep_interval 180
ceph config set mgr mgr/balancer/upmap_max_deviation 5
ceph config set mgr mgr/balancer/upmap_max_optimizations 10
ceph config set mgr mgr/balancer/pool_ids 2,3注意：

[*]将 mgr/balancer/pool_ids设置为真实环境的 pool id
[*]在执行 balancer 的期间不要做性能测试，PGbackfill 会占用额外的资源
人工判断

可以选择关闭 balancer，由管理人员判断是否要进行均衡，随着 OSD 的更替，集群的扩缩容，PG 数量会变得不均衡，比如容量的使用率偏差超过了 20%，可以由管理员选择时间开启 balancer，设置指定的时间段。
观察哪个 OSD 上的 PG 数量偏差最大，以此为基准，逐步降低 upmap_max_deviation，例如 osd.1 的 PG 数比该存储池中所有 OSD 的平均 pg 数多 30 个，可以调整 upmap_max_deviation为 20，等待集群 backfill 完成后，如果不满足容量偏差的容忍程度，继续降低 upmap_max_deviation开始下一轮调整。
测试

调整前：
# ceph osd df tree
IDCLASSWEIGHT REWEIGHTSIZE RAW USEDATA OMAP META    AVAIL %USE VAR PGSSTATUSTYPE NAME
-1       0.58612       -600 GiB 81 GiB 74 GiB408 MiB 6.4 GiB519 GiB13.501.00 -       root default
-3       0.19537       -200 GiB 27 GiB 25 GiB139 MiB 2.1 GiB173 GiB13.561.00 -          host ceph-01
2 hdd0.09769 1.00000100 GiB 11 GiB9.7 GiB 72 MiB1008 MiB 89 GiB10.760.80 23    up       osd.2
5 hdd0.09769 1.00000100 GiB 16 GiB 15 GiB 67 MiB 1.1 GiB 84 GiB16.361.21 31    up       osd.5
-5       0.19537       -200 GiB 30 GiB 27 GiB 92 MiB 2.8 GiB170 GiB15.091.12 -          host ceph-02
1 hdd0.09769 1.00000100 GiB 15 GiB 13 GiB 37 MiB 1.5 GiB 85 GiB14.841.10 28    up       osd.1
4 hdd0.09769 1.00000100 GiB 15 GiB 14 GiB 55 MiB 1.3 GiB 85 GiB15.351.14 29    up       osd.4
-7       0.19537       -200 GiB 24 GiB 22 GiB177 MiB 1.6 GiB176 GiB11.860.88 -          host ceph-03
0 hdd0.09769 1.00000100 GiB 11 GiB9.8 GiB107 MiB 1.1 GiB 89 GiB10.990.81 25    up       osd.0
3 hdd0.09769 1.00000100 GiB 13 GiB 12 GiB 70 MiB 501 MiB 87 GiB12.720.94 27    up       osd.3
                  TOTAL600 GiB 81 GiB 74 GiB408 MiB 6.4 GiB519 GiB13.50
MIN/MAX VAR: 0.80/1.21STDDEV: 2.15调整后：
# ceph osd df tree
IDCLASSWEIGHT REWEIGHTSIZE RAW USEDATA OMAP META AVAIL %USE VAR PGSSTATUSTYPE NAME
-1       0.58612       -600 GiB 82 GiB74 GiB408 MiB7.0 GiB518 GiB13.601.00 -       root default
-3       0.19537       -200 GiB 27 GiB25 GiB139 MiB2.3 GiB173 GiB13.701.01 -          host ceph-01
2 hdd0.09769 1.00000100 GiB 14 GiB13 GiB 72 MiB1.3 GiB 86 GiB13.931.02 28    up       osd.2
5 hdd0.09769 1.00000100 GiB 13 GiB12 GiB 67 MiB1.1 GiB 87 GiB13.470.99 26    up       osd.5
-5       0.19537       -200 GiB 28 GiB25 GiB 92 MiB2.8 GiB172 GiB14.171.04 -          host ceph-02
1 hdd0.09769 1.00000100 GiB 14 GiB13 GiB 37 MiB1.5 GiB 86 GiB14.231.05 27    up       osd.1
4 hdd0.09769 1.00000100 GiB 14 GiB13 GiB 55 MiB1.3 GiB 86 GiB14.101.04 27    up       osd.4
-7       0.19537       -200 GiB 26 GiB24 GiB177 MiB1.8 GiB174 GiB12.930.95 -          host ceph-03
0 hdd0.09769 1.00000100 GiB 13 GiB12 GiB107 MiB1.4 GiB 87 GiB13.140.97 28    up       osd.0
3 hdd0.09769 1.00000100 GiB 13 GiB12 GiB 70 MiB501 MiB 87 GiB12.720.94 27    up       osd.3
                  TOTAL600 GiB 82 GiB74 GiB408 MiB7.0 GiB518 GiB13.60
MIN/MAX VAR: 0.94/1.05STDDEV: 0.54OSD 上的 PG 数量偏差已经小于等于设置的 2，CRUSH 规则没有被改变，OSD 的容量也变得均衡了。

来源：程序园用户自行投稿发布，如果侵权，请联系站长删除
免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！

页: [1]

程序园's Archiver

ceph mgr balancer模块执行流程与配置方案