osd down 处理

检查集群状态

ceph -s

查看osd tree

ceph osd tree | more

恢复osd

登录对应节点，重启osd

service ceph status osd.77
service ceph restart osd.77

检查osd up

ceph osd tree | more

查看ceph osd

ps -aux|grep ceph-osd

ps -ef|grep ceph-osd

数据recovery流量控制

本质上，用户数据写入ceph时，会被切分成大小相等的object，这些object由PG承载，分布到不同的OSD上（每个OSD一般会对应一块硬盘）。数据的迁移会以PG为单位进行，所以当PG发生变化时，就会有数据rebalance。

后端的数据均衡IO会对client的IO造成影响从而影响到集群的业务IO，所以我们需要对数据均衡IO进行控制，主要是业务优先和恢复优先。

业务优先

ceph tell osd.* injectargs '--osd-max-backfills 1 --osd-recovery-max-active 1 --osd-recovery-max-single-start 1'
ceph tell osd.* injectargs '--osd-recovery-sleep 1'

恢复优先

ceph tell osd.* injectargs '--osd-max-backfills 5 --osd-recovery-max-active 5 --osd-recovery-max-single-start 5'
ceph tell osd.* injectargs '--osd-recovery-sleep 0'

在业务繁忙时，完全关闭数据重建及迁移：

ceph osd set norebalance
ceph osd set norecover
ceph osd set nobackfill

在业务空闲时，打开数据重建及迁移

ceph osd unset norebalance
ceph osd unset norecover
ceph osd unset nobackfill

如果想长期有效，可以在进行以上操作立即生效后，修改所有ceph集群节点的配置文件。

注：查看现有recovery配置信息，这里的133为具体osd的id号

ceph --admin-daemon  /var/run/ceph/ceph-osd.133.asok config show | grep -E "osd_max_backfills|osd_recovery_max_active|osd_recovery_max_single_start|osd_recovery_sleep"
    "osd_max_backfills": "1",
    "osd_recovery_max_active": "1",
    "osd_recovery_max_single_start": "1",
    "osd_recovery_sleep": "0.000000",
    "osd_recovery_sleep_hdd": "0.100000",
    "osd_recovery_sleep_hybrid": "0.025000",
    "osd_recovery_sleep_ssd": "0.000000",

🪴 十二万光年

探索

ceph