(6)ceph集群osd down 故障处理
admin
2023-03-12 02:21:45
0

(1)查看集群状态,发现2个osd 状态为down

[root@node140 /]# ceph -s 
  cluster:
    id:     58a12719-a5ed-4f95-b312-6efd6e34e558
    health: HEALTH_ERR
            noout flag(s) set
            2 osds down
            1 scrub errors
            Possible data damage: 1 pg inconsistent
            Degraded data redundancy: 1633/10191 objects degraded (16.024%), 84 pgs degraded, 122 pgs undersized

  services:
    mon: 2 daemons, quorum node140,node142 (age 3d)
    mgr: admin(active, since 3d), standbys: node140
    osd: 18 osds: 16 up (since 3d), 18 in (since 5d)
         flags noout

  data:
    pools:   2 pools, 384 pgs
    objects: 3.40k objects, 9.8 GiB
    usage:   43 GiB used, 8.7 TiB / 8.7 TiB avail
    pgs:     1633/10191 objects degraded (16.024%)
             261 active+clean
             84  active+undersized+degraded
             38  active+undersized
             1   active+clean+inconsistent

(2)查看osd状态

[root@node140 /]# ceph  osd tree
ID CLASS WEIGHT  TYPE NAME        STATUS REWEIGHT PRI-AFF 
-1       9.80804 root default                             
-2       3.26935     host node140                         
 0   hdd 0.54489         osd.0        up  1.00000 1.00000 
 1   hdd 0.54489         osd.1        up  1.00000 1.00000 
 2   hdd 0.54489         osd.2        up  1.00000 1.00000 
 3   hdd 0.54489         osd.3        up  1.00000 1.00000 
 4   hdd 0.54489         osd.4        up  1.00000 1.00000 
 5   hdd 0.54489         osd.5        up  1.00000 1.00000 
-3       3.26935     host node141                         
12   hdd 0.54489         osd.12       up  1.00000 1.00000 
13   hdd 0.54489         osd.13       up  1.00000 1.00000 
14   hdd 0.54489         osd.14       up  1.00000 1.00000 
15   hdd 0.54489         osd.15       up  1.00000 1.00000 
16   hdd 0.54489         osd.16       up  1.00000 1.00000 
17   hdd 0.54489         osd.17       up  1.00000 1.00000 
-4       3.26935     host node142                         
 6   hdd 0.54489         osd.6        up  1.00000 1.00000 
 7   hdd 0.54489         osd.7      down  1.00000 1.00000 
 8   hdd 0.54489         osd.8      down  1.00000 1.00000 
 9   hdd 0.54489         osd.9        up  1.00000 1.00000 
10   hdd 0.54489         osd.10       up  1.00000 1.00000 
11   hdd 0.54489         osd.11       up  1.00000 1.00000 

(3)osd 7 osd 8状态查看,已经failed了,重启也无法启动

[root@node140 /]# systemctl status ceph-osd@8.service
● ceph-osd@8.service - Ceph object storage daemon osd.8
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Fri 2019-08-30 17:36:50 CST; 1min 20s ago
  Process: 433642 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=1/FAILURE)

Aug 30 17:36:50 node140 systemd[1]: Failed to start Ceph object storage daemon osd.8.
Aug 30 17:36:50 node140 systemd[1]: Unit ceph-osd@8.service entered failed state.
Aug 30 17:36:50 node140 systemd[1]: ceph-osd@8.service failed.
Aug 30 17:36:50 node140 systemd[1]: ceph-osd@8.service holdoff time over, scheduling restart.
Aug 30 17:36:50 node140 systemd[1]: Stopped Ceph object storage daemon osd.8.
Aug 30 17:36:50 node140 systemd[1]: start request repeated too quickly for ceph-osd@8.service
Aug 30 17:36:50 node140 systemd[1]: Failed to start Ceph object storage daemon osd.8.
Aug 30 17:36:50 node140 systemd[1]: Unit ceph-osd@8.service entered failed state.
Aug 30 17:36:50 node140 systemd[1]: ceph-osd@8.service failed.

(4)osd硬盘故障,状态变化
osd硬盘故障,状态变为down。在经过mod osd down out interval 设定的时间间隔后,ceph将其标记为out,并开始进行数据迁移恢复。 为了降低影响可以先关闭,待硬盘更换完成后再开启
[root@node140 /]# cat /etc/ceph/ceph.conf
[global]
mon osd down out interval = 900

(5)停止数据均衡
[root@node140 /]# for i in noout nobackfill norecover noscrub nodeep-scrub;do ceph osd set $i;done

(6)定位i故障盘
[root@node140 /]# ceph osd tree | grep -i down
7 hdd 0.54489 osd.7 down 0 1.00000
8 hdd 0.54489 osd.8 down 0 1.00000

(7)卸载故障的节点
[root@node142 ~]# umount /var/lib/ceph/osd/ceph-7
[root@node142 ~]# umount /var/lib/ceph/osd/ceph-8

(8)从crush map 中移除osd
[root@node142 ~]# ceph osd crush remove osd.7
removed item id 7 name 'osd.7' from crush map
[root@node142 ~]# ceph osd crush remove osd.8
removed item id 8 name 'osd.8' from crush map

(9)删除故障osd的密钥
[root@node142 ~]# ceph auth del osd.7
updated
[root@node142 ~]# ceph auth del osd.8
updated

(10)删除故障osd

[root@node142 ~]# ceph osd rm 7 
removed osd.7
[root@node142 ~]# ceph osd rm 8
removed osd.8
[root@node142 ~]# ceph osd tree 
ID CLASS WEIGHT  TYPE NAME        STATUS REWEIGHT PRI-AFF 
-1       8.71826 root default                             
-2       3.26935     host node140                         
 0   hdd 0.54489         osd.0        up  1.00000 1.00000 
 1   hdd 0.54489         osd.1        up  1.00000 1.00000 
 2   hdd 0.54489         osd.2        up  1.00000 1.00000 
 3   hdd 0.54489         osd.3        up  1.00000 1.00000 
 4   hdd 0.54489         osd.4        up  1.00000 1.00000 
 5   hdd 0.54489         osd.5        up  1.00000 1.00000 
-3       3.26935     host node141                         
12   hdd 0.54489         osd.12       up  1.00000 1.00000 
13   hdd 0.54489         osd.13       up  1.00000 1.00000 
14   hdd 0.54489         osd.14       up  1.00000 1.00000 
15   hdd 0.54489         osd.15       up  1.00000 1.00000 
16   hdd 0.54489         osd.16       up  1.00000 1.00000 
17   hdd 0.54489         osd.17       up  1.00000 1.00000 
-4       2.17957     host node142                         
 6   hdd 0.54489         osd.6        up  1.00000 1.00000 
 9   hdd 0.54489         osd.9        up  1.00000 1.00000 
10   hdd 0.54489         osd.10       up  1.00000 1.00000 
11   hdd 0.54489         osd.11       up  1.00000 1.00000 
[root@node142 ~]# 

(11)更换故障硬盘查看盘符,然后重建
[root@node142 ~]# ceph-volume lvm create --data /dev/sdd
[root@node142 ~]# ceph-volume lvm create --data /dev/sdc

(12)
[root@node142 ~]# ceph-volume lvm list
(13)待新osd添加crush map后,重新开启集群禁用标志
for i in noout nobackfill norecover noscrub nodeep-scrub;do ceph osd unset $i;done

相关内容

热门资讯

苏巧慧阵营影射李四川家族涉黑,... 海峡导报综合报道 新北市长选战硝烟渐起,身处苏巧慧阵营的新北市议员翁震州发文质疑李四川家族在小琉球做...
红场阅兵后,普京同军官握手致意 据凤凰卫视报道,5月9日,俄罗斯胜利日阅兵接近尾声,普京与军官一一握手致意。
车间里走出“准工程师”——黄河... 3500元项目津贴,三个真实项目研发经历,一份实习期月薪8000多元的录用通知——这是工学部2022...
狱中一年瘦脱相,被释放的巴勒斯... 曾为巴勒斯坦媒体工作的记者阿里·萨穆迪在以色列监狱被关押一年后,于4月30日获释。获释当天,萨穆迪的...
俄罗斯红场阅兵战士齐呼乌拉 当地时间5月9日10时(北京时间15时),俄罗斯纪念卫国战争胜利81周年阅兵在莫斯科红场举行。俄罗斯...
福建一地发现华南虎?当地辟谣:... 近日,福建龙岩漳平市,有网民称,“三重岭发现华南虎”,还配了一张老虎夜间出没山林的图片。这位网民表示...
俄红场阅兵现场播放无人机作战视... 据凤凰卫视报道,当地时间5月9日,俄罗斯胜利日阅兵式在莫斯科举行。阅兵现场,同步播放了无人机作战相关...
5月10日起,北京部分地铁线试... 为服务骑行爱好者携车出行,在借鉴国内先进城市成熟经验、深入开展实地调研的基础上,结合本市轨道交通运营...
解放军主战舰艇编队进入澎湖西南... 5月9日下午,国防部新闻局副局长、国防部新闻发言人蒋斌大校就近期涉军问题发布消息。媒体提到,据报道,...
普京会见三国总统 强化后苏联空... 普京会见三国总统  【普京会见三国总统】莫斯科5月8日电​ 当地时间5月8日,俄罗斯总统普京在莫斯科...