Ceph集群下更换坏盘OSD|全过程

文章目录[隐藏]

前言

Ceph集群大了之后,坏盘是很正常的事,这里记录一下更换坏盘的操作步骤。

正文

发现坏盘

首先是通过ceph perf或者直接在管理面板中查看ceph osd的情况,发现latency很高的盘就是有问题的盘了,如果这个盘的使用量低且延迟高,就应该果断的换掉:

移出坏盘

ceph osd out 23这里的23换成自己的问题OSD编号

停止服务

systemctl stop ceph-osd@23这里的23换成自己的问题OSD编号

移除磁盘



root@SH-1004:~# ceph osd safe-to-destroy osd.23

OSD(s) 23 are safe to destroy without reducing data durability.

#先确认可以安全的移除磁盘,然后再进行操作:

root@SH-1004:~# ceph osd destroy 23 --yes-i-really-mean-it

destroyed osd.23

更换硬盘

首先要用megacli标记这个盘,然后让机房进行操作,第一步是定位这个盘符对应的磁盘序列号:



root@SH-1004:~# smartctl -a /dev/sda -d megaraid,0

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.34-1-pve] (local build)

Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION === Vendor: HP Product: MB2000FAMYV Revision: HPD7 Compliance: SPC-3 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Logical block size: 512 bytes Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000c500211c625f Serial number: 9WM15QAD0000C050EV4F Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Mon Dec 21 20:04:53 2020 CST SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION === SMART Health Status: OK Current Drive Temperature: 0 C Drive Trip Temperature: 0 C

Elements in grown defect list: 4096

Error Counter logging not supported

Device does not support Self Test logging

然后再用PDlist (MegaCli64 -PDlist -a0)来查看对应的slot:



MegaCli64 -PdLocate -start -physdrv[32:0] -a0

之后offline+亮灯:



MegaCli64 -pdoffline -physdrv[32:0] -a0

MegaCli64 -PdLocate -start -physdrv[32:0] -a0


This article is under CC BY-NC-SA 4.0 license.
Please quote the original link:https://www.liujason.com/article/1125.html
Licensed under CC BY-NC-SA 4.0
comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy