Sorry, the current document does not have an English version. Click to switch to Chinese.

查看问题

记录一次 Ceph OSD down 的问题,将故障 OSD 对应磁盘还原成裸盘重新接入集群

查看 OSD 状态

ceph osd status
ceph osd tree

#查看osd对应的硬盘
ceph device ls-by-daemon osd.0

重启有问题的 OSD

systemctl status [email protected]
systemctl restart [email protected]

查看启动日志

2024-03-13T01:57:33.588+0000 7f510481f640 0 osd.0:4.OSDShard using op scheduler mClockScheduler
2024-03-13T01:57:33.756+0000 7f510481f640 1 bdev(0x5573cf796000 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
2024-03-13T01:57:33.760+0000 7f510481f640 1 bdev(0x5573cf796000 /var/lib/ceph/osd/ceph-0/block) open size 107369988096 (0x18ffc00000, 100 GiB) block_size 4096 (4 KiB) rotational device, discard supported
2024-03-13T01:57:33.896+0000 7f510481f640 1 bluestore(/var/lib/ceph/osd/ceph-0) _set_cache_sizes cache_size 1073741824 meta 0.45 kv 0.45 data 0.06
2024-03-13T01:57:33.896+0000 7f510481f640 1 bdev(0x5573cf796380 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
2024-03-13T01:57:33.896+0000 7f510481f640 1 bdev(0x5573cf796380 /var/lib/ceph/osd/ceph-0/block) open size 107369988096 (0x18ffc00000, 100 GiB) block_size 4096 (4 KiB) rotational device, discard supported
2024-03-13T01:57:33.896+0000 7f510481f640 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-0/block size 100 GiB
2024-03-13T01:57:33.896+0000 7f510481f640 1 bluefs mount
2024-03-13T01:57:34.020+0000 7f510481f640 1 bluefs _init_alloc shared, id 1, capacity 0x18ffc00000, block size 0x10000
2024-03-13T01:57:34.432+0000 7f510481f640 1 bluefs mount shared_bdev_used = 0
2024-03-13T01:57:34.436+0000 7f510481f640 1 bluestore(/var/lib/ceph/osd/ceph-0) _prepare_db_environment set db_paths to db,102001488691 db.slow,102001488691
2024-03-13T01:57:34.756+0000 7f510481f640 -1 bluestore(/var/lib/ceph/osd/ceph-0) _open_db erroring opening db:
2024-03-13T01:57:34.756+0000 7f510481f640 1 bluefs umount
2024-03-13T01:57:34.756+0000 7f510481f640 1 bdev(0x5573cf796380 /var/lib/ceph/osd/ceph-0/block) close
2024-03-13T01:57:34.936+0000 7f510481f640 1 bdev(0x5573cf796000 /var/lib/ceph/osd/ceph-0/block) close
2024-03-13T01:57:35.052+0000 7f510481f640 -1 osd.0 0 OSD:init: unable to mount object store
2024-03-13T01:57:35.052+0000 7f510481f640 -1 ** ERROR: osd init failed: (5) Input/output error

可以看到这里日志报警 Bluestore 无法打开数据库,报告了“Input/output error” (I/O 错误)
可能的原因:

  • 磁盘/分区故障: osd.0 运行所在的磁盘或分区可能出现物理故障,导致读写操作失败
  • 文件系统损坏: osd.0 使用的文件系统可能因各种原因出现损坏
  • 配置错误: Ceph 的配置可能存在问题,导致 OSD 无法找到或正确使用存储设备
  • 权限问题: Ceph 进程可能缺少对 /var/lib/ceph/osd/ceph-0 目录及其中文件的必要权限

排错过程

配置是没有更改的,应该不是配置的问题

检查磁盘状态

smartctl -a /dev/nvme1n1

使用 SMART 进行磁盘检测,发现磁盘本身无问题

smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.19.7-1.el7.elrepo.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: Amazon Elastic Block Store
Serial Number: vol0d9d42104245d5125
Firmware Version: 2.0
PCI Vendor/Subsystem ID: 0x1d0f
IEEE OUI Identifier: 0xa002dc
Controller ID: 0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 214,748,364,800 [214 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Wed Mar 13 04:21:53 2024 UTC
Firmware Updates (0x03): 1 Slot, Slot 1 R/O
Maximum Data Transfer Size: 64 Pages
Warning Comp. Temp. Threshold: 70 Celsius
Namespace 1 Features (0x12): NA_Fields *Other*

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 0.01W - - 0 0 0 0 1000000 1000000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: -
Available Spare: 0%
Available Spare Threshold: 0%
Percentage Used: 0%
Data Units Read: 0
Data Units Written: 0
Host Read Commands: 0
Host Write Commands: 0
Controller Busy Time: 0
Power Cycles: 0
Power On Hours: 0
Unsafe Shutdowns: 0
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

权限检查

ll /var/lib/ceph/6b6d577c-b035-11ee-8b7f-e51965640d97/osd.0

查看 ceph osd 0 所在文件夹权限发现无问题

lrwxrwxrwx 1 167 167  111 Mar 13 04:16 block -> /dev/mapper/ceph--2ee73f01--0a14--4ea0--a33d--59977dd4e642-osd--block--8e734088--2ddc--4a50--8e04--9d4570e7ae88
-rw------- 1 167 167 37 Mar 13 04:16 ceph_fsid
-rw------- 1 167 167 259 Mar 13 04:16 config
-rw------- 1 167 167 37 Mar 13 04:16 fsid
-rw------- 1 167 167 55 Mar 13 04:16 keyring
-rw------- 1 167 167 6 Mar 13 04:16 ready
-rw------- 1 167 167 3 Mar 13 04:16 require_osd_release
-rw------- 1 167 167 10 Mar 13 04:16 type
-rw------- 1 167 167 38 Mar 13 04:16 unit.configured
-rw------- 1 167 167 48 Mar 13 04:16 unit.created
-rw------- 1 167 167 90 Mar 13 04:16 unit.image
-rw------- 1 167 167 361 Mar 13 04:16 unit.meta
-rw------- 1 167 167 1.7K Mar 13 04:16 unit.poststop
-rw------- 1 167 167 2.8K Mar 13 04:16 unit.run
-rw------- 1 167 167 330 Mar 13 04:16 unit.stop
-rw------- 1 167 167 2 Mar 13 04:16 whoami

判断问题应该出在 Ceph 的文件系统损坏了,需要重建文件系统

恢复过程

卸载 osd

ceph osd out osd.0
ceph osd purge 0 --yes-i-really-mean-it
ceph osd crush remove osd.0
ceph auth rm osd。0
ceph osd rm osd.0

清理磁盘

sgdisk --zap-all /dev/nvme1n1
wipefs -a -f /dev/nvme1n1
blkdiscard /dev/nvme1n1
partprobe /dev/nvme1n1

找到对应磁盘的卷组删除

ls /dev/mapper/ceph-*
rm -rf /dev/mapper/ceph--52ae98df--3053--40af--86c4--def1ed9b9e68-osd--block--f09b0512--e73a--473b--8316--ccfe092c142d
ceph-volume lvm zap /dev/nvme1n1 --destroy

报错,表明当前磁盘正在被占用,需要解除占用后,再格式化

--> Zapping: /dev/nvme1n1
stderr: wipefs: error: /dev/nvme1n1: probing initialization failed: Device or resource busy
--> failed to wipefs device, will try again to workaround probable race condition
stderr: wipefs: error: /dev/nvme1n1: probing initialization failed: Device or resource busy
--> failed to wipefs device, will try again to workaround probable race condition
stderr: wipefs: error: /dev/nvme1n1: probing initialization failed: Device or resource busy
--> failed to wipefs device, will try again to workaround probable race condition
stderr: wipefs: error: /dev/nvme1n1: probing initialization failed: Device or resource busy
--> failed to wipefs device, will try again to workaround probable race condition
stderr: wipefs: error: /dev/nvme1n1: probing initialization failed: Device or resource busy
--> failed to wipefs device, will try again to workaround probable race condition

查看是否有进程占用,如有则需要清理进程

lsof /dev/nvme1n1
fuser /dev/nvme1n1

查看是否有设备映射,需要删除设备映射

dmsetup ls
dmsetup remove ceph--52ae98df--3053--40af--86c4--def1ed9b9e68-osd--block--f09b0512--e73a--473b--8316--ccfe092c142d

重新添加osd

ceph orch daemon add osd ceph1:/dev/nvme1n1

检查健康情况

重新检查 ceph 集群状态,发现已经 osd 已经正常,需要等待 pg 恢复

ceph -s
ceph osd tree
ceph osd status