Ne peut pas changer de DRBD sur secondaire

Question

Je cours drbd83 avec ocfs2 dans centos 5 et prévoyez d'utiliser packemaker avec eux. Aufer un peu de temps, je suis confronté à drbd problème cérébrale.

version: 8.3.13 (api:88/proto:86-96) GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by mockbuild@builder10.centos.org, 2012-05-07 11:56:36 1: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r----- ns:0 nr:0 dw:112281991 dr:797551 al:99 bm:6401 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:60

Je ne peux pas changer mon DRBD au secondaire.

drbdadm secondary r0 1: State change failed: (-12) Device is held open by someone Command 'drbdsetup 1 secondary' terminated with exit code 11

My drbd configuration de ressource:

resource r0 { syncer { rate 1000M; verify-alg sha1; } disk { on-io-error detach; } handlers { pri-lost-after-sb "/usr/lib/drbd/notify-split-brain.sh root"; } net { allow-two-primaries; after-sb-0pri discard-younger-primary; after-sb-1pri call-pri-lost-after-sb; after-sb-2pri call-pri-lost-after-sb; } startup { become-primary-on both; } on serving_4130{ device /dev/drbd1; disk /dev/sdb1; address 192.168.4.130:7789; meta-disk internal; } on MT305-3182 { device /dev/drbd1; disk /dev/xvdb1; address 192.168.3.182:7789; meta-disk internal; } }

Statut du statut OCFS2:

service ocfs2 status Configured OCFS2 mountpoints: /data

lsof montre que, il y a un processus relatif au DRBD.

lsof | grep drbd COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME drbd1_wor 7782 root cwd DIR 253,0 4096 2 / drbd1_wor 7782 root rtd DIR 253,0 4096 2 / drbd1_wor 7782 root txt unknown /proc/7782/exe

Et c'est un pays mort:

# ls -l /proc/7782/exe ls: cannot read symbolic link /proc/7782/exe: No such file or directory lrwxrwxrwx 1 root root 0 May 4 09:56 /proc/7782/exe # ps -ef | awk '$2 == "7782" { print $0 }' root 7782 1 0 Apr22 ? 00:00:20 [drbd1_worker]

Notez que ce processus est enveloppé entre crochets:

man ps :

args COMMAND command with all its arguments as a string. Modifications to the arguments may be shown. The output in this column may contain spaces. A process marked <defunct> is partly dead, waiting to be fully destroyed by its parent. Sometimes the process args will be unavailable; when this happens, ps will instead print the executable name in brackets.

Donc, la dernière question est la suivante: comment récupérer manuellement la DRBD dans ce cas sans redémarrer?

Répondre à @andreask:

Ma table de partition:

# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 35G 6.9G 27G 21% / /dev/xvda1 99M 20M 74M 22% /boot tmpfs 1.0G 0 1.0G 0% /dev/shm /dev/drbd1 100G 902M 100G 1% /data

Les noms de périphérique:

# dmsetup ls --tree -o inverted (202:2) ├─VolGroup00-LogVol01 (253:1) └─VolGroup00-LogVol00 (253:0)

Faites attention au dispositif de bloc (253:0), il est identique à celui de la sortie de lsof:

# lvdisplay --- Logical volume --- LV Name /dev/VolGroup00/LogVol00 VG Name VolGroup00 LV UUID vCd152-amVZ-GaPo-H9Zs-TIS0-KI6j-ej8kYi LV Write Access read/write LV Status available # open 1 LV Size 35.97 GB Current LE 1151 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:0

Répondre à @Doug:

# vgdisplay --- Volume group --- VG Name VolGroup00 System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 3 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 2 Max PV 0 Cur PV 1 Act PV 1 VG Size 39.88 GB PE Size 32.00 MB Total PE 1276 Alloc PE / Size 1276 / 39.88 GB Free PE / Size 0 / 0 VG UUID OTwzII-AP5H-nIbH-k2UA-H9nw-juBv-wcvmBq

Mise à jour du vendredi 17 mai 16:08:16 ICT 201

Voici Quelqu'un idées de Lars Ellenberg :

si le système de fichiers est toujours monté ... Oh bien. démonter. Pas paresseux, mais vraiment.

Je suis sûr que OCFS2 était déjà démonté.

Si NFS était impliqué, essayez
killall -9 nfsd killall -9 lockd echo 0 > /proc/fs/nfsd/threads 

Non, NFS n'a pas été impliqué.

si LVM/DMSETUP/KPARTX/MULTIPATH/UDEV est impliqué, essayez
dmsetup ls --tree -o inverted 
et vérifiez s'il y a des dépendances de DRBD.

Comme vous pouvez le constater de ma sortie ci-dessus, LVM ne concerne pas la DRBD:

pvdisplay -m

 --- Physical volume --- PV Name /dev/xvda2 VG Name VolGroup00 PV Size 39.90 GB / not usable 20.79 MB Allocatable yes (but full) PE Size (KByte) 32768 Total PE 1276 Free PE 0 Allocated PE 1276 PV UUID 1t4hkB-p43c-ABex-stfQ-XaRt-9H4i-51gSTD --- Physical Segments --- Physical extent 0 to 1148: Logical volume /dev/VolGroup00/LogVol00 Logical extents 0 to 1148 Physical extent 1149 to 1275: Logical volume /dev/VolGroup00/LogVol01 Logical extents 0 to 126

fdisk -l

Disk /dev/xvda: 42.9 GB, 42949672960 bytes 255 heads, 63 sectors/track, 5221 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/xvda1 * 1 13 104391 83 Linux /dev/xvda2 14 5221 41833260 8e Linux LVM Disk /dev/xvdb: 107.3 GB, 107374182400 bytes 255 heads, 63 sectors/track, 13054 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/xvdb1 1 13054 104856223+ 83 Linux

si la boucle/cryptoloop/etc. est impliquée, vérifiez si l'une de ces personnes y accédez toujours.

si une virtualisation Tecknique est utilisée, fermez/détruisez tous les conteneurs/VMS pouvant accéder à ce DRBD au cours de leur vie.

Non, ça ne le fait pas.

Parfois, c'est juste Udev ou équivalent à faire une course.

J'ai désactivé la règle multipath et même arrêter le udevd, et rien ne change.

Parfois, il s'agit d'une prise de domaine UNIX ou similaire toujours ouverte ouverte (ne sera pas nécessaire à apparaître dans LSOF/FUSER).

Si oui, comment pouvons-nous découvrir cette prise UNIX?

Mettre à jour le mercredi 22 mai 22:10:41 ICT 201

Voici la stackTrace du processus de travailleur DRBD lors de la dumping via Magic SysRQ Key :

kernel: drbd1_worker S ffff81007ae21820 0 7782 1 7795 7038 (L-TLB) kernel: ffff810055d89e00 0000000000000046 000573a8befba2d6 ffffffff8008e82f kernel: 00078d18577c6114 0000000000000009 ffff81007ae21820 ffff81007fcae040 kernel: 00078d18577ca893 00000000000002b1 ffff81007ae21a08 000000017a590180 kernel: Call Trace: kernel: [<ffffffff8008e82f>] enqueue_task+0x41/0x56 kernel: [<ffffffff80063002>] thread_return+0x62/0xfe kernel: [<ffffffff80064905>] __down_interruptible+0xbf/0x112 kernel: [<ffffffff8008ee84>] default_wake_function+0x0/0xe kernel: [<ffffffff80064713>] __down_failed_interruptible+0x35/0x3a kernel: [<ffffffff885d461a>] :drbd:.text.lock.drbd_worker+0x2d/0x43 kernel: [<ffffffff885eca37>] :drbd:drbd_thread_setup+0x127/0x1e1 kernel: [<ffffffff800bab82>] audit_syscall_exit+0x329/0x344 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 kernel: [<ffffffff885ec910>] :drbd:drbd_thread_setup+0x0/0x1e1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11

Je ne sais pas si cette région Heartbeat OCFS2 empêche la DRBD de passer au secondaire:

kernel: o2hb-C3E41CA2 S ffff810002536420 0 9251 31 3690 (L-TLB) kernel: ffff810004af7d20 0000000000000046 ffff810004af7d30 ffffffff80063002 kernel: 1400000004000000 000000000000000a ffff81007ec307a0 ffffffff80319b60 kernel: 000935c260ad6764 0000000000000fcd ffff81007ec30988 0000000000027e86 kernel: Call Trace: kernel: [<ffffffff80063002>] thread_return+0x62/0xfe kernel: [<ffffffff8006389f>] schedule_timeout+0x8a/0xad kernel: [<ffffffff8009a41d>] process_timeout+0x0/0x5 kernel: [<ffffffff8009a97c>] msleep_interruptible+0x21/0x42 kernel: [<ffffffff884b3b0b>] :ocfs2_nodemanager:o2hb_thread+0xd2c/0x10d6 kernel: [<ffffffff80063002>] thread_return+0x62/0xfe kernel: [<ffffffff800a329f>] keventd_create_kthread+0x0/0xc4 kernel: [<ffffffff884b2ddf>] :ocfs2_nodemanager:o2hb_thread+0x0/0x10d6 kernel: [<ffffffff800a329f>] keventd_create_kthread+0x0/0xc4 kernel: [<ffffffff80032632>] kthread+0xfe/0x132 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 kernel: [<ffffffff800a329f>] keventd_create_kthread+0x0/0xc4 kernel: [<ffffffff80032534>] kthread+0x0/0x132 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11

quanta · Accepted Answer

Je ne sais pas si cette région Heartbeat OCFS2 empêche la DRBD de passer au secondaire:

Peut-être. Avez-vous essayé de tuer cette région Suivre this guide?

# /etc/init.d/o2cb offline serving Stopping O2CB cluster serving: Failed Unable to stop cluster as heartbeat region still active

OK, tout d'abord, vous devez énumérer les volumes OCFS2 avec leurs étiquettes et leurs uuids:

# mounted.ocfs2 -d Device FS Stack UUID Label /dev/sdb1 ocfs2 o2cb C3E41CA2BDE8477CA7FF2C796098633C data_ocfs2 /dev/drbd1 ocfs2 o2cb C3E41CA2BDE8477CA7FF2C796098633C data_ocfs2

Deuxièmement, vérifiez si vous avez une référence à cet appareil:

# ocfs2_hb_ctl -I -d /dev/sdb1 C3E41CA2BDE8477CA7FF2C796098633C: 1 refs

Essayez de le tuer:

# ocfs2_hb_ctl -K -d /dev/sdb1 ocfs2

puis arrêtez la pile de cluster:

# /etc/init.d/o2cb stop Stopping O2CB cluster serving: OK Unmounting ocfs2_dlmfs filesystem: OK Unloading module "ocfs2_dlmfs": OK Unmounting configfs filesystem: OK Unloading module "configfs": OK

et ramener l'appareil dans le rôle secondaire:

# drbdadm secondary r0 # drbd-overview 1:r0 StandAlone Secondary/Unknown UpToDate/DUnknown r-----

Maintenant, vous pouvez récupérer le cerveau divisé comme d'habitude:

# drbdadm -- --discard-my-data connect r0 # drbd-overview 1:r0 WFConnection Secondary/Unknown UpToDate/DUnknown C r-----

Sur l'autre nœud (survivant cérébral divisé):

# drbdadm connect r0 # drbd-overview 1:r0 SyncSource Primary/Secondary UpToDate/Inconsistent C r---- /data ocfs2 100G 1.9G 99G 2% [>....................] sync'ed: 3.2% (753892/775004)K delay_probe: 28

Sur la victime du cerveau divisé:

# /etc/init.d/o2cb start Loading filesystem "configfs": OK Mounting configfs filesystem at /sys/kernel/config: OK Loading filesystem "ocfs2_dlmfs": OK Mounting ocfs2_dlmfs filesystem at /dlm: OK Starting O2CB cluster serving: OK # /etc/init.d/ocfs2 start Starting Oracle Cluster File System (OCFS2) [ OK ]

Vérifiez que ce point de montage est opérationnel:

# df -h /data/ Filesystem Size Used Avail Use% Mounted on /dev/drbd1 100G 1.9G 99G 2% /data

andreask · Answer

Une raison commune de la DRBD ne peut pas modifier une ressource est un appareil de périphérique actif-mapper ... comme un groupe de volumes. Vous pouvez vérifier cela par ex. avec:

dmsetup ls --tree -o inverted