Project

General

Profile

Anomalie #4236

Reboot super long de calamus

Added by Christian P. Momon 21 days ago. Updated 21 days ago.

Status:
En cours de traitement
Priority:
Élevée
Category:
-
Target version:
Start date:
02/02/2020
Due date:
% Done:

0%

Estimated time:
Difficulté:
2 Facile

Description

Le 02/02/2020, suite à un besoin d'upgrade, reboot de calamus : durée ~45 minutes.
À noter que le teste de détection de fsck a répondu OK.
Juste après, virola a rebootée dans un temps « habituel ».

History

#1

Updated by Christian P. Momon 21 days ago

  • Status changed from Nouveau to En cours de traitement
  • Assignee set to Christian P. Momon
  • Target version changed from Backlog to Février 2020

Ci-après, le copier/coller de ma session ssh :

cpm@ocmstar (14:57:03) ~ 1 > sshapril root@calamus.april.org
Linux calamus.april.org 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Sun Feb  2 11:59:22 2020 from 91.160.139.56
(April) root@calamus:~# virsh list
 Id   Name                   State
--------------------------------------
 1    adl                    running
 3    agir                   running
 5    bots                   running
 7    candidatsfr            running
 12   lamp                   running
 15   pad                    running
 16   photos                 running
 17   pouet                  running
 18   republique-numerique   running
 19   scm                    running
 21   webchat                running

(April) root@calamus:~# for disk in `mount -t ext4 | awk '{print $1}'`; do tune2fs -l $disk 2>/dev/null | /srv/scripts/common/check_tune2fs.py ; done
Mount count and Fsck delay are OK
Mount count and Fsck delay are OK
Mount count and Fsck delay are OK
Mount count and Fsck delay are OK
Mount count and Fsck delay are OK
(April) root@calamus:~# uptime
 13:58:02 up 19 days,  2:13,  1 user,  load average: 5.57, 5.86, 5.83
(April) root@calamus:~# apt-get update
Hit:1 http://apt.april.org/debian buster InRelease
Hit:2 http://apt.easter-eggs.com/debian buster-ee InRelease
Hit:3 http://ftp.fr.debian.org/debian buster InRelease
Hit:4 http://security.debian.org buster/updates InRelease
Reading package lists... Done
(April) root@calamus:~# apt-get dist-upgrade
Reading package lists... Done
Building dependency tree
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
(April) root@calamus:~# cd /etc/libvirt/qemu
(April) root@calamus:/etc/libvirt/qemu[master u=]# for host in $(ls *xml | sed -e 's/.xml//g') ; do virsh shutdown $host ; done; watch virsh list
Domain adl is being shutdown

error: Failed to shutdown domain admin
error: Requested operation is not valid: domain is not running

Domain agir is being shutdown

error: Failed to shutdown domain bastion
error: Requested operation is not valid: domain is not running

Domain bots is being shutdown

error: Failed to shutdown domain candidatsbe
error: Requested operation is not valid: domain is not running

Domain candidatsfr is being shutdown

error: Failed to shutdown domain cms-dev
error: Requested operation is not valid: domain is not running

error: Failed to shutdown domain dns
error: Requested operation is not valid: domain is not running

error: Failed to shutdown domain drupal6
error: Requested operation is not valid: domain is not running

error: Failed to shutdown domain dtc
error: Requested operation is not valid: domain is not running

Domain lamp is being shutdown

error: Failed to shutdown domain mail
error: Requested operation is not valid: domain is not running

error: Failed to shutdown domain mumble
error: Requested operation is not valid: domain is not running

Domain pad is being shutdown

Domain photos is being shutdown

Domain pouet is being shutdown

Domain republique-numerique is being shutdown

Domain scm is being shutdown

error: Failed to shutdown domain sympa
error: Requested operation is not valid: domain is not running

Domain webchat is being shutdown

(April) root@calamus:/etc/libvirt/qemu[master u=]# umount /var/lib/libvirt/calamus
(April) root@calamus:/etc/libvirt/qemu[master u=]# drbdadm disconnect virola
(April) root@calamus:/etc/libvirt/qemu[master u=]# drbdadm secondary calamus
(April) root@calamus:/etc/libvirt/qemu[master u=]# drbdadm disconnect calamus
(April) root@calamus:/etc/libvirt/qemu[master u=]# reboot
===================== ATTENTION !! ======================

Vous êtes sur le point de rebooter calamus.

Il s'agit d'un hyperviseur qui requiert une attention particulière.

Avant de procéder, veuillez svp ouvrir la page
https://admin.april.org/dokuwiki/doku.php?id=sysadm:redemarrer_calamus
et vérifier que vous suivez les points de la procédure
de redémarrage.

En particulier, assurez vous que quelqu'un de qualifié est
disponible pour intervenir physiquement en cas de besoin.

Pour continuer la procédure de reboot, saisissez RjU7
=========================================================

Saisissez le mot de passe : RjU7
W: molly-guard: SSH session detected!
Please type in hostname of the machine to reboot: calamus
Connection to calamus.april.org closed by remote host.
Connection to calamus.april.org closed.

[…]
cpm@ocmstar (15:16:50) ~ 8 >  sshapril root@calamus.april.org
^[[1;2Bssh: connect to host calamus.april.org port 22: Connection timed out

cpm@ocmstar (15:49:02) ~ 23 > sshapril  root@calamus.april.org
Linux calamus.april.org 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64
[…]

#2

Updated by Christian P. Momon 21 days ago

Extrait des logs systèmes :

/var/log/messages :
Feb  2 14:04:42 calamus kernel: [1649977.633223] drbd calamus: Terminating drbd_a_calamus
Feb  2 14:04:42 calamus kernel: [1649977.756268] drbd calamus: Connection closed
Feb  2 14:04:42 calamus kernel: [1649977.756285] drbd calamus: conn( Disconnecting -> StandAlone )
Feb  2 14:04:42 calamus kernel: [1649977.756287] drbd calamus: receiver terminated
Feb  2 14:04:42 calamus kernel: [1649977.756294] drbd calamus: Terminating drbd_r_calamus
Feb  2 14:04:57 calamus kernel: [1649992.302489] block drbd1: disk( UpToDate -> Failed )
Feb  2 14:04:57 calamus kernel: [1649992.302625] block drbd1: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Feb  2 14:04:57 calamus kernel: [1649992.302629] block drbd1: disk( Failed -> Diskless )
Feb  2 14:46:34 calamus kernel: [    0.000000] Linux version 4.19.0-6-amd64 (debian-kernel@lists.debian.org) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.67-2+\
deb10u2 (2019-11-11)
Feb  2 14:46:34 calamus kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.19.0-6-amd64 root=/dev/mapper/vg_calamus-root ro quiet
Feb  2 14:46:34 calamus kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Feb  2 14:46:34 calamus kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Feb  2 14:46:34 calamus kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Feb  2 14:46:34 calamus kernel: [    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Feb  2 14:46:34 calamus kernel: [    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.

C'est très caractéristique d'un fsck.

#3

Updated by Christian P. Momon 21 days ago

Vérification des valeurs utilisées par la commande de détection d'un fsck prochain, réalisé après le reboot super long de calamus :

(April) root@calamus:/var/log# for disk in `mount -t ext4 | awk '{print $1}'`; do tune2fs -l $disk 2>/dev/null ; done |egrep -i "(Last checked|interval|mount count)" 
Mount count:              75
Maximum mount count:      -1
Last checked:             Mon Mar  7 16:24:15 2016
Check interval:           0 (<none>)
Mount count:              75
Maximum mount count:      -1
Last checked:             Mon Mar  7 16:24:18 2016
Check interval:           0 (<none>)
Mount count:              75
Maximum mount count:      -1
Last checked:             Mon Mar  7 16:24:18 2016
Check interval:           0 (<none>)
Mount count:              72
Maximum mount count:      -1
Last checked:             Mon Mar  7 19:54:05 2016
Check interval:           0 (<none>)

#4

Updated by Christian P. Momon 21 days ago

Valeurs avant le reboot de virola :

cpm@ocmstar (15:55:33) ~ 24 > sshapril  root@virola.april.org
Linux virola.april.org 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64
[…]
(April) root@virola:/etc/libvirt/qemu[master$ u=]# for disk in `mount -t ext4 | awk '{print $1}'`; do tune2fs -l $disk 2>/dev/null ; done |egrep -i "(Last checked|interval|mount count)" 
Mount count:              99
Maximum mount count:      -1
Last checked:             Mon Mar  7 16:05:58 2016
Check interval:           0 (<none>)
Mount count:              99
Maximum mount count:      -1
Last checked:             Mon Mar  7 16:06:04 2016
Check interval:           0 (<none>)
Mount count:              99
Maximum mount count:      -1
Last checked:             Mon Mar  7 16:06:03 2016
Check interval:           0 (<none>)
Mount count:              96
Maximum mount count:      -1
Last checked:             Mon Mar  7 19:52:45 2016
Check interval:           0 (<none>)
Mount count:              78
Maximum mount count:      -1
Last checked:             Mon Mar  7 19:57:03 2016
Check interval:           0 (<none>)

Valeurs après le reboot de virola :

Last login: Sun Feb  2 15:12:47 2020 from 91.160.139.56
(April) root@virola:~# for disk in `mount -t ext4 | awk '{print $1}'`; do tune2fs -l $disk 2>/dev/null ; done |egrep -i "(Last checked|interval|mount count)" 
Mount count:              100
Maximum mount count:      -1
Last checked:             Mon Mar  7 16:05:58 2016
Check interval:           0 (<none>)
Mount count:              100
Maximum mount count:      -1
Last checked:             Mon Mar  7 16:06:03 2016
Check interval:           0 (<none>)
Mount count:              100
Maximum mount count:      -1
Last checked:             Mon Mar  7 16:06:04 2016
Check interval:           0 (<none>)
Mount count:              97
Maximum mount count:      -1
Last checked:             Mon Mar  7 19:52:45 2016
Check interval:           0 (<none>)
Mount count:              79
Maximum mount count:      -1
Last checked:             Mon Mar  7 19:57:03 2016
Check interval:           0 (<none>)

#5

Updated by Christian P. Momon 21 days ago

  • Priority changed from Normale to Élevée

À en croire les valeurs suivantes, la détection ne peut pas fonctionner.

Mount count:              100
Maximum mount count:      -1
Last checked:             Mon Mar  7 16:05:58 2016
Check interval:           0 (<none>)

Questions :
  • pourquoi un fsck s'est-il déclenché quand même ?
  • Le champs « Last checked » est-il le bon ?
  • Que faut-il modifier dans la commande de détection ?
  • Faut-il forcer un fsck préventif ?
  • Des sondes existent-elles et alors faut-il les modifier ? Ou faut-il en créer ?
#6

Updated by François Poulain 21 days ago

Tu peux aussi avoir un soucis sur le volume, indétecté par la sonde, mais qui pousse au fsck via le paramètre pass du fstab, non ?

Also available in: Atom PDF