Project

General

Profile

Anomalie #4601

Redémarrage difficile des vm coon

Added by Christian P. Momon 3 months ago. Updated 27 days ago.

Status:
Fermé
Priority:
Élevée
Start date:
07/15/2020
Due date:
% Done:

0%

Estimated time:

Description

Suite à une simple mise à jour de sécurité, de grosses difficultés à redémarrer coon.

Quelques traces :

Voici quelques traces :
<pre>
=(^-^)=root@coon:/etc/libvirt/qemu# for host in $(ls *xml | sed -e 's/.xml//g'| grep -v modele) ; do virsh start $host ; done
error: Failed to start domain admin
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain allo
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain bastion
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain bla
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain dns
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain drop
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain lamp
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain libreoffice
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain ludo
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain mail
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain pad
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain pouet
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain sympa
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain valise
error: Requested operation is not valid: network 'default' is not active

error: Failed to start domain xmpp
error: Requested operation is not valid: network 'default' is not active

=================================================================================================

En regardant Virtmanager > coon > réseaux virtuels > default > État, il est indiqué « inactif ».
Si activation alors :
Erreur lors du démarrage du réseau « default »: internal error: Network is already in use by interface virbr0
Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 75, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 111, in tmpcb
    callback(*args, **kwargs)
  File "/usr/share/virt-manager/virtManager/libvirtobject.py", line 66, in newfn
    ret = fn(self, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/network.py", line 76, in start
    self._backend.create()
  File "/usr/lib/python3/dist-packages/libvirt.py", line 2996, in create
    if ret == -1: raise libvirtError ('virNetworkCreate() failed', net=self)
libvirt.libvirtError: internal error: Network is already in use by interface virbr0

=================================================================================================

juil. 15 00:57:44 coon.chapril.org icinga2[1305]: [2020-07-15 00:57:44 +0200] information/ApiListener: Finished reconnecting to endpoint 'admin.cluster.chapril.org' via host 'admin.cluster.chapril.org' and port '5665'
juil. 15 00:57:44 coon.chapril.org systemd[1]: Reloading Postfix Mail Transport Agent.
juil. 15 00:57:44 coon.chapril.org systemd[1]: Reloaded Postfix Mail Transport Agent.
juil. 15 00:57:44 coon.chapril.org kernel: e1000e: enp1s0 NIC Link is Down
juil. 15 00:57:44 coon.chapril.org kernel: virbr0: port 2(enp1s0) entered disabled state
juil. 15 00:57:44 coon.chapril.org libvirtd[2697]: libvirt version: 5.0.0, package: 4+deb10u1 (Guido Günther <agx@sigxcpu.org> Thu, 05 Dec 2019 00:22:14 +0100)
juil. 15 00:57:44 coon.chapril.org libvirtd[2697]: hostname: coon.chapril.org
juil. 15 00:57:44 coon.chapril.org libvirtd[2697]: Network name='default' uuid=0f6e21a4-a6e3-45fe-af5b-3af5361ec327 is tainted: hook-script
juil. 15 00:57:44 coon.chapril.org libvirtd[2697]: internal error: Network is already in use by interface virbr0
juil. 15 00:57:45 coon.chapril.org kernel: drop UNMATCHED IN-external_trIN=enp0s31f6 OUT= MAC=90:1b:0e:cb:cd:12:40:71:83:a5:f1:d0:08:00 SRC=185.39.11.32 DST=94.130.8.3 LEN=40 TOS=0x00 PREC=0x00 TTL=250 ID=60789 PROTO=TCP SPT=41728 DPT=622 WINDOW=1024 RES=0x00 SYN URGP=0 
juil. 15 00:57:52 coon.chapril.org sshd[2762]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=222.186.173.226  user=root
juil. 15 00:57:52 coon.chapril.org systemd[1]: systemd-fsckd.service: Succeeded.
juil. 15 00:57:52 coon.chapril.org kernel: drbd coon: bind before connect failed, err = -99
juil. 15 00:57:52 coon.chapril.org kernel: drbd coon: conn( WFConnection -> Disconnecting ) 
juil. 15 00:57:53 coon.chapril.org kernel: drop UNMATCHED IN-external_trIN=enp0s31f6 OUT= MAC=90:1b:0e:cb:cd:12:40:71:83:a5:f1:d0:08:00 SRC=93.174.93.123 DST=94.130.8.3 LEN=40 TOS=0x00 PREC=0x00 TTL=250 ID=57905 PROTO=TCP SPT=43411 DPT=11395 WINDOW=1024 RES=0x00 SYN URGP=0 
juil. 15 00:57:54 coon.chapril.org sshd[2762]: Failed password for root from 222.186.173.226 port 62800 ssh2
juil. 15 00:57:54 coon.chapril.org icinga2[1305]: [2020-07-15 00:57:54 +0200] information/ApiListener: Reconnecting to endpoint 'admin.cluster.chapril.org' via host 'admin.cluster.chapril.org' and port '5665'
juil. 15 00:57:54 coon.chapril.org FireHOL[2965]: FireHOL started from '/' with: /usr/sbin/firehol start
juil. 15 00:57:54 coon.chapril.org FireHOL[2967]: Saving active firewall to a temporary file started
juil. 15 00:57:54 coon.chapril.org delayed_fw_reload[2714]: FireHOL: Saving active firewall to a temporary file...  OK
juil. 15 00:57:54 coon.chapril.org FireHOL[2979]: Saving active firewall to a temporary file succeeded
juil. 15 00:57:54 coon.chapril.org FireHOL[2980]: Processing file '/etc/firehol/firehol.conf' started
juil. 15 00:57:55 coon.chapril.org drbd[1324]: WARN: stdin/stdout is not a TTY; using /dev/console
juil. 15 00:57:55 coon.chapril.org drbd[1324]: .
juil. 15 00:57:55 coon.chapril.org systemd[1]: Started LSB: Control DRBD resources..
juil. 15 00:57:55 coon.chapril.org kernel: drbd maine: bind before connect failed, err = -99
juil. 15 00:57:55 coon.chapril.org kernel: drbd maine: conn( WFConnection -> Disconnecting )
</pre>


Related issues

Related to Infra Chapril - Anomalie #3603: Rédémarrage difficile des vm de coonFermé02/19/2019

Actions
Related to Infra Chapril - Demande #4599: Ouverture de ports pour ludoFermé07/14/2020

Actions

History

#1

Updated by Christian P. Momon 3 months ago

  • Description updated (diff)
#2

Updated by Christian P. Momon 3 months ago

  • Status changed from Nouveau to En cours de traitement
  • Priority changed from Normale to Élevée

Pour débloquer la situation, il a fallu annuler des modifs de configuration de minetest dans firehol.conf alors que ce dernier ne présente absolument aucune source potentielle d'anomalie.

#3

Updated by Christian P. Momon 3 months ago

Commentaire de PoluX sur #4599 :

Après une longue investigation avec Christian, on a tiqué sur :
Jul 15 00:01:20 coon delayed_fw_reload[1616]: ERROR: FireHOL is already running. Exiting...
On soupçonne une race condition qui apparait du fait que les regles iptables deviennent nombreuses (4000).
En regardant la conf systemd de libvirtd sur coon on tombe sur :
  Drop-In: /etc/systemd/system/libvirtd.service.d
           └─override.conf
[...]
  Process: 1441 ExecStartPost=/usr/local/bin/delayed_fw_reload (code=exited, status=0/SUCCESS)
Ça ressemble à une cochonceté oubliée, datant des premiers jours du cluster. Depuis, les hooks network on rendu ça inutiles.
Par ailleurs le code impliqué est plus que naïf.
# cat /usr/local/bin/delayed_fw_reload
#!/bin/bash

sleep 10 && firehol start

Je dégage tout ça de coon. Maine en est exempt.
#4

Updated by Christian P. Momon 3 months ago

Le test de redémarrage avec les modifs firehost ré-intégrées donne un résultat nominal. Donc nous validons que ça venait de ça \o/

#5

Updated by Christian P. Momon 3 months ago

  • Status changed from En cours de traitement to Résolu

PoluX met évidence que le problème existait depuis un moment, on retrouve les symptômes dans les tickets #3603, #3734 et le courriel du 08/10/2019…

Donc, cool d'avoir trouvé et corrigé !

#6

Updated by Christian P. Momon 3 months ago

  • Related to Anomalie #3603: Rédémarrage difficile des vm de coon added
#7

Updated by Christian P. Momon 3 months ago

#8

Updated by Christian P. Momon 28 days ago

  • Status changed from Résolu to Fermé
#9

Updated by Christian P. Momon 27 days ago

  • Target version changed from Backlog to Sprint 2020 été

Also available in: Atom PDF