Projet

Général

Profil

Anomalie #4447

Coupures des connexions au service Mumble April

Ajouté par Christian P. Momon il y a presque 4 ans. Mis à jour il y a presque 4 ans.

Statut:
Fermé
Priorité:
Normale
Assigné à:
Christian P. Momon
Catégorie:
-
Version cible:
Début:
30/04/2020
Echéance:
% réalisé:

0%

Temps estimé:
Difficulté:
7 Cauchemard

Description

Ce jeudi 30 avril 2020 à 11h, avait lieu une réunion d'adhérents sur le Mumble April, moins d'une dizaine de personnes.

Entre 10h59 et 11h09 (à peu prêt), constat de déconnexions répétées.

La solution a été de migrer vers le Mumble Chapril.

Demande : essayer de voir d'où c'est venu.


Fichiers

graphique-virola.png (2 Mo) graphique-virola.png Christian P. Momon, 30/04/2020 14:48

Historique

#1

Mis à jour par Christian P. Momon il y a presque 4 ans

Sur Icinga1, section « Event log » (merci PoluX) :

Service Critical[2020-04-30 09:58:40] SERVICE ALERT: drupal6;Check Web april.org: https form action;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
Service Ok      [2020-04-30 09:59:32] SERVICE ALERT: drupal6;Check Web april.org: https form action;OK;SOFT;2;HTTP OK: Status line output matched "200" - 48922 bytes in 0.149 second response time

 April 30, 2020 09:00 

Service Warning [2020-04-30 10:02:20] SERVICE ALERT: drupal6;LOAD;WARNING;SOFT;1;WARNING - load average: 6.18, 6.85, 3.65
Service Warning [2020-04-30 10:03:21] SERVICE ALERT: drupal6;LOAD;WARNING;SOFT;2;WARNING - load average: 2.27, 5.60, 3.42
Service Ok      [2020-04-30 10:04:20] SERVICE ALERT: drupal6;LOAD;OK;SOFT;3;OK - load average: 0.98, 4.62, 3.22
Service Warning [2020-04-30 10:28:10] SERVICE ALERT: calamus;LOAD;WARNING;SOFT;1;WARNING - load average: 5.32, 3.62, 2.87
Service Ok      [2020-04-30 10:29:10] SERVICE ALERT: calamus;LOAD;OK;SOFT;2;OK - load average: 4.29, 3.67, 2.94
Service Critical[2020-04-30 10:29:20] SERVICE ALERT: drupal6;LOAD;CRITICAL;SOFT;1;CRITICAL - load average: 11.26, 9.59, 5.03
Service Warning [2020-04-30 10:30:01] SERVICE ALERT: virola;IOWAIT DRBD0;WARNING;SOFT;1;WARNING - iowait stats iowait=19.38
Service Critical[2020-04-30 10:30:20] SERVICE ALERT: bastion;Check Web Secure Redirection wiki.april.org;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
Service Critical[2020-04-30 10:30:20] SERVICE ALERT: bastion;Check Web Redirection wiki.april.org;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds
Service Warning [2020-04-30 10:30:20] SERVICE ALERT: drupal6;LOAD;WARNING;SOFT;2;WARNING - load average: 5.83, 8.30, 4.87
Service Ok      [2020-04-30 10:31:01] SERVICE ALERT: virola;IOWAIT DRBD0;OK;SOFT;2;OK - iowait stats iowait=7.09
Service Ok      [2020-04-30 10:31:10] SERVICE ALERT: bastion;Check Web Secure Redirection wiki.april.org;OK;SOFT;2;HTTP OK: Status line output matched "301,Location: https://wiki.april.org/w/Accueil" - 446 bytes in 0.075 second response time
Service Ok      [2020-04-30 10:31:11] SERVICE ALERT: bastion;Check Web Redirection wiki.april.org;OK;SOFT;2;HTTP OK: Status line output matched "301,Location: https://wiki.april.org/w/Accueil" - 446 bytes in 0.079 second response time
Service Warning [2020-04-30 10:31:20] SERVICE ALERT: drupal6;LOAD;WARNING;SOFT;3;WARNING - load average: 3.25, 7.12, 4.68
Service Warning [2020-04-30 10:32:21] SERVICE ALERT: drupal6;LOAD;WARNING;HARD;4;WARNING - load average: 2.18, 6.10, 4.47
Service Critical[2020-04-30 10:42:20] SERVICE ALERT: drupal6;LOAD;CRITICAL;HARD;4;CRITICAL - load average: 13.45, 7.67, 5.59
Service Warning [2020-04-30 10:43:31] SERVICE ALERT: calamus;IOWAIT DRBD1;WARNING;SOFT;1;WARNING - iowait stats iowait=20.33
Service Ok      [2020-04-30 10:44:32] SERVICE ALERT: calamus;IOWAIT DRBD1;OK;SOFT;2;OK - iowait stats iowait=13.59
Service Critical[2020-04-30 10:46:01] SERVICE ALERT: virola;IOWAIT DRBD0;CRITICAL;SOFT;1;CRITICAL - iowait stats iowait=45.66
Service Ok      [2020-04-30 10:47:01] SERVICE ALERT: virola;IOWAIT DRBD0;OK;SOFT;2;OK - iowait stats iowait=7.27
Service Ok      [2020-04-30 10:47:20] SERVICE ALERT: drupal6;LOAD;OK;HARD;4;OK - load average: 1.19, 3.88, 4.57
Service Warning [2020-04-30 10:49:32] SERVICE ALERT: calamus;IOWAIT DRBD1;WARNING;SOFT;1;WARNING - iowait stats iowait=21.91
Service Ok      [2020-04-30 10:50:31] SERVICE ALERT: calamus;IOWAIT DRBD1;OK;SOFT;2;OK - iowait stats iowait=7.57
Auto-save retention data[2020-04-30 10:51:59] Auto-save of retention data completed successfully.
Service Warning [2020-04-30 10:55:31] SERVICE ALERT: calamus;IOWAIT DRBD1;WARNING;SOFT;1;WARNING - iowait stats iowait=16.49
Service Ok      [2020-04-30 10:56:32] SERVICE ALERT: calamus;IOWAIT DRBD1;OK;SOFT;2;OK - iowait stats iowait=10.33
Service Critical[2020-04-30 10:57:01] SERVICE ALERT: virola;IOWAIT DRBD0;CRITICAL;SOFT;1;CRITICAL - iowait stats iowait=39.41
Service Ok      [2020-04-30 10:58:01] SERVICE ALERT: virola;IOWAIT DRBD0;OK;SOFT;2;OK - iowait stats iowait=12.41
Service Warning [2020-04-30 10:58:51] SERVICE ALERT: virola;LOAD;WARNING;SOFT;1;WARNING - load average: 8.37, 5.66, 4.16
Service Critical[2020-04-30 10:59:11] SERVICE ALERT: calamus;LOAD;CRITICAL;SOFT;1;CRITICAL - load average: 11.29, 5.84, 4.28
Service Critical[2020-04-30 10:59:51] SERVICE ALERT: virola;LOAD;CRITICAL;SOFT;2;CRITICAL - load average: 12.54, 7.53, 4.90

 April 30, 2020 10:00 

Service Warning [2020-04-30 11:00:12] SERVICE ALERT: calamus;LOAD;WARNING;SOFT;2;WARNING - load average: 9.23, 6.21, 4.50
Service Critical[2020-04-30 11:00:50] SERVICE ALERT: virola;LOAD;CRITICAL;SOFT;3;CRITICAL - load average: 12.68, 8.43, 5.38
Service Warning [2020-04-30 11:01:11] SERVICE ALERT: calamus;LOAD;WARNING;SOFT;3;WARNING - load average: 9.34, 6.85, 4.84
Service Critical[2020-04-30 11:01:31] SERVICE ALERT: calamus;IOWAIT DRBD1;CRITICAL;SOFT;1;CRITICAL - iowait stats iowait=45.59
Service Critical[2020-04-30 11:01:50] SERVICE ALERT: virola;LOAD;CRITICAL;HARD;4;CRITICAL - load average: 11.58, 8.98, 5.77
Service Warning [2020-04-30 11:02:10] SERVICE ALERT: calamus;LOAD;WARNING;HARD;4;WARNING - load average: 7.54, 6.88, 4.97
Service Critical[2020-04-30 11:02:31] SERVICE ALERT: calamus;IOWAIT DRBD1;CRITICAL;SOFT;2;CRITICAL - iowait stats iowait=34.38
Service Critical[2020-04-30 11:03:01] SERVICE ALERT: virola;IOWAIT DRBD0;CRITICAL;SOFT;1;CRITICAL - iowait stats iowait=75.85
Service Critical[2020-04-30 11:03:31] SERVICE ALERT: calamus;IOWAIT DRBD1;CRITICAL;SOFT;3;CRITICAL - iowait stats iowait=39.13
Service Warning [2020-04-30 11:04:01] SERVICE ALERT: mail;Check SRS procs;WARNING;SOFT;1;PROCS WARNING: 19 processes with command name 'postsrsd'
Service Ok      [2020-04-30 11:04:01] SERVICE ALERT: virola;IOWAIT DRBD0;OK;SOFT;2;OK - iowait stats iowait=6.06
Service Critical[2020-04-30 11:04:32] SERVICE ALERT: calamus;IOWAIT DRBD1;CRITICAL;HARD;4;CRITICAL - iowait stats iowait=46.75
Service Warning [2020-04-30 11:05:00] SERVICE ALERT: mail;Check SRS procs;WARNING;SOFT;2;PROCS WARNING: 19 processes with command name 'postsrsd'
Service Critical[2020-04-30 11:05:11] SERVICE ALERT: bastion;Check Secure Web libre-en-fete.net;CRITICAL;SOFT;1;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 429 Too Many Requests
Service Ok      [2020-04-30 11:06:00] SERVICE ALERT: mail;Check SRS procs;OK;SOFT;3;PROCS OK: 8 processes with command name 'postsrsd'
Service Critical[2020-04-30 11:06:10] SERVICE ALERT: bastion;Check Secure Web libre-en-fete.org;CRITICAL;SOFT;1;HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 429 Too Many Requests
Service Ok      [2020-04-30 11:06:11] SERVICE ALERT: bastion;Check Secure Web libre-en-fete.net;OK;SOFT;2;HTTP OK: Status line output matched "200" - 15330 bytes in 0.067 second response time
Service Ok      [2020-04-30 11:07:10] SERVICE ALERT: bastion;Check Secure Web libre-en-fete.org;OK;SOFT;2;HTTP OK: Status line output matched "200" - 15330 bytes in 0.122 second response time
Service Ok      [2020-04-30 11:09:32] SERVICE ALERT: calamus;IOWAIT DRBD1;OK;HARD;4;OK - iowait stats iowait=5.76
Service Warning [2020-04-30 11:11:50] SERVICE ALERT: virola;LOAD;WARNING;HARD;4;WARNING - load average: 2.85, 5.53, 6.23
Service Ok      [2020-04-30 11:12:10] SERVICE ALERT: calamus;LOAD;OK;HARD;4;OK - load average: 2.88, 4.51, 4.94
Service Critical[2020-04-30 11:13:22] SERVICE ALERT: calamus;Check if Fail2ban DB is purged;CRITICAL;SOFT;1;CRITICAL : sqlite returned an error
Service Ok      [2020-04-30 11:14:20] SERVICE ALERT: calamus;Check if Fail2ban DB is purged;OK;SOFT;2;OK : database is purged
Service Warning [2020-04-30 11:19:01] SERVICE ALERT: virola;IOWAIT DRBD0;WARNING;SOFT;1;WARNING - iowait stats iowait=16.07
Service Warning [2020-04-30 11:20:01] SERVICE ALERT: virola;IOWAIT DRBD0;WARNING;SOFT;2;WARNING - iowait stats iowait=29.51
Service Warning [2020-04-30 11:21:00] SERVICE ALERT: mail;Check SRS procs;WARNING;SOFT;1;PROCS WARNING: 22 processes with command name 'postsrsd'
Service Warning [2020-04-30 11:21:01] SERVICE ALERT: virola;IOWAIT DRBD0;WARNING;SOFT;3;WARNING - iowait stats iowait=18.95
Service Warning [2020-04-30 11:22:01] SERVICE ALERT: mail;Check SRS procs;WARNING;SOFT;2;PROCS WARNING: 22 processes with command name 'postsrsd'
Service Ok      [2020-04-30 11:22:01] SERVICE ALERT: virola;IOWAIT DRBD0;OK;SOFT;4;OK - iowait stats iowait=13.96
Service Warning [2020-04-30 11:23:01] SERVICE ALERT: mail;Check SRS procs;WARNING;SOFT;3;PROCS WARNING: 22 processes with command name 'postsrsd'
Service Warning [2020-04-30 11:24:01] SERVICE ALERT: mail;Check SRS procs;WARNING;HARD;4;PROCS WARNING: 21 processes with command name 'postsrsd'
Service Ok      [2020-04-30 11:26:51] SERVICE ALERT: virola;LOAD;OK;HARD;4;OK - load average: 3.44, 3.39, 4.39
Service Ok      [2020-04-30 11:29:00] SERVICE ALERT: mail;Check SRS procs;OK;HARD;4;PROCS OK: 9 processes with command name 'postsrsd'
Service Ok      [2020-04-30 11:51:59] Auto-save of retention data completed successfully.

#2

Mis à jour par Christian P. Momon il y a presque 4 ans

Sur Icinga2 April :

virola.april.org Service: Ping Remote LAN WARNING 10:59:33 [ 1/5 ] PING WARNING - Packet loss = 0%, RTA = 196.49 ms

#3

Mis à jour par Christian P. Momon il y a presque 4 ans

Des graphiques de virola par Scaleway. Le pic de 11h est à comparer au pic de 21h la veille, moment de la réunion adminsys sur le même Mumble.

#4

Mis à jour par Christian P. Momon il y a presque 4 ans

Ce vendredi 22 mai 2020 à 11h00 pile, une petite dizaine de personnes en réunion sur le Mumble April se sont vus déconnectés, une fois.

Cela rappelle exactement le problème qui a ouvert ce ticket. Même horaire, pas le même jour.

<W>2020-05-22 10:59:58.495 1 => <146:(-1)> New connection: 90.XX.XX.XX:44080
<W>2020-05-22 10:59:59.377 1 => <146:(-1)> Client version 1.3.0 (X11: 1.3.0~git20190125.440b173+dfsg-2)
<W>2020-05-22 10:59:59.514 1 => <146:lonugem(-1)> Authenticated
<W>2020-05-22 11:00:17.129 1 => <146:lonugem(-1)> Moved lonugem:146(-1) to Terrasse Sud[5:0]
<W>2020-05-22 11:00:17.531 1 => <147:(-1)> New connection: 91.XX.XX.XX:27936
<W>2020-05-22 11:00:19.251 1 => <138:isa(-1)> Connection closed: The remote host closed the connection [1]
<W>2020-05-22 11:00:21.948 1 => <145:Kévin(-1)> Connection closed: The remote host closed the connection [1]
<W>2020-05-22 11:00:22.630 1 => <146:lonugem(-1)> Connection closed: The remote host closed the connection [1]
<W>2020-05-22 11:00:23.084 1 => <141:mo(-1)> Connection closed: The remote host closed the connection [1]
<W>2020-05-22 11:00:23.878 1 => <139:madix(-1)> Connection closed: The remote host closed the connection [1]
<W>2020-05-22 11:00:24.113 1 => <140:Christian(Cpm)(-1)> Connection closed: The remote host closed the connection [1]
<W>2020-05-22 11:00:24.598 1 => <147:(-1)> Client version 1.2.19 (X11: 1.2.19-1ubuntu1)
<W>2020-05-22 11:00:24.971 1 => <147:Didier(-1)> Authenticated
<W>2020-05-22 11:00:26.393 1 => <148:(-1)> New connection: 93.XX.XX.XX:46384
<W>2020-05-22 11:00:27.018 1 => <149:(-1)> New connection: 90.XX.XX.XX:49896
<W>2020-05-22 11:00:27.227 1 => <148:(-1)> Client version 1.3.0 (X11: 1.3.0~git20190125.440b173+dfsg-2)
<W>2020-05-22 11:00:27.441 1 => <148:isa(-1)> Authenticated
<W>2020-05-22 11:00:27.693 1 => <149:(-1)> Client version 1.3.0 (X11: 1.3.0)
<W>2020-05-22 11:00:27.886 1 => <149:Kévin(-1)> Authenticated

Ça serait bien de trouver d'où vient le problème.

Deux fois au même horaire, le problème est-il reproductible ?

#5

Mis à jour par Christian P. Momon il y a presque 4 ans

Dans bastion:/var/log/syslog, détection d'un incident au même moment :

May 22 10:59:40 bastion icinga2[510]: [2020-05-22 10:59:40 +0200] information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
May 22 10:59:50 bastion icinga2[510]: [2020-05-22 10:59:50 +0200] information/RemoteCheckQueue: items: 0, rate: 0/s (12/min 60/5min 180/15min);
May 22 11:00:00 bastion icinga2[510]: [2020-05-22 11:00:00 +0200] information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
May 22 11:00:00 bastion kernel: [1027919.996688] warn_alloc: 62 callbacks suppressed
May 22 11:00:00 bastion kernel: [1027919.996692] ksoftirqd/0: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
May 22 11:00:00 bastion kernel: [1027919.996693] ksoftirqd/0 cpuset=/ mems_allowed=0
May 22 11:00:00 bastion kernel: [1027919.996698] CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 4.19.0-9-amd64 #1 Debian 4.19.118-2
May 22 11:00:00 bastion kernel: [1027919.996699] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
May 22 11:00:00 bastion kernel: [1027919.996700] Call Trace:
May 22 11:00:00 bastion kernel: [1027919.996719]  dump_stack+0x66/0x90
May 22 11:00:00 bastion kernel: [1027919.996726]  warn_alloc.cold.120+0x6c/0xec
May 22 11:00:00 bastion kernel: [1027919.996730]  __alloc_pages_slowpath+0xc73/0xcb0
May 22 11:00:00 bastion kernel: [1027919.996733]  __alloc_pages_nodemask+0x28b/0x2b0
May 22 11:00:00 bastion kernel: [1027919.996734]  page_frag_alloc+0x124/0x150
May 22 11:00:00 bastion kernel: [1027919.996737]  __napi_alloc_skb+0x88/0xe0
May 22 11:00:00 bastion kernel: [1027919.996740]  ? dev_gro_receive+0x39b/0x660
May 22 11:00:00 bastion kernel: [1027919.996755]  page_to_skb+0x62/0x330 [virtio_net]
May 22 11:00:00 bastion kernel: [1027919.996758]  receive_buf+0x61d/0x14c0 [virtio_net]
May 22 11:00:00 bastion kernel: [1027919.996762]  ? enqueue_entity+0x108/0x640
May 22 11:00:00 bastion kernel: [1027919.996766]  ? vring_unmap_one+0x16/0x70 [virtio_ring]
May 22 11:00:00 bastion kernel: [1027919.996768]  ? detach_buf+0x69/0x110 [virtio_ring]
May 22 11:00:00 bastion kernel: [1027919.996770]  virtnet_poll+0x143/0x328 [virtio_net]
May 22 11:00:00 bastion kernel: [1027919.996773]  net_rx_action+0x149/0x3b0
May 22 11:00:00 bastion kernel: [1027919.996776]  __do_softirq+0xde/0x2d8
May 22 11:00:00 bastion kernel: [1027919.996779]  ? sort_range+0x20/0x20
May 22 11:00:00 bastion kernel: [1027919.996782]  run_ksoftirqd+0x26/0x40
May 22 11:00:00 bastion kernel: [1027919.996785]  smpboot_thread_fn+0xc5/0x160
May 22 11:00:00 bastion kernel: [1027919.996788]  kthread+0x112/0x130
May 22 11:00:00 bastion kernel: [1027919.996789]  ? kthread_bind+0x30/0x30
May 22 11:00:00 bastion kernel: [1027919.996791]  ret_from_fork+0x35/0x40
May 22 11:00:00 bastion kernel: [1027919.996795] warn_alloc_show_mem: 9 callbacks suppressed
May 22 11:00:00 bastion kernel: [1027919.996796] Mem-Info:
May 22 11:00:00 bastion kernel: [1027919.996799] active_anon:3924 inactive_anon:5484 isolated_anon:0
May 22 11:00:00 bastion kernel: [1027919.996799]  active_file:11966 inactive_file:10683 isolated_file:0
May 22 11:00:00 bastion kernel: [1027919.996799]  unevictable:0 dirty:2253 writeback:0 unstable:0
May 22 11:00:00 bastion kernel: [1027919.996799]  slab_reclaimable:10120 slab_unreclaimable:9080
May 22 11:00:00 bastion kernel: [1027919.996799]  mapped:7645 shmem:454 pagetables:452 bounce:0
May 22 11:00:00 bastion kernel: [1027919.996799]  free:1485 free_pcp:81 free_cma:0
May 22 11:00:00 bastion kernel: [1027919.996802] Node 0 active_anon:15696kB inactive_anon:21936kB active_file:47864kB inactive_file:42732kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:30580kB dirty:9012kB writeback:0kB shmem:1816kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
May 22 11:00:00 bastion kernel: [1027919.996803] Node 0 DMA free:2004kB min:140kB low:172kB high:204kB active_anon:372kB inactive_anon:500kB active_file:2352kB inactive_file:4644kB unevictable:0kB writepending:64kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:256kB pagetables:36kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
May 22 11:00:00 bastion kernel: [1027919.996806] lowmem_reserve[]: 0 184 184 184 184
May 22 11:00:00 bastion kernel: [1027919.996808] Node 0 DMA32 free:3936kB min:1664kB low:2080kB high:2496kB active_anon:15324kB inactive_anon:21436kB active_file:45508kB inactive_file:38088kB unevictable:0kB writepending:8940kB present:245604kB managed:220820kB mlocked:0kB kernel_stack:1680kB pagetables:1772kB bounce:0kB free_pcp:324kB local_pcp:324kB free_cma:0kB
May 22 11:00:00 bastion kernel: [1027919.996811] lowmem_reserve[]: 0 0 0 0 0

Cela pourrait être un problème de mémoire insuffisante sur la machine bastion. Ticket à suivre : #4497.

#6

Mis à jour par Christian P. Momon il y a presque 4 ans

  • Statut changé de Nouveau à En cours de traitement
  • Assigné à mis à Christian P. Momon
#7

Mis à jour par Christian P. Momon il y a presque 4 ans

  • Statut changé de En cours de traitement à Résolu

La concomitance avec le ticket #4497 est cohérente donc on va dire que c'est ça, un problème de mémoire sur la vm bastion.

#8

Mis à jour par Quentin Gibeaux il y a presque 4 ans

  • Statut changé de Résolu à Fermé

Formats disponibles : Atom PDF