Projet

Général

Profil

Actions

Anomalie #1321

fermé

Opium : backtraces frequentes dans les syslogs et état SMART d'un des disques

Ajouté par Anonyme il y a environ 11 ans. Mis à jour il y a environ 5 ans.

Statut:
Fermé
Priorité:
Urgente
Assigné à:
Catégorie:
Task
Version cible:
Début:
29/06/2013
Echéance:
% réalisé:

100%

Temps estimé:
Difficulté:
2 Facile

Description

De plus en plus fréquemment, l'on voit syslog qui écrit via wall sur les terminau des messages de ce type :
Toujours lié à "/sys/devices/virtual/block/vroot7/stat"

kernel:[117500.193975] general protection fault: 0000 [#301] SMP 

Message from syslogd@opium at Jun 29 00:21:24 ...
 kernel:[117500.194060] last sysfs file: /sys/devices/virtual/block/vroot7/stat

Message from syslogd@opium at Jun 29 00:21:24 ...
 kernel:[117500.197068] Stack:

Message from syslogd@opium at Jun 29 00:21:24 ...
 kernel:[117500.197570] Call Trace:

Message from syslogd@opium at Jun 29 00:21:24 ...
 kernel:[117500.197947] Code: fa 66 66 90 66 66 90 65 8b 04 25 a8 e3 00 00 48 98 49 8b 94 c4 f0 02 00 00 8b 4a 18 89 4c 24 14 48 8b 1a 48 85 db 74 0c 8b 42 14 <48> 8b 04 c3 48 89 02 eb 19 48 8b 4c 24 08 49 89 d0 44 89 ee 83 

Dans les syslogs, on trouve des messages de ce type :

[117727.112851] general protection fault: 0000 [#377] SMP 
[117727.112935] last sysfs file: /sys/devices/virtual/block/vroot7/stat
[117727.112966] CPU 2 
[117727.113020] Modules linked in: cryptd aes_x86_64 aes_generic xts gf128mul ses enclosure usb_storage tun ipt_LOG xt_tcpudp ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables ext4 jbd2 crc16 ext2 dm_crypt firewire_sbp2 loop snd_hda_codec_atihdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcsp asus_atk0110 parport_pc snd_pcm edac_core i2c_piix4 parport snd_timer edac_mce_amd processor i2c_core shpchp button evdev snd soundcore snd_page_alloc pci_hotplug ext3 jbd mbcache dm_mod raid456 md_mod async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx ide_cd_mod sd_mod cdrom crc_t10dif ide_pci_generic ohci_hcd ata_generic ahci thermal r8169 mii atiixp ehci_hcd ide_core firewire_ohci usbcore firewire_core nls_base crc_itu_t floppy thermal_sys libata scsi_mod [last unloaded: scsi_wait_scan]
[117727.115436] Pid: 11837, comm: munin-node Tainted: G      D W  2.6.32-bpo.3-vserver-amd64 #1 System Product Name
[117727.115470] RIP: 0010:[<ffffffff810f0f6c>]  [<ffffffff810f0f6c>] __kmalloc+0xd2/0x141
[117727.115533] RSP: 0018:ffff88008b103b88  EFLAGS: 00010082
[117727.115564] RAX: 0000000000000000 RBX: 940f003e8348c031 RCX: 0000000000000020
[117727.115597] RDX: ffff880005111f00 RSI: 00000000000000d0 RDI: ffffffff811308e1
[117727.115630] RBP: 0000000000000246 R08: 0000000000000000 R09: 0000000000000000
[117727.115663] R10: 00000000000001c0 R11: ffff88008b103b38 R12: ffffffff8144c4f0
[117727.115696] R13: 00000000000000d0 R14: 00000000000000d0 R15: 000000000000001c
[117727.115730] FS:  00007ff34f358700(0000) GS:ffff880005100000(0000) knlGS:0000000000000000
[117727.115763] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[117727.115794] CR2: 00007ff34e3d8050 CR3: 0000000110aee000 CR4: 00000000000006e0
[117727.115828] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[117727.115861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[117727.115895] Process munin-node (pid: 11837, threadinfo ffff88008b102000, task ffff88011c2f1590)
[117727.115928] Stack:
[117727.115957]  00000000000001c0 ffffffff811308e1 00000020811306f0 ffff880052445238
[117727.116068] <0> 00000000000001c0 00000000fffffff8 ffffffff811306f0 00000000fffffff4
[117727.116232] <0> 0000000000000001 ffffffff811308e1 ffff88010dc17d00 0000000000000080
[117727.116424] Call Trace:
[117727.116456]  [<ffffffff811308e1>] ? load_elf_binary+0x1f1/0x1958
[117727.116488]  [<ffffffff811306f0>] ? load_elf_binary+0x0/0x1958
[117727.116519]  [<ffffffff811308e1>] ? load_elf_binary+0x1f1/0x1958
[117727.116551]  [<ffffffff810f7cd9>] ? do_sync_read+0xce/0x113
[117727.116583]  [<ffffffff812fdc17>] ? __down_read+0x15/0xab
[117727.116615]  [<ffffffff81065a0e>] ? autoremove_wake_function+0x0/0x2e
[117727.116648]  [<ffffffff811306f0>] ? load_elf_binary+0x0/0x1958
[117727.116680]  [<ffffffff810fc5a6>] ? search_binary_handler+0xb4/0x245
[117727.116712]  [<ffffffff8112f0fc>] ? load_script+0x0/0x1ec
[117727.116743]  [<ffffffff8112f2bd>] ? load_script+0x1c1/0x1ec
[117727.116775]  [<ffffffff810fc1ba>] ? get_arg_page+0x4b/0xa4
[117727.116807]  [<ffffffff810fc5a6>] ? search_binary_handler+0xb4/0x245
[117727.116815]  [<ffffffff810fda14>] ? do_execve+0x1e8/0x2dc
[117727.116815]  [<ffffffff8100f4eb>] ? sys_execve+0x35/0x4c
[117727.116815]  [<ffffffff81010f9a>] ? stub_execve+0x6a/0xc0
[117727.116815] Code: fa 66 66 90 66 66 90 65 8b 04 25 a8 e3 00 00 48 98 49 8b 94 c4 f0 02 00 00 8b 4a 18 89 4c 24 14 48 8b 1a 48 85 db 74 0c 8b 42 14 <48> 8b 04 c3 48 89 02 eb 19 48 8b 4c 24 08 49 89 d0 44 89 ee 83 
[117727.116815] RIP  [<ffffffff810f0f6c>] __kmalloc+0xd2/0x141
[117727.116815]  RSP <ffff88008b103b88>
[117727.116815] ---[ end trace 721a8e84d9e66c0b ]---

Dans les processus cités dans ces traces, on voit un peu de tout : munin-node (beaucoup), nrpe (2 fois), bash (1 fois), ntp_kernel_pll_...

Ça sent le problème matériel.

En plus de ça, le smart de /dev/sdc est dans un état assez triste :

root@opium:~# smartctl -a /dev/sdc
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.12 family
Device Model:     ST31000528AS
Serial Number:    5VP4GSKE
Firmware Version: CC38
User Capacity:    1 000 204 886 016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sat Jun 29 00:33:08 2013 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:          ( 609) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 185) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x103f)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always       -       120470406
  3 Spin_Up_Time            0x0003   095   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       280
  5 Reallocated_Sector_Ct   0x0033   096   096   036    Pre-fail  Always       -       198
  7 Seek_Error_Rate         0x000f   087   060   030    Pre-fail  Always       -       633051438
  9 Power_On_Hours          0x0032   069   069   000    Old_age   Always       -       27454
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       140
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   088   088   000    Old_age   Always       -       12
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       42950328331
189 High_Fly_Writes         0x003a   099   099   000    Old_age   Always       -       1
190 Airflow_Temperature_Cel 0x0022   051   036   045    Old_age   Always   In_the_past 49 (34 33 54 39)
194 Temperature_Celsius     0x0022   049   064   000    Old_age   Always       -       49 (0 16 0 0)
195 Hardware_ECC_Recovered  0x001a   041   013   000    Old_age   Always       -       120470406
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       3
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       3
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       272176372542530
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2042468901
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1485209532

SMART Error Log Version: 1
ATA Error Count: 12 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 12 occurred at disk power-on lifetime: 26972 hours (1123 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 39 59 1b 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 35 59 1b 40 00   7d+12:48:02.411  READ FPDMA QUEUED
  27 00 00 00 00 00 e0 00   7d+12:48:02.411  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   7d+12:48:02.410  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   7d+12:48:02.410  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   7d+12:48:02.390  READ NATIVE MAX ADDRESS EXT

Error 11 occurred at disk power-on lifetime: 26972 hours (1123 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 39 59 1b 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 35 59 1b 40 00   7d+12:47:59.230  READ FPDMA QUEUED
  27 00 00 00 00 00 e0 00   7d+12:47:59.230  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   7d+12:47:59.229  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   7d+12:47:59.229  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   7d+12:47:59.210  READ NATIVE MAX ADDRESS EXT

Error 10 occurred at disk power-on lifetime: 26972 hours (1123 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 39 59 1b 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 35 59 1b 40 00   7d+12:47:56.067  READ FPDMA QUEUED
  27 00 00 00 00 00 e0 00   7d+12:47:56.066  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   7d+12:47:56.066  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   7d+12:47:56.065  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   7d+12:47:56.046  READ NATIVE MAX ADDRESS EXT

Error 9 occurred at disk power-on lifetime: 26972 hours (1123 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 39 59 1b 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 35 59 1b 40 00   7d+12:47:52.895  READ FPDMA QUEUED
  27 00 00 00 00 00 e0 00   7d+12:47:52.894  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   7d+12:47:52.894  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   7d+12:47:52.893  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   7d+12:47:52.874  READ NATIVE MAX ADDRESS EXT

Error 8 occurred at disk power-on lifetime: 26972 hours (1123 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 39 59 1b 00

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 08 35 59 1b 40 00   7d+12:47:49.748  READ FPDMA QUEUED
  27 00 00 00 00 00 e0 00   7d+12:47:49.747  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   7d+12:47:49.746  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   7d+12:47:49.746  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   7d+12:47:49.727  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Il faudrait vérifier si cela est lié. et le cas échéant, prendre des mesures.


Demandes liées 2 (0 ouverte2 fermées)

Lié à Admins - Anomalie #1323: Paniclog sur OpiumFerméQuentin CHERGUI04/07/2013

Actions
Lié à Admins - Anomalie #1166: oops on ns1FerméFrançois Poulain03/01/2013

Actions
Actions

Formats disponibles : Atom PDF