[gpfsug-discuss] /sbin/rmmod mmfs26 hangs on mmshutdown

Billich Heinrich Rainer (PSI) heiner.billich at psi.ch
Wed Jul 11 14:40:46 BST 2018


Hello,

I have two nodes which hang on  ‘mmshutdown’, in detail the command ‘/sbin/rmmod mmfs26’ hangs. I get kernel messages which I append below. I wonder if this looks familiar to somebody? Is it a known bug?  I can avoid the issue if I reduce pagepool from 128G to 64G.

Running ‘systemctl stop gpfs’ shows the same issue. It forcefully terminates after a while, but ‘rmmod’ stays stuck.

Two functions cxiReleaseAndForgetPages and put_page seem to be involved,  the first part of gpfs, the second a kernel call.

The servers have 256G memory  and 72 (virtual) cores each.
I run 5.0.1-1 on RHEL7.4  with kernel 3.10.0-693.17.1.el7.x86_64.

I can try to switch back to 5.0.0

Thank you & kind regards,

Heiner



Jul 11 14:12:04 node-1.x.y mmremote[1641]: Unloading module mmfs26
Jul 11 14:12:04 node-1.x.y mmsysmon[2440]: [E] Event raised: The Spectrum Scale service process not running on this node. Normal operation cannot be done
Jul 11 14:12:04 node-1.x.y mmsysmon[2440]: [I] Event raised: The Spectrum Scale service process is running
Jul 11 14:12:04 node-1.x.y mmsysmon[2440]: [E] Event raised: The node is not able to form a quorum with the other available nodes.
Jul 11 14:12:38 node-1.x.y sshd[2826]: Connection closed by xxx port 52814 [preauth]

Jul 11 14:12:41 node-1.x.y kernel: NMI watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [rmmod:2695]

Jul 11 14:12:41 node-1.x.y kernel: Modules linked in: mmfs26(OE-) mmfslinux(OE) tracedev(OE) tcp_diag inet_diag rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_core(OE) vfat fat ext4 sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi mbcache jbd2 kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support ipmi_ssif pcc_cpufreq hpilo ipmi_si sg hpwdt pcspkr i2c_i801 lpc_ich ipmi_devintf wmi ioatdma shpchp ipmi_msghandler acpi_power_meter binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
Jul 11 14:12:41 node-1.x.y kernel:  sysimgblt fb_sys_fops ttm ixgbe mlx4_core(OE) crct10dif_pclmul mdio mlx_compat(OE) crct10dif_common drm ptp crc32c_intel devlink hpsa pps_core i2c_core scsi_transport_sas dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: tracedev]
Jul 11 14:12:41 node-1.x.y kernel: CPU: 28 PID: 2695 Comm: rmmod Tainted: G        W  OEL ------------   3.10.0-693.17.1.el7.x86_64 #1
Jul 11 14:12:41 node-1.x.y kernel: Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 01/22/2018
Jul 11 14:12:41 node-1.x.y kernel: task: ffff8808c4814f10 ti: ffff881619778000 task.ti: ffff881619778000
Jul 11 14:12:41 node-1.x.y kernel: RIP: 0010:[<ffffffff816a2970>]  [<ffffffff816a2970>] put_compound_page+0xc3/0x174
Jul 11 14:12:41 node-1.x.y kernel: RSP: 0018:ffff88161977bd50  EFLAGS: 00000246
Jul 11 14:12:41 node-1.x.y kernel: RAX: 0000000000000283 RBX: 00000000fae3d201 RCX: 0000000000000284
Jul 11 14:12:41 node-1.x.y kernel: RDX: 0000000000000283 RSI: 0000000000000246 RDI: ffffea003d478000
Jul 11 14:12:41 node-1.x.y kernel: RBP: ffff88161977bd68 R08: ffff881ffae3d1e0 R09: 0000000180800059
Jul 11 14:12:41 node-1.x.y kernel: R10: 00000000fae3d201 R11: ffffea007feb8f40 R12: 00000000fae3d201
Jul 11 14:12:41 node-1.x.y kernel: R13: ffff88161977bd40 R14: 0000000000000000 R15: ffff88161977bd40
Jul 11 14:12:41 node-1.x.y kernel: FS:  00007f81a1db0740(0000) GS:ffff883ffee80000(0000) knlGS:0000000000000000
Jul 11 14:12:41 node-1.x.y kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 11 14:12:41 node-1.x.y kernel: CR2: 00007fa96e38f980 CR3: 0000000c36b2c000 CR4: 00000000001607e0
Jul 11 14:12:41 node-1.x.y kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 11 14:12:41 node-1.x.y kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Jul 11 14:12:41 node-1.x.y kernel: Call Trace:
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffff81192275>] put_page+0x45/0x50
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc08e3562>] cxiReleaseAndForgetPages+0xb2/0x1c0 [mmfslinux]
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc08e3ae5>] cxiDeallocPageList+0x45/0x110 [mmfslinux]
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffff811e0b02>] ? kmem_cache_free+0x1e2/0x200
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc08e3cda>] cxiFreeSharedMemory+0x12a/0x130 [mmfslinux]
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc0c70c12>] kxFreeAllSharedMemory+0xe2/0x160 [mmfs26]
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc0c5bd15>] mmfs+0xc85/0xca0 [mmfs26]
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc08c8f16>] gpfs_clean+0x26/0x30 [mmfslinux]
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffffc0da5565>] cleanup_module+0x25/0x30 [mmfs26]
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffff8110044b>] SyS_delete_module+0x19b/0x300
Jul 11 14:12:41 node-1.x.y kernel:  [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
Jul 11 14:12:41 node-1.x.y kernel: Code: d1 00 00 00 4c 89 e7 e8 3a ff ff ff e9 c4 00 00 00 4c 39 e3 74 c1 41 8b 54 24 1c 85 d2 74 b8 8d 4a 01 89 d0 f0 41 0f b1 4c 24 1c <39> c2 74 04 89 c2 eb e8 e8 f3 f0 ae ff 49 89 c5 f0 41 0f ba 2c

Jul 11 14:13:23 node-1.x.y systemd[1]: gpfs.service stopping timed out. Terminating.

Jul 11 14:13:27 node-1.x.y kernel: NMI watchdog: BUG: soft lockup - CPU#28 stuck for 21s! [rmmod:2695]

Jul 11 14:13:27 node-1.x.y kernel: Modules linked in: mmfs26(OE-) mmfslinux(OE) tracedev(OE) tcp_diag inet_diag rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlx4_en(OE) mlx4_ib(OE) ib_core(OE) vfat fat ext4 sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi mbcache jbd2 kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd iTCO_wdt iTCO_vendor_support ipmi_ssif pcc_cpufreq hpilo ipmi_si sg hpwdt pcspkr i2c_i801 lpc_ich ipmi_devintf wmi ioatdma shpchp ipmi_msghandler
Jul 11 14:13:27 node-1.x.y kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
Jul 11 14:13:27 node-1.x.y kernel:  {
Jul 11 14:13:27 node-1.x.y kernel:  28
Jul 11 14:13:27 node-1.x.y kernel: }
Jul 11 14:13:27 node-1.x.y kernel: (detected by 17, t=60002 jiffies, g=267734, c=267733, q=36089)
Jul 11 14:13:27 node-1.x.y kernel: Task dump for CPU 28:
Jul 11 14:13:27 node-1.x.y kernel: rmmod           R
Jul 11 14:13:27 node-1.x.y kernel:   running task
Jul 11 14:13:27 node-1.x.y kernel:     0  2695   2642 0x00000008
Jul 11 14:13:27 node-1.x.y kernel: Call Trace:
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff811dea1c>] ? __free_slab+0xdc/0x200
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff816a28ad>] ? __put_compound_page+0x22/0x22
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff81192275>] ? put_page+0x45/0x50
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3562>] ? cxiReleaseAndForgetPages+0xb2/0x1c0 [mmfslinux]
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3ae5>] ? cxiDeallocPageList+0x45/0x110 [mmfslinux]
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3cda>] ? cxiFreeSharedMemory+0x12a/0x130 [mmfslinux]
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0c70c12>] ? kxFreeAllSharedMemory+0xe2/0x160 [mmfs26]
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0c5bd15>] ? mmfs+0xc85/0xca0 [mmfs26]
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08c8f16>] ? gpfs_clean+0x26/0x30 [mmfslinux]
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0da5565>] ? cleanup_module+0x25/0x30 [mmfs26]
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff8110044b>] ? SyS_delete_module+0x19b/0x300
Jul 11 14:13:27 node-1.x.y kernel:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff816b89fd>] ? system_call_fastpath+0x16/0x1b
Jul 11 14:13:27 node-1.x.y kernel:  acpi_power_meter
Jul 11 14:13:27 node-1.x.y kernel:  binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ixgbe mlx4_core(OE) crct10dif_pclmul mdio mlx_compat(OE) crct10dif_common drm ptp crc32c_intel devlink hpsa pps_core i2c_core scsi_transport_sas dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: tracedev]
Jul 11 14:13:27 node-1.x.y kernel: CPU: 28 PID: 2695 Comm: rmmod Tainted: G        W  OEL ------------   3.10.0-693.17.1.el7.x86_64 #1
Jul 11 14:13:27 node-1.x.y kernel: Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 01/22/2018
Jul 11 14:13:27 node-1.x.y kernel: task: ffff8808c4814f10 ti: ffff881619778000 task.ti: ffff881619778000
Jul 11 14:13:27 node-1.x.y kernel: RIP: 0010:[<ffffffff816a28ad>]  [<ffffffff816a28ad>] __put_compound_page+0x22/0x22
Jul 11 14:13:27 node-1.x.y kernel: RSP: 0018:ffff88161977bd70  EFLAGS: 00000282
Jul 11 14:13:27 node-1.x.y kernel: RAX: 002fffff00008010 RBX: 0000000000000135 RCX: 00000000000001c1
Jul 11 14:13:27 node-1.x.y kernel: RDX: ffff8814adbbf000 RSI: 0000000000000246 RDI: ffffea00650e7040
Jul 11 14:13:27 node-1.x.y kernel: RBP: ffff88161977bd78 R08: ffff881ffae3df60 R09: 0000000180800052
Jul 11 14:13:27 node-1.x.y kernel: R10: 00000000fae3db01 R11: ffffea007feb8f40 R12: ffff881ffae3df60
Jul 11 14:13:27 node-1.x.y kernel: R13: 0000000180800052 R14: 00000000fae3db01 R15: ffffea007feb8f40
Jul 11 14:13:27 node-1.x.y kernel: FS:  00007f81a1db0740(0000) GS:ffff883ffee80000(0000) knlGS:0000000000000000
Jul 11 14:13:27 node-1.x.y kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 11 14:13:27 node-1.x.y kernel: CR2: 00007fa96e38f980 CR3: 0000000c36b2c000 CR4: 00000000001607e0
Jul 11 14:13:27 node-1.x.y kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 11 14:13:27 node-1.x.y kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jul 11 14:13:27 node-1.x.y kernel: Call Trace:
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff81192275>] ? put_page+0x45/0x50
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3562>] cxiReleaseAndForgetPages+0xb2/0x1c0 [mmfslinux]
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3ae5>] cxiDeallocPageList+0x45/0x110 [mmfslinux]
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08e3cda>] cxiFreeSharedMemory+0x12a/0x130 [mmfslinux]
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0c70c12>] kxFreeAllSharedMemory+0xe2/0x160 [mmfs26]
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0c5bd15>] mmfs+0xc85/0xca0 [mmfs26]
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc08c8f16>] gpfs_clean+0x26/0x30 [mmfslinux]
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffffc0da5565>] cleanup_module+0x25/0x30 [mmfs26]
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff8110044b>] SyS_delete_module+0x19b/0x300
Jul 11 14:13:27 node-1.x.y kernel:  [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
Jul 11 14:13:27 node-1.x.y kernel: Code: c0 0f 95 c0 0f b6 c0 5d c3 0f 1f 44 00 00 55 48 89 e5 53 48 8b 07 48 89 fb a8 20 74 05 e8 0c f8 ae ff 48 89 df ff 53 60 5b 5d c3 <0f> 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 8b 07 48 89 fb f6

--
Paul Scherrer Institut
Science IT
Heiner Billich
WHGA 106
CH 5232  Villigen PSI
056 310 36 02
https://www.psi.ch




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180711/292c4fbd/attachment.htm>


More information about the gpfsug-discuss mailing list