[gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread configuration needs more threads
Aaron Knister
aaron.s.knister at nasa.gov
Fri Mar 16 14:52:11 GMT 2018
Ah. You, my friend, have been struck by a smooth criminal. And by smooth
criminal I mean systemd. I ran into this last week and spent many hours
banging my head against the wall trying to figure it out.
systemd by default limits cgroups to I think 512 tasks and since a
thread counts as a task that's likely what you're running into.
Try setting DefaultTasksMax=infinity in /etc/systemd/system.conf and
then reboot (and yes, I mean reboot. changing it live doesn't seem
possible because of the infinite wisdom of the systemd developers).
The pid limit of a given slice/unit cgroup may already be overriden to
something more reasonable than the 512 default so if, for example, you
were logging in and startng it via ssh the limit may be different than
if its started from the gpfs.service unit because mmfsd effectively is
running in different cgroups in each case.
Hope that helps!
-Aaron
On 3/16/18 10:25 AM, Tomasz.Wolski at ts.fujitsu.com wrote:
> Hello GPFS Team,
>
> We are observing strange behavior of GPFS during startup on SLES12 node.
>
> In our test cluster, we reinstalled VLP1 node with SLES 12 SP3 as a base
> and when GPFS starts for the first time on this node, it complains about
>
> too little NSD threads:
>
> ..
>
> 2018-03-16_13:11:28.947+0100: GPFS: 6027-310 [I] mmfsd initializing.
> {Version: 4.2.3.7 Built: Feb 15 2018 11:38:38} ...
>
> 2018-03-16_13:11:28.947+0100: [I] Cleaning old shared memory ...
>
> 2018-03-16_13:11:28.947+0100: [I] First pass parsing mmfs.cfg ...
>
> ..
>
> 2018-03-16_13:11:29.375+0100: [I] Initializing the cluster manager ...
>
> 2018-03-16_13:11:29.523+0100: [I] Initializing the token manager ...
>
> 2018-03-16_13:11:29.524+0100: [I] Initializing network shared disks ...
>
> *_2018-03-16_13:11:29.626+0100: [E] NSD thread configuration needs 413
> more threads, exceeds max thread count 1024_*
>
> 2018-03-16_13:11:29.628+0100: GPFS: 6027-311 [N] mmfsd is shutting down.
>
> 2018-03-16_13:11:29.628+0100: [N] Reason for shutdown: Could not
> initialize network shared disks
>
> 2018-03-16_13:11:29.633+0100: [E] processStart: fork: err 11
>
> 2018-03-16_13:11:30.701+0100: runmmfs starting
>
> Removing old /var/adm/ras/mmfs.log.* files:
>
> 2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds
> before restarting mmfsd
>
> 2018-03-16_13:13:13.298+0100: [I] Calling user exit script mmSdrBackup:
> event mmSdrBackup, async command /var/mmfs/etc/mmsdrbackup
>
> GPFS starts loop and tries to respawn mmfsd periodically:
>
> *_2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds
> before restarting mmfsd_*
>
> It seems that this issue can be resolved by doing mmshutdown. Later,
> when we manually perform mmstartup the problem is gone.
>
> We are running GPFS 4.2.3.7 and all nodes except VLP1 are running SLES11
> SP4. Only on VLP1 we installed SLES12 SP3.
>
> The test cluster looks as below:
>
> Node Daemon node name IP address Admin node name Designation
>
> -----------------------------------------------------------------------
>
> 1 VLP0.cs-intern 192.168.101.210 VLP0.cs-intern
> quorum-manager-snmp_collector
>
> 2 VLP1.cs-intern 192.168.101.211 VLP1.cs-intern quorum-manager
>
> 3 TBP0.cs-intern 192.168.101.215 TBP0.cs-intern quorum
>
> 4 IDP0.cs-intern 192.168.101.110 IDP0.cs-intern
>
> 5 IDP1.cs-intern 192.168.101.111 IDP1.cs-intern
>
> 6 IDP2.cs-intern 192.168.101.112 IDP2.cs-intern
>
> 7 IDP3.cs-intern 192.168.101.113 IDP3.cs-intern
>
> 8 ICP0.cs-intern 192.168.101.10 ICP0.cs-intern
>
> 9 ICP1.cs-intern 192.168.101.11 ICP1.cs-intern
>
> 10 ICP2.cs-intern 192.168.101.12 ICP2.cs-intern
>
> 11 ICP3.cs-intern 192.168.101.13 ICP3.cs-intern
>
> 12 ICP4.cs-intern 192.168.101.14 ICP4.cs-intern
>
> 13 ICP5.cs-intern 192.168.101.15 ICP5.cs-intern
>
> We have enabled traces and reproduced the issue as follows:
>
> 1.When GPFS daemon was in a respawn loop, we have started traces, all
> files from this period you can find in uploaded archive under
> *_1_nsd_threads_problem_* directory
>
> 2.We have manually stopped the “respawn” loop on VLP1 by executing
> mmshutdown and start GPFS manually by mmstartup. All traces from this
> execution can be found in archive file under *_2_mmshutdown_mmstartup
> _*directory
>
> All data related to this problem is uploaded to our ftp to file:
>
> ftp.ts.fujitsu.com/CS-Diagnose/IBM
> <ftp://ftp.ts.fujitsu.com/CS-Diagnose/IBM>, (fe_cs_oem, 12Monkeys)
> item435_nsd_threads.tar.gz
>
> Could you please have a look at this problem?
>
> Best regards,
>
> Tomasz Wolski
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
More information about the gpfsug-discuss
mailing list