[gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread configuration needs more threads

Tomasz.Wolski at ts.fujitsu.com Tomasz.Wolski at ts.fujitsu.com
Fri Mar 16 15:01:08 GMT 2018


Hi Aaron,

Thanks for the hint! :)

Best regards,
Tomasz Wolski

> -----Original Message-----
> From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-
> bounces at spectrumscale.org] On Behalf Of Aaron Knister
> Sent: Friday, March 16, 2018 3:52 PM
> To: gpfsug-discuss at spectrumscale.org
> Subject: Re: [gpfsug-discuss] [item 435]: GPFS fails to start - NSD thread
> configuration needs more threads
> 
> Ah. You, my friend, have been struck by a smooth criminal. And by smooth
> criminal I mean systemd. I ran into this last week and spent many hours
> banging my head against the wall trying to figure it out.
> 
> systemd by default limits cgroups to I think 512 tasks and since a thread
> counts as a task that's likely what you're running into.
> 
> Try setting DefaultTasksMax=infinity in /etc/systemd/system.conf and then
> reboot (and yes, I mean reboot. changing it live doesn't seem possible
> because of the infinite wisdom of the systemd developers).
> 
> The pid limit of a given slice/unit cgroup may already be overriden to
> something more reasonable than the 512 default so if, for example, you
> were logging in and startng it via ssh the limit may be different than if its
> started from the gpfs.service unit because mmfsd effectively is running in
> different cgroups in each case.
> 
> Hope that helps!
> 
> -Aaron
> 
> On 3/16/18 10:25 AM, Tomasz.Wolski at ts.fujitsu.com wrote:
> > Hello GPFS Team,
> >
> > We are observing strange behavior of GPFS during startup on SLES12 node.
> >
> > In our test cluster, we reinstalled VLP1 node with SLES 12 SP3 as a
> > base and when GPFS starts for the first time on this node, it
> > complains about
> >
> > too little NSD threads:
> >
> > ..
> >
> > 2018-03-16_13:11:28.947+0100: GPFS: 6027-310 [I] mmfsd initializing.
> > {Version: 4.2.3.7   Built: Feb 15 2018 11:38:38} ...
> >
> > 2018-03-16_13:11:28.947+0100: [I] Cleaning old shared memory ...
> >
> > 2018-03-16_13:11:28.947+0100: [I] First pass parsing mmfs.cfg ...
> >
> > ..
> >
> > 2018-03-16_13:11:29.375+0100: [I] Initializing the cluster manager ...
> >
> > 2018-03-16_13:11:29.523+0100: [I] Initializing the token manager ...
> >
> > 2018-03-16_13:11:29.524+0100: [I] Initializing network shared disks ...
> >
> > *_2018-03-16_13:11:29.626+0100: [E] NSD thread configuration needs 413
> > more threads, exceeds max thread count 1024_*
> >
> > 2018-03-16_13:11:29.628+0100: GPFS: 6027-311 [N] mmfsd is shutting
> down.
> >
> > 2018-03-16_13:11:29.628+0100: [N] Reason for shutdown: Could not
> > initialize network shared disks
> >
> > 2018-03-16_13:11:29.633+0100: [E] processStart: fork: err 11
> >
> > 2018-03-16_13:11:30.701+0100: runmmfs starting
> >
> > Removing old /var/adm/ras/mmfs.log.* files:
> >
> > 2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336 seconds
> > before restarting mmfsd
> >
> > 2018-03-16_13:13:13.298+0100: [I] Calling user exit script mmSdrBackup:
> > event mmSdrBackup, async command /var/mmfs/etc/mmsdrbackup
> >
> > GPFS starts loop and tries to respawn mmfsd periodically:
> >
> > *_2018-03-16_13:11:30.713+0100 runmmfs: respawn 32 waiting 336
> seconds
> > before restarting mmfsd_*
> >
> > It seems that this issue can be resolved by doing mmshutdown. Later,
> > when we manually perform mmstartup the problem is gone.
> >
> > We are running GPFS 4.2.3.7 and all nodes except VLP1 are running
> > SLES11 SP4. Only on VLP1 we installed SLES12 SP3.
> >
> > The test cluster looks as below:
> >
> > Node  Daemon node name  IP address       Admin node name  Designation
> >
> > ----------------------------------------------------------------------
> > -
> >
> >     1   VLP0.cs-intern    192.168.101.210  VLP0.cs-intern
> > quorum-manager-snmp_collector
> >
> >     2   VLP1.cs-intern    192.168.101.211  VLP1.cs-intern
> > quorum-manager
> >
> >     3   TBP0.cs-intern    192.168.101.215  TBP0.cs-intern   quorum
> >
> >     4   IDP0.cs-intern    192.168.101.110  IDP0.cs-intern
> >
> >     5   IDP1.cs-intern    192.168.101.111  IDP1.cs-intern
> >
> >     6   IDP2.cs-intern    192.168.101.112  IDP2.cs-intern
> >
> >     7   IDP3.cs-intern    192.168.101.113  IDP3.cs-intern
> >
> >     8   ICP0.cs-intern    192.168.101.10   ICP0.cs-intern
> >
> >     9   ICP1.cs-intern    192.168.101.11   ICP1.cs-intern
> >
> >    10   ICP2.cs-intern    192.168.101.12   ICP2.cs-intern
> >
> >    11   ICP3.cs-intern    192.168.101.13   ICP3.cs-intern
> >
> >    12   ICP4.cs-intern    192.168.101.14   ICP4.cs-intern
> >
> >    13   ICP5.cs-intern    192.168.101.15   ICP5.cs-intern
> >
> > We have enabled traces and reproduced the issue as follows:
> >
> > 1.When GPFS daemon was in a respawn loop, we have started traces, all
> > files from this period you can find in uploaded archive under
> > *_1_nsd_threads_problem_* directory
> >
> > 2.We have manually stopped the “respawn” loop on VLP1 by executing
> > mmshutdown and start GPFS manually by mmstartup. All traces from this
> > execution can be found in archive file under
> *_2_mmshutdown_mmstartup
> > _*directory
> >
> > All data related to this problem is uploaded to our ftp to file:
> >
> > ftp.ts.fujitsu.com/CS-Diagnose/IBM
> > <ftp://ftp.ts.fujitsu.com/CS-Diagnose/IBM>, (fe_cs_oem, 12Monkeys)
> > item435_nsd_threads.tar.gz
> >
> > Could you please have a look at this problem?
> >
> > Best regards,
> >
> > Tomasz Wolski
> >
> >
> >
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >
> 
> --
> Aaron Knister
> NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight
> Center
> (301) 286-2776
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss



More information about the gpfsug-discuss mailing list