[gpfsug-discuss] FS freeze on client nodes with nbCores>workerThreads

IBM Spectrum Scale scale at us.ibm.com
Fri Jun 30 07:57:49 BST 2017


I'm not aware this kind of defects, seems it should not. but lack of data,
we don't know what happened. I suggest you can open a PMR for your issue.
Thanks.

Regards, The Spectrum Scale (GPFS) team

------------------------------------------------------------------------------------------------------------------

If you feel that your question can benefit other users of  Spectrum Scale
(GPFS), then please post it to the public IBM developerWroks Forum at
https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479.


If your query concerns a potential software error in Spectrum Scale (GPFS)
and you have an IBM software maintenance contract please contact
1-800-237-5511 in the United States or your local IBM Service Center in
other countries.

The forum is informally monitored as time permits and should not be used
for priority messages to the Spectrum Scale (GPFS) team.



From:	"CAPIT, NICOLAS" <ncapit at atos.net>
To:	gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:	06/27/2017 02:59 PM
Subject:	Re: [gpfsug-discuss] FS freeze on client nodes with
            nbCores>workerThreads
Sent by:	gpfsug-discuss-bounces at spectrumscale.org



Hello,

When the node is locked up there is no waiters ("mmdiad --waiters" or
"mmfsadm dump waiters").
In the GPFS log file "/var/mmfs/gen/mmfslog" there is nothing and nothing
in the dmesg output or system log.
The "mmgetstate" command says that the node is "active".
The only thing is the freeze of the FS.

Best regards,
Nicolas Capit
________________________________________
De : gpfsug-discuss-bounces at spectrumscale.org
[gpfsug-discuss-bounces at spectrumscale.org] de la part de Aaron Knister
[aaron.s.knister at nasa.gov]
Envoyé : mardi 27 juin 2017 01:57
À : gpfsug-discuss at spectrumscale.org
Objet : Re: [gpfsug-discuss] FS freeze on client nodes with
nbCores>workerThreads

That's a fascinating bug. When the node is locked up what does "mmdiag
--waiters" show from the node in question? I suspect there's more
low-level diagnostic data that's helpful for the gurus at IBM but I'm
just curious what the waiters look like.

-Aaron

On 6/26/17 3:49 AM, CAPIT, NICOLAS wrote:
> Hello,
>
> I don't know if this behavior/bug was already reported on this ML, so in
> doubt.
>
> Context:
>
>    - SpectrumScale 4.2.2-3
>    - client node with 64 cores
>    - OS: RHEL7.3
>
> When a MPI job with 64 processes is launched on the node with 64 cores
> then the FS freezed (only the output log file of the MPI job is put on
> the GPFS; so it may be related to the 64 processes writing in a same
> file???).
>
>    strace -p 3105         # mmfsd pid stucked
>    Process 3105 attached
>    wait4(-1,              # stucked at this point
>
>    strace ls /gpfs
>    stat("/gpfs", {st_mode=S_IFDIR|0755, st_size=131072, ...}) = 0
>    openat(AT_FDCWD, "/gpfs", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC
> # stucked at this point
>
> I have no problem with the other nodes of 28 cores.
> The GPFS command mmgetstate is working and I am able to use mmshutdown
> to recover the node.
>
>
> If I put workerThreads=72 on the 64 core node then I am not able to
> reproduce the freeze and I get the right behavior.
>
> Is this a known bug with a number of cores > workerThreads?
>
> Best regards,
> --
> *Nicolas Capit*
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>

--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170630/ea7ce753/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170630/ea7ce753/attachment.gif>


More information about the gpfsug-discuss mailing list