[gpfsug-discuss] FS freeze on client nodes with nbCores>workerThreads

CAPIT, NICOLAS ncapit at atos.net
Mon Jun 26 08:49:28 BST 2017


Hello,

I don't know if this behavior/bug was already reported on this ML, so in doubt.

Context:

  - SpectrumScale 4.2.2-3
  - client node with 64 cores
  - OS: RHEL7.3

When a MPI job with 64 processes is launched on the node with 64 cores then the FS freezed (only the output log file of the MPI job is put on the GPFS; so it may be related to the 64 processes writing in a same file???).

  strace -p 3105         # mmfsd pid stucked
  Process 3105 attached
  wait4(-1,              # stucked at this point

  strace ls /gpfs
  stat("/gpfs", {st_mode=S_IFDIR|0755, st_size=131072, ...}) = 0
  openat(AT_FDCWD, "/gpfs", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC   # stucked at this point

I have no problem with the other nodes of 28 cores.
The GPFS command mmgetstate is working and I am able to use mmshutdown to recover the node.


If I put workerThreads=72 on the 64 core node then I am not able to reproduce the freeze and I get the right behavior.

Is this a known bug with a number of cores > workerThreads?

Best regards,
[--]
Nicolas Capit
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170626/d3291efa/attachment.htm>


More information about the gpfsug-discuss mailing list