[gpfsug-discuss] gpfs.snap taking a really long time (4.2.3.6 efix17)

Aaron Knister aaron.s.knister at nasa.gov
Sat Mar 10 21:39:28 GMT 2018


Hey All,

I've noticed after upgrading to 4.1 to 4.2.3.6 efix17 that a gpfs.snap 
now takes a really long time as in... a *really* long time. Digging into 
it I can see that the snap command is actually done but the sshd child 
is left waiting on a sleep process on the clients (a sleep 600 at that). 
Trying to get 3500 nodes snapped in chunks of 64 nodes that each take 10 
minutes looks like it'll take a good 10 hours.

It seems the trouble is in the runCommand function in gpfs.snap. The 
function creates a child process to act as a sort of alarm to kill the 
specified command if it exceeds the timeout. The problem while the alarm 
process gets killed the kill signal isn't passed to the sleep process 
(because the sleep command is run as a process inside the "alarm" child 
shell process).

In gpfs.snap changing this:
[[ -n $sleepingAgentPid ]] && $kill -9 $sleepingAgentPid > /dev/null 2>&1

to this:
[[ -n $sleepingAgentPid ]] && $kill -9 $(findDescendants 
$sleepingAgentPid) $sleepingAgentPid > /dev/null 2>&1

seems to fix the behavior.

I'll open a PMR for this shortly but I'm just wondering if anyone else 
has seen this.

-Aaron


-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776



More information about the gpfsug-discuss mailing list