[gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2

Fri Nov 2 15:55:27 GMT 2018

Hi,

Did you ever figure out the root cause of the issue? We have recently (end
of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64

In the last few weeks we have seen an increasing number of ps hangs across
compute and login nodes on our cluster. The filesystem version (of all
filesystems on our cluster) is:
 -V                 15.01 (4.2.0.0)          File system version

I am just wondering if anyone has seen this type of issue since you first
reported it and if there is a known fix for it.

Damir

On Tue, May 22, 2018 at 10:43 AM <valleru at cbio.mskcc.org> wrote:

> Hello All,
>
> We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month
> ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That
> is we have not run the mmchconfig release=LATEST command)
> Right after the upgrade, we are seeing many “ps hangs" across the cluster.
> All the “ps hangs” happen when jobs run related to a Java process or many
> Java threads (example: GATK )
> The hangs are pretty random, and have no particular pattern except that we
> know that it is related to just Java or some jobs reading from directories
> with about 600000 files.
>
> I have raised an IBM critical service request about a month ago related to
> this - PMR: 24090,L6Q,000.
> However, According to the ticket  - they seemed to feel that it might not
> be related to GPFS.
> Although, we are sure that these hangs started to appear only after we
> upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2.
>
> One of the other reasons we are not able to prove that it is GPFS is
> because, we are unable to capture any logs/traces from GPFS once the hang
> happens.
> Even GPFS trace commands hang, once “ps hangs” and thus it is getting
> difficult to get any dumps from GPFS.
>
> Also  - According to the IBM ticket, they seemed to have a seen a “ps
> hang" issue and we have to run  mmchconfig release=LATEST command, and that
> will resolve the issue.
> However we are not comfortable making the permanent change to Filesystem
> version 5. and since we don’t see any near solution to these hangs - we are
> thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know
> the cluster was stable.
>
> Can downgrading GPFS take us back to exactly the previous GPFS config
> state?
> With respect to downgrading from 5 to 4.2.3.2 -> is it just that i
> reinstall all rpms to a previous version? or is there anything else that i
> need to make sure with respect to GPFS configuration?
> Because i think that GPFS 5.0 might have updated internal default GPFS
> configuration parameters , and i am not sure if downgrading GPFS will
> change them back to what they were in GPFS 4.2.3.2
>
> Our previous state:
>
> 2 Storage clusters - 4.2.3.2
> 1 Compute cluster - 4.2.3.2  ( remote mounts the above 2 storage clusters )
>
> Our current state:
>
> 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2)
> 1 Compute cluster - 5.0.0.2
>
> Do i need to downgrade all the clusters to go to the previous state ? or
> is it ok if we just downgrade the compute cluster to previous version?
>
> Any advice on the best steps forward, would greatly help.
>
> Thanks,
>
> Lohit
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20181102/2e49aec1/attachment.htm>