[gpfsug-discuss] gpfs client cluster, lost quorum, ccr issues

Wed Jun 27 22:14:23 BST 2018

Hi Renata,

You may want to reduce the set of quorum nodes.  If your version supports
the --force option, you can run

mmchnode --noquorum -N <broken-nodes> --force

It is a good idea to configure tiebreaker disks in a cluster that has only
2 quorum nodes.

Regards, The Spectrum Scale (GPFS) team

------------------------------------------------------------------------------------------------------------------

If you feel that your question can benefit other users of  Spectrum Scale
(GPFS), then please post it to the public IBM developerWroks Forum at
https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479.

If your query concerns a potential software error in Spectrum Scale (GPFS)
and you have an IBM software maintenance contract please contact
1-800-237-5511 in the United States or your local IBM Service Center in
other countries.

The forum is informally monitored as time permits and should not be used
for priority messages to the Spectrum Scale (GPFS) team.

From:	Renata Maria Dart <renata at slac.stanford.edu>
To:	gpfsug-discuss at spectrumscale.org
Date:	06/27/2018 02:21 PM
Subject:	[gpfsug-discuss] gpfs client cluster, lost quorum, ccr issues
Sent by:	gpfsug-discuss-bounces at spectrumscale.org

Hi, we have a client cluster of 4 nodes with 3 quorum nodes.  One of the
quorum nodes is no longer in service and the other was reinstalled with
a newer OS, both without informing the gpfs admins.  Gpfs is still
"working" on the two remaining nodes, that is, they continue to have access
to the gpfs data on the remote clusters.  But, I can no longer get
any gpfs commands to work.  On one of the 2 nodes that are still serving
data,

root at ocio-gpu01 ~]# mmlscluster
get file failed: Not enough CCR quorum nodes available (err 809)
gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 158
mmlscluster: Command failed. Examine previous error messages to determine
cause.

On the reinstalled node, this fails in the same way:

[root at ocio-gpu02 ccr]# mmstartup
get file failed: Not enough CCR quorum nodes available (err 809)
gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 158
mmstartup: Command failed. Examine previous error messages to determine
cause.

I have looked through the users group interchanges but didn't find anything
that seems to fit this scenario.

Is there a way to salvage this cluster?  Can it be done without
shutting gpfs down on the 2 nodes that continue to work?

Thanks for any advice,

Renata Dart
SLAC National Accelerator Lb

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180627/f993d717/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180627/f993d717/attachment.gif>