[gpfsug-discuss] gpfs client cluster, lost quorum, ccr issues

Tue Jul 10 19:35:28 BST 2018

Hi, many thanks to all of the suggestions for how to deal with this issue.
Ftr, I tried this

mmchnode --noquorum -N <broken-nodes> --force

on the node that was reinstalled which reinstated some of the
communications between the cluster nodes, but then when I restarted
the cluster, communications begain to fail again, complaining about
not enough CCR nodes for quorum.  I ended up
reinstalling the cluster since at this point the nodes couldn't mount
the remote data and I thought it would be faster.

Thanks again for all of the responses,

Renata Dart
SLAC National Accelerator Lab

On Wed, 27 Jun 2018, IBM Spectrum Scale wrote:

>
>Hi Renata,
>
>You may want to reduce the set of quorum nodes.  If your version supports
>the --force option, you can run
>
>mmchnode --noquorum -N <broken-nodes> --force
>
>It is a good idea to configure tiebreaker disks in a cluster that has only
>2 quorum nodes.
>
>Regards, The Spectrum Scale (GPFS) team
>
>------------------------------------------------------------------------------------------------------------------
>
>If you feel that your question can benefit other users of  Spectrum Scale
>(GPFS), then please post it to the public IBM developerWroks Forum at
>https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479.
>
>
>If your query concerns a potential software error in Spectrum Scale (GPFS)
>and you have an IBM software maintenance contract please contact
>1-800-237-5511 in the United States or your local IBM Service Center in
>other countries.
>
>The forum is informally monitored as time permits and should not be used
>for priority messages to the Spectrum Scale (GPFS) team.
>
>
>
>From:	Renata Maria Dart <renata at slac.stanford.edu>
>To:	gpfsug-discuss at spectrumscale.org
>Date:	06/27/2018 02:21 PM
>Subject:	[gpfsug-discuss] gpfs client cluster, lost quorum, ccr issues
>Sent by:	gpfsug-discuss-bounces at spectrumscale.org
>
>
>
>Hi, we have a client cluster of 4 nodes with 3 quorum nodes.  One of the
>quorum nodes is no longer in service and the other was reinstalled with
>a newer OS, both without informing the gpfs admins.  Gpfs is still
>"working" on the two remaining nodes, that is, they continue to have access
>to the gpfs data on the remote clusters.  But, I can no longer get
>any gpfs commands to work.  On one of the 2 nodes that are still serving
>data,
>
>root at ocio-gpu01 ~]# mmlscluster
>get file failed: Not enough CCR quorum nodes available (err 809)
>gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 158
>mmlscluster: Command failed. Examine previous error messages to determine
>cause.
>
>
>On the reinstalled node, this fails in the same way:
>
>[root at ocio-gpu02 ccr]# mmstartup
>get file failed: Not enough CCR quorum nodes available (err 809)
>gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 158
>mmstartup: Command failed. Examine previous error messages to determine
>cause.
>
>
>I have looked through the users group interchanges but didn't find anything
>that seems to fit this scenario.
>
>Is there a way to salvage this cluster?  Can it be done without
>shutting gpfs down on the 2 nodes that continue to work?
>
>Thanks for any advice,
>
>Renata Dart
>SLAC National Accelerator Lb
>
>_______________________________________________
>gpfsug-discuss mailing list
>gpfsug-discuss at spectrumscale.org
>http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
>
>