[gpfsug-discuss] correct way of taking IO server down for maintenance

Tue Dec 20 17:19:48 GMT 2016

For sake of everyone else on this listserv, I'll highlight the appropriate
procedure here. It turns out, changing recovery group on an active system
is not recommended by IBM. We tried following Jan's recommendation this
morning, and the system became unresponsive for about 30 minutes. It only
became responsive (and recovery group change finished) after we killed
couple of processes (ssh and scp) going to couple of clients.

I got a Sev. 1 with IBM opened and they tell me that appropriate steps for
IO maintenance are as follows:

1. change cluster managers to system that will stay up (mmlsmgr - mmchmgr)
2. unmount gpfs on io node that is going down
3. shutdown gpfs on io node that is going down
4. shutdown os

That's it - recovery groups should not be changed. If there is a need to
change recovery group, use --active option (not permanent change).

We are now stuck in situation that io2 server is owner of both recovery
groups. The way IBM tells us to fix this is to unmount the filesystem on
all clients and change recovery groups then. We can't do it now and will
have to schedule maintenance sometime in 2017. For now, we have switched
recovery groups using --active flag and things (filesystem performance)
seems to be OK. Load average on both io servers is quite high (250avg) and
does not seem to be going down.

I really wish that maintenance procedures were documented somewhere on IBM
website. This experience this morning has really shaken my confidence in
ESS.

Damir

On Mon, Dec 19, 2016 at 9:53 AM Jan-Frode Myklebust <janfrode at tanso.net>
wrote:

>
> Move its recoverygrops to the other node by putting the other node as
> primary server for it:
>
> mmchrecoverygroup rgname --servers otherServer,thisServer
>
> And verify that it's now active on the other node by "mmlsrecoverygroup
> rgname -L".
>
> Move away any filesystem managers or cluster manager role if that's active
> on it. Check with mmlsmgr, move with mmchmgr/mmchmgr -c.
>
> Then  you can run mmshutdown on it (assuming you also have enough quorum
> nodes in the remaining cluster).
>
>
>   -jf
>
> man. 19. des. 2016 kl. 15.53 skrev Damir Krstic <damir.krstic at gmail.com>:
>
> We have a single ESS GL6 system running GPFS 4.2.0-1. Last night one of
> the IO servers phoned home with memory error. IBM is coming out today to
> replace the faulty DIMM.
>
> What is the correct way of taking this system out for maintenance?
>
> Before ESS we had a large GPFS 3.5 installation with 14 IO servers. When
> we needed to do maintenance on the old system, we would migrate manager
> role and also move primary and secondary server roles if one of those
> systems had to be taken down.
>
> With ESS and resource pool manager roles etc. is there a correct way of
> shutting down one of the IO serves for maintenance?
>
> Thanks,
> Damir
>
>
> _______________________________________________
>
> gpfsug-discuss mailing list
>
> gpfsug-discuss at spectrumscale.org
>
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20161220/30d23aef/attachment.htm>