[gpfsug-discuss] GPFS 3.5 to 4.1 Upgrade Question

Sun Dec 11 15:07:09 GMT 2016

I thought I'd share this with folks. I saw some log asserts in our test 
environment (~1050 client nodes and 12 manager/server nodes). I'm going 
from 3.5.0.31 (well, 2 clients are still at 3.5.0.19) -> 4.1.1.10. I've 
been running filebench in a loop for the past several days. It's 
sustaining about 60k write iops and about 15k read iops to the metadata 
disks for the filesystem I'm testing with, so I'd say it's getting 
pushed reasonably hard. The test cluster had 4.1 clients before it had 
4.1 servers but after flipping 420 clients from 3.5.0.31 to 4.1.1.10 and 
starting up filebench I'm now seeing periodic logasserts from the 
manager/server nodes:

Dec 11 08:57:39 loremds12 mmfs: Generic error in 
/project/sprelfks2/build/rfks2s010a/src/avs/fs/mmfs/ts/tm/HandleReq.C 
line 304  retCode 0, reasonCode 0
Dec 11 08:57:39 loremds12 mmfs: mmfsd: Error=MMFS_GENERIC, 
ID=0x30D9195E, Tag=4908715
Dec 11 08:57:39 loremds12 mmfs: Tag=4908715
Dec 11 08:57:39 loremds12 mmfs: Tag=4908715   (!"downgrade to mode which 
is not StrictlyWeaker")
Dec 11 08:57:39 loremds12 mmfs: Tag=4908715   node 584 old mode ro new 
mode (A:    D:  A)
Dec 11 08:57:39 loremds12 mmfs: [X] logAssertFailed: (!"downgrade to 
mode which is not StrictlyWeaker")
Dec 11 08:57:39 loremds12 mmfs: [X] return code 0, reason code 0, log 
record tag 0
Dec 11 08:57:42 loremds12 mmfs: [E]         10:0xA1BD5B 
RcvWorker::thread(void*).A1BD00 + 0x5B at ??:0
Dec 11 08:57:42 loremds12 mmfs: [E]         11:0x622126 
Thread::callBody(Thread*).6220E0 + 0x46 at ??:0
Dec 11 08:57:42 loremds12 mmfs: [E]         12:0x61220F 
Thread::callBodyWrapper(Thread*).612180 + 0x8F at ??:0
Dec 11 08:57:42 loremds12 mmfs: [E]         13:0x7FF4E6BE66B6 
start_thread + 0xE6 at ??:0
Dec 11 08:57:42 loremds12 mmfs: [E]         14:0x7FF4E5FEE06D clone + 
0x6D at ??:0
Dec 11 08:57:42 loremds12 mmfs: [E]         2:0x9F95E9 
logAssertFailed.9F9440 + 0x1A9 at ??:0
Dec 11 08:57:42 loremds12 mmfs: [E]         3:0x1232836 
TokenClass::fixClientMode(Token*, int, int, int, CopysetRevoke*).1232350 
+ 0x4E6 at ??:0
Dec 11 08:57:42 loremds12 mmfs: [E]         4:0x1235593 
TokenClass::HandleTellRequest(RpcContext*, Request*, char**, 
int).1232AD0 + 0x2AC3 at ??:0
Dec 11 08:57:42 loremds12 mmfs: [E]         5:0x123A23C 
HandleTellRequestInterface(RpcContext*, Request*, char**, int).123A0D0 + 
0x16C at ??:0
Dec 11 08:57:42 loremds12 mmfs: [E]         6:0x125C6B0 
queuedTellServer(RpcContext*, Request*, int, unsigned int).125C670 + 
0x40 at ??:0
Dec 11 08:57:42 loremds12 mmfs: [E]         7:0x125EF72 
tmHandleTellServer(RpcContext*, char*).125EEC0 + 0xB2 at ??:0
Dec 11 08:57:42 loremds12 mmfs: [E]         8:0xA12668 
tscHandleMsg(RpcContext*, MsgDataBuf*).A120D0 + 0x598 at ??:0
Dec 11 08:57:42 loremds12 mmfs: [E]         9:0xA1BC4E 
RcvWorker::RcvMain().A1BB50 + 0xFE at ??:0
Dec 11 08:57:42 loremds12 mmfs: [E] *** Traceback:
Dec 11 08:57:42 loremds12 mmfs: [N] Signal 6 at location 0x7FF4E5F456D5 
in process 12188, link reg 0xFFFFFFFFFFFFFFFF.
Dec 11 08:57:42 loremds12 mmfs: [X] *** Assert exp((!"downgrade to mode 
which is not StrictlyWeaker") node 584 old mode ro new mode (A:    D: 
A)  ) in line 304 of file /project/sprelfks2/bui
ld/rfks2s010a/src/avs/fs/mmfs/ts/tm/HandleReq.C

I've seen different messages on that third line of the "Tag=" message:

Dec 11 00:16:40 loremds11 mmfs: Tag=5012168   node 825 old mode ro new 
mode 0x31
Dec 11 01:52:53 loremds10 mmfs: Tag=5016618   node 655 old mode ro new 
mode (A: MA D:   )
Dec 11 02:15:57 loremds10 mmfs: Tag=5045549   node 994 old mode ro new 
mode (A:  A D:  A)
Dec 11 08:14:22 loremds10 mmfs: Tag=5067054   node 237 old mode ro new 
mode 0x08
Dec 11 08:57:39 loremds12 mmfs: Tag=4908715   node 584 old mode ro new 
mode (A:    D:  A)
Dec 11 00:47:39 loremds09 mmfs: Tag=4998635   node 461 old mode ro new 
mode (A:R   D:   )

It's interesting to note that all of these node indexes are still 
running 3.5. I'm going to open up a PMR but thought I'd share the gory 
details here and see if folks had any insight. I'm starting to wonder if 
4.1 clients are more tolerant of 3.5 servers than 4.1 servers are of 3.5 
clients.

-Aaron

On 12/5/16 4:31 PM, Aaron Knister wrote:
> Hi Everyone,
>
> In the GPFS documentation
> (http://www.ibm.com/support/knowledgecenter/SSFKCN_4.1.0/com.ibm.cluster.gpfs.v4r1.gpfs300.doc/bl1ins_migratl.htm)
> it has this to say about the duration of an upgrade from 3.5 to 4.1:
>
>> Rolling upgrades allow you to install new GPFS code one node at a time
>> without shutting down GPFS
>> on other nodes. However, you must upgrade all nodes within a short
>> time. The time dependency exists
>> because some GPFS 4.1 features become available on each node as soon as
> the node is upgraded, while
>> other features will not become available until you upgrade all
> participating nodes.
>
> Does anyone have a feel for what "a short time" means? I'm looking to
> upgrade from 3.5.0.31 to 4.1.1.10 in a rolling fashion but given the
> size of our system it might take several weeks to complete. Seeing this
> language concerns me that after some period of time something bad is
> going to happen, but I don't know what that period of time is.
>
> Also, if anyone has done a rolling 3.5 to 4.1 upgrade and has any
> anecdotes they'd like to share, I would like to hear them.
>
> Thanks!
>
> -Aaron
>

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776