[gpfsug-discuss] Joining RDMA over different networks?
Jonathan Buzzard
jonathan.buzzard at strath.ac.uk
Tue Aug 22 11:20:38 BST 2023
On 22/08/2023 10:51, Kidger, Daniel wrote:
>
> Jonathan,
>
> Thank you for the great answer!
> Just to be clear though - are you talking about TCP/IP mounting of the filesystem(s) rather than RDMA ?
>
Yes for a few reasons. Firstly a bunch of our Ethernet adaptors don't
support RDMA. Second there a lot of ducks to be got in line and kept in
line for RDMA to work and that's too much effort IMHO. Thirdly the nodes
can peg the 10Gbps interface they have which is a hard QOS that we are
happy with. Though if specifying today we would have 25Gbps to the
compute nodes and 100 possibly 200Gbps on the DSS-G nodes. Basically we
don't want one node to go nuts and monopolize the file system :-) The
DSS-G nodes don't have an issue keeping up so I am not sure there is
much performance benefit from RDMA to be had.
That said you are supposed to be able to do IPoIB over the RDMA
hardware's network, and I had presumed that the same could be said of
TCP/IP over RDMA on Ethernet.
> I think routing of RDMA is perhaps something only Lustre can do?
>
Possibly, something else is that we have our DSS-G nodes doing MLAG's
over a pair of switches. I need to be able to do firmware updates on the
network switches the DSS-G nodes are connected to without shutting down
the cluster. I don't think you can do that with RDMA reading the switch
manuals so another reason not to do it IMHO. In the 2020's the mantra is
patch baby patch and everything is focused on making that quick and easy
to achieve. Your expensive HPC system is for jack if hackers have taken
it over because you didn't path it in a timely fashion. Also I would
have a *lot* of explaining to do which I would rather not.
Also in our experience storage is rarely the bottle neck and when it is
aka Gromacs is creating a ~1TB temp file at 10Gbps (yeah that's a real
thing we have observed on a fairly regular basis) that's an intended QOS
so everyone else can get work done and I don't get a bunch of tickets
from users complaining about the file system performing badly. We have
seen enough simultaneous Gromacs that without the 10Gbps hard QOS the
filesystem would have been brought to it's knees.
We can't do the temp files locally on the node because we only spec'ed
them with 1TB local disks and the Gromacs temp files regularly exceed
the available local space. Also getting users to do it would be a
nightmare :-)
JAB.
--
Jonathan A. Buzzard Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
More information about the gpfsug-discuss
mailing list