[gpfsug-discuss] Joining RDMA over different networks?
Jonathan Buzzard
jonathan.buzzard at strath.ac.uk
Thu Aug 24 19:42:33 BST 2023
On 24/08/2023 17:46, Alec wrote:
> So why not use the built in QOS features of Spectrum Scale to adjust the
> performance of a particular fileset, that way you can ensure you have
> appropriate bandwidth?
>
Because all the users files are in the same fileset would be the simple
answer. Way way to much administration overhead for that to change.
There is huge amounts of KISS involved in the cluster design. Also it's
only a subset of John's jobs that peg the network. Oh and at tender we
didn't know we would get GPFS so we had to account for that in the
system design.
As a side note is that GPU nodes get 40Gbps network connections, so I am
bandwidth limiting by node type.
The flip side is that the high speed network (Omnipath in this case) has
been reserved for MPI (or similar) traffic.
Basically we observed that core counts where growing at faster rate than
Infiniband/Omnipath bandwidth. We went from 12 cores a node to 40 cores,
but from 40Gbps Infiniband to 100Gbps Omnipath. So rather than mixing
both storage and MPI on the same fabric we moved the storage out onto
10Gbps Ethernet which for >99% of users is adequate and freed up
capacity on the low latency, high speed network for the MPI traffic. I
stand by that design choice 110%.
Then because low latency/high speed network is only for MPI traffic we
don't need to equip all nodes with Omnipath (as the tender turned out)
which saved $$$$ which could be spent otherwise. A login node for
example does just fine with plain Ethernet. As does a large memory (3TB
RAM) node which doesn't run multinode jobs. Same for GPU nodes, and
worked again in our favour when we added a whole bunch of refurb
ethernet only connected standard nodes last year as we had capacity
problems. Most of our jobs run on a single node so topology aware
scheduling in Slurm to the rescue. Cheap addition if your storage is
commodity Ethernet, would have been horrendously expensive for Omnipath.
There are also other considerations. Running GPFS is already enough of a
minority sport that running it over the likes of Omnipath or Infiniband
or even with RDMA is just asking for problems and fails the KISS test IMHO.
JAB.
--
Jonathan A. Buzzard Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
More information about the gpfsug-discuss
mailing list