[gpfsug-discuss] Joining RDMA over different networks?
Jonathan Buzzard
jonathan.buzzard at strath.ac.uk
Thu Aug 24 16:26:34 BST 2023
On 22/08/2023 11:52, Alec wrote:
> I wouldn't want to use GPFS if I didn't want my nodes to be able to go
> nuts, why bother to be frank.
>
Because there are multiple users to the system. Do you want to be the
one explaining to 50 other users that they can't use the system today
because John from Chemistry is pounding the filesystem to death for his
jobs? Didn't think so.
There is not an infinite amount of money available and it is not
possible with a reasonable amount of money to make a file system that
all the nodes can max out their network connection at once.
> I had tested a configuration with a single x86 box and 4 x 100Gbe
> adapters talking to an ESS, that thing did amazing performance in excess
> of 25 GB/s over Ethernet. If you have a node that needs that
> performance build to it. Spend more time configuring QoS to fair share
> your bandwidth than baking bottlenecks into your configuration.
>
There are finite budgets and compromises have to be made. The
compromises we made back in 2017 when the specification was written and
put out to tender have held up really well.
> The reasoning of holding end nodes to a smaller bandwidth than the
> backend doesn't make sense. You want to clear "the work" as efficiently
> as possible, more than keep IT from having any constraints popping up.
> That's what leads to just endless dithering and diluting of
> infrastructure until no one can figure out how to get real performance.
>
It does because a small number of jobs can hold the system to ransom for
lots of other users. I have to balance things across a large number of
nodes. There is only a finite amount of bandwidth to the storage and it
has to be shared out fairly. I could attempt to do it with QOS on the
switches or I could go sod that for a lark 10Gbps is all you get and
lets keep it simple. Though like I said today it would be 25Gbps, but
this was a specification written six years ago when 25Gbps Ethernet was
rather exotic and too expensive.
> So yeah 95% of the workloads don't care about their performance and can
> live on dithered and diluted infrastructure that costs a zillion times
> more money than what the 5% of workload that does care about bandwidth
> needs to spend to actually deliver.
>
They do care about performance, they just don't need to max out the
allotted performance per node. However if performance of the file system
is bad the performance of the their jobs will also be bad and the total
FLOPS I get from the system will plummet through the floor.
Note it is more like 0.1% of jobs that peg the 10Gbps network interface
for any period of time it at all.
> Build your infrastructure storage as high bandwidth as possible per node
> because compared to all the other costs it's a drop in the bucket...
> Don't cheap out on "cables".
No it's not. The Omnipath network (which by the way is reserved
deliberately for MPI) cost a *LOT* of money. We are having serious
conversations that with current core counts per node that an
Infiniband/Omnipath network doesn't make sense any more, and that 25Gbps
Ethernet will do just fine for a standard compute node.
Around 85% of our jobs run on 40 cores (aka one node) or less. If you go
to 128 cores a node it's more like 95% of all jobs. If you go to 192
cores it's about 98% of all jobs. The maximum job size we allow
currently is 400 cores.
Better to ditch the expensive interconnect and use the hundreds of
thousands of dollars saved and buy more compute nodes is the current
thinking. The 2% of users can just have longer runtimes but hey there
will be a lot more FLOPS available in total and they rarely have just
one job in the queue so it will all balance out in the wash and be
positive for most users.
In consultation the users are on board with this direction of travel.
From our perspective if a user absolutely needs more than 192 cores on
a modern system it would not be unreasonable to direct them to a
national facility that can handle the really huge jobs. We are an
institutional HPC facility after all. We don't claim to be able to
handle a 1000 core job for example.
JAB.
--
Jonathan A. Buzzard Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
More information about the gpfsug-discuss
mailing list