[gpfsug-discuss] Looking for a way to see which node is having an impact on server?

Mon Dec 9 21:05:41 GMT 2013

Hi Alex,

I should have mention that my GPFS network is done through
infiniband/RDMA, so looking at the TCP probably won't work. I will try
to see if the traffic can be seen through ib0 (instead of eth0), but I
have my doubts.

As for the placement. The file system was 95% full when I added the new
NSDs. I know that what is waiting now from the waiters commands is the
to the 2 NSDs:

waiting 0.791707000 seconds, NSDThread: for I/O completion on disk d9

I have added more NSDs since then but the waiting is still on the 2
disks. None of the others.

Richard

On 12/09/2013 02:52 PM, Alex Chekholko wrote:
> Hi Richard,
> 
> I would just use something like 'iftop' to look at the traffic between
> the nodes.  Or 'collectl'.  Or 'dstat'.
> 
> e.g. dstat -N eth0 --gpfs --gpfs-ops --top-cpu-adv --top-io 2 10
> http://dag.wiee.rs/home-made/dstat/
> 
> For the NSD balance question, since GPFS stripes the blocks evenly
> across all the NSDs, they will end up balanced over time.  Or you can
> rebalance manually with 'mmrestripefs -b' or similar.
> 
> It is unlikely that particular files ended up on a single NSD, unless
> the other NSDs are totally full.
> 
> Regards,
> Alex
> 
> On 12/06/2013 04:31 PM, Richard Lefebvre wrote:
>> Hi,
>>
>> I'm looking for a way to see which node (or nodes) is having an impact
>> on the gpfs server nodes which is slowing the whole file system? What
>> happens, usually, is a user is doing some I/O that doesn't fit the
>> configuration of the gpfs file system and the way it was explain on how
>> to use it efficiently.  It is usually by doing a lot of unbuffered byte
>> size, very random I/O on the file system that was made for large files
>> and large block size.
>>
>> My problem is finding out who is doing that. I haven't found a way to
>> pinpoint the node or nodes that could be the source of the problem, with
>> over 600 client nodes.
>>
>> I tried to use "mmlsnodes -N waiters -L" but there is too much waiting
>> that I cannot pinpoint on something.
>>
>> I must be missing something simple. Anyone got any help?
>>
>> Note: there is another thing I'm trying to pinpoint. A temporary
>> imbalance was created by adding a new NSD. It seems that a group of
>> files have been created on that same NSD and a user keeps hitting that
>> NSD causing a high load.  I'm trying to pinpoint the origin of that too.
>> At least until everything is balance back. But will balancing spread
>> those files since they are already on the most empty NSD?
>>
>> Richard
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at gpfsug.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>