[gpfsug-discuss] Looking for a way to see which node is having an impact on server?

Tue Dec 10 09:59:06 GMT 2013

We do not have as many client nodes, but we have an extensive Ganglia
configuration that monitors all of the nodes on our network.

For the client nodes we also run a script that pushes stats into Ganglia
using 'mmpmon'.

Using this we have been able to locate problem machines a lot quicker.

I have attached the script, and is released on a 'it works for me' term.

We run it every minute from cron.

Vince.

--

Vincent Andrews
NOC,
European Way,
Southampton ,
SO14 3ZH
Ext. 27616
External 023 80597616

This e-mail (and any attachments) is confidential and intended solely for
the use of the individual or entity to whom it is addressed. Both NERC and
the University of Southampton (who operate NOCS as a collaboration) are
subject to the Freedom of Information Act 2000. The information contained
in this e-mail and any reply you make may be disclosed unless it is
legally from disclosure. Any material supplied to NOCS may be stored in
the electronic records management system of either the University or NERC
as appropriate.

On 09/12/2013 21:21, "Alex Chekholko" <chekh at stanford.edu> wrote:

>Hi Richard,
>
>For IB traffic, you can use 'collectl -sx'
>http://collectl.sourceforge.net/Infiniband.html
>or else mmpmon (which is what 'dstat --gpfs' uses underneath anyway)
>
>If your other NSDs are full, then of course all writes will go to the
>empty NSDs.  And then reading those new files your performance will be
>limited to just the new NSDs.
>
>
>Regards,
>Alex
>
>On 12/09/2013 01:05 PM, Richard Lefebvre wrote:
>> Hi Alex,
>>
>> I should have mention that my GPFS network is done through
>> infiniband/RDMA, so looking at the TCP probably won't work. I will try
>> to see if the traffic can be seen through ib0 (instead of eth0), but I
>> have my doubts.
>>
>> As for the placement. The file system was 95% full when I added the new
>> NSDs. I know that what is waiting now from the waiters commands is the
>> to the 2 NSDs:
>>
>> waiting 0.791707000 seconds, NSDThread: for I/O completion on disk d9
>>
>> I have added more NSDs since then but the waiting is still on the 2
>> disks. None of the others.
>>
>> Richard
>>
>> On 12/09/2013 02:52 PM, Alex Chekholko wrote:
>>> Hi Richard,
>>>
>>> I would just use something like 'iftop' to look at the traffic between
>>> the nodes.  Or 'collectl'.  Or 'dstat'.
>>>
>>> e.g. dstat -N eth0 --gpfs --gpfs-ops --top-cpu-adv --top-io 2 10
>>> http://dag.wiee.rs/home-made/dstat/
>>>
>>> For the NSD balance question, since GPFS stripes the blocks evenly
>>> across all the NSDs, they will end up balanced over time.  Or you can
>>> rebalance manually with 'mmrestripefs -b' or similar.
>>>
>>> It is unlikely that particular files ended up on a single NSD, unless
>>> the other NSDs are totally full.
>>>
>>> Regards,
>>> Alex
>>>
>>> On 12/06/2013 04:31 PM, Richard Lefebvre wrote:
>>>> Hi,
>>>>
>>>> I'm looking for a way to see which node (or nodes) is having an impact
>>>> on the gpfs server nodes which is slowing the whole file system? What
>>>> happens, usually, is a user is doing some I/O that doesn't fit the
>>>> configuration of the gpfs file system and the way it was explain on
>>>>how
>>>> to use it efficiently.  It is usually by doing a lot of unbuffered
>>>>byte
>>>> size, very random I/O on the file system that was made for large files
>>>> and large block size.
>>>>
>>>> My problem is finding out who is doing that. I haven't found a way to
>>>> pinpoint the node or nodes that could be the source of the problem,
>>>>with
>>>> over 600 client nodes.
>>>>
>>>> I tried to use "mmlsnodes -N waiters -L" but there is too much waiting
>>>> that I cannot pinpoint on something.
>>>>
>>>> I must be missing something simple. Anyone got any help?
>>>>
>>>> Note: there is another thing I'm trying to pinpoint. A temporary
>>>> imbalance was created by adding a new NSD. It seems that a group of
>>>> files have been created on that same NSD and a user keeps hitting that
>>>> NSD causing a high load.  I'm trying to pinpoint the origin of that
>>>>too.
>>>> At least until everything is balance back. But will balancing spread
>>>> those files since they are already on the most empty NSD?
>>>>
>>>> Richard
>>>> _______________________________________________
>>>> gpfsug-discuss mailing list
>>>> gpfsug-discuss at gpfsug.org
>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>>
>>>
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at gpfsug.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>
>--
>Alex Chekholko chekh at stanford.edu 347-401-4860
>_______________________________________________
>gpfsug-discuss mailing list
>gpfsug-discuss at gpfsug.org
>http://gpfsug.org/mailman/listinfo/gpfsug-discuss

This message (and any attachments) is for the recipient only. NERC is subject to the Freedom of Information Act 2000 and the contents of this email and any reply you make may be disclosed by NERC unless it is exempt from release under the Act. Any material supplied to NERC may be stored in an electronic records management system.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ganglia_gpfs_client_stats.pl
Type: text/x-perl-script
Size: 3701 bytes
Desc: ganglia_gpfs_client_stats.pl
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20131210/4ec4400e/attachment.bin>