[gpfsug-discuss] CTDB woes

Orlando Richards orlando.richards at ed.ac.uk
Mon Apr 22 15:52:55 BST 2013


On 17/04/13 11:30, Orlando Richards wrote:
> Hi All - an update to this,
>
> After re-initialising the databases on Monday, things did seem to be
> running better, but ultimately we got back to suffering from spikes in
> ctdb processes and corresponding "pauses" in service. We fell back to a
> single node again for Tuesday (and things were stable once again), and
> this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was
> rebuilt against CTDB 1.2.61 headers).
>
> Things seem to be stable for now - more so than on Monday.
>
> For the record - one metric I'm watching is the number of ctdb processes
> running (this would spike to > 1000 under the failure conditions). It's
> currently sitting consistently at 3 processes, with occasional blips of
> 5-7 processes.
>


Hi all,

Looks like things have been running fine since we upgraded ctdb last 
Wednesday, so I think it's safe to say that we've found a fix for our 
problem in CTDB 1.2.61.

Thanks for all the input! If anyone wants more info, feel free to get in 
touch.


--
Orlando

> --
> Orlando
>
>
>
>
>
> On 15/04/13 10:54, Orlando Richards wrote:
>> On 12/04/13 19:44, Vic Cornell wrote:
>>> Have you tried putting the ctdb files onto a separate gpfs filesystem?
>>
>> No - but considered it. However, the only "live" CTDB file that sits on
>> GPFS is the reclock file, which - I think - is only used as the
>> heartbeat between nodes and for the recovery process. Now, there's
>> mileage in insulating that, certainly, but I don't think that's what
>> we're suffering from here.
>>
>> On a positive note - we took the steps this morning to re-initialise the
>> ctdb databases from current data, and things seem to be stable today so
>> far.
>>
>> Basically - shut down ctdb on all but one node. On all but that node, do:
>> mv /var/ctdb/ /var/ctdb.save.date
>>
>> then start up ctdb on those nodes. Once they've come up, shut down ctdb
>> on the last node, move /var/ctdb out the way, and restart. That brings
>> them all up with freshly compacted databases.
>>
>> Also, from the samba-technical mailing list came the advice to use a
>> more recent ctdb - specifically, 1.2.61. I've got that built and ready
>> to go (and a rebuilt samba compiled against it too), but if things prove
>> to be stable after today's compacting, then we will probably leave it at
>> that and not deploy this.
>>
>> Interesting that 2.0 wasn't suggested for "stable", and that the current
>> "dev" version is 2.1.
>>
>> For reference, here's the start of the thread:
>> https://lists.samba.org/archive/samba-technical/2013-April/091525.html
>>
>> --
>> Orlando.
>>
>>
>>
>>>
>>> On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk>
>>> wrote:
>>>
>>>> On 12/04/13 15:43, Bob Cregan wrote:
>>>>> Hi Orlando,
>>>>>                        We use ctdb/samba for CIFS, and CNFS for NFS
>>>>> (GPFS version 3.4.0-13) . Current versions are
>>>>>
>>>>> ctdb - 1.0.99
>>>>> samba 3.5.15
>>>>>
>>>>> Both compiled from source. We have about 300+ users normally.
>>>>>
>>>>
>>>> We have suspicions that 3.6 has put additional "chatter" into the
>>>> ctdb database stream, which has pushed us over the edge. Barry Evans
>>>> has found that the clustered locking databases, in particular, prove
>>>> to be a scalability/usability limit for ctdb.
>>>>
>>>>
>>>>> We have had no issues with this setup apart from CNFS which had 2 or 3
>>>>> bad moments over the last year . These have gone away since we have
>>>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>>>>> be2net) which lead to occasional dropped packets for jumbo frames.
>>>>> There
>>>>> have been no issues with samba/ctdb
>>>>>
>>>>> The only comment I can make is that during initial investigations into
>>>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>>>>> compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
>>>>> with error messages like:
>>>>>
>>>>>   configure: checking whether cluster support is available
>>>>> checking for ctdb.h... yes
>>>>> checking for ctdb_private.h... yes
>>>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>>>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>>>>> configure: error: "cluster support not available: support for
>>>>> SCHEDULE_FOR_DELETION control missing"
>>>>>
>>>>>
>>>>> What occurs to me is that this message seems to indicate that it is
>>>>> possible to run  a ctdb version that is incompatible with samba 3.6.
>>>>>   That would imply that an upgrade to a higher version of ctdb might
>>>>> help, of course it might not and make backing out harder.
>>>>
>>>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared!
>>>> The versioning in CTDB has proved hard for me to fathom...
>>>>
>>>>>
>>>>> A compile against ctdb 2.0 works fine. We will soon be running in this
>>>>> upgrade, but I'm waiting to see what the samba  people say at the UG
>>>>> meeting first!
>>>>>
>>>>
>>>> It has to be said - the timing is good!
>>>> Cheers,
>>>> Orlando
>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Bob
>>>>>
>>>>>
>>>>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>>>>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>>>>>
>>>>>     Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>>>>>
>>>>>     We've long been using CTDB and Samba for our NAS service,
>>>>> servicing
>>>>>     ~500 users. We've been suffering from some problems with the CTDB
>>>>>     performance over the last few weeks, likely triggered either by an
>>>>>     upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a
>>>>> result),
>>>>>     or possibly by additional users coming on with a new workload.
>>>>>
>>>>>     We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>>>>     from sernet). Before we roll back, we'd like to make sure we can't
>>>>>     fix the problem and stick with Samba 3.6 (and we don't even know
>>>>>     that a roll back would fix the issue).
>>>>>
>>>>>     The symptoms are a complete freeze of the service for CIFS users
>>>>> for
>>>>>     10-60 seconds, and on the servers a corresponding spawning of
>>>>> large
>>>>>     numbers of CTDB processes, which seem to be created in a "big
>>>>> bang",
>>>>>     and then do what they do and exit in the subsequent 10-60 seconds.
>>>>>
>>>>>     We also serve up NFS from the same ctdb-managed frontends, and
>>>>> GPFS
>>>>>     from the cluster - and these are both fine throughout.
>>>>>
>>>>>     This was happening 5-10 times per hour, not at exact intervals
>>>>>     though. When we added a third node to the CTDB cluster, it "got
>>>>>     worse", and when we dropped the CTDB cluster down to a single node
>>>>>     and everything started behaving fine - which is where we are now.
>>>>>
>>>>>     So, I've got a bunch of questions!
>>>>>
>>>>>       - does anyone know why ctdb would be spawning these processes,
>>>>> and
>>>>>     if there's anything we can do to stop it needing to do it?
>>>>>       - has anyone done any more general performance / config
>>>>>     optimisation of CTDB?
>>>>>
>>>>>     And - more generally - does anyone else actually use
>>>>> ctdb/samba/gpfs
>>>>>     on the scale of ~500 users or higher? If so - how do you find it?
>>>>>
>>>>>
>>>>>     --
>>>>>                  --
>>>>>         Dr Orlando Richards
>>>>>        Information Services
>>>>>     IT Infrastructure Division
>>>>>             Unix Section
>>>>>          Tel: 0131 650 4994
>>>>>
>>>>>     The University of Edinburgh is a charitable body, registered in
>>>>>     Scotland, with registration number SC005336.
>>>>>     _________________________________________________
>>>>>     gpfsug-discuss mailing list
>>>>>     gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>>>>     http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>>>>     <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Bob Cregan
>>>>>
>>>>> Senior Storage Systems Administrator
>>>>>
>>>>> ACRC
>>>>>
>>>>> Bristol University
>>>>>
>>>>> Tel:     +44 (0) 117 331 4406
>>>>>
>>>>> skype:  bobcregan
>>>>>
>>>>> Mobile: +44 (0) 7712388129
>>>>>
>>>>
>>>>
>>>> --
>>>>             --
>>>>    Dr Orlando Richards
>>>>   Information Services
>>>> IT Infrastructure Division
>>>>        Unix Section
>>>>     Tel: 0131 650 4994
>>>>
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>> _______________________________________________
>>>> gpfsug-discuss mailing list
>>>> gpfsug-discuss at gpfsug.org
>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at gpfsug.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>>
>>
>
>


-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.



More information about the gpfsug-discuss mailing list