[gpfsug-discuss] CTDB woes
Orlando Richards
orlando.richards at ed.ac.uk
Mon Apr 22 15:52:55 BST 2013
On 17/04/13 11:30, Orlando Richards wrote:
> Hi All - an update to this,
>
> After re-initialising the databases on Monday, things did seem to be
> running better, but ultimately we got back to suffering from spikes in
> ctdb processes and corresponding "pauses" in service. We fell back to a
> single node again for Tuesday (and things were stable once again), and
> this morning rolled out CTDB 1.2.61 (plus a 3.6.12 samba which was
> rebuilt against CTDB 1.2.61 headers).
>
> Things seem to be stable for now - more so than on Monday.
>
> For the record - one metric I'm watching is the number of ctdb processes
> running (this would spike to > 1000 under the failure conditions). It's
> currently sitting consistently at 3 processes, with occasional blips of
> 5-7 processes.
>
Hi all,
Looks like things have been running fine since we upgraded ctdb last
Wednesday, so I think it's safe to say that we've found a fix for our
problem in CTDB 1.2.61.
Thanks for all the input! If anyone wants more info, feel free to get in
touch.
--
Orlando
> --
> Orlando
>
>
>
>
>
> On 15/04/13 10:54, Orlando Richards wrote:
>> On 12/04/13 19:44, Vic Cornell wrote:
>>> Have you tried putting the ctdb files onto a separate gpfs filesystem?
>>
>> No - but considered it. However, the only "live" CTDB file that sits on
>> GPFS is the reclock file, which - I think - is only used as the
>> heartbeat between nodes and for the recovery process. Now, there's
>> mileage in insulating that, certainly, but I don't think that's what
>> we're suffering from here.
>>
>> On a positive note - we took the steps this morning to re-initialise the
>> ctdb databases from current data, and things seem to be stable today so
>> far.
>>
>> Basically - shut down ctdb on all but one node. On all but that node, do:
>> mv /var/ctdb/ /var/ctdb.save.date
>>
>> then start up ctdb on those nodes. Once they've come up, shut down ctdb
>> on the last node, move /var/ctdb out the way, and restart. That brings
>> them all up with freshly compacted databases.
>>
>> Also, from the samba-technical mailing list came the advice to use a
>> more recent ctdb - specifically, 1.2.61. I've got that built and ready
>> to go (and a rebuilt samba compiled against it too), but if things prove
>> to be stable after today's compacting, then we will probably leave it at
>> that and not deploy this.
>>
>> Interesting that 2.0 wasn't suggested for "stable", and that the current
>> "dev" version is 2.1.
>>
>> For reference, here's the start of the thread:
>> https://lists.samba.org/archive/samba-technical/2013-April/091525.html
>>
>> --
>> Orlando.
>>
>>
>>
>>>
>>> On 12 Apr 2013, at 16:43, Orlando Richards <orlando.richards at ed.ac.uk>
>>> wrote:
>>>
>>>> On 12/04/13 15:43, Bob Cregan wrote:
>>>>> Hi Orlando,
>>>>> We use ctdb/samba for CIFS, and CNFS for NFS
>>>>> (GPFS version 3.4.0-13) . Current versions are
>>>>>
>>>>> ctdb - 1.0.99
>>>>> samba 3.5.15
>>>>>
>>>>> Both compiled from source. We have about 300+ users normally.
>>>>>
>>>>
>>>> We have suspicions that 3.6 has put additional "chatter" into the
>>>> ctdb database stream, which has pushed us over the edge. Barry Evans
>>>> has found that the clustered locking databases, in particular, prove
>>>> to be a scalability/usability limit for ctdb.
>>>>
>>>>
>>>>> We have had no issues with this setup apart from CNFS which had 2 or 3
>>>>> bad moments over the last year . These have gone away since we have
>>>>> fixed a bug with our 10G NIC drivers (emulex cards , kernel module
>>>>> be2net) which lead to occasional dropped packets for jumbo frames.
>>>>> There
>>>>> have been no issues with samba/ctdb
>>>>>
>>>>> The only comment I can make is that during initial investigations into
>>>>> an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
>>>>> compile against ctdb 1.0.99 (compilation requires tthe ctdb source )
>>>>> with error messages like:
>>>>>
>>>>> configure: checking whether cluster support is available
>>>>> checking for ctdb.h... yes
>>>>> checking for ctdb_private.h... yes
>>>>> checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
>>>>> checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
>>>>> configure: error: "cluster support not available: support for
>>>>> SCHEDULE_FOR_DELETION control missing"
>>>>>
>>>>>
>>>>> What occurs to me is that this message seems to indicate that it is
>>>>> possible to run a ctdb version that is incompatible with samba 3.6.
>>>>> That would imply that an upgrade to a higher version of ctdb might
>>>>> help, of course it might not and make backing out harder.
>>>>
>>>> Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared!
>>>> The versioning in CTDB has proved hard for me to fathom...
>>>>
>>>>>
>>>>> A compile against ctdb 2.0 works fine. We will soon be running in this
>>>>> upgrade, but I'm waiting to see what the samba people say at the UG
>>>>> meeting first!
>>>>>
>>>>
>>>> It has to be said - the timing is good!
>>>> Cheers,
>>>> Orlando
>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Bob
>>>>>
>>>>>
>>>>> On 12 April 2013 13:37, Orlando Richards <orlando.richards at ed..uk
>>>>> <mailto:orlando.richards at ed.ac.uk>> wrote:
>>>>>
>>>>> Hi folks, ac <mailto:orlando.richards at ed.ac.uk>
>>>>>
>>>>> We've long been using CTDB and Samba for our NAS service,
>>>>> servicing
>>>>> ~500 users. We've been suffering from some problems with the CTDB
>>>>> performance over the last few weeks, likely triggered either by an
>>>>> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a
>>>>> result),
>>>>> or possibly by additional users coming on with a new workload.
>>>>>
>>>>> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
>>>>> from sernet). Before we roll back, we'd like to make sure we can't
>>>>> fix the problem and stick with Samba 3.6 (and we don't even know
>>>>> that a roll back would fix the issue).
>>>>>
>>>>> The symptoms are a complete freeze of the service for CIFS users
>>>>> for
>>>>> 10-60 seconds, and on the servers a corresponding spawning of
>>>>> large
>>>>> numbers of CTDB processes, which seem to be created in a "big
>>>>> bang",
>>>>> and then do what they do and exit in the subsequent 10-60 seconds.
>>>>>
>>>>> We also serve up NFS from the same ctdb-managed frontends, and
>>>>> GPFS
>>>>> from the cluster - and these are both fine throughout.
>>>>>
>>>>> This was happening 5-10 times per hour, not at exact intervals
>>>>> though. When we added a third node to the CTDB cluster, it "got
>>>>> worse", and when we dropped the CTDB cluster down to a single node
>>>>> and everything started behaving fine - which is where we are now.
>>>>>
>>>>> So, I've got a bunch of questions!
>>>>>
>>>>> - does anyone know why ctdb would be spawning these processes,
>>>>> and
>>>>> if there's anything we can do to stop it needing to do it?
>>>>> - has anyone done any more general performance / config
>>>>> optimisation of CTDB?
>>>>>
>>>>> And - more generally - does anyone else actually use
>>>>> ctdb/samba/gpfs
>>>>> on the scale of ~500 users or higher? If so - how do you find it?
>>>>>
>>>>>
>>>>> --
>>>>> --
>>>>> Dr Orlando Richards
>>>>> Information Services
>>>>> IT Infrastructure Division
>>>>> Unix Section
>>>>> Tel: 0131 650 4994
>>>>>
>>>>> The University of Edinburgh is a charitable body, registered in
>>>>> Scotland, with registration number SC005336.
>>>>> _________________________________________________
>>>>> gpfsug-discuss mailing list
>>>>> gpfsug-discuss at gpfsug.org <mailto:gpfsug-discuss at gpfsug.org>
>>>>> http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
>>>>> <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Bob Cregan
>>>>>
>>>>> Senior Storage Systems Administrator
>>>>>
>>>>> ACRC
>>>>>
>>>>> Bristol University
>>>>>
>>>>> Tel: +44 (0) 117 331 4406
>>>>>
>>>>> skype: bobcregan
>>>>>
>>>>> Mobile: +44 (0) 7712388129
>>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> Dr Orlando Richards
>>>> Information Services
>>>> IT Infrastructure Division
>>>> Unix Section
>>>> Tel: 0131 650 4994
>>>>
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>> _______________________________________________
>>>> gpfsug-discuss mailing list
>>>> gpfsug-discuss at gpfsug.org
>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at gpfsug.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>>
>>
>
>
--
--
Dr Orlando Richards
Information Services
IT Infrastructure Division
Unix Section
Tel: 0131 650 4994
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
More information about the gpfsug-discuss
mailing list