[gpfsug-discuss] NDS in Two Site scenario

Yuri L Volobuev volobuev at us.ibm.com
Fri Jul 22 18:56:31 BST 2016


There are multiple ways to accomplish active-active two-side synchronous
DR, aka "stretch cluster".

The most common approach is to have 3 sites: two main sites A and B, plus
tiebreaker site C.  The two main sites host all data/metadata disks and
each has some even number of quorum nodes.  There's no stretched SAN, each
site has its own set of NSDs defined.  The tiebreaker site consists of a
single quorum node with a small descOnly LUN.  In this config, any of the 3
sites can do down or be disconnected from the rest without affecting the
other two.  The tiebreaker site is essential: it provides a quorum node for
node majority quorum to function, and a descOnly disk for the file system
descriptor quorum.  Technically speaking, one do away with the need to have
a quorum node at site C by using "minority quorum", i.e. tiebreaker disks,
but this model is more complex and it is harder to predict its behavior
under various failure conditions.  The basic problem with the minority
quorum is that it allows a minority of nodes to win in a network partition
scenario, just like the name implies.  In the extreme case this leads to
the "dictator problem", when a single partitioned node could manage to win
the disk election and thus kick everyone else out.  And since a tiebreaker
disk needs to be visible from all quorum nodes, you do need a stretched SAN
that extends between sites.  The classic active-active stretch cluster only
requires a good TCP/IP network.

The question that gets asked a lot is "how good should be network
connection between sites be".  There's no simple answer, unfortunately.  It
would be completely impractical to try to frame this in simple thresholds.
The worse the network connection is, the more pain it produces, but
everyone has a different level of pain tolerance.  And everyone's workload
is different.  In any GPFS configuration that uses data replication, writes
are impacted far more by replication than reads.  So a read-mostly workload
may run fine with a dodgy inter-site link, while a write-heavy workload may
just run into the ground, as IOs may be submitted faster than they could be
completed.  The buffering model could make a big difference.  An
application that does a fair amount of write bursts, with those writes
being buffered in a generously sized pagepool, may perform acceptably,
while a different application that uses O_SYNC or O_DIRECT semantics for
writes may run a lot worse, all other things being equal.  As long as all
nodes can renew their disk leases within the configured disk lease interval
(35 sec by default), GPFS will basically work, so the absolute threshold
for the network link quality is not particularly stringent, but beyond that
it all depends on your workload and your level of pain tolerance.
Practically speaking, you want a network link with low-double-digits RTT at
worst, almost no packet loss, and bandwidth commensurate with your
application IO needs (fudged some to allow for write amplification --
another factor that's entirely workload-dependent).  So a link with, say,
100ms RTT and 2% packet loss is not going to be usable to almost anyone, in
my opinion, a link with 30ms RTT and 0.1% packet loss may work for some
undemanding read-mostly workloads, and so on.  So you pretty much have to
try it out to see.

The disk configuration is another tricky angle.  The simplest approach is
to have two groups of data/metadata NSDs, on sites A and B, and not have
any sort of SAN reaching across sites.  Historically, such a config was
actually preferred over a stretched SAN, because it allowed for a basic
site topology definition.  When multiple replicas of the same logical block
are present, it is obviously better/faster to read the replica that resides
on a disk that's local to a given site.  This is conceptually simple, but
how would GPFS know what a site is and what disks are local vs remote?  To
GPFS, all disks are equal.  Historically, the readReplicaPolicy=local
config parameter was put forward to work around the problem.  The basic
idea was: if the reader node is on the same subnet as the primary NSD
server for a given replica, this replica is "local", and is thus preferred.
This sort of works, but requires a very specific network configuration,
which isn't always practical.  Starting with GPFS 4.1.1, GPFS implements
readReplicaPolicy=fastest, where the best replica for reads is picked based
on observed disk IO latency.  This is more general and works for all disk
topologies, including a stretched SAN.

yuri



From:	"Mark.Bush at siriuscom.com" <Mark.Bush at siriuscom.com>
To:	gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>,
Date:	07/21/2016 05:45 AM
Subject:	Re: [gpfsug-discuss] NDS in Two Site scenario
Sent by:	gpfsug-discuss-bounces at spectrumscale.org



This is where my confusion sits.  So if I have two sites, and two NDS Nodes
per site with 1 NSD (to keep it simple), do I just present the physical LUN
in Site1 to Site1 NDS Nodes and physical LUN in Site2 to Site2 NSD Nodes?
Or is it that I present physical LUN in Site1 to all 4 NDS Nodes and the
same at Site2?  (Assuming SAN and not direct attached in this case).  I
know I’m being persistent but this for some reason confuses me.

Site1
NSD Node1
                                ---NSD1 ---Physical LUN1 from SAN1
NSD Node2


Site2
NSD Node3
      ---NSD2 –Physical LUN2 from SAN2
NSD Node4


Or


Site1
NSD Node1
                                ----NSD1 –Physical LUN1 from SAN1
                               ----NSD2 –Physical LUN2 from SAN2
NSD Node2

Site 2
NSD Node3
                                ---NSD2 – Physical LUN2 from SAN2
                                ---NSD1  --Physical LUN1 from SAN1
NSD Node4


Site 3
Node5 Quorum



From: <gpfsug-discuss-bounces at spectrumscale.org> on behalf of Ken Hill
<kenh at us.ibm.com>
Reply-To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: Wednesday, July 20, 2016 at 7:02 PM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] NDS in Two Site scenario

Yes - it is a cluster.

The sites should NOT be further than a MAN - or Campus network. If you're
looking to do this over a large distance - it would be best to choose
another GPFS solution (Multi-Cluster, AFM, etc).

Regards,

Ken Hill
Technical Sales Specialist | Software Defined Solution Sales
IBM Systems


                                                                                                     
                                                                                                     
                                                                                                     
 Phone:1-540-207-7270                                                                                
 E-mail:                                                                                             
 kenh at us.ibm.com                                                          2300 Dulles Station Blvd   
                                                                            Herndon, VA 20171-6133   
                                                                                     United States   
                                                                                                     










From:        "Mark.Bush at siriuscom.com" <Mark.Bush at siriuscom.com>
To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:        07/20/2016 07:33 PM
Subject:        Re: [gpfsug-discuss] NDS in Two Site scenario
Sent by:        gpfsug-discuss-bounces at spectrumscale.org




So in this scenario Ken, can server3 see any disks in site1?

From: <gpfsug-discuss-bounces at spectrumscale.org> on behalf of Ken Hill
<kenh at us.ibm.com>
Reply-To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: Wednesday, July 20, 2016 at 4:15 PM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] NDS in Two Site scenario


                                 Site1
Site2
                                 Server1 (quorum 1)
Server3 (quorum 2)
                                 Server2
Server4




                                 SiteX
                                 Server5 (quorum 3)




You need to set up another site (or server) that is at least power isolated
(if not completely infrastructure isolated) from Site1 or Site2. You would
then set up a quorum node at that site | location. This insures you can
still access your data even if one of your sites go down.

You can further isolate failure by increasing quorum (odd numbers).

The way quorum works is: The majority of the quorum nodes need to be up to
survive an outage.

- With 3 quorum nodes you can have 1 quorum node failures and continue
filesystem operations.
- With 5 quorum nodes you can have 2 quorum node failures and continue
filesystem operations.
- With 7 quorum nodes you can have 3 quorum node failures and continue
filesystem operations.
- etc

Please see
http://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/ibmspectrumscale42_content.html?view=kc
for more information about quorum and tiebreaker disks.

Ken Hill
Technical Sales Specialist | Software Defined Solution Sales
IBM Systems


                                                                                                     
                                                                                                     
                                                                                                     
 Phone:1-540-207-7270                                                                                
 E-mail:                                                                                             
 kenh at us.ibm.com                                                          2300 Dulles Station Blvd   
                                                                            Herndon, VA 20171-6133   
                                                                                     United States   
                                                                                                     










From:        "Mark.Bush at siriuscom.com" <Mark.Bush at siriuscom.com>
To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:        07/20/2016 04:47 PM
Subject:        [gpfsug-discuss] NDS in Two Site scenario
Sent by:        gpfsug-discuss-bounces at spectrumscale.org





For some reason this concept is a round peg that doesn’t fit the square
hole inside my brain.  Can someone please explain the best practice to
setting up two sites same cluster?  I get that I would likely have two NDS
nodes in site 1 and two NDS nodes in site two.  What I don’t understand are
the failure scenarios and what would happen if I lose one or worse a whole
site goes down.  Do I solve this by having scale replication set to 2 for
all my files?  I mean a single site I think I get it’s when there are two
datacenters and I don’t want two clusters typically.



Mark R. Bush| Solutions Architect
Mobile: 210.237.8415 | mark.bush at siriuscom.com
Sirius Computer Solutions | www.siriuscom.com
10100 Reunion Place, Suite 500, San Antonio, TX 78216



This message (including any attachments) is intended only for the use of
the individual or entity to which it is addressed and may contain
information that is non-public, proprietary, privileged, confidential, and
exempt from disclosure under applicable law. If you are not the intended
recipient, you are hereby notified that any use, dissemination,
distribution, or copying of this communication is strictly prohibited. This
message may be viewed by parties at Sirius Computer Solutions other than
those named in the message header. This message does not contain an
official representation of Sirius Computer Solutions. If you have received
this communication in error, notify Sirius Computer Solutions immediately
and (i) destroy this message if a facsimile or (ii) delete this message
immediately if this is an electronic communication. Thank you.


Sirius Computer Solutions_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


 _______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


 _______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11130580.gif
Type: image/gif
Size: 1621 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11971715.gif
Type: image/gif
Size: 1597 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0002.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11264118.gif
Type: image/gif
Size: 1072 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0003.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11128019.gif
Type: image/gif
Size: 979 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0004.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11470612.gif
Type: image/gif
Size: 1564 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0005.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11952793.gif
Type: image/gif
Size: 1313 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0006.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11488202.gif
Type: image/gif
Size: 1168 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0007.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11248852.gif
Type: image/gif
Size: 1426 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0008.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11036495.gif
Type: image/gif
Size: 1369 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0009.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11647743.gif
Type: image/gif
Size: 1244 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0010.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11125683.gif
Type: image/gif
Size: 4454 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0011.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0012.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11353219.gif
Type: image/gif
Size: 1622 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0013.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11280235.gif
Type: image/gif
Size: 1598 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0014.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11669375.gif
Type: image/gif
Size: 1073 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0015.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11650693.gif
Type: image/gif
Size: 980 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0016.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11604766.gif
Type: image/gif
Size: 1565 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0017.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11840270.gif
Type: image/gif
Size: 1314 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0018.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11842186.gif
Type: image/gif
Size: 1169 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0019.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11809831.gif
Type: image/gif
Size: 1427 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0020.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11549547.gif
Type: image/gif
Size: 1370 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0021.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11344792.gif
Type: image/gif
Size: 1245 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0022.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 11830257.gif
Type: image/gif
Size: 4455 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160722/477fe951/attachment-0023.gif>


More information about the gpfsug-discuss mailing list