[gpfsug-discuss] GPFS architecture choice: large servers or directly-attached clients?

Mon Mar 11 19:26:58 GMT 2013

I'm in the process of planning a new HPC cluster, and I'd appreciate getting
some feedback on different approaches to the GPFS architecture.

The cluster will have about 25~50 nodes initially (up to 1000 CPU-cores),
expected to grow to about 50~80 nodes.

The jobs are primarily independent, single-threaded, with a mixture of
small- to medium-sized IO, and a lot of random access. It is very common to
have 100s or 1000s of jobs on different cores and nodes each accessing
the same directories, often with an overlap of the same data files.

For example, many jobs on different nodes will use the same executable
and the same baseline data models, but will differ in individual data
files to compare to the model.

My goal is to ensure reasonable performance, particularly when there's a lot
of contention from multiple jobs accessing the same meta-data and some of the
same data.

My question here is in a choice between two GPFS archicture designs
(the storage array configurations, drive types, RAID types, etc. are
also being examined separately). I'd really like to hear any suggestions
about these (or other) configurations:

[1] Large GPFS servers
	About 5 GPFS servers with significant RAM. Each GPFS server would
	be connected to storage via an 8Gb/s fibre SAN (multiple paths)
	to storage arrays.

	Each GPFS server would provide NSDs via 10Gb/s and 1Gb/s (for legacy 
	servers) ethernet to GPFS clients (computational compute nodes). 

		Questions:

			Since the GPFS clients would not be SAN attached
			with direct access to block storage, and many
			clients (~50) will access similar data (and the
			same directories) for many jobs, it seems like it
			would make sense to do a lot of caching on the
			GPFS servers. Multiple clients would benefit by
			reading from the same cached data on the servers.

			I'm thinking of sizing caches to handle 1~2GB
			per core in the compute nodes, divided by the
			number of GPFS servers.  This would mean caching
			(maxFilesToCache, pagepool, maxStatCache) on the
			GPFS servers of about 200GB+ on each GPFS server.

			Is there any way to configure GPFS so that the
			GPFS servers can do a large amount of caching
			without requiring the same resources on the
			GPFS clients?

			Is there any way to configure the GPFS clients
			so that their RAM can be used primarily for
			computational jobs?

[2] Direct-attached GPFS clients
	About 3~5 GPFS servers with modest resources (8CPU-cores, ~60GB RAM).

	Each GPFS server and client (HPC compute node) would be directly
	connected to the SAN (8Gb/s fibre, iSCSI over 10Gb/s ethernet,
	FCoE over 10Gb/s ethernet).

	Either 10Gb/s or 1Gb/s ethernet for communication between GPFS nodes.

	Since this is a relatively small cluster in terms of the total
	node count, the increased cost in terms of HBAs, switches, and
	cabling for direct-connecting all nodes to the storage shouldn't
	be excessive.

Ideas? Suggestions? Things I'm overlooking?

Thanks,

Mark