[gpfsug-discuss] GPFS GPFS-FPO and Hadoop (specifically MapReduce in the first instance)

Wed Jul 16 15:21:31 BST 2014

Dear GPFSUG,

I've been looking into the possibility of using GPFS with Hadoop, especially as we already have experience with GPFS (traditional san-based) cluster for our HPC provision (which is part of the same network fabric, so integration should be possible and would be desirable).

The proof-of-concept Hadoop cluster I've setup has HDFS as well as our current GPFS file system exposed (to allow users to import/export their data from HDFS to the shared filestore).  HDFS is a pain to get data in and out of and also precludes us using many deployment tools to mass-update the nodes (I know this would also be a problem with GPFS-FPO) by reimage and/or reinstall.

It appears that the GPFS-FPO product is intended to provide HDFS's performance benefits for highly distributed data-intensive workloads with the same ease of use of a traditional GPFS filesystem.  One of the things I'm wondering is; can we link this with our existing GPFS cluster sanely?  This would avoid having to have additional filesystem gateway servers for our users to import/export their data from outside the system and allow, as seemlessly as possible, a clear workflow from generating large datasets on the HPC facility to analysing them (e.g. with a MapReduce function) on the Hadoop facility.

Looking at FPO it appears to require being setup as a separate 'shared-nothing' cluster, with additional FPO and (at least 3) server licensing costs attached.  Presumably we could then use AFM to ingest(/copy/sync) data from a Hadoop-specific fileset on our existing GPFS cluster to the FPO cluster, removing the requirement for additional gateway/heads for user (data) access?  At least, based on what I've read so far this would be the way we would have to do it but it seems convoluted and not ideal.

Or am I completely barking up the wrong tree with FPO?

Has anyone else run Hadoop alongside, or on top of, an existing san-based GPFS cluster (and wanted to use data stored on that cluster)?  Any tips, if you have?  How does it (traditional GPFS or GPFS-FPO) compare to HDFS, especial regards performance (I know IBM have produced lots of pretty graphs showing how much more performant than HDFS GPFS-FPO is for particular use cases)?

Many thanks,

Laurence
-- 
Laurence Hurst, IT Services, University of Birmingham, Edgbaston, B15 2TT