[gpfsug-discuss] GPFS GPFS-FPO and Hadoop (specifically MapReduce in the first instance)

Thu Jul 17 10:27:27 BST 2014

Hello,

> Looking at FPO it appears to require being setup as a separate
> 'shared-nothing' cluster, with additional FPO and (at least 3)
> server licensing costs attached.  Presumably we could then use AFM
> to ingest(/copy/sync) data from a Hadoop-specific fileset on our
> existing GPFS cluster to the FPO cluster, removing the requirement
> for additional gateway/heads for user (data) access?  At least,
> based on what I've read so far this would be the way we would have
> to do it but it seems convoluted and not ideal.

GPFS FPO nodes can become part of your existing cluster.
Have you read this document? If not, take a look, it contains quite a lot
of details on how it's done.
http://public.dhe.ibm.com/common/ssi/ecm/en/dcw03051usen/DCW03051USEN.PDF
Also take a look at the public GPFS FAQs which contain some recommendations
related to GFPS FPO.

> Has anyone else run Hadoop alongside, or on top of, an existing san-
> based GPFS cluster (and wanted to use data stored on that cluster)?
> Any tips, if you have?  How does it (traditional GPFS or GPFS-FPO)
> compare to HDFS, especial regards performance (I know IBM have
> produced lots of pretty graphs showing how much more performant than
> HDFS GPFS-FPO is for particular use cases)?

Yes, there are GPFS users who run MapReduce workloads against multi-purpose
GPFS clusters that contain both "classic" and FPO filesystems.
Performance-wise, a lot depends on the workload.
But also don't forget that by avoiding the back-and-forth copying and
moving of your data isn't directly measured as better performance, although
that too can make turnaround times faster.

Regards
Sean
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140717/8846ad00/attachment.htm>