[gpfsug-discuss] Follow-up: migrating billions of files

William Abbott babbott at rutgers.edu
Wed Mar 6 15:11:15 GMT 2019


We had a similar situation and ended up using parsyncfp, which generates 
multiple parallel rsyncs based on file lists.  If they're on the same IB 
fabric (as ours were) you can use that instead of ethernet, and it 
worked pretty well.  One caveat is that you need to follow the parallel 
transfers with a final single rsync, so you can use --delete.

For the initial transfer you can also use bbcp.  It can get very good 
performance but isn't nearly as convenient as rsync for subsequent 
transfers.  The performance isn't good with small files but you can use 
tar on both ends to deal with that, in a similar way to what Uwe 
suggests below.  The bbcp documentation outlines how to do that.

Bill

On 3/6/19 8:13 AM, Uwe Falke wrote:
> Hi, in that case I'd open several tar pipes in parallel, maybe using
> directories carefully selected, like
> 
>    tar -c <source_dir> | ssh <target_host>  "tar -x"
> 
> I am not quite sure whether "-C /" for tar works here ("tar -C / -x"), but
> along these lines might be a good efficient method. target_hosts should be
> all nodes haveing the target file system mounted, and you should start
> those pipes on the nodes with the source file system.
> It is best to start with the largest directories, and use some
> masterscript to start the tar pipes controlled by semaphores  to not
> overload anything.
> 
> 
>   
> Mit freundlichen Grüßen / Kind regards
> 
>   
> Dr. Uwe Falke
>   
> IT Specialist
> High Performance Computing Services / Integrated Technology Services /
> Data Center Services
> -------------------------------------------------------------------------------------------------------------------------------------------
> IBM Deutschland
> Rathausstr. 7
> 09111 Chemnitz
> Phone: +49 371 6978 2165
> Mobile: +49 175 575 2877
> E-Mail: uwefalke at de.ibm.com
> -------------------------------------------------------------------------------------------------------------------------------------------
> IBM Deutschland Business & Technology Services GmbH / Geschäftsführung:
> Thomas Wolter, Sven Schooß
> Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
> HRB 17122
> 
> 
> 
> 
> From:   "Oesterlin, Robert" <Robert.Oesterlin at nuance.com>
> To:     gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Date:   06/03/2019 13:44
> Subject:        [gpfsug-discuss] Follow-up: migrating billions of files
> Sent by:        gpfsug-discuss-bounces at spectrumscale.org
> 
> 
> 
> Some of you had questions to my original post. More information:
>   
> Source:
> - Files are straight GPFS/Posix - no extended NFSV4 ACLs
> - A solution that requires $?s to be spent on software (ie, Aspera) isn?t
> a very viable option
> - Both source and target clusters are in the same DC
> - Source is stand-alone NSD servers (bonded 10g-E) and 8gb FC SAN storage
> - Approx 40 file systems, a few large ones with 300M-400M files each,
> others smaller
> - no independent file sets
> - migration must pose minimal disruption to existing users
>   
> Target architecture is a small number of file systems (2-3) on ESS with
> independent filesets
> - Target (ESS) will have multiple 40gb-E links on each NSD server (GS4)
>   
> My current thinking is AFM with a pre-populate of the file space and
> switch the clients over to have them pull data they need (most of the data
> is older and less active) and them let AFM populate the rest in the
> background.
>   
>   
> Bob Oesterlin
> Sr Principal Storage Engineer, Nuance
>   _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss%26d%3DDwICAg%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DfTuVGtgq6A14KiNeaGfNZzOOgtHW5Lm4crZU6lJxtB8%26m%3DJ5RpIj-EzFyU_dM9I4P8SrpHMikte_pn9sbllFcOvyM%26s%3DfEwDQyDSL7hvOVPbg_n8o_LDz-cLqSI6lQtSzmhaSoI%26e&data=02%7C01%7Cbabbott%40rutgers.edu%7C8cbda3d651584119393808d6a2358544%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636874748092821399&sdata=W06i8IWqrxgEmdp3htxad0euiRhA6%2Bexd3YAziSrUhg%3D&reserved=0=
> 
> 
> 
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7Cbabbott%40rutgers.edu%7C8cbda3d651584119393808d6a2358544%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636874748092821399&sdata=Pjf4RhUchThoFvWI7hLJO4eWhoTXnIYd9m7Mvf809iE%3D&reserved=0
> 


More information about the gpfsug-discuss mailing list