[gpfsug-discuss] GPFS best practises : end user standpoint

John Hearns john.hearns at asml.com
Wed Jan 17 08:39:35 GMT 2018


My own thoughts on this one are that I believe when software is being developed, the developers use 'small' use cases.
At the company which develops the software, the developers will probably have a desktop machine with a modest amount of RAM and disk space. Then the company might have a small to medium sized HPC cluster.
I know I am stretching things a bit, but I would imagine a lot of effort goes into verifying the correct operation of software on given data sets.
When the software is released to customers, either commercially or internally within a company, it is suddenly applied to larger datasets, and is applied repetitively. Hence the creation of directories filled with thousands of small files.

My own example from this here is in a former job, wind tunnel data was captured as thousands of PNG files which were frames from a camera. The data was shipped back to me on a hard drive, and I was asked to store it on an HSM system with tape as the lowest tier.
I knew that the PNG files had all bene processed anyway, and significant data had been extracted. The engineers wanted to keep the data 'just in case'. I knew that keeping thousands of files is bad for filesystem performance and also on a tape based system you can have the fiels stored on many tapes, so if you ever do trigger a recall you have  a festival of robot tape loading. So what I did was zip up all the directories.


-----Original Message-----
From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Jonathan Buzzard
Sent: Tuesday, January 16, 2018 6:26 PM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] GPFS best practises : end user standpoint

On Tue, 2018-01-16 at 16:35 +0000, Buterbaugh, Kevin L wrote:

[SNIP]

>
> We’re in Tennessee, so not only do we not speak English, we barely
> speak American … y’all will just have to understand, bless your
> hearts!  ;-).
>
> But seriously, like most Universities, we have a ton of users for whom
> English is not their “primary” language, so dealing with “interesting”
> filenames is pretty hard to avoid.  And users’ problems are our
> problems whether or not they’re our problem.
>

User comes with problem, you investigate find problem is due to "wacky"
characters point them to the mandatory training documentation, tell them they need to rename their files to something sane and take no further action. Sure English is not their primary language but *they* have chosen to study in an English speaking country so best to actually embrace that.

I do get it, many of our users are not native English speakers as well.
Yes it's a tough policy but on the other hand pandering to them does them no favours either.

[SNIP]

> If you’ve got (bio)medical users using your cluster I don’t see how
> you avoid this … they’re using commercial apps that do this kind of
> stupid stuff (10’s of thousands of files in a directory and the full
> path to each file is longer than the contents of the files
> themselves!).

Well then they have justified the use; aka it's not their fault so you up the quota for them. Though they could use different less brain dead software. The idea is to force a bump in the road so the users are aware that what they are doing is considered bad practice. Most users have no idea that putting a million files in a directory is not sensible and worse that trying to access them using a GUI file manager is positively brain dead.

[SNIP]

> OK, so here’s my main question … you’re right that SSD’s are the
> answer … but how do you charge them more?  SSDs are move expensive
> than hard disks, and enterprise SSDs are stupid expensive … and users
> barely want to pay hard drive prices for their storage.  If you’ve got
> the magic answer to how to charge them enough to pay for SSDs I’m sure
> I’m not the only one who’d love to hear how you do it?!?!
>

Give every user a one million file number quota. Need to store more than one million files, then you are going to have to pay $X per extra million files. Either they cough up the money to continue using their brain dead software or they switch to less stupid software. If they complain you just say that enterprise SSD's are stupidly expensive and you are using that space up at an above average rate and so have to pay the costs.

I am quite sure someone storing 1PB has to pay more than someone storing 1TB, so why should someone storing 20 million files not have to pay more than someone storing 100k files? The only difference is people are used to paying more to store extra bytes and not used to paying more for more files, but that is because most sane people don't store millions and millions of files necessitating the purchase of large amounts of expensive enterprise SSD's.


JAB.

--
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearns%40asml.com%7Cf9b43f106c124ced6a4108d55d063196%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=c5B3JAJZDp3YiCN2uOzTmf%2BlsLMVRw6BsIzacQuORN8%3D&reserved=0
-- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt.


More information about the gpfsug-discuss mailing list