Print

Print


Hi All,

Will be good for anyone to submit a CCPR when they notice this (and you can let me know too).  That gives scicomp some official record of how many people it's affecting, and maybe more likely to be better addressed.

I think the slow /work (or /volatile) is often when (new) people submit jobs to the batch farm that are doing lots of IO on /work (instead of using the input/output tags to let the system stage data for them and leave all intensive job IO on the local filesystem).  Then one day they get a thousand jobs simultaneously and kill the fileserver.  I know that happened again yesterday, and someone got their job queue limited to 100 as a result.

Rafo, is your tarring really too slow to keep up?  We definitely need to keep an eye on and understand that.  Things look better today, but let me know.  Asking for a dedicated disk is an option, but we'd need to first show that we *really* need it.  I made a similar inquiry for CLAS12 recently, but so far working without it.

-Nathan


On Feb 7, 2019, at 23:42, Graf, Norman A. <[log in to unmask]> wrote:

Hello Rafo, Nathan,

Thanks for shepherding and monitoring the reconstruction. 

I, too, have noticed some issues with the /work disks. Transfers to SLAC have been timing out very often in the past few weeks, and I have been noticing the same latency when doing a simple "ls."

Norman



From: [log in to unmask] <[log in to unmask]> on behalf of Rafayel Paremuzyan <[log in to unmask]>
Sent: Thursday, February 7, 2019 7:28 PM
To: Nathan Baltzell
Cc: hps-software
Subject: Re: The new cooking
 
Hi Nathan,

These are the last jobs of the 10% pass.

Usually yes I am keeping the queue, I am not waiting all the jobs to be finished before submitting the next chunk.


Another thing that is starting to annoy more and more, is that ifarms are becoming practically unusable, most of the time
they are overloaded, and a simple "ls" command takes forever. This happens to "/work" disk too.
This behavior impacts on tarring, i.e. farms put a lot of outputs to work disk, but ifarms are "Too slow" yo catch up with 
farm jobs. 

One thing I was thinking, will be good if hps would have a separate machine designed for tarring files and sending to tape,
or even would be much better if that machine would have about 20T free space, that we will be separate from the work disk.
I don't know how reasonable is this...

Rafo




From: Nathan Baltzell
Sent: Thursday, February 7, 2019 9:47:27 PM
To: Rafayel Paremuzyan
Cc: HPS-SOFTWARE
Subject: Re: The new cooking
 
Hi Rafo,

Based on scicomp.jlab.org website, I see ~800 hps jobs in SLURM currently running (great, almost 3x more than hps's previous *average* experience on jlab batch farm) but almost none in the queue or in depend state.  Wondering if it may be better to keep the queue more saturated?  I guess you are staging things in batches (including tarballing to tape)?  Ok, I was just taking a look and curious about throughput:)


-Nathan



On Feb 6, 2019, at 23:52, Rafayel Paremuzyan <[log in to unmask]> wrote:

Hi all,

while the new pass BLPass4b is being cooked,
you could look into pass related details in this confluence page.


Rafo


Use REPLY-ALL to reply to list
To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1



Use REPLY-ALL to reply to list
To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1



Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1