Print

Print


From what Maurik said earlier in this email, he already had
scicomp today start transferring them directly to cache first
while waiting on them to make it to tape.  Is this not working,
or did I misunderstand?



On Mar 8, 2016, at 7:47 PM, Sebouh Paul <[log in to unmask]> wrote:

> Stepan has suggested something that might make things work smoother.  Instead of just copying files from the counting house to tape, we should simultaneously also copy them to a location in /volatile/, so that I can run the DQM jobs inputting from the volatile location.  After the jobs are done, I can delete the files from the volatile, or I can set the script up so that the file is deleted when the job is over.   Anyone else think this is a good idea?  
> 
> On Tue, Mar 8, 2016 at 4:11 PM, McCormick, Jeremy I. <[log in to unmask]> wrote:
> Hi,
> 
> Rerouted to software list:
> 
> I don't believe swif has a way to directly monitor directories for new files, but we can do this in python.  I'm working with Sebouh on the procedures for this right now.  We already have several useful scripts; they just need to be cleaned up a bit and added to a cron job.  We can also use the datacat to tell if DQM files exist already (depends on how fancy we want to get).
> 
> As far as DQM, I suggested Sebouh process runs 7796 - 7800 to start, because these are marked as very good runs in the spreadsheet (good beam).  If there are additional runs that we want to look at specifically right now, please communicate with him about it.
> 
> --Jeremy
> 
> -----Original Message-----
> From: Maurik Holtrop [mailto:[log in to unmask]]
> Sent: Tuesday, March 08, 2016 11:54 AM
> To: Nathan Baltzell
> Cc: Sebouh Paul; McCormick, Jeremy I.; Bradley Yale; Graham, Mathew Thomas; Uemura, Sho; Holly Vance
> Subject: Re: Keeping up with DQM
> 
> Hello Nathan, Sebouh,
> 
> We are now also seeing the files appear at /cache/mss/hallb/hps/data . There are 287 files there right now.
> I am not sure if the "swift" system is smart enough to recognize the files are there, and thus not also try to queue them from tape. Not having to grab them from tape will be a big plus.
> Sebouh should be able to change his glob to point to the /cache/mss rather than /mss and get things working directly from the cache.
> 
> I am noticing that although many of last weekends runs are not on tape, and have been there for a while, there is still no new DQM output. Sebouh, perhaps you can keep your eye on this, and start jobs from the /cache on the files that are appearing there. It will be good to see at least some of the output before too long.
> 
> Best,
>         Maurik
> 
> 
> > On Mar 8, 2016, at 10:56 AM, Nathan Baltzell <[log in to unmask]> wrote:
> >
> > We should see a huge increase in speed of copying data to tape, now
> > that Sergey mounted scicomp's lustre drive on clondaq5.
> >
> > -Nathan
> >
> >
> >
> > On Mar 7, 2016, at 22:35, Maurik Holtrop <[log in to unmask]> wrote:
> >
> >> Hello Sebouh,
> >>
> >> I do wonder if your jobs are stuck because you submitted the job *before* the file actually existed on the tape silo. I see ReconDataQuality_7781 in the job queue, but no 7781 file on the tapes. You can still find those files on clondaq5 in /data/totape, so presumably they haven't been copied yet. This is perhaps strangely slow?
> >>
> >> You may want to double check if jobs that were started before the file existed on tape actually run when that file becomes available on the tape silo, or if these jobs are going to be stuck in perpetuity. You should be able to check.
> >>
> >> I can see on the silo:
> >>
> >> hps@ifarm1102> ls -l /mss/hallb/hps/data/hps_007799.evio.262
> >> -r--r--r-- 1 halldata nobody 441 Mar  7 08:00
> >> /mss/hallb/hps/data/hps_007799.evio.262
> >>
> >> and your job asking for that file (job id 21152089) is still pending....
> >>
> >> If the files are directly put on the /cache drives, in principle this would save a tape operation. I heard that Hall-A is doing this with their data as well.
> >> Making use of these files should not be a lot of changes to your scripts I would think. Just remember that it would be nice to mark the files for deletion when you are done with them.
> >>
> >> Best,
> >>      Maurik
> >>
> >>
> >>
> >>
> >>
> >>> On Mar 7, 2016, at 9:55 PM, Sebouh Paul <[log in to unmask]> wrote:
> >>>
> >>> Jeremy, what do you think of Maurik's suggestion?  It's already monday night, and all of the DQM jobs from this weekend have been stuck in dependency-limbo, waiting for the files from tape.
> >>>
> >>> On Mon, Mar 7, 2016 at 4:48 PM, Maurik Holtrop <[log in to unmask]> wrote:
> >>> Hello Bradley,
> >>>
> >>> No need to, I think.
> >>>
> >>> If you look at the DQM jobs, you will see that they are all pending on a dependency = the files are not available.
> >>>
> >>> As far as I can see, it is not the job slots but the files, but I would appreciate it if someone else could check that I am coming to the correct conclusion.
> >>>
> >>> I suggest that we move to a slightly different way of processing the data:
> >>>
> >>> * Files that are copied from the counting house to JLab are immediately put on the /cache disk as soon as they are written to tape.
> >>> * Sebouh, or a clever script acting on his behalve, monitors /cache/hallb/hps/data for new files and starts batch jobs to process them immediately.
> >>> * As soon as a file is processed from /cache the file is marked for deletion. This is needed so that we don't fill our cache quota immediately with raw files and not leaving any space for other use.
> >>>
> >>> Step one has to be arrange with the computer center. I took the jump and already asked Chris to set this up.
> >>>
> >>> Best,
> >>>     Maurik
> >>>
> >>>
> >>>
> >>>
> >>>> On Mar 7, 2016, at 4:11 PM, Bradley T Yale <[log in to unmask]> wrote:
> >>>>
> >>>> Sorry, I'm killing the pending aprime jobs so that yours can start.
> >>>> These were mainly for increasing the statistics for Omar's analysis, but not as high of a priority I think.
> >>>> The farm was also very crowded over the weekend, which did not help things.
> >>>>
> >>>> From: Sebouh Paul <[log in to unmask]>
> >>>> Sent: Monday, March 7, 2016 12:10 PM
> >>>> To: Maurik Holtrop
> >>>> Cc: Bradley T Yale; Nathan Baltzell; Mathew Thomas Graham; Sho
> >>>> Uemura; Holly Vance
> >>>> Subject: Re: Keeping up with DQM
> >>>>
> >>>> If you have any suggestions as to how to increase priority for the DQM jobs (or decrease it for the other hps jobs that can wait, such as monte carlo) let me know.
> >>>> On Mar 7, 2016 12:06 PM, "Sebouh Paul" <[log in to unmask]> wrote:
> >>>> I have submitted jobs to the farm for the runs in which all or at
> >>>> least most of the files have been transferred to tape, but none of them have started running yet.  My guess is the farm is giving higher priority to the slic_aprimes jobs than to my dqm jobs,since those are sometimes running but none of my dqm jobs have started running yet On Mar 7, 2016 11:53 AM, "Maurik Holtrop" <[log in to unmask]> wrote:
> >>>> Hello Sebouh,
> >>>>
> >>>> How well are you able to keep up with DQM output as data comes out of the counting house?
> >>>>
> >>>> If we had continuous running, the goal would be to have DQM report within 24 hours of the data being taken. I.e. there would be a summary of DQM at each run meeting on the quality of the data taken the previous day. At this point, I am not yet seeing the DQM output from last Friday-Saturday runs, several of which were 100M+ events, in /lustre/expphy/work/hallb/hps/data/physrun2016/pass0/dqm. Can you please let me know what the throughput is of DQM?
> >>>>
> >>>> Not having this output in a timely manner also hinders the experts that should be looking at this output.
> >>>>
> >>>> Best,
> >>>>        Maurik
> >>>
> >>>
> >>
> >
> 
> ########################################################################
> Use REPLY-ALL to reply to list
> 
> To unsubscribe from the HPS-SOFTWARE list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1
> 
> 
> Use REPLY-ALL to reply to list
> 
> To unsubscribe from the HPS-SOFTWARE list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1
> 

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1