Print

Print


Stepan has suggested something that might make things work smoother.
Instead of just copying files from the counting house to tape, we should
simultaneously also copy them to a location in /volatile/, so that I can
run the DQM jobs inputting from the volatile location.  After the jobs are
done, I can delete the files from the volatile, or I can set the script up
so that the file is deleted when the job is over.   Anyone else think this
is a good idea?

On Tue, Mar 8, 2016 at 4:11 PM, McCormick, Jeremy I. <
[log in to unmask]> wrote:

> Hi,
>
> Rerouted to software list:
>
> I don't believe swif has a way to directly monitor directories for new
> files, but we can do this in python.  I'm working with Sebouh on the
> procedures for this right now.  We already have several useful scripts;
> they just need to be cleaned up a bit and added to a cron job.  We can also
> use the datacat to tell if DQM files exist already (depends on how fancy we
> want to get).
>
> As far as DQM, I suggested Sebouh process runs 7796 - 7800 to start,
> because these are marked as very good runs in the spreadsheet (good beam).
> If there are additional runs that we want to look at specifically right
> now, please communicate with him about it.
>
> --Jeremy
>
> -----Original Message-----
> From: Maurik Holtrop [mailto:[log in to unmask]]
> Sent: Tuesday, March 08, 2016 11:54 AM
> To: Nathan Baltzell
> Cc: Sebouh Paul; McCormick, Jeremy I.; Bradley Yale; Graham, Mathew
> Thomas; Uemura, Sho; Holly Vance
> Subject: Re: Keeping up with DQM
>
> Hello Nathan, Sebouh,
>
> We are now also seeing the files appear at /cache/mss/hallb/hps/data .
> There are 287 files there right now.
> I am not sure if the "swift" system is smart enough to recognize the files
> are there, and thus not also try to queue them from tape. Not having to
> grab them from tape will be a big plus.
> Sebouh should be able to change his glob to point to the /cache/mss rather
> than /mss and get things working directly from the cache.
>
> I am noticing that although many of last weekends runs are not on tape,
> and have been there for a while, there is still no new DQM output. Sebouh,
> perhaps you can keep your eye on this, and start jobs from the /cache on
> the files that are appearing there. It will be good to see at least some of
> the output before too long.
>
> Best,
>         Maurik
>
>
> > On Mar 8, 2016, at 10:56 AM, Nathan Baltzell <[log in to unmask]> wrote:
> >
> > We should see a huge increase in speed of copying data to tape, now
> > that Sergey mounted scicomp's lustre drive on clondaq5.
> >
> > -Nathan
> >
> >
> >
> > On Mar 7, 2016, at 22:35, Maurik Holtrop <[log in to unmask]> wrote:
> >
> >> Hello Sebouh,
> >>
> >> I do wonder if your jobs are stuck because you submitted the job
> *before* the file actually existed on the tape silo. I see
> ReconDataQuality_7781 in the job queue, but no 7781 file on the tapes. You
> can still find those files on clondaq5 in /data/totape, so presumably they
> haven't been copied yet. This is perhaps strangely slow?
> >>
> >> You may want to double check if jobs that were started before the file
> existed on tape actually run when that file becomes available on the tape
> silo, or if these jobs are going to be stuck in perpetuity. You should be
> able to check.
> >>
> >> I can see on the silo:
> >>
> >> hps@ifarm1102> ls -l /mss/hallb/hps/data/hps_007799.evio.262
> >> -r--r--r-- 1 halldata nobody 441 Mar  7 08:00
> >> /mss/hallb/hps/data/hps_007799.evio.262
> >>
> >> and your job asking for that file (job id 21152089) is still pending....
> >>
> >> If the files are directly put on the /cache drives, in principle this
> would save a tape operation. I heard that Hall-A is doing this with their
> data as well.
> >> Making use of these files should not be a lot of changes to your
> scripts I would think. Just remember that it would be nice to mark the
> files for deletion when you are done with them.
> >>
> >> Best,
> >>      Maurik
> >>
> >>
> >>
> >>
> >>
> >>> On Mar 7, 2016, at 9:55 PM, Sebouh Paul <[log in to unmask]> wrote:
> >>>
> >>> Jeremy, what do you think of Maurik's suggestion?  It's already monday
> night, and all of the DQM jobs from this weekend have been stuck in
> dependency-limbo, waiting for the files from tape.
> >>>
> >>> On Mon, Mar 7, 2016 at 4:48 PM, Maurik Holtrop <[log in to unmask]>
> wrote:
> >>> Hello Bradley,
> >>>
> >>> No need to, I think.
> >>>
> >>> If you look at the DQM jobs, you will see that they are all pending on
> a dependency = the files are not available.
> >>>
> >>> As far as I can see, it is not the job slots but the files, but I
> would appreciate it if someone else could check that I am coming to the
> correct conclusion.
> >>>
> >>> I suggest that we move to a slightly different way of processing the
> data:
> >>>
> >>> * Files that are copied from the counting house to JLab are
> immediately put on the /cache disk as soon as they are written to tape.
> >>> * Sebouh, or a clever script acting on his behalve, monitors
> /cache/hallb/hps/data for new files and starts batch jobs to process them
> immediately.
> >>> * As soon as a file is processed from /cache the file is marked for
> deletion. This is needed so that we don't fill our cache quota immediately
> with raw files and not leaving any space for other use.
> >>>
> >>> Step one has to be arrange with the computer center. I took the jump
> and already asked Chris to set this up.
> >>>
> >>> Best,
> >>>     Maurik
> >>>
> >>>
> >>>
> >>>
> >>>> On Mar 7, 2016, at 4:11 PM, Bradley T Yale <[log in to unmask]>
> wrote:
> >>>>
> >>>> Sorry, I'm killing the pending aprime jobs so that yours can start.
> >>>> These were mainly for increasing the statistics for Omar's analysis,
> but not as high of a priority I think.
> >>>> The farm was also very crowded over the weekend, which did not help
> things.
> >>>>
> >>>> From: Sebouh Paul <[log in to unmask]>
> >>>> Sent: Monday, March 7, 2016 12:10 PM
> >>>> To: Maurik Holtrop
> >>>> Cc: Bradley T Yale; Nathan Baltzell; Mathew Thomas Graham; Sho
> >>>> Uemura; Holly Vance
> >>>> Subject: Re: Keeping up with DQM
> >>>>
> >>>> If you have any suggestions as to how to increase priority for the
> DQM jobs (or decrease it for the other hps jobs that can wait, such as
> monte carlo) let me know.
> >>>> On Mar 7, 2016 12:06 PM, "Sebouh Paul" <[log in to unmask]> wrote:
> >>>> I have submitted jobs to the farm for the runs in which all or at
> >>>> least most of the files have been transferred to tape, but none of
> them have started running yet.  My guess is the farm is giving higher
> priority to the slic_aprimes jobs than to my dqm jobs,since those are
> sometimes running but none of my dqm jobs have started running yet On Mar
> 7, 2016 11:53 AM, "Maurik Holtrop" <[log in to unmask]> wrote:
> >>>> Hello Sebouh,
> >>>>
> >>>> How well are you able to keep up with DQM output as data comes out of
> the counting house?
> >>>>
> >>>> If we had continuous running, the goal would be to have DQM report
> within 24 hours of the data being taken. I.e. there would be a summary of
> DQM at each run meeting on the quality of the data taken the previous day.
> At this point, I am not yet seeing the DQM output from last Friday-Saturday
> runs, several of which were 100M+ events, in
> /lustre/expphy/work/hallb/hps/data/physrun2016/pass0/dqm. Can you please
> let me know what the throughput is of DQM?
> >>>>
> >>>> Not having this output in a timely manner also hinders the experts
> that should be looking at this output.
> >>>>
> >>>> Best,
> >>>>        Maurik
> >>>
> >>>
> >>
> >
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the HPS-SOFTWARE list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1