Stepan has suggested something that might make things work smoother. Instead of just copying files from the counting house to tape, we should simultaneously also copy them to a location in /volatile/, so that I can run the DQM jobs inputting from the volatile location. After the jobs are done, I can delete the files from the volatile, or I can set the script up so that the file is deleted when the job is over. Anyone else think this is a good idea? On Tue, Mar 8, 2016 at 4:11 PM, McCormick, Jeremy I. < [log in to unmask]> wrote: > Hi, > > Rerouted to software list: > > I don't believe swif has a way to directly monitor directories for new > files, but we can do this in python. I'm working with Sebouh on the > procedures for this right now. We already have several useful scripts; > they just need to be cleaned up a bit and added to a cron job. We can also > use the datacat to tell if DQM files exist already (depends on how fancy we > want to get). > > As far as DQM, I suggested Sebouh process runs 7796 - 7800 to start, > because these are marked as very good runs in the spreadsheet (good beam). > If there are additional runs that we want to look at specifically right > now, please communicate with him about it. > > --Jeremy > > -----Original Message----- > From: Maurik Holtrop [mailto:[log in to unmask]] > Sent: Tuesday, March 08, 2016 11:54 AM > To: Nathan Baltzell > Cc: Sebouh Paul; McCormick, Jeremy I.; Bradley Yale; Graham, Mathew > Thomas; Uemura, Sho; Holly Vance > Subject: Re: Keeping up with DQM > > Hello Nathan, Sebouh, > > We are now also seeing the files appear at /cache/mss/hallb/hps/data . > There are 287 files there right now. > I am not sure if the "swift" system is smart enough to recognize the files > are there, and thus not also try to queue them from tape. Not having to > grab them from tape will be a big plus. > Sebouh should be able to change his glob to point to the /cache/mss rather > than /mss and get things working directly from the cache. > > I am noticing that although many of last weekends runs are not on tape, > and have been there for a while, there is still no new DQM output. Sebouh, > perhaps you can keep your eye on this, and start jobs from the /cache on > the files that are appearing there. It will be good to see at least some of > the output before too long. > > Best, > Maurik > > > > On Mar 8, 2016, at 10:56 AM, Nathan Baltzell <[log in to unmask]> wrote: > > > > We should see a huge increase in speed of copying data to tape, now > > that Sergey mounted scicomp's lustre drive on clondaq5. > > > > -Nathan > > > > > > > > On Mar 7, 2016, at 22:35, Maurik Holtrop <[log in to unmask]> wrote: > > > >> Hello Sebouh, > >> > >> I do wonder if your jobs are stuck because you submitted the job > *before* the file actually existed on the tape silo. I see > ReconDataQuality_7781 in the job queue, but no 7781 file on the tapes. You > can still find those files on clondaq5 in /data/totape, so presumably they > haven't been copied yet. This is perhaps strangely slow? > >> > >> You may want to double check if jobs that were started before the file > existed on tape actually run when that file becomes available on the tape > silo, or if these jobs are going to be stuck in perpetuity. You should be > able to check. > >> > >> I can see on the silo: > >> > >> hps@ifarm1102> ls -l /mss/hallb/hps/data/hps_007799.evio.262 > >> -r--r--r-- 1 halldata nobody 441 Mar 7 08:00 > >> /mss/hallb/hps/data/hps_007799.evio.262 > >> > >> and your job asking for that file (job id 21152089) is still pending.... > >> > >> If the files are directly put on the /cache drives, in principle this > would save a tape operation. I heard that Hall-A is doing this with their > data as well. > >> Making use of these files should not be a lot of changes to your > scripts I would think. Just remember that it would be nice to mark the > files for deletion when you are done with them. > >> > >> Best, > >> Maurik > >> > >> > >> > >> > >> > >>> On Mar 7, 2016, at 9:55 PM, Sebouh Paul <[log in to unmask]> wrote: > >>> > >>> Jeremy, what do you think of Maurik's suggestion? It's already monday > night, and all of the DQM jobs from this weekend have been stuck in > dependency-limbo, waiting for the files from tape. > >>> > >>> On Mon, Mar 7, 2016 at 4:48 PM, Maurik Holtrop <[log in to unmask]> > wrote: > >>> Hello Bradley, > >>> > >>> No need to, I think. > >>> > >>> If you look at the DQM jobs, you will see that they are all pending on > a dependency = the files are not available. > >>> > >>> As far as I can see, it is not the job slots but the files, but I > would appreciate it if someone else could check that I am coming to the > correct conclusion. > >>> > >>> I suggest that we move to a slightly different way of processing the > data: > >>> > >>> * Files that are copied from the counting house to JLab are > immediately put on the /cache disk as soon as they are written to tape. > >>> * Sebouh, or a clever script acting on his behalve, monitors > /cache/hallb/hps/data for new files and starts batch jobs to process them > immediately. > >>> * As soon as a file is processed from /cache the file is marked for > deletion. This is needed so that we don't fill our cache quota immediately > with raw files and not leaving any space for other use. > >>> > >>> Step one has to be arrange with the computer center. I took the jump > and already asked Chris to set this up. > >>> > >>> Best, > >>> Maurik > >>> > >>> > >>> > >>> > >>>> On Mar 7, 2016, at 4:11 PM, Bradley T Yale <[log in to unmask]> > wrote: > >>>> > >>>> Sorry, I'm killing the pending aprime jobs so that yours can start. > >>>> These were mainly for increasing the statistics for Omar's analysis, > but not as high of a priority I think. > >>>> The farm was also very crowded over the weekend, which did not help > things. > >>>> > >>>> From: Sebouh Paul <[log in to unmask]> > >>>> Sent: Monday, March 7, 2016 12:10 PM > >>>> To: Maurik Holtrop > >>>> Cc: Bradley T Yale; Nathan Baltzell; Mathew Thomas Graham; Sho > >>>> Uemura; Holly Vance > >>>> Subject: Re: Keeping up with DQM > >>>> > >>>> If you have any suggestions as to how to increase priority for the > DQM jobs (or decrease it for the other hps jobs that can wait, such as > monte carlo) let me know. > >>>> On Mar 7, 2016 12:06 PM, "Sebouh Paul" <[log in to unmask]> wrote: > >>>> I have submitted jobs to the farm for the runs in which all or at > >>>> least most of the files have been transferred to tape, but none of > them have started running yet. My guess is the farm is giving higher > priority to the slic_aprimes jobs than to my dqm jobs,since those are > sometimes running but none of my dqm jobs have started running yet On Mar > 7, 2016 11:53 AM, "Maurik Holtrop" <[log in to unmask]> wrote: > >>>> Hello Sebouh, > >>>> > >>>> How well are you able to keep up with DQM output as data comes out of > the counting house? > >>>> > >>>> If we had continuous running, the goal would be to have DQM report > within 24 hours of the data being taken. I.e. there would be a summary of > DQM at each run meeting on the quality of the data taken the previous day. > At this point, I am not yet seeing the DQM output from last Friday-Saturday > runs, several of which were 100M+ events, in > /lustre/expphy/work/hallb/hps/data/physrun2016/pass0/dqm. Can you please > let me know what the throughput is of DQM? > >>>> > >>>> Not having this output in a timely manner also hinders the experts > that should be looking at this output. > >>>> > >>>> Best, > >>>> Maurik > >>> > >>> > >> > > > > ######################################################################## > Use REPLY-ALL to reply to list > > To unsubscribe from the HPS-SOFTWARE list, click the following link: > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1 > ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the HPS-SOFTWARE list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1