LISTSERV 16.5 - ATLAS-SCCS-PLANNING-L Archives

GT&S Coordination 	  	Wednesday 25 July 2007
from 16:30 to 18:00
at CERN ( 32-S-C22 )
chaired by: Stephen Gowdy
Description:

Phone CERN 76000 code 0124636# (leader is 0114663#)
       or https://audioconf.cern.ch/call/0124636 .

Present: John, Alexei, Stephen, Gilbert, Dietrich, Sanjay, Kaushik,
 	 Simone

Wednesday 25 July 2007

16:30  	Minutes of last meeting and action list (05')

 	No corrections.

16:35  	Hot topics (20')
     * Pilot jobs on EGEE (15')				Simone Campana

       Initiated about a month ago. Every slot at CERN for production
       jobs was taken by Sanjay's pilots and using no CPU time. The
       problem was worse because the length of the queue was three
       weeks for wall clock and one week for CPU. The problem of long
       queues is independent of the pilot jobs. No other jobs could get
       through because the queue of pilots was long. Sanjay
       investigated and found a problem so this should not happen
       again. Isn't a bug a bug? Couldn't a normal job also have a bug
       that held a CPU for the whole time allowed.

       In this case the was a gLite upgrade which caused a loss of
       tracking the number of jobs running. In general the jobs should
       kill itself in this case. No daemon code was written after
       this. Previously it kept trying to contact the server for a
       job.

       Having shorter queues might be a good idea to avoid accumulating
       pilots. If a job only takes 2 hours the next job would go to the
       same pilot and another one would be idle. Both CERN and NIKHEF
       now have queues of 1 day CPU and 1.5 days wall clock. For
       production at CERN all jobs are mapped to atlasprd
       account. Might be better to have multiple pool accounts,
       hopefully soon. It will either be today or tomorrow or a couple
       of weeks (person going on holiday tomorrow). Should ask other
       sites to also have pool accounts for production as they do for
       user jobs. One issue we need to worry about is the batch system
       coping with a large number of shorter jobs.

       One difference with PANDA is that it only runs one job per
       pilot. CRONUS runs as much as possible. If the queue is about
       the same length as a job then CRONUS would only run one
       too. PANDA chose this way as site admins preferred this. It does
       put a heavy load on PANDA as some jobs can be as short as thirty
       minutes. Otherwise it could mess up local priority system.

       Switching identity might be a bigger issue. GLEXEC should allow
       a switch once only. But if you run analysis for many users they
       (if desired by the site) you might want to change the user more
       than once.

       PANDA will use GLEXEC but doesn't yet. No sites have requested
       it yet. As they only run one job by pilot they don't need switch
       more than once. Also the pilot kills itself if it doesn't find a
       job.

       Should we perhaps recommend that ATLAS pilot jobs only run one
       job? Sanjay would like to look at the statistics from CERN and
       NIKHEF to see what experience is gained.

       It would hurt the production system to have a draconian limit of
       one day. We should try to avoid this. However, some users also
       have jobs that occupy a CPU due to errors.

       Will talk about this at next weeks meeting and try to reach a
       conclusion.

     * CHEP Reviewers (05')

       Laura has put forward Sanjay's name. Do we need another person?
       There are 50 for all of ATLAS. There are only around 6 papers in
       the Grid area. Stephen will also act as a reviewer.

       Perhaps need to have practice talks in the first part of the
       last week in August, some folk will disappear for the GDB at the
       end of that week.

16:55  	ProdSys issues (10')				Luc GOOSSENS
17:05  	Distributed Analysis issues (10')		Dietrich Liko

       Current issue is the use of SLC4 32bit machines. FZK seems to be
       where this is an issue just now. They are using the SLC3 kit on
       these machines. The compilation seems to succeed but run time
       fails. Hopefully to solve it for the release 12 by shipping the
       compiler with the release. For release 13 will do a
       re-installation to use the SLC4 kits. Running binaries is fine
       (like production).

       Should we move away from compilation on the WN? Currently both
       GANGA and pathena support compilation this way. The reason for
       doing compilation the WN is to allow adoption of environment
       there. Should perhaps worry about undefined environments and
       what they mean for reproducibility.

17:15  	DDM issues (10')
     * Deployment (05')					Massimo Lamanna
     * Development (05')					Miguel Branco
17:25  	Tier-0 (10')					Luc GOOSSENS
17:35  	Job Transformations (05')			Manuel  Gallas
17:40  	Software Integration issues (05')		Alexei Klimentov

       Planning a series of function tests. One for DQ2 0.3, on for LFC
       and a third on for PANDA. This is in the border between
       operations, GT&S and SWING. For DQ2 want to check that all sites
       are ready for a predefined number of files shipped from Tier-1s
       and Tier-2s. Would measure a metric of time from subscription
       till first file delivered. The recent table of conditions
       doesn't look good. For LFC need the test instance from
       Jean-Phillipe, which test the new bulk queries and deletes. For
       PANDA will run the server at CERN. Will start discussion on
       preliminary timescale tomorrow, looking like the first week of
       August just now.

       M3 data is organised in datasets. The convention is being
       discussed for M4 data. Will be part of a document that will be
       released this month for the dataset naming convention.

17:45  	Grid middleware news (10') 	EGEE/OSG/NG
     * EGEE/LCG (05')					Laura Perini
     * NG (05')						Alexander Read
     * OSG (05')						Michael Ernst
17:55  	A.O.B. (05')

Action Items:

070725 Sanjay	Examine statistics from CERN & NIKHEF for pilots jobs

070620 Stephen	Put together LCG Metric note for further discussion.
        070725 Not done yet.

070606 Kaushik  Ask Ian & Pavel if we can switch the AOD merge to 20:1
 		and if we can do it for everything
        070725 Some discussion during software week. Now put to 10:1, due
 	      perhaps space on the node, will check. Certainly done.

070523 Stephen	Email Kors about VOBOX Tier-1 Service Level Agreement.
        070606 Not done yet.
        070620 Not done yet.
        070725 Not done yet.

070523 Dietrich Summaries available DA documentation to decide
 		what is needed
        070606 Not done yet by Dietrich. Has been nicely summaries by
 	      Constantine in the analysis model meeting last
 	      week. This gives a good overview. Action is done. Some
 	      discussion about the support model beyond the
 	      documentation. HyperNews seems to be working well for
 	      GANGA, hope that more people can answer questions as the
 	      amount of requests scale up.
        070725 Nothing more done yet.