GT&S Coordination Wednesday 25 July 2007
from 16:30 to 18:00
at CERN ( 32-S-C22 )
chaired by: Stephen Gowdy
Description:
Phone CERN 76000 code 0124636# (leader is 0114663#)
or https://audioconf.cern.ch/call/0124636 .
Present: John, Alexei, Stephen, Gilbert, Dietrich, Sanjay, Kaushik,
Simone
Wednesday 25 July 2007
16:30 Minutes of last meeting and action list (05')
No corrections.
16:35 Hot topics (20')
* Pilot jobs on EGEE (15') Simone Campana
Initiated about a month ago. Every slot at CERN for production
jobs was taken by Sanjay's pilots and using no CPU time. The
problem was worse because the length of the queue was three
weeks for wall clock and one week for CPU. The problem of long
queues is independent of the pilot jobs. No other jobs could get
through because the queue of pilots was long. Sanjay
investigated and found a problem so this should not happen
again. Isn't a bug a bug? Couldn't a normal job also have a bug
that held a CPU for the whole time allowed.
In this case the was a gLite upgrade which caused a loss of
tracking the number of jobs running. In general the jobs should
kill itself in this case. No daemon code was written after
this. Previously it kept trying to contact the server for a
job.
Having shorter queues might be a good idea to avoid accumulating
pilots. If a job only takes 2 hours the next job would go to the
same pilot and another one would be idle. Both CERN and NIKHEF
now have queues of 1 day CPU and 1.5 days wall clock. For
production at CERN all jobs are mapped to atlasprd
account. Might be better to have multiple pool accounts,
hopefully soon. It will either be today or tomorrow or a couple
of weeks (person going on holiday tomorrow). Should ask other
sites to also have pool accounts for production as they do for
user jobs. One issue we need to worry about is the batch system
coping with a large number of shorter jobs.
One difference with PANDA is that it only runs one job per
pilot. CRONUS runs as much as possible. If the queue is about
the same length as a job then CRONUS would only run one
too. PANDA chose this way as site admins preferred this. It does
put a heavy load on PANDA as some jobs can be as short as thirty
minutes. Otherwise it could mess up local priority system.
Switching identity might be a bigger issue. GLEXEC should allow
a switch once only. But if you run analysis for many users they
(if desired by the site) you might want to change the user more
than once.
PANDA will use GLEXEC but doesn't yet. No sites have requested
it yet. As they only run one job by pilot they don't need switch
more than once. Also the pilot kills itself if it doesn't find a
job.
Should we perhaps recommend that ATLAS pilot jobs only run one
job? Sanjay would like to look at the statistics from CERN and
NIKHEF to see what experience is gained.
It would hurt the production system to have a draconian limit of
one day. We should try to avoid this. However, some users also
have jobs that occupy a CPU due to errors.
Will talk about this at next weeks meeting and try to reach a
conclusion.
* CHEP Reviewers (05')
Laura has put forward Sanjay's name. Do we need another person?
There are 50 for all of ATLAS. There are only around 6 papers in
the Grid area. Stephen will also act as a reviewer.
Perhaps need to have practice talks in the first part of the
last week in August, some folk will disappear for the GDB at the
end of that week.
16:55 ProdSys issues (10') Luc GOOSSENS
17:05 Distributed Analysis issues (10') Dietrich Liko
Current issue is the use of SLC4 32bit machines. FZK seems to be
where this is an issue just now. They are using the SLC3 kit on
these machines. The compilation seems to succeed but run time
fails. Hopefully to solve it for the release 12 by shipping the
compiler with the release. For release 13 will do a
re-installation to use the SLC4 kits. Running binaries is fine
(like production).
Should we move away from compilation on the WN? Currently both
GANGA and pathena support compilation this way. The reason for
doing compilation the WN is to allow adoption of environment
there. Should perhaps worry about undefined environments and
what they mean for reproducibility.
17:15 DDM issues (10')
* Deployment (05') Massimo Lamanna
* Development (05') Miguel Branco
17:25 Tier-0 (10') Luc GOOSSENS
17:35 Job Transformations (05') Manuel Gallas
17:40 Software Integration issues (05') Alexei Klimentov
Planning a series of function tests. One for DQ2 0.3, on for LFC
and a third on for PANDA. This is in the border between
operations, GT&S and SWING. For DQ2 want to check that all sites
are ready for a predefined number of files shipped from Tier-1s
and Tier-2s. Would measure a metric of time from subscription
till first file delivered. The recent table of conditions
doesn't look good. For LFC need the test instance from
Jean-Phillipe, which test the new bulk queries and deletes. For
PANDA will run the server at CERN. Will start discussion on
preliminary timescale tomorrow, looking like the first week of
August just now.
M3 data is organised in datasets. The convention is being
discussed for M4 data. Will be part of a document that will be
released this month for the dataset naming convention.
17:45 Grid middleware news (10') EGEE/OSG/NG
* EGEE/LCG (05') Laura Perini
* NG (05') Alexander Read
* OSG (05') Michael Ernst
17:55 A.O.B. (05')
Action Items:
070725 Sanjay Examine statistics from CERN & NIKHEF for pilots jobs
070620 Stephen Put together LCG Metric note for further discussion.
070725 Not done yet.
070606 Kaushik Ask Ian & Pavel if we can switch the AOD merge to 20:1
and if we can do it for everything
070725 Some discussion during software week. Now put to 10:1, due
perhaps space on the node, will check. Certainly done.
070523 Stephen Email Kors about VOBOX Tier-1 Service Level Agreement.
070606 Not done yet.
070620 Not done yet.
070725 Not done yet.
070523 Dietrich Summaries available DA documentation to decide
what is needed
070606 Not done yet by Dietrich. Has been nicely summaries by
Constantine in the analysis model meeting last
week. This gives a good overview. Action is done. Some
discussion about the support model beyond the
documentation. HyperNews seems to be working well for
GANGA, hope that more people can answer questions as the
amount of requests scale up.
070725 Nothing more done yet.
|