GT&S Coordination Wednesday 25 July 2007 from 16:30 to 18:00 at CERN ( 32-S-C22 ) chaired by: Stephen Gowdy Description: Phone CERN 76000 code 0124636# (leader is 0114663#) or https://audioconf.cern.ch/call/0124636 . Present: John, Alexei, Stephen, Gilbert, Dietrich, Sanjay, Kaushik, Simone Wednesday 25 July 2007 16:30 Minutes of last meeting and action list (05') No corrections. 16:35 Hot topics (20') * Pilot jobs on EGEE (15') Simone Campana Initiated about a month ago. Every slot at CERN for production jobs was taken by Sanjay's pilots and using no CPU time. The problem was worse because the length of the queue was three weeks for wall clock and one week for CPU. The problem of long queues is independent of the pilot jobs. No other jobs could get through because the queue of pilots was long. Sanjay investigated and found a problem so this should not happen again. Isn't a bug a bug? Couldn't a normal job also have a bug that held a CPU for the whole time allowed. In this case the was a gLite upgrade which caused a loss of tracking the number of jobs running. In general the jobs should kill itself in this case. No daemon code was written after this. Previously it kept trying to contact the server for a job. Having shorter queues might be a good idea to avoid accumulating pilots. If a job only takes 2 hours the next job would go to the same pilot and another one would be idle. Both CERN and NIKHEF now have queues of 1 day CPU and 1.5 days wall clock. For production at CERN all jobs are mapped to atlasprd account. Might be better to have multiple pool accounts, hopefully soon. It will either be today or tomorrow or a couple of weeks (person going on holiday tomorrow). Should ask other sites to also have pool accounts for production as they do for user jobs. One issue we need to worry about is the batch system coping with a large number of shorter jobs. One difference with PANDA is that it only runs one job per pilot. CRONUS runs as much as possible. If the queue is about the same length as a job then CRONUS would only run one too. PANDA chose this way as site admins preferred this. It does put a heavy load on PANDA as some jobs can be as short as thirty minutes. Otherwise it could mess up local priority system. Switching identity might be a bigger issue. GLEXEC should allow a switch once only. But if you run analysis for many users they (if desired by the site) you might want to change the user more than once. PANDA will use GLEXEC but doesn't yet. No sites have requested it yet. As they only run one job by pilot they don't need switch more than once. Also the pilot kills itself if it doesn't find a job. Should we perhaps recommend that ATLAS pilot jobs only run one job? Sanjay would like to look at the statistics from CERN and NIKHEF to see what experience is gained. It would hurt the production system to have a draconian limit of one day. We should try to avoid this. However, some users also have jobs that occupy a CPU due to errors. Will talk about this at next weeks meeting and try to reach a conclusion. * CHEP Reviewers (05') Laura has put forward Sanjay's name. Do we need another person? There are 50 for all of ATLAS. There are only around 6 papers in the Grid area. Stephen will also act as a reviewer. Perhaps need to have practice talks in the first part of the last week in August, some folk will disappear for the GDB at the end of that week. 16:55 ProdSys issues (10') Luc GOOSSENS 17:05 Distributed Analysis issues (10') Dietrich Liko Current issue is the use of SLC4 32bit machines. FZK seems to be where this is an issue just now. They are using the SLC3 kit on these machines. The compilation seems to succeed but run time fails. Hopefully to solve it for the release 12 by shipping the compiler with the release. For release 13 will do a re-installation to use the SLC4 kits. Running binaries is fine (like production). Should we move away from compilation on the WN? Currently both GANGA and pathena support compilation this way. The reason for doing compilation the WN is to allow adoption of environment there. Should perhaps worry about undefined environments and what they mean for reproducibility. 17:15 DDM issues (10') * Deployment (05') Massimo Lamanna * Development (05') Miguel Branco 17:25 Tier-0 (10') Luc GOOSSENS 17:35 Job Transformations (05') Manuel Gallas 17:40 Software Integration issues (05') Alexei Klimentov Planning a series of function tests. One for DQ2 0.3, on for LFC and a third on for PANDA. This is in the border between operations, GT&S and SWING. For DQ2 want to check that all sites are ready for a predefined number of files shipped from Tier-1s and Tier-2s. Would measure a metric of time from subscription till first file delivered. The recent table of conditions doesn't look good. For LFC need the test instance from Jean-Phillipe, which test the new bulk queries and deletes. For PANDA will run the server at CERN. Will start discussion on preliminary timescale tomorrow, looking like the first week of August just now. M3 data is organised in datasets. The convention is being discussed for M4 data. Will be part of a document that will be released this month for the dataset naming convention. 17:45 Grid middleware news (10') EGEE/OSG/NG * EGEE/LCG (05') Laura Perini * NG (05') Alexander Read * OSG (05') Michael Ernst 17:55 A.O.B. (05') Action Items: 070725 Sanjay Examine statistics from CERN & NIKHEF for pilots jobs 070620 Stephen Put together LCG Metric note for further discussion. 070725 Not done yet. 070606 Kaushik Ask Ian & Pavel if we can switch the AOD merge to 20:1 and if we can do it for everything 070725 Some discussion during software week. Now put to 10:1, due perhaps space on the node, will check. Certainly done. 070523 Stephen Email Kors about VOBOX Tier-1 Service Level Agreement. 070606 Not done yet. 070620 Not done yet. 070725 Not done yet. 070523 Dietrich Summaries available DA documentation to decide what is needed 070606 Not done yet by Dietrich. Has been nicely summaries by Constantine in the analysis model meeting last week. This gives a good overview. Action is done. Some discussion about the support model beyond the documentation. HyperNews seems to be working well for GANGA, hope that more people can answer questions as the amount of requests scale up. 070725 Nothing more done yet.