On Aug 23, 2006, at 10:02 AM, Stephen J. Gowdy wrote: > ATLAS SCCS Planning 23Aug2006 > ----------------------------- > > SCCS Conf Rm A, to call in +1 510 665 5437, press 1, 3935# > > Present: Booker, Stephen. Wei, Len, John, Chuck, Charlie > > Agenda: > > 1. DQ2 Status/Web Proxy > > Nothing happened since last. Looks like some jobs are running now, > must be very recent as when I looked there were none. > > Looks like there has been one successful DQ2 transfer. > > The main hacker will send Stephen information when he gets back > from holiday. > > Need help to figure out why things work sometimes and sometimes > they don't. > > 2. Trigger Farm Status > > Nothing to discuss this week. > > 3. ATLAS Oracle Server > > Waiting for a switch to be installed. > > 4. Slots for ATLAS Production jobs and other batch related stuff > > Could open up production without removing the limit per > user. Perhaps not as that is defined by the queue. Could think > about using a different queue. > > Might be an issue with the gatekeeper. In past have seen issues, it > spawns four processes per job submitted. It looks like in the new > version of CONDOR-G that they've dealt with this issue. > > We also need to look at what machines have been assigned to the > osgq. It is a good idea to raise the limit gradually. We will > probably run in to problems before we run out of batch machines > want to push the limit and fix problems retroactively. There are > Tier-2s who are running several hundred workers, so is our setup > so different? > > Looking at another solution for scaling, want to be able to react > when there are lot of jobs coming. > > Should raise the limit to something like 20 or 30. > Wei just dropped by and asked about this change. I've made the change to a 30 job limit per user in the LSF configuration file and it will go into effect during this evening's scheduled LSF reconfiguration at approx. 19:35 PDT. --Neal > To setup the fair-share waiting to get a number for what the > fair-share should be. We're not really stuck without it but > would be > good to get the mechanism in place sooner than when we absolutely > need it. > > 5. Validation of ATLAS jobs on RHEL > > We need to find out more information. Need to determine process for > new sites and for upgrades to existing systems. > > 6. AOB > > - 10am PST Conference for Tier2s. Primarily people from centres, > not users. Talk about issues about what they want to use for > storage, why DQ2 runs into troubles. All the folk at SCCS would > like to attend, so should try to wrap by 10am. > > - Would like to hear a synopsis of what happened in Boston next > week. Not much beyond site reports and discussions about > DQ2. They were talking how the data transfers in the production > world. Many people at BNL had a strong interest in using xrootd > instead of dCache for storage. The basic xrootd software is > already there, what is needed is the SRM interface. There is one > available with the Berkeley SRM interface but someone needs to > package it. There was also talk about 32bit vrs 64bit, and SL3 > vrs SL4. Not interested in validating on 64bit. Should encourage > them to continue to build 32bit binaries but validate on both 32- > and 64bit platforms. Finally got across the point that their SQL > databases were wide open. They are working on a version that will > use the grid certificates instead of clear text passwords. > > Action Items: > ------------- > > 060823 Stephen Find out what current validation processes exist > > 060823 Wei Talk to Neal about raising the osgq limit > > 060816 Wei Setup ATLAS/SLAC Web page > 060823 Wei circulated a not try to bring back comments for next > week. > > 060816 Charlie Talk to SLUO about adding institutions. > > 060816 Neal Setup atlas priority group for LSF > 060823 Not done yet. > > 060816 Chuck Check with Bob about web server approval need > 060823 To be done. > > 060809 Stephen Ask what dq2user needs to do in MySQL > 060816 No good answer. Limited to dq2user from offsite can only > SELECT from localreplicas. From onsite can do > SELECT,UPDATE,DELETE and INSERT to either localreplicas > or queued_transfers_SLAC. We'll see if that works or > not. Without onsite privileges production stopped. > 060823 Sounds like things are working again, but no concrete > info. > > 060412 Systems Provide Oracle service for ATLAS Trigger testing > (RT 46089) > 060419 No ticket yet, so nothing done. > 060426 Now have ticket 46089. > 060503 No news. > 060524 Steffen has provided configuration information. Now > in Chuck's > hands. > 060628 Randy will ask Chuck about status. > 060726 First on list for V240 but not sure when it will > happen. Will put a T3a on it. > 060802 John checking for rack space. > 060809 Still needs allocated rack space. > 060816 Has rack and power, waiting for network. > 060823 Waiting for switch, > > 060224 Richard Discuss ATLAS trigger machines with others in SCCS > 060301 Only limited response from John W was resigned > acceptance... need to work on an actual deployment plan as > there are real issues to be solved. > 060308 John aware and in plans as much as anything is. New > engineer will take over. > 060315 No update. > 060405 No update. > 060412 No update. > 060419 No update. > 060426 No update. > 060503 No update. > 060524 RT 45823. Engineer looking at power availability. On > track for > August. > 060628 Understand schedule, Randy will make sure John is aware. > 060726 Need to nail down when power will be available. Steffen > things he can make it happen with existing equipment. > 060802 Looks like this will fit in SCCS. Can reuse rack, > switches and fibres. > 060809 Everything looking good for this now. > 060823 This is Done, will drop it from the agenda for now. >