On Aug 23, 2006, at 10:02 AM, Stephen J. Gowdy wrote:
> ATLAS SCCS Planning 23Aug2006
> -----------------------------
>
> SCCS Conf Rm A, to call in +1 510 665 5437, press 1, 3935#
>
> Present: Booker, Stephen. Wei, Len, John, Chuck, Charlie
>
> Agenda:
>
> 1. DQ2 Status/Web Proxy
>
> Nothing happened since last. Looks like some jobs are running now,
> must be very recent as when I looked there were none.
>
> Looks like there has been one successful DQ2 transfer.
>
> The main hacker will send Stephen information when he gets back
> from holiday.
>
> Need help to figure out why things work sometimes and sometimes
> they don't.
>
> 2. Trigger Farm Status
>
> Nothing to discuss this week.
>
> 3. ATLAS Oracle Server
>
> Waiting for a switch to be installed.
>
> 4. Slots for ATLAS Production jobs and other batch related stuff
>
> Could open up production without removing the limit per
> user. Perhaps not as that is defined by the queue. Could think
> about using a different queue.
>
> Might be an issue with the gatekeeper. In past have seen issues, it
> spawns four processes per job submitted. It looks like in the new
> version of CONDOR-G that they've dealt with this issue.
>
> We also need to look at what machines have been assigned to the
> osgq. It is a good idea to raise the limit gradually. We will
> probably run in to problems before we run out of batch machines
> want to push the limit and fix problems retroactively. There are
> Tier-2s who are running several hundred workers, so is our setup
> so different?
>
> Looking at another solution for scaling, want to be able to react
> when there are lot of jobs coming.
>
> Should raise the limit to something like 20 or 30.
>
Wei just dropped by and asked about this change. I've made the change
to a 30 job limit per user in the LSF configuration file and it will
go into effect during this evening's scheduled LSF reconfiguration at
approx. 19:35 PDT.
--Neal
> To setup the fair-share waiting to get a number for what the
> fair-share should be. We're not really stuck without it but
> would be
> good to get the mechanism in place sooner than when we absolutely
> need it.
>
> 5. Validation of ATLAS jobs on RHEL
>
> We need to find out more information. Need to determine process for
> new sites and for upgrades to existing systems.
>
> 6. AOB
>
> - 10am PST Conference for Tier2s. Primarily people from centres,
> not users. Talk about issues about what they want to use for
> storage, why DQ2 runs into troubles. All the folk at SCCS would
> like to attend, so should try to wrap by 10am.
>
> - Would like to hear a synopsis of what happened in Boston next
> week. Not much beyond site reports and discussions about
> DQ2. They were talking how the data transfers in the production
> world. Many people at BNL had a strong interest in using xrootd
> instead of dCache for storage. The basic xrootd software is
> already there, what is needed is the SRM interface. There is one
> available with the Berkeley SRM interface but someone needs to
> package it. There was also talk about 32bit vrs 64bit, and SL3
> vrs SL4. Not interested in validating on 64bit. Should encourage
> them to continue to build 32bit binaries but validate on both 32-
> and 64bit platforms. Finally got across the point that their SQL
> databases were wide open. They are working on a version that will
> use the grid certificates instead of clear text passwords.
>
> Action Items:
> -------------
>
> 060823 Stephen Find out what current validation processes exist
>
> 060823 Wei Talk to Neal about raising the osgq limit
>
> 060816 Wei Setup ATLAS/SLAC Web page
> 060823 Wei circulated a not try to bring back comments for next
> week.
>
> 060816 Charlie Talk to SLUO about adding institutions.
>
> 060816 Neal Setup atlas priority group for LSF
> 060823 Not done yet.
>
> 060816 Chuck Check with Bob about web server approval need
> 060823 To be done.
>
> 060809 Stephen Ask what dq2user needs to do in MySQL
> 060816 No good answer. Limited to dq2user from offsite can only
> SELECT from localreplicas. From onsite can do
> SELECT,UPDATE,DELETE and INSERT to either localreplicas
> or queued_transfers_SLAC. We'll see if that works or
> not. Without onsite privileges production stopped.
> 060823 Sounds like things are working again, but no concrete
> info.
>
> 060412 Systems Provide Oracle service for ATLAS Trigger testing
> (RT 46089)
> 060419 No ticket yet, so nothing done.
> 060426 Now have ticket 46089.
> 060503 No news.
> 060524 Steffen has provided configuration information. Now
> in Chuck's
> hands.
> 060628 Randy will ask Chuck about status.
> 060726 First on list for V240 but not sure when it will
> happen. Will put a T3a on it.
> 060802 John checking for rack space.
> 060809 Still needs allocated rack space.
> 060816 Has rack and power, waiting for network.
> 060823 Waiting for switch,
>
> 060224 Richard Discuss ATLAS trigger machines with others in SCCS
> 060301 Only limited response from John W was resigned
> acceptance... need to work on an actual deployment plan as
> there are real issues to be solved.
> 060308 John aware and in plans as much as anything is. New
> engineer will take over.
> 060315 No update.
> 060405 No update.
> 060412 No update.
> 060419 No update.
> 060426 No update.
> 060503 No update.
> 060524 RT 45823. Engineer looking at power availability. On
> track for
> August.
> 060628 Understand schedule, Randy will make sure John is aware.
> 060726 Need to nail down when power will be available. Steffen
> things he can make it happen with existing equipment.
> 060802 Looks like this will fit in SCCS. Can reuse rack,
> switches and fibres.
> 060809 Everything looking good for this now.
> 060823 This is Done, will drop it from the agenda for now.
>
|