Print

Print


On Aug 23, 2006, at 10:02 AM, Stephen J. Gowdy wrote:

> ATLAS SCCS Planning 23Aug2006
> -----------------------------
>
>  SCCS Conf Rm A, to call in +1 510 665 5437, press 1, 3935#
>
> Present: Booker, Stephen. Wei, Len, John, Chuck, Charlie
>
> Agenda:
>
> 1. DQ2 Status/Web Proxy
>
>    Nothing happened since last. Looks like some jobs are running now,
>    must be very recent as when I looked there were none.
>
>    Looks like there has been one successful DQ2 transfer.
>
>    The main hacker will send Stephen information when he gets back
>    from holiday.
>
>    Need help to figure out why things work sometimes and sometimes
>    they don't.
>
> 2. Trigger Farm Status
>
>    Nothing to discuss this week.
>
> 3. ATLAS Oracle Server
>
>    Waiting for a switch to be installed.
>
> 4. Slots for ATLAS Production jobs and other batch related stuff
>
>    Could open up production without removing the limit per
>    user. Perhaps not as that is defined by the queue. Could think
>    about using a different queue.
>
>    Might be an issue with the gatekeeper. In past have seen issues, it
>    spawns four processes per job submitted. It looks like in the new
>    version of CONDOR-G that they've dealt with this issue.
>
>    We also need to look at what machines have been assigned to the
>    osgq. It is a good idea to raise the limit gradually. We will
>    probably run in to problems before we run out of batch machines
>    want to push the limit and fix problems retroactively. There are
>    Tier-2s who are running several hundred workers, so is our setup
>    so different?
>
>    Looking at another solution for scaling, want to be able to react
>    when there are lot of jobs coming.
>
>    Should raise the limit to something like 20 or 30.
>

Wei just dropped by and asked about this change. I've made the change
to a 30 job limit per user in the LSF configuration file and it will
go into effect during this evening's scheduled LSF reconfiguration at
approx. 19:35 PDT.

--Neal



>    To setup the fair-share waiting to get a number for what the
>    fair-share should be. We're not really stuck without it but  
> would be
>    good to get the mechanism in place sooner than when we absolutely
>    need it.
>
> 5. Validation of ATLAS jobs on RHEL
>
>    We need to find out more information. Need to determine process for
>    new sites and for upgrades to existing systems.
>
> 6. AOB
>
>    - 10am PST Conference for Tier2s. Primarily people from centres,
>      not users. Talk about issues about what they want to use for
>      storage, why DQ2 runs into troubles. All the folk at SCCS would
>      like to attend, so should try to wrap by 10am.
>
>    - Would like to hear a synopsis of what happened in Boston next
>      week. Not much beyond site reports and discussions about
>      DQ2. They were talking how the data transfers in the production
>      world. Many people at BNL had a strong interest in using xrootd
>      instead of dCache for storage. The basic xrootd software is
>      already there, what is needed is the SRM interface. There is one
>      available with the Berkeley SRM interface but someone needs to
>      package it. There was also talk about 32bit vrs 64bit, and SL3
>      vrs SL4. Not interested in validating on 64bit. Should encourage
>      them to continue to build 32bit binaries but validate on both 32-
>      and 64bit platforms. Finally got across the point that their SQL
>      databases were wide open. They are working on a version that will
>      use the grid certificates instead of clear text passwords.
>
> Action Items:
> -------------
>
> 060823 Stephen	Find out what current validation processes exist
>
> 060823 Wei	Talk to Neal about raising the osgq limit
>
> 060816 Wei	Setup ATLAS/SLAC Web page
>        060823 Wei circulated a not try to bring back comments for next
> 	      week.
>
> 060816 Charlie	Talk to SLUO about adding institutions.
>
> 060816 Neal	Setup atlas priority group for LSF
>        060823 Not done yet.
>
> 060816 Chuck	Check with Bob about web server approval need
>        060823 To be done.
>
> 060809 Stephen Ask what dq2user needs to do in MySQL
>        060816 No good answer. Limited to dq2user from offsite can only
> 	      SELECT from localreplicas. From onsite can do
> 	      SELECT,UPDATE,DELETE and INSERT to either localreplicas
> 	      or queued_transfers_SLAC. We'll see if that works or
> 	      not. Without onsite privileges production stopped.
>        060823 Sounds like things are working again, but no concrete  
> info.
>
> 060412 Systems  Provide Oracle service for ATLAS Trigger testing  
> (RT 46089)
>        060419 No ticket yet, so nothing done.
>        060426 Now have ticket 46089.
>        060503 No news.
>        060524 Steffen has provided configuration information. Now  
> in Chuck's
> 	      hands.
>        060628 Randy will ask Chuck about status.
>        060726 First on list for V240 but not sure when it will
>               happen. Will put a T3a on it.
>        060802 John checking for rack space.
>        060809 Still needs allocated rack space.
>        060816 Has rack and power, waiting for network.
>        060823 Waiting for switch,
>
> 060224 Richard	Discuss ATLAS trigger machines with others in SCCS
>        060301 Only limited response from John W was resigned
> 	      acceptance... need to work on an actual deployment plan as
> 	      there are real issues to be solved.
>        060308 John aware and in plans as much as anything is. New
> 	      engineer will take over.
>        060315 No update.
>        060405 No update.
>        060412 No update.
>        060419 No update.
>        060426 No update.
>        060503 No update.
>        060524 RT 45823. Engineer looking at power availability. On  
> track for
> 	      August.
>        060628 Understand schedule, Randy will make sure John is aware.
>        060726 Need to nail down when power will be available. Steffen
> 	      things he can make it happen with existing equipment.
>        060802 Looks like this will fit in SCCS. Can reuse rack,
>               switches and fibres.
>        060809 Everything looking good for this now.
>        060823 This is Done, will drop it from the agenda for now.
>