Print

Print


ATLAS SCCS Planning 23Aug2006
-----------------------------

  SCCS Conf Rm A, to call in +1 510 665 5437, press 1, 3935#

Present: Booker, Stephen. Wei, Len, John, Chuck, Charlie

Agenda:

1. DQ2 Status/Web Proxy

    Nothing happened since last. Looks like some jobs are running now,
    must be very recent as when I looked there were none.

    Looks like there has been one successful DQ2 transfer.

    The main hacker will send Stephen information when he gets back
    from holiday.

    Need help to figure out why things work sometimes and sometimes
    they don't.

2. Trigger Farm Status

    Nothing to discuss this week.

3. ATLAS Oracle Server

    Waiting for a switch to be installed.

4. Slots for ATLAS Production jobs and other batch related stuff

    Could open up production without removing the limit per
    user. Perhaps not as that is defined by the queue. Could think
    about using a different queue.

    Might be an issue with the gatekeeper. In past have seen issues, it
    spawns four processes per job submitted. It looks like in the new
    version of CONDOR-G that they've dealt with this issue.

    We also need to look at what machines have been assigned to the
    osgq. It is a good idea to raise the limit gradually. We will
    probably run in to problems before we run out of batch machines
    want to push the limit and fix problems retroactively. There are
    Tier-2s who are running several hundred workers, so is our setup
    so different?

    Looking at another solution for scaling, want to be able to react
    when there are lot of jobs coming.

    Should raise the limit to something like 20 or 30.

    To setup the fair-share waiting to get a number for what the
    fair-share should be. We're not really stuck without it but would be
    good to get the mechanism in place sooner than when we absolutely
    need it.

5. Validation of ATLAS jobs on RHEL

    We need to find out more information. Need to determine process for
    new sites and for upgrades to existing systems.

6. AOB

    - 10am PST Conference for Tier2s. Primarily people from centres,
      not users. Talk about issues about what they want to use for
      storage, why DQ2 runs into troubles. All the folk at SCCS would
      like to attend, so should try to wrap by 10am.

    - Would like to hear a synopsis of what happened in Boston next
      week. Not much beyond site reports and discussions about
      DQ2. They were talking how the data transfers in the production
      world. Many people at BNL had a strong interest in using xrootd
      instead of dCache for storage. The basic xrootd software is
      already there, what is needed is the SRM interface. There is one
      available with the Berkeley SRM interface but someone needs to
      package it. There was also talk about 32bit vrs 64bit, and SL3
      vrs SL4. Not interested in validating on 64bit. Should encourage
      them to continue to build 32bit binaries but validate on both 32-
      and 64bit platforms. Finally got across the point that their SQL
      databases were wide open. They are working on a version that will
      use the grid certificates instead of clear text passwords.

Action Items:
-------------

060823 Stephen	Find out what current validation processes exist

060823 Wei	Talk to Neal about raising the osgq limit

060816 Wei	Setup ATLAS/SLAC Web page
        060823 Wei circulated a not try to bring back comments for next
 	      week.

060816 Charlie	Talk to SLUO about adding institutions.

060816 Neal	Setup atlas priority group for LSF
        060823 Not done yet.

060816 Chuck	Check with Bob about web server approval need
        060823 To be done.

060809 Stephen Ask what dq2user needs to do in MySQL
        060816 No good answer. Limited to dq2user from offsite can only
 	      SELECT from localreplicas. From onsite can do
 	      SELECT,UPDATE,DELETE and INSERT to either localreplicas
 	      or queued_transfers_SLAC. We'll see if that works or
 	      not. Without onsite privileges production stopped.
        060823 Sounds like things are working again, but no concrete info.

060412 Systems  Provide Oracle service for ATLAS Trigger testing (RT 
46089)
        060419 No ticket yet, so nothing done.
        060426 Now have ticket 46089.
        060503 No news.
        060524 Steffen has provided configuration information. Now in 
Chuck's
 	      hands.
        060628 Randy will ask Chuck about status.
        060726 First on list for V240 but not sure when it will
               happen. Will put a T3a on it.
        060802 John checking for rack space.
        060809 Still needs allocated rack space.
        060816 Has rack and power, waiting for network.
        060823 Waiting for switch,

060224 Richard	Discuss ATLAS trigger machines with others in SCCS
        060301 Only limited response from John W was resigned
 	      acceptance... need to work on an actual deployment plan as
 	      there are real issues to be solved.
        060308 John aware and in plans as much as anything is. New
 	      engineer will take over.
        060315 No update.
        060405 No update.
        060412 No update.
        060419 No update.
        060426 No update.
        060503 No update.
        060524 RT 45823. Engineer looking at power availability. On track 
for
 	      August.
        060628 Understand schedule, Randy will make sure John is aware.
        060726 Need to nail down when power will be available. Steffen
 	      things he can make it happen with existing equipment.
        060802 Looks like this will fit in SCCS. Can reuse rack,
               switches and fibres.
        060809 Everything looking good for this now.
        060823 This is Done, will drop it from the agenda for now.