LISTSERV 16.5 - ATLAS-SCCS-PLANNING-L Archives

ATLAS SCCS Planning 15Aug2007
-----------------------------

  9am, SCCS Conf Rm A, to call in +1 510 665 5437, press 1, 3935#

Present: Wei, Stephen, Len, Richard, Chuck, Booker

Agenda:

1. DQ2 Status/Web Proxy

    BNL is reallocating machines just now so have shutdown. Also
    problems at CERN with central DQ2 services. Things at SLAC are
    running fine.

2. Tier-2 Hardware

    Every site should have the new hardware up and running by 1st
    April. Several sites had different kinds of estimates, SLAC for
    instance assumed the 2008 commitments were needed at the end of
    2007. Thing that JimS took 20% off the contributions suggested from
    the sites.

    The longer we wait the better value we can get but it will also
    mean we risk loosing the rack space where they would go. Right now
    PPA is deciding who uses what space. We know just now where it
    would go but in six months. Having had a bit of trouble this year
    with having the money but no equipment thing we should think about
    buying earlier than later.

    There is the water cooled racks and some space on the second floor
    (need to worry a little about weight). After that it isn't clear
    where we'll have space. It might all be occupied by PII, BSD,
    etc. It is first come first served with the option that anyone can
    go to the management to get that changed. It will dawn on everyone
    that we run out of space at the end of the year more folk will
    (need to) go that route.

    The BaBar purchase needs to be up and running by the 1st of the
    year. This needs a new Black Box. The purchase is being determined
    based on the total power that can be put in it. Not likely to be
    the fastest and hottest CPUs you can get. Might get more CPU/$$
    going down a notch or two for the power limit of Black Box. In the
    water cooled racks you need to get the most expensive CPU to get
    the most out of them. It is something like $500k for eight racks
    (of around 40U). This will take around $4M in computing
    power. Normally to a outside contract you add 60% without CEF
    actually doing anything, with about $350k going to the
    contractor. The Black Box comes out something like $3k rack unit in
    black box, but you are limited by power.

    The infrastructure is getting more expensive than the machines. A
    white paper from Stanford shows this and also the electrical bill
    getting more expensive than the hardware in the future. Have also
    had in the past that people dominated 2:1 in the past but this
    trend is reversing.

    Thinking of building some terraces into the hillside around
    IR8. Try to do it as low cost but well planned as possible. Looking
    out three years ahead. This will provide facilities to SLAC and
    Stanford. Need to somehow survive till then, perhaps more water
    cooled racks but the chilled water in Building 50 is also at
    capacity. Black Boxes also can't be added in the same place as need
    separate power and cooling provided. The industry might learn as
    people can buy less hardware as they are paying more for
    infrastructure that they need to make cooler stuff.

    So our purchase should probably be on the same timescale as the
    BaBar one. They are in a hurry though. We should keep open that we
    use the same purchasing and evaluation effort. Perhaps the ATLAS
    stuff gets the same hardware with faster CPUs in it. So will
    probably tie these together.

3. AOB

    - SLAC ATLAS Group Allocation

      Came to Richard's knowledge that currently the SLAC ATLAS group
      has a special fairshare (the group in question isn't actually for
      the SLAC ATLAS group, but for any ATLAS users wanting to use the
      Tier-2 in batch mode) on the Tier-2. Can not run a Tier-2 and
      give your local users special access. There should be lab funded
      machines for general use for PPA employees. Will put a large
      fraction of Black Box #1 into this. Then all groups will get
      their what they expect.

      Need to try to keep separate "local" users and general ATLAS. The
      "local" is the informal consortium of universities that supports
      the Tier-2. Need to be able to support analysis at the Tier-2 by
      giving them enough cycles etc. For local analysis activities can
      only let in folk in the Western Tier-2 Consortium. Many people
      are trying to do their own thing at their institution so the load
      may not be as high as expected.

      As a local person you could use the Tier-2 as a general ATLAS
      person or the Tier-3 as a local user. The same issue is coming up
      at CERN for local analysis here.

      Technically it isn't difficult to setup another LSF group. Not
      quite sure how to separate disk space usage though. Lab
      management has not been asked about an AllUsers disk pool. If
      there will be heavy use need some disk space funded as cannot use
      the Tier-2 disk space for local usage only. It is thought that
      the 20% reduction by JimS was for US usage. Trying to serve a
      diffuse community with a storage area isn't easy, best to have
      well identified set of users.

      Something that works for a production activity with one master
      won't necessarily work in a more chaotic usage. BaBar is looking
      at using the local disk as a temporary storage area for skimming,
      which could be there for many hours. Could end up having more
      storage on the nodes due to the more-or-less minimum size disks
      coming with machines these days.

    - xrootd

      Some issues with redirector and ATLAS software, not sure where
      the problem is.

      Wei things there may also be some problems with 64 bit but there
      are some other issues Andy should address.

Action Items:
-------------

070815	Wei	Thing about how we maintain lists of local people etc

070801	Stephen See when new Tier-2 hardware is needed
 	070815 Received answers independently.

070725	Stephen	Try to test eval01
 	070801 Didn't have access when attempted, Booked fixed that.
 	       Problem with ATLAS software (hopefully trivial).
 	070815 Installed new software to get around the problem. Not
 	       tested yet. Have order 128 cousins for it, have bought
 	       the machine.

070711	Stephen	Find out about benchmarks for CPUs for next purchase
 	070718 Not done yet.
 	070801 Extracted data from our Production Database, need to
 	       analyse it still.
 	070815 Not done yet.