LISTSERV 16.5 - ATLAS-SCCS-PLANNING-L Archives

Subscriber's Corner
Email Lists
ATLAS-SCCS-PLANNING-L Archives

ATLAS-SCCS-PLANNING-L@LISTSERV.SLAC.STANFORD.EDU

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		ATLAS-SCCS-PLANNING-L Home
		ATLAS-SCCS-PLANNING-L January 2008
Subject:
Minutes of ATLAS/SCCS Planning Meeting 23rd Jan 2008
From:
"Stephen J. Gowdy" <[log in to unmask]>
Date:
23 Jan 2008 18:58:18 +0100 (CET)Wed, 23 Jan 2008 18:58:18 +0100 (CET)
Content-Type:
TEXT/PLAIN
Parts/Attachments:
TEXT/PLAIN (148 lines)
ATLAS SCCS Planning 23Jan2008
-----------------------------

  9am, SCCS Conf Rm A, to call in +1 510 665 5437, press 1, 3935#

Present: Stephen, Wei, Randy, Chuck, Peter, Richard, Bob

Agenda:

1. DQ2 Status/Web Proxy

    There is a problem with DQ2. The number of jobs running is much
    higher than at SLAC. This is the third time this problem has been
    seen. They are still working on it. PANDA thinks there are 800 when
    there are really only 500. Over the weekend it thought there was
    more than 1k but was only 700 or 800 hundred.

2. Tier-2 Hardware

    The Tier-2 CPUs are there but not installed yet. All the Dell
    machines are being installed but at the same time there is a
    reduction in the number of machines previously installed. They are
    being moved to the new Black Box. Will be increasing the fair
    shares a little but not to the final number for a month or so. The
    Dell machines are too long to fit in the Black Box, so need to
    deinstall machines from the water cooled racks and moved to the
    Black Box at IR12. The new systems will be put into the water
    cooled racks. While installing the new machines will cable up the
    older machines in the Black Box. Not sure what staff levels will
    be so hard to give concrete schedule.

    Making sure that the storage will be able to be plugged in. Looking
    at what is needed from GLAST and BaBar to put order together. Have
    a fairly reasonable offer in hand. The Thumpers will use 1TB
    disks. Each server will use 4 Gb ports so the storage will use
    almost as much networking ports as the CPUs. If this continues the
    Tier-2 will not lean on the hardware from the rest of the lab but
    it will on the support. As we are buying the same for everyone can
    make priority based decisions on who needs it at any particular
    time. Not evident that there is no where to put them. There is an
    issue with 10gigE Intel cards in the Thumpers reported, but
    shouldn't effect us. Might see some Thumper-like device that is
    really a JBOD with 10gigE on the motherboard.

3. AOB

    - Gatekeeper

    Could probably use a faster machine. We also need more gridFTP
    machines. If we use 3 20Zs that would probably be enough. Not sure
    if multicore machines would help. Best way to get fastest machines
    is to use machines bought for batch. There is nobody's money ear
    marked for this. Was wondered if we could use the BaBar machines
    for ATLAS instead of buying new ones but it was too late. But
    perhaps could sell some of the Dell 1950s if they are not
    needed. For gridFTP not completely sure where the CPU time is
    spent. Might be interesting to find out what other Tier-2s are
    using. Could take some of the opterons machines bought for ATLAS to
    do this. Memory is one of the issues just now, could add more
    memory.

    Will take some boers consistently for these services. Need four
    machines, taking the last four will take it down to 135.

    - External Security Review

    In late February will be having an external person doing a external
    scan. On the week of March 15th will start trying to do penetration
    from internal. Will start having meetings this afternoon to tighten
    up security at lab. Some examples will be screen savers with
    passwords, running crack on local passwords, etc. This will be a
    pretty tough review, so everyone needs to help. Some extra staff
    time will need to be put on security.

    For ATLAS need to make sure things like the mySQL server is up to
    date.

    - Network

    Working up upgrading external network to 10gigE.

    - Job slots on xxl

    At the moment the limit is 62. xlong is about a day and many jobs
    are running longer than that so they need to be moved to xxl. Would
    like to increase the number of slots in xxl to something like
    256. Wei has a graph to saw that 400 jobs had been killed due to
    this. Could also increase the length of xlong queue, try to keep
    that around a day. ATLAS jobs are meant to take around a day. Also
    need to make sure jobs don't run forever. US ATLAS should define an
    amount of CPU time that is needed at the Tier-2 queues. Will
    discuss in email about changing length limits. Will also have a
    meeting at SLAC to see if xxl could accept more jobs.

    Is there one type of machine that kills jobs? BaBar is asking to
    decrease the CPUF on the Dells as they are not performing as well
    as expected.

    Working on a proposal to submit jobs to a single queue
    ("general"). You would specify the CPU time needed. Short jobs
    would be run first but as longer jobs aged they would get higher
    priority to start. This would be useful with jobs start requesting
    multiple cores, could put short jobs on cores will you create a
    free machine.

    - Stephen Leaving

    Leaving ATLAS and SLAC and moving to CERN and CMS. It has been
    getting talked about for a long time but is finally happening in a
    fast fashion. Peter will be taking over this meeting. Stephen &
    Peter need to talk to come up with a transition plan, expect
    Stephen will still run at least the next meeting.

    - SL4 status

    The new machines are being installed with RHEL4. Have had the green
    light from GLAST on moving to RHEL4 and 64 bit. BaBar have been
    running on it but not compiling on it. ATLAS also doesn't compile
    in 64-bit mode.

    - FDR

    All US ATLAS Tier-2 sites are participating. Need to upgrade the
    network. Could perhaps move the switches needed first. Have done
    two already but those were the easy ones to schedule. The next ones
    will not be so easy as they effect more people. Would probably be
    too much work to do them all at the same time. Small steps are
    generally better. Have learnt that running name servers on local
    machines has caused problems, could perhaps stop doing that. Do now
    have enough UPS to support the first module for the 10gigE uplink
    to ESnet.

Action Items:
-------------

080123	Stephen  Find out if there is or can be a US ATLAS batch limit(s)


--
  /------------------------------------+-------------------------\
|Stephen J. Gowdy, SLAC               | CERN     Office: 32-2-A22|
|http://www.slac.stanford.edu/~gowdy/ | CH-1211 Geneva 23        |
|                                     | Switzerland              |
|EMail: [log in to unmask]       | Tel: +41 22 767 5840     |
  \------------------------------------+-------------------------/
Top of Message | Previous Page | Permalink
Search Archives

Advanced Options
Options

		Log In
		Get Password

		Search Archives

		Subscribe or Unsubscribe