LISTSERV 16.5 - ATLAS-SCCS-PLANNING-L Archives

Hi,
  We finished our initial testing of 8-core machines at DZERO for the Level-3 trigger/DAQ (these are configured as a large farm running a software-only trigger). Unfortunately, before we could run careful memory usage tests we had to return the tests. The node had 8 cores and 8 gigs of memory (x64, obviously).

  We did have to make one change to the linux kernel (easy): /proc/sys/kernel/msgmni -- up that to 64 message queues. But I doubt normal usage will have to deal with that -- we use these message queue's to coordinate the movement of data through the system. We don't have exactly 8 independent executables. Rather we have an event builder in each node which sends fully built events to one of 8 trigger executables. These message queues do the coordination.

  As far as the simple memory pressure tests we did do, with 7 of the images running there didn't seem to be much of a problem. Certainly it was not the bottle neck.

  Sorry this isn't more helpful. But at least for the DØ Trigger we've not hit any limits.

	Cheers,
		Gordon.

P.S. I remember the VAX where we talked about never _ever_ using up all the address space that 32 bit machine had. Ha!

> -----Original Message-----
> From: [log in to unmask] [mailto:owner-
> [log in to unmask]] On Behalf Of Stephen J. Gowdy
> Sent: Wednesday, August 15, 2007 7:07 PM
> To: ATLAS SCCS Planning
> Subject: Minutes of ATLAS/SCCS Planning Meeting 15th August 2007
> 
> ATLAS SCCS Planning 15Aug2007
> -----------------------------
> 
>   9am, SCCS Conf Rm A, to call in +1 510 665 5437, press 1, 3935#
> 
> Present: Wei, Stephen, Len, Richard, Chuck, Booker
> 
> Agenda:
> 
> 1. DQ2 Status/Web Proxy
> 
>     BNL is reallocating machines just now so have shutdown. Also
>     problems at CERN with central DQ2 services. Things at SLAC are
>     running fine.
> 
> 2. Tier-2 Hardware
> 
>     Every site should have the new hardware up and running by 1st
>     April. Several sites had different kinds of estimates, SLAC for
>     instance assumed the 2008 commitments were needed at the end of
>     2007. Thing that JimS took 20% off the contributions suggested from
>     the sites.
> 
>     The longer we wait the better value we can get but it will also
>     mean we risk loosing the rack space where they would go. Right now
>     PPA is deciding who uses what space. We know just now where it
>     would go but in six months. Having had a bit of trouble this year
>     with having the money but no equipment thing we should think about
>     buying earlier than later.
> 
>     There is the water cooled racks and some space on the second floor
>     (need to worry a little about weight). After that it isn't clear
>     where we'll have space. It might all be occupied by PII, BSD,
>     etc. It is first come first served with the option that anyone can
>     go to the management to get that changed. It will dawn on everyone
>     that we run out of space at the end of the year more folk will
>     (need to) go that route.
> 
>     The BaBar purchase needs to be up and running by the 1st of the
>     year. This needs a new Black Box. The purchase is being determined
>     based on the total power that can be put in it. Not likely to be
>     the fastest and hottest CPUs you can get. Might get more CPU/$$
>     going down a notch or two for the power limit of Black Box. In the
>     water cooled racks you need to get the most expensive CPU to get
>     the most out of them. It is something like $500k for eight racks
>     (of around 40U). This will take around $4M in computing
>     power. Normally to a outside contract you add 60% without CEF
>     actually doing anything, with about $350k going to the
>     contractor. The Black Box comes out something like $3k rack unit in
>     black box, but you are limited by power.
> 
>     The infrastructure is getting more expensive than the machines. A
>     white paper from Stanford shows this and also the electrical bill
>     getting more expensive than the hardware in the future. Have also
>     had in the past that people dominated 2:1 in the past but this
>     trend is reversing.
> 
>     Thinking of building some terraces into the hillside around
>     IR8. Try to do it as low cost but well planned as possible. Looking
>     out three years ahead. This will provide facilities to SLAC and
>     Stanford. Need to somehow survive till then, perhaps more water
>     cooled racks but the chilled water in Building 50 is also at
>     capacity. Black Boxes also can't be added in the same place as need
>     separate power and cooling provided. The industry might learn as
>     people can buy less hardware as they are paying more for
>     infrastructure that they need to make cooler stuff.
> 
>     So our purchase should probably be on the same timescale as the
>     BaBar one. They are in a hurry though. We should keep open that we
>     use the same purchasing and evaluation effort. Perhaps the ATLAS
>     stuff gets the same hardware with faster CPUs in it. So will
>     probably tie these together.
> 
> 3. AOB
> 
>     - SLAC ATLAS Group Allocation
> 
>       Came to Richard's knowledge that currently the SLAC ATLAS group
>       has a special fairshare (the group in question isn't actually for
>       the SLAC ATLAS group, but for any ATLAS users wanting to use the
>       Tier-2 in batch mode) on the Tier-2. Can not run a Tier-2 and
>       give your local users special access. There should be lab funded
>       machines for general use for PPA employees. Will put a large
>       fraction of Black Box #1 into this. Then all groups will get
>       their what they expect.
> 
>       Need to try to keep separate "local" users and general ATLAS. The
>       "local" is the informal consortium of universities that supports
>       the Tier-2. Need to be able to support analysis at the Tier-2 by
>       giving them enough cycles etc. For local analysis activities can
>       only let in folk in the Western Tier-2 Consortium. Many people
>       are trying to do their own thing at their institution so the load
>       may not be as high as expected.
> 
>       As a local person you could use the Tier-2 as a general ATLAS
>       person or the Tier-3 as a local user. The same issue is coming up
>       at CERN for local analysis here.
> 
>       Technically it isn't difficult to setup another LSF group. Not
>       quite sure how to separate disk space usage though. Lab
>       management has not been asked about an AllUsers disk pool. If
>       there will be heavy use need some disk space funded as cannot use
>       the Tier-2 disk space for local usage only. It is thought that
>       the 20% reduction by JimS was for US usage. Trying to serve a
>       diffuse community with a storage area isn't easy, best to have
>       well identified set of users.
> 
>       Something that works for a production activity with one master
>       won't necessarily work in a more chaotic usage. BaBar is looking
>       at using the local disk as a temporary storage area for skimming,
>       which could be there for many hours. Could end up having more
>       storage on the nodes due to the more-or-less minimum size disks
>       coming with machines these days.
> 
>     - xrootd
> 
>       Some issues with redirector and ATLAS software, not sure where
>       the problem is.
> 
>       Wei things there may also be some problems with 64 bit but there
>       are some other issues Andy should address.
> 
> Action Items:
> -------------
> 
> 070815	Wei	Thing about how we maintain lists of local people etc
> 
> 070801	Stephen See when new Tier-2 hardware is needed
>  	070815 Received answers independently.
> 
> 070725	Stephen	Try to test eval01
>  	070801 Didn't have access when attempted, Booked fixed that.
>  	       Problem with ATLAS software (hopefully trivial).
>  	070815 Installed new software to get around the problem. Not
>  	       tested yet. Have order 128 cousins for it, have bought
>  	       the machine.
> 
> 070711	Stephen	Find out about benchmarks for CPUs for next
> purchase
>  	070718 Not done yet.
>  	070801 Extracted data from our Production Database, need to
>  	       analyse it still.
>  	070815 Not done yet.