ATLAS SCCS Planning 23Jan2008
-----------------------------
9am, SCCS Conf Rm A, to call in +1 510 665 5437, press 1, 3935#
Present: Stephen, Wei, Randy, Chuck, Peter, Richard, Bob
Agenda:
1. DQ2 Status/Web Proxy
There is a problem with DQ2. The number of jobs running is much
higher than at SLAC. This is the third time this problem has been
seen. They are still working on it. PANDA thinks there are 800 when
there are really only 500. Over the weekend it thought there was
more than 1k but was only 700 or 800 hundred.
2. Tier-2 Hardware
The Tier-2 CPUs are there but not installed yet. All the Dell
machines are being installed but at the same time there is a
reduction in the number of machines previously installed. They are
being moved to the new Black Box. Will be increasing the fair
shares a little but not to the final number for a month or so. The
Dell machines are too long to fit in the Black Box, so need to
deinstall machines from the water cooled racks and moved to the
Black Box at IR12. The new systems will be put into the water
cooled racks. While installing the new machines will cable up the
older machines in the Black Box. Not sure what staff levels will
be so hard to give concrete schedule.
Making sure that the storage will be able to be plugged in. Looking
at what is needed from GLAST and BaBar to put order together. Have
a fairly reasonable offer in hand. The Thumpers will use 1TB
disks. Each server will use 4 Gb ports so the storage will use
almost as much networking ports as the CPUs. If this continues the
Tier-2 will not lean on the hardware from the rest of the lab but
it will on the support. As we are buying the same for everyone can
make priority based decisions on who needs it at any particular
time. Not evident that there is no where to put them. There is an
issue with 10gigE Intel cards in the Thumpers reported, but
shouldn't effect us. Might see some Thumper-like device that is
really a JBOD with 10gigE on the motherboard.
3. AOB
- Gatekeeper
Could probably use a faster machine. We also need more gridFTP
machines. If we use 3 20Zs that would probably be enough. Not sure
if multicore machines would help. Best way to get fastest machines
is to use machines bought for batch. There is nobody's money ear
marked for this. Was wondered if we could use the BaBar machines
for ATLAS instead of buying new ones but it was too late. But
perhaps could sell some of the Dell 1950s if they are not
needed. For gridFTP not completely sure where the CPU time is
spent. Might be interesting to find out what other Tier-2s are
using. Could take some of the opterons machines bought for ATLAS to
do this. Memory is one of the issues just now, could add more
memory.
Will take some boers consistently for these services. Need four
machines, taking the last four will take it down to 135.
- External Security Review
In late February will be having an external person doing a external
scan. On the week of March 15th will start trying to do penetration
from internal. Will start having meetings this afternoon to tighten
up security at lab. Some examples will be screen savers with
passwords, running crack on local passwords, etc. This will be a
pretty tough review, so everyone needs to help. Some extra staff
time will need to be put on security.
For ATLAS need to make sure things like the mySQL server is up to
date.
- Network
Working up upgrading external network to 10gigE.
- Job slots on xxl
At the moment the limit is 62. xlong is about a day and many jobs
are running longer than that so they need to be moved to xxl. Would
like to increase the number of slots in xxl to something like
256. Wei has a graph to saw that 400 jobs had been killed due to
this. Could also increase the length of xlong queue, try to keep
that around a day. ATLAS jobs are meant to take around a day. Also
need to make sure jobs don't run forever. US ATLAS should define an
amount of CPU time that is needed at the Tier-2 queues. Will
discuss in email about changing length limits. Will also have a
meeting at SLAC to see if xxl could accept more jobs.
Is there one type of machine that kills jobs? BaBar is asking to
decrease the CPUF on the Dells as they are not performing as well
as expected.
Working on a proposal to submit jobs to a single queue
("general"). You would specify the CPU time needed. Short jobs
would be run first but as longer jobs aged they would get higher
priority to start. This would be useful with jobs start requesting
multiple cores, could put short jobs on cores will you create a
free machine.
- Stephen Leaving
Leaving ATLAS and SLAC and moving to CERN and CMS. It has been
getting talked about for a long time but is finally happening in a
fast fashion. Peter will be taking over this meeting. Stephen &
Peter need to talk to come up with a transition plan, expect
Stephen will still run at least the next meeting.
- SL4 status
The new machines are being installed with RHEL4. Have had the green
light from GLAST on moving to RHEL4 and 64 bit. BaBar have been
running on it but not compiling on it. ATLAS also doesn't compile
in 64-bit mode.
- FDR
All US ATLAS Tier-2 sites are participating. Need to upgrade the
network. Could perhaps move the switches needed first. Have done
two already but those were the easy ones to schedule. The next ones
will not be so easy as they effect more people. Would probably be
too much work to do them all at the same time. Small steps are
generally better. Have learnt that running name servers on local
machines has caused problems, could perhaps stop doing that. Do now
have enough UPS to support the first module for the 10gigE uplink
to ESnet.
Action Items:
-------------
080123 Stephen Find out if there is or can be a US ATLAS batch limit(s)
--
/------------------------------------+-------------------------\
|Stephen J. Gowdy, SLAC | CERN Office: 32-2-A22|
|http://www.slac.stanford.edu/~gowdy/ | CH-1211 Geneva 23 |
| | Switzerland |
|EMail: [log in to unmask] | Tel: +41 22 767 5840 |
\------------------------------------+-------------------------/
|