Production
status:
-
All
fine
-
dCache pbs. over
weekend:
o
write pool nodes went down due to
mistaken maintenance script. Problem identified and resolved (feature now
presumed available in v1.7)
o
Network configuration over
provisioning used all available system memory. Identified and rectified. Network
optimization efforts were misdirected by deficiencies of the monitoring
hardware.
o
Production jobs recovered
automatically.
-
(Action Item –
due next Wed) Status page with updates regarding
issues in progress (preferably including the history of the problem). Currently
we have an announcement page http://www.rhic.bnl.gov/RCF/Announcements/announce.html
but we’ll try something better.
-
Friday @ BU: controller problems
-> switched production to “data2”; “data3,4” become unavailable – catalog
inconsistencies were detected. Removed all files from “data3” and “data4” and
resynchronized the catalog.
-
New GLTier2 site using OSG 0.6 is up
and running fine.
-
Tadashi is reworking the brokering
algorithm to better account for sites with multiple (virt) CPU per node. New
version was released this morning – under close monitoring – watch for better
site utilization factors.
-
Test jobs worked fine at first
“non-DQ2” site (OSCER@OU). Prototype for utilization of non-USATLAS resources
(OSG).
-
Resource Alloc Committee (RAC)
allows for regional use of up to 20% of site resources. (equiv. of about
50,000events per day). So far not much regional demand. Consider
it.
OSG 0.6 production readiness
status:
-
The GL T2 installation is certainly
a good verification. However, considering the GLT2 configuration specifics, a
more thorough validation would be desirable before committing to full
deployment. Is UC (Marco) having additional results?
Marco?
-
We’ll wait until next week to gain
more experience and more feedback, and then we’ll reassess the
situation.
Data
Management:
-
Note from Alexei: he’ll start
testing DQ2 v0.3 using the Lyon (IN2P3) cloud and AOD distribution only. For the
moment the DQ2 developers are busy with Castor and network optimizations. Some
concern with the current TCP/IP parameterization (high packet drop rate).
Unfortunately it diverts their attention away from DQ2 issues.
-
(Action
Item) How to split AOD and production
storage?
o
Multiple ToA “sites” were deemed
undesirable for not totally clear reasons. Need further
discussion.
o
Multiple “storage destinations”
within same ToA site + patch in site services hardcoding a different processing
method based on a selection using file names, is not very attractive to many
participants. By definition not completely reliable (no generally accepted
agreement on file name coding) and quite difficult to maintain –>
undesirable, too.
o
A metadata tag to be defined during
the subscription phase (and stored in the LRC) is more attractive to many sites.
How to proceed?
o
Action: Reopen the discussion with
Miguel (including Wensheng, Saul, Shawn, Dan, Hiro and not only them) to try to
find an acceptable compromise.
Site
updates:
-
GL: All running well (OSG 0.6).
Harmless entries “reached max no of agents” in (site services) log (I forgot
which log).
-
MW: All good. MWT2_UC dCache upgrade
in progress.
-
NE: 2 filesystems lost (see above).
60TB to become operational soon. 5 new blade
chassis.
-
SW: Prod fine. No issues. New
deployment of OSG 0.6 gateway for testing. Downtime on 4/28,29
!
-
OU: All good. New OSG ITB in
progress.
-
SLAC: Prod fine. Will run out of
space, soon. New HW arrived. Computing will be operational by end of June
(require plumbing and power outage). Storage (54TB) will be operational be end
of month (or sooner) – running xrootd and gridftp
-
BNL: dCache pbs (see above). Site
unavailable on Thu (tomorrow) 10:00am (for 2hr) for network maintenance. Prod
should recover.
R