Print

Print



From: [log in to unmask] [mailto:[log in to unmask]] On Behalf Of Popescu, Razvan
Sent: Wednesday, April 18, 2007 12:20 PM
To: [log in to unmask]
Subject: [Usatlas-grid-l] minutes F&O meeting 4/18

Production status:

 

-          All fine

-          dCache pbs. over weekend:

o        write pool nodes went down due to mistaken maintenance script. Problem identified and resolved (feature now presumed available in v1.7)

o        Network configuration over provisioning used all available system memory. Identified and rectified. Network optimization efforts were misdirected by deficiencies of the monitoring hardware.

o        Production jobs recovered automatically.

-          (Action Item – due next Wed) Status page with updates regarding issues in progress (preferably including the history of the problem). Currently we have an announcement page http://www.rhic.bnl.gov/RCF/Announcements/announce.html but we’ll try something better.

-          Friday @ BU: controller problems -> switched production to “data2”; “data3,4” become unavailable – catalog inconsistencies were detected. Removed all files from “data3” and “data4” and resynchronized the catalog.

-          New GLTier2 site using OSG 0.6 is up and running fine.

-          Tadashi is reworking the brokering algorithm to better account for sites with multiple (virt) CPU per node. New version was released this morning – under close monitoring – watch for better site utilization factors.

-          Test jobs worked fine at first “non-DQ2” site (OSCER@OU). Prototype for utilization of non-USATLAS resources (OSG).

-          Resource Alloc Committee (RAC) allows for regional use of up to 20% of site resources. (equiv. of about 50,000events per day). So far not much regional demand. Consider it.

 

OSG 0.6 production readiness status:

 

-          The GL T2 installation is certainly a good verification. However, considering the GLT2 configuration specifics, a more thorough validation would be desirable before committing to full deployment. Is UC (Marco) having additional results? Marco?

-          We’ll wait until next week to gain more experience and more feedback, and then we’ll reassess the situation.

 

Data Management:

 

-          Note from Alexei: he’ll start testing DQ2 v0.3 using the Lyon (IN2P3) cloud and AOD distribution only. For the moment the DQ2 developers are busy with Castor and network optimizations. Some concern with the current TCP/IP parameterization (high packet drop rate). Unfortunately it diverts their attention away from DQ2 issues.

-          (Action Item) How to split AOD and production storage?

o        Multiple ToA “sites” were deemed undesirable for not totally clear reasons. Need further discussion.

o        Multiple “storage destinations” within same ToA site + patch in site services hardcoding a different processing method based on a selection using file names, is not very attractive to many participants. By definition not completely reliable (no generally accepted agreement on file name coding) and quite difficult to maintain –> undesirable, too.

o        A metadata tag to be defined during the subscription phase (and stored in the LRC) is more attractive to many sites. How to proceed?

o        Action: Reopen the discussion with Miguel (including Wensheng, Saul, Shawn, Dan, Hiro and not only them) to try to find an acceptable compromise.

 

Site updates:

 

-          GL: All running well (OSG 0.6). Harmless entries “reached max no of agents” in (site services) log (I forgot which log).

-          MW: All good. MWT2_UC dCache upgrade in progress.

-          NE: 2 filesystems lost (see above). 60TB to become operational soon. 5 new blade chassis.

-          SW: Prod fine. No issues. New deployment of OSG 0.6 gateway for testing. Downtime on 4/28,29 !

-          OU: All good. New OSG ITB in progress.

-          SLAC: Prod fine. Will run out of space, soon. New HW arrived. Computing will be operational by end of June (require plumbing and power outage). Storage (54TB) will be operational be end of month (or sooner) – running xrootd and gridftp

-          BNL: dCache pbs (see above). Site unavailable on Thu (tomorrow) 10:00am (for 2hr) for network maintenance. Prod should recover.

 

R