LISTSERV 16.5 - ATLAS-SCCS-PLANNING-L Archives

________________________________

From: [log in to unmask] [mailto:[log in to unmask]] On Behalf Of Razvan Popescu
Sent: Wednesday, April 04, 2007 3:00 PM
To: [log in to unmask]
Subject: [Usatlas-grid-l] F&O Metting Notes 4/4



Production/Operation Tutorial session(s):

 

*       Very welcome!

*       Tutorial session with operational procedures for site administrators: how to monitor production and data transfers, how to debug problems, routine procedures, and many more... 

*       Two deliveries: one "virtual" session (via a collaborative tool to be announced) on or around May 4th (3hours 1-4pm). Tentative schedule.

*       A second session under consideration - June 22nd (after the next T2 quarterly meeting at IndianaU). Watch for confirmation or changes.

*       Content to be defined and prepared by UTA/production operations and data mgmt operations.

*       Alexei - not yet contacted.

 

Production: 

 

*       All fine. 

*       Enough job supply for months.

*       Sites: watch disk utilization and resource availability - we're in for a long production cycle.

 

Data Management:

 

*       Overall ATLAS disk storage shortage - BNL offered help (addt'l storage) Alexei will coordinate data relocation (on tape) to free up disk storage.

*       Due to dCache data corruption we'll need to resubscribe all AOD (and Ntuple) data brought in during the recent AOD replication exercise. The data will be deleted and resubscribed, first at T1, followed by T2 following a schedule driven by regional needs.

*       Alexei will coordinate.

*       Dcache issue details will be sent to this list by Michael.

*       Question: can we have separate storage for production and AOD? Is it possible to operate multiple storage endpoints under the same TiersOfAtlas location? - It is expected to be possible, however BNL (Hiro, Wensheng,  Xin) will investigate and provide the details. Guys, please follow-up!

 

Network monitoring:

 

*       A network monitoring system, developed at UMichigan, available for barebones install has been available as install disk for some time. So far it was used at UM and UTA, as well as UMinnesota and during SuperComputing 2006.

*        We should proceed with deployment of the system at all sites. Please provide IP address location to Shawn.

*       UMich will put instructions on the network twiki.

 

Capacity ramp-up:

 

*       NE: 328 cores, 492kSi2k, 129TB 

*       GL: new GW (OSG 0.6) will be avail today. 160 new job slots (under Condor). Two more purchases: June/Tier3, August/rest of equipment. This year will exceed requirements: 784kSi2k, 220TB.

*       MW: in progress: addt'l 100TB to bring total to 160TB. Phase 4 of procurement: addt'l 50TB and 140kSi2k. Late summer totals: 600kSi2k, 200TB.

*       SLAC: June: 421kSi2k, 54TB. End of year: 660kSi2k, 250TB.

*       SW: Now 700cores (UTA+OU) ~ 800kSi2k, 95TB (storage not completely operational, expected completion by end of month). New purchase: 200-400 cores + 150-200TB. Oklahoma: 160cores soon + 15TB on order.

 

AOB:

 

*       Is OSG 0.6 ready for USATLAS deployment? Wait for confirmation from production/data_mgmt people.

*       UM will give early results. No anticipated difficulties.

*       In general AOD replication stressed systems at many sites. Limitation of I/O bandwidth (either at network level or storage system level) made few sites shutdown the replication to assure enough resources for production data flows. To a certain extend the DQ2 modification enabling the use of (2class) fair share helped - when sites were not resource constrained.

*       The AOD flows will continue to be a challenge - see the need to re-replicate and the future redistribution of new AODs based on a new software release. The need for sufficient end to end I/O bandwidth must remain a first level priority.