LISTSERV 16.5 - ATLAS-SCCS-PLANNING-L Archives

Hi Wei, 

If I understood you, the probe jobs failed and that is what the statistics
are based on. At the same time, ATLAS jobs from the grid ran normally
because they are insensitive to the problem. So the numbers in the report
present a distorted picture of reality in that availability for production
jobs was at a higher level than indicated in the report. Of course we
cannot easily quantify what that level is, and it is not worth the effort.
Thanks for the explanation!

Charlie
-- 
Charles C. Young
M.S. 43, Stanford Linear Accelerator Center
P.O. Box 20450     
Stanford, CA 94309 
[log in to unmask]
voice  (650) 926 2669
fax    (650) 926 2923
CERN GSM +41 76 487 2069




On 7/19/14 10:06 AM, "Yang, Wei" <[log in to unmask]> wrote:

>Hi Charlie,
>
>The LSF event log file is hosted by the LSF master. We switched the LSF
>master host during that time and I did¹t notice that. As a result, Grid
>jobs can continue to come to SLAC but they won¹t be able to get their
>status update (because the Grid can¹t access LSF event log). ATLAS jobs
>are generic pilots and don¹t care about this status (pilot update its
>status with Panda). WLCG submits a probe job to SLAC once a hour, and
>will check the status. So that is why saw no problem from ATLAS side, but
>the WLCG reliability and availability test failed for those 6 days.
>
>regards,
>Wei Yang  |  [log in to unmask]  |  650-926-3338(O)
>
>
>
>On Jul 19, 2014, at 12:51 AM, Young, Charles C. <[log in to unmask]>
>wrote:
>
>> Hi Wei, 
>> 
>> Just to understand the point about LSF scheduler. Are you saying that
>>issues with accessing LSF log files biased the numbers in the report and
>>they are actually better? Or are you saying that LSF scheduler problem
>>led to lower availability?
>> 
>> The numbers for June are about 20% down. Translates to 6 days out of
>>the month. I wasn't paying close attention but did not notice jobs not
>>running for a week. Nor did I get complaints from others ‹ someone must
>>have been running batch jobs separate from Tier-2 production. Was the
>>drop-off not a global one but somehow reduced the number of machines
>>available? Cheers.
>> 
>> Charlie
>> --
>> Charles C. Young
>> M.S. 43, Stanford Linear Accelerator Center
>> P.O. Box 20450  
>> Stanford, CA 94309
>> [log in to unmask]
>> voice  (650) 926 2669
>> fax    (650) 926 2923
>> CERN GSM +41 76 487 2069
>> 
>> From: <Yang>, Wei Yang <[log in to unmask]>
>> Date: Thursday, July 17, 2014 7:33 PM
>> To: atlas-sccs-planning-l <[log in to unmask]>
>> Subject: Fwd: T2 Reliability & Availability - June 2014
>> 
>> fyi, We were pretty low in june. We changed the LSF scheduler master
>>host from Solaris to Linux, and ran into subtle issues in accessing LSF
>>log files via NFS. and it all happened when I took a few days of sick
>>leave Š there was also a scheduled outage in June.
>> 
>> Wei Yang  |  [log in to unmask]  |  1-650-926-3338
>> 
>> 
>> 
>> 
>> Begin forwarded message:
>> 
>>> From: WLCG Office <[log in to unmask]>
>>> Subject: RE: T2 Reliability & Availability - June 2014
>>> Date: July 17, 2014 at 8:27:43 AM PDT
>>> To: "project-wlcg-cb (Members of the WLCG CB)"
>>><[log in to unmask]>
>>> Cc: "project-lcg-gdb (LCG - Grid Deployment Board)"
>>><[log in to unmask]>, "sam-support (SAM support)"
>>><[log in to unmask]>, "[log in to unmask]" <[log in to unmask]>,
>>>"[log in to unmask]" <[log in to unmask]>
>>> 
>>> Dear all,
>>> 
>>> The final T2 reliability & availability for June 2014 is now available
>>>at:
>>> 
>>> 
>>>https://espace2013.cern.ch/WLCG-document-repository/ReliabilityAvailabil
>>>ity/2014/june-14/  under titles starting with "WLCG_All_Sites..."
>>> 
>>> The reports take into consideration all re-computation requests
>>>received in the last 10 calendar days as described in the
>>>re-computation policy.
>>> 
>>> Kind regards,
>>> Cath
>>> 
>>> 
>>> -----------------------------------------------
>>> WLCG Office
>>> IT Dept - CERN
>>> CH-1211 Genève, Switzerland
>>> www.cern.ch/wlcg
>>> From: WLCG Office
>>> Sent: 02 July 2014 11:02
>>> To: project-wlcg-cb (Members of the WLCG CB)
>>> Cc: project-lcg-gdb (LCG - Grid Deployment Board); sam-support (SAM
>>>support); [log in to unmask]; [log in to unmask]
>>> Subject: T2 Reliability & Availability - June 2014
>>> 
>>> Dear all,
>>> 
>>> The draft T2 reliability & availability reports for June 2014 are now
>>>available at:
>>> 
>>> http://sam-reports.web.cern.ch/sam-reports/2014/201406/wlcg/ under
>>>titles starting with "WLCG_All_Sites..."
>>> 
>>> Please verify your data and send any comments to WLCG Office by 12
>>>July.
>>> 
>>> Any requests for recomputation must be submitted via GGUS within the
>>>next 10 calendar days; full details here.
>>> 
>>> Kind regards,
>>> Cath
>>> 
>>> 
>>> -----------------------------------------------
>>> WLCG Office
>>> IT Dept - CERN
>>> CH-1211 Genève, Switzerland
>>> www.cern.ch/wlcg
>> 
>> 
>> Use REPLY-ALL to reply to list
>> To unsubscribe from the ATLAS-SCCS-PLANNING-L list, click the following
>>link:
>> 
>>https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=ATLAS-SCCS-PLANNING-
>>L&A=1
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the ATLAS-SCCS-PLANNING-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=ATLAS-SCCS-PLANNING-L&A=1