@osschar Thanks for the nice plots and more importantly the summary. I am still trying to understand them and the implications. To answer your questions, the following points may help

  1. All test jobs (feeding into this plot) are identical except for the input files - 12 files are randomly picked from a list of about 2000.
  2. Final status of the jobs : Some jobs succeed. Some jobs are killed by DIRAC as having been found to be stuck (CPU efficiency < 5% - "Watchdog identified this job as stalled"). The remaining jobs have failed with the error we are concerned with here - this is the "application finished with errors" in the following monitoring page
    http://lhcb-project-dirac.web.cern.ch/lhcb-project-dirac/RALEchoTestMonitoring.html
  3. There is no way I can guarantee a job to be successful streaming data. This is the real problem that a given job can fail once, and succeed another time on the same machine, with the same input files and had led to so much puzzlement and time spent on this issue.

I am guessing that the monitoring information you get is not by job. As a result, there is no easy way for me to tell you if a file got fully streamed or not. All I can tell you is about 3% of jobs succeeded recently (last 24 hours) which implies, one of its input 12 files failed to stream fully. From experience, this can happen at any point in the streaming of the file.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/issues/1259#issuecomment-673728049", "url": "https://github.com/xrootd/xrootd/issues/1259#issuecomment-673728049", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1