In CMS, we're hitting a rather bizarre (but nightmarish!) problem with Xrootd that appears to be something in the redirection code. See here:

dmwm/WMCore#9432

The basic symptom is that, when file A is opened the contents of file B are given. This is a nightmare as the contents of file B are sometimes "similar enough" (differing, say, by physics process) to allow the job to succeed. In the most common case, file A is not actually present at the site where the client is redirected to - but yet file B is served.

The problem seems limited to sites (primarily, one DPM site - but there are others!) that include significant CGI in the redirection hostname. Example of one such hostname in the redirection response (not from an affected job, just to illustrate what I mean by "CGI in the hostname"):

YYYYYY.ac.uk?&dpm.time=1575473136&dpm.dhost=YYYYYY.ac.uk&dpm.loc=1&dpm.chunk0=0,0,YYYYYY.ac.uk:/data0/dp
mfs/cms/2019-12-04/30D1B46E-D055-704E-A492-4B364B10F13C.root.125675.1575462690&dpm.nonce=XXXXXX&dpm.hv2=YYYYYYY

Honestly, I'm at a loss on how this could happen -- whether the redirector is serving up bad redirection URLs, the site is serving the wrong files, or some memory bug in the client causes it to make the wrong requests. All seem remarkably unlikely - but yet something is happening.

For example, how would a memory bug in the client cause it to serve data from the file /store/unmerged/RunIIFall18wmLHEGS/JJH0PMToTauTauPlusTwoJets_Filtered_M125_TuneCP5_13TeV-mcatnloFXFX-pythia8/GEN-SIM/102X_upgrade2018_realistic_v11-v1/230000/B04C9BFD-09C6-0D4D-8E67-2D4DDCE88027.root when we request /store/unmerged/RunIISummer19UL17wmLHEGEN/QCD_HT100to200_TuneCP5_PSWeights_13TeV-madgraphMLM-pythia8/GEN/106X_mc2017_realistic_v6-v1/40017/81CB9803-B71C-464B-AD3F-5B53605EF41E.root? These are from unrelated datasets, different campaigns, and different versions -- and the incorrect filename (the one ending with B04C9BFD-09C6-0D4D-8E67-2D4DDCE88027.root) is nowhere within the job!

On the other hand, DPM seems to validate that the CGI presented matches the filename requested (@ffurano - can you confirm?). So it's unlikely that the redirector simply cached some CGI in the hostname. Further, if it was a DPM bug, why would the DPM site receive a redirection for a file it doesn't have?

Finally, for the culprit to be the redirector, why would it be changing the filename in the redirection (honestly, I thought that was impossible)? Why does it affect one modest-sized T2 and not huge sites like CERN? If it was a random cache corruption, wouldn't it affect all subscribed sites equally?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/issues/1097?email_source=notifications\u0026email_token=AA7NRDRO6YORNK7OECJGOP3QW7GUFA5CNFSM4JVKXSBKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H6BOZNQ", "url": "https://github.com/xrootd/xrootd/issues/1097?email_source=notifications\u0026email_token=AA7NRDRO6YORNK7OECJGOP3QW7GUFA5CNFSM4JVKXSBKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H6BOZNQ", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1