When XRD_RUNFORKHANDLER is set, launching new processes can be fail due to a race between the main thread and XRootD's polling thread.

This isn't Python specific (and I'm using GFAL2 instead of the XRootD python bindings) but I'll explain how it happens using it as an example. When subprocess creates a new process it works by:

  1. Calling clone on the current process
  2. If close_fds=True then call close on all open file descriptors (Python 2.7 does it all all possible file descriptors but it's not relevant)
  3. Start the requested process using execve

If XRD_RUNFORKHANDLER is set, the pthread_atfork hook causes XrdSys::IOEvents::Poller::newPoller to call epoll_create1 to create a new epoll file descriptor . The newly cloned process then proceeds to close all of the open file descriptors. If XrdSys::IOEvents::PollE::Begin runs before the main thread calls execvp it finds the file descriptor is closed and aborts after printing EPoll: Bad file descriptor polling for events.

I've struggled to make it crash locally however I can see it ~20% of the time in real jobs submitted to some WLCG sites.

While I can come up with workarounds, I think the underlying issue needs to be fixed as quite a few libraries set XRD_RUNFORKHANDLER for good reasons then perfectly correct user code will randomly fail.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/issues/1198", "url": "https://github.com/xrootd/xrootd/issues/1198", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1