When XRD_RUNFORKHANDLER
is set, launching new processes can be fail due to a race between the main thread and XRootD's polling thread.
This isn't Python specific (and I'm using GFAL2 instead of the XRootD python bindings) but I'll explain how it happens using it as an example. When subprocess
creates a new process it works by:
clone
on the current processclose_fds=True
then call close
on all open file descriptors (Python 2.7 does it all all possible file descriptors but it's not relevant)execve
If XRD_RUNFORKHANDLER
is set, the pthread_atfork
hook causes XrdSys::IOEvents::Poller::newPoller
to call epoll_create1
to create a new epoll file descriptor . The newly cloned process then proceeds to close all of the open file descriptors. If XrdSys::IOEvents::PollE::Begin
runs before the main thread calls execvp
it finds the file descriptor is closed and aborts after printing EPoll: Bad file descriptor polling for events
.
I've struggled to make it crash locally however I can see it ~20% of the time in real jobs submitted to some WLCG sites.
While I can come up with workarounds, I think the underlying issue needs to be fixed as quite a few libraries set XRD_RUNFORKHANDLER
for good reasons then perfectly correct user code will randomly fail.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1