Print

Print


~25000 "xrdcp" coredumps have been recorded in ntofpro Batch jobs this
morning, happening on 9 worker nodes between 01:34 and 03:56. The
commands issued look like

xrdcp -f -N root://eospublic.cern.ch//eos/experiment/ntof/processing/TAC_DATA/114471/run114471.idx.finished /tmp/ntofpro/run114471.idx.finished

Note that we did not record any "xrdcp" coredumps from other users.

A similar burst of "xrdcp" coredumps happened on Sep 28, when we
recorded >12000 cores on 2 servers, between 02:42 and 03:50.

Attached you can see some details of the crash, including a backtrace:

           PID: 125003 (xrdcp)
           UID: 95759 (ntofpro)
           GID: 2348 (za)
        Signal: 11 (SEGV)
     Timestamp: Tue 2022-10-04 01:34:49 CEST (9h ago)
  Command Line: xrdcp -f -N root://eospublic.cern.ch//eos/experiment/ntof/processing/TAC_DATA/114471/run114471.idx.finished /tmp/ntofpro/run114471.idx.finished
    Executable: /usr/bin/xrdcp
 Control Group: /system.slice/condor.service
          Unit: condor.service
         Slice: system.slice
       Boot ID: 1615e66df0174904b3821ad44a980f28
    Machine ID: 7aea158db12945aebe58d21e3d64855a
      Hostname: b7g02p8793.cern.ch
       Message: Process 125003 (xrdcp) of user 95759 dumped core.
                
                Stack trace of thread 30919:
                #0  0x00002b1fa7f93b30 _ZN5XrdCl10PostMaster13GetJobManagerEv (libXrdCl.so.3)
                #1  0x00002b1fa804de17 _ZN5XrdCl16LocalFileHandlerC1Ev (libXrdCl.so.3)
                #2  0x00002b1fa7febce6 _ZN5XrdCl16FileStateHandlerC1ERPNS_10FilePlugInE (libXrdCl.so.3)
                #3  0x00002b1fa7fdeb75 _ZN5XrdCl4FileC2Eb (libXrdCl.so.3)
                #4  0x00002b1fa801107f _ZN12_GLOBAL__N_112XRootDSourceC2EPKN5XrdCl3URLEjhRKSsRKSt6vectorISsSaISsEEb (libXrdCl.so.3)
                #5  0x00002b1fa8014feb _ZN5XrdCl14ClassicCopyJob3RunEPNS_19CopyProgressHandlerE (libXrdCl.so.3)
                #6  0x00002b1fa7ffca02 _ZN12_GLOBAL__N_113QueuedCopyJob3RunEPv (libXrdCl.so.3)
                #7  0x00002b1fa7ffe808 _ZN5XrdCl11CopyProcess3RunEPNS_19CopyProgressHandlerE (libXrdCl.so.3)
                #8  0x000000000040b5c3 main (xrdcp)
                #9  0x00002b1fa9042555 __libc_start_main (libc.so.6)
                #10 0x000000000040cdc3 _start (xrdcp)





(gdb) bt
#0  0x00002b1fa7f93b30 in XrdCl::PostMaster::GetJobManager() () from /lib64/libXrdCl.so.3
#1  0x00002b1fa804de17 in XrdCl::LocalFileHandler::LocalFileHandler() () from /lib64/libXrdCl.so.3
#2  0x00002b1fa7febce6 in XrdCl::FileStateHandler::FileStateHandler(XrdCl::FilePlugIn*&) () from /lib64/libXrdCl.so.3
#3  0x00002b1fa7fdeb75 in XrdCl::File::File(bool) () from /lib64/libXrdCl.so.3
#4  0x00002b1fa801107f in (anonymous namespace)::XRootDSource::XRootDSource(XrdCl::URL const*, unsigned int, unsigned char, std::string const&, std::vector<std::string, std::allocator<std::string> > const&, bool) () from /lib64/libXrdCl.so.3
#5  0x00002b1fa8014feb in XrdCl::ClassicCopyJob::Run(XrdCl::CopyProgressHandler*) () from /lib64/libXrdCl.so.3
#6  0x00002b1fa7ffca02 in (anonymous namespace)::QueuedCopyJob::Run(void*) () from /lib64/libXrdCl.so.3
#7  0x00002b1fa7ffe808 in XrdCl::CopyProcess::Run(XrdCl::CopyProgressHandler*) () from /lib64/libXrdCl.so.3
#8  0x000000000040b5c3 in main ()


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: <xrootd/xrootd/issues/1797@github.com>

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/issues/1797", "url": "https://github.com/xrootd/xrootd/issues/1797", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1