I believe we have seen this before. I have looked into it some time ago and below you have some additional information. I also have a 5GB core file, if you need it let me know and I can put it in my public afs directory.

A thread was looping in the monitoring code around this point:
#0 0x000000000041a1c0 in do_Shift (this=0x7f11fb80e600, dictid=3290236416, rTot=, wTot=670147) at /usr/src/debug/xrootd/xrootd/src/XrdXrootd/XrdXrootdMonitor.cc:882

since the right shift applied on a negative value will fill with 1 the vacant bits. Therefore, the thread will never exit the while loop and the enclosing Close method will hold the lock forever.

The source of the problem is why the xTot value is negative - I couldn't find a better explanation than a memory corruption or an overflow ...
Looking at the XrdXrootdFileTable object confirms that the xfr.read value is negative. This in principle should always be positive since it represents the number of bytes read from the file - so probably it would be best to have this value be an unsigned type.

(gdb) f 2
#2  0x000000000041485a in XrdXrootdFileTable::Recycle (this=0x7f119ab8c0c0, monP=0x7f11fb80e600, monF=false) at /usr/src/debug/xrootd/xrootd/src/XrdXrootd/XrdXrootdFile.cc:234
234                                     FTab[i]->Stats.xfr.write);
(gdb) print *FTab[1]
$21 = {XrdSfsp = 0x7f116b7dc000, mmAddr = 0x0, FileKey = "3141", '0' <repeats 12 times>, "9099e81c00\000ame=mam", Reserved = "ar", FileMode = 119 'w', AsyncMode = 0 '\000', isMMapped = 0 '\000', sfEnabled = 0 '\000', fdNum = 32506,
  ID = 0x7f11ab5381fc "daemon.8151:167@p05614923e75338", Stats = {FileID = 3290236416, MonEnt = -1, monLvl = 1 '\001', xfrXeq = 1 '\001', fSize = 31895106, xfr = {read = -205550562, readv = 0, write = 670147}, ops = {read = 1,
      readv = 0, write = 51, rsMin = 32767, rsMax = 0, rsegs = 0, rdMin = 2147483647, rdMax = 0, rvMin = 2147483647, rvMax = 0, wrMin = 2147483647, wrMax = 0}, ssq = {read = 0, readv = 0, rsegs = 0, write = 0}},
  static Locker = 0x7f11fb8306f0, static sfOK = 0, static TraceID = 0x4299ee "File"}

Notice the xfr.read value which is negative.

The only info I could extract from the coredump is the identity of the client which opened the file and then making the correlation with the log file form today is seems the connections was quite long-lived:
150526 09:27:07 28615 XrootdXeq: daemon.8151:167@p05614923e75338 disc 1794:27:30


Reply to this email directly or view it on GitHub.



Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1