Print

Print


Hi David,

Ah, now I see your point. I forgot that hName was not a pure host name but the actual redirect line with opaque information. Indeed, an strncpy() should be used here as, yes, your particular redirection mechanism has a goof chance of over-running the buffer. I will fix that. Thanks again.

Andy

From: smithdh
Sent: Monday, December 09, 2013 1:19 AM
To: xrootd/xrootd
Cc: Andrew Hanushevsky
Subject: Re: [xrootd] crash of xrootd 3.3.3 redirector (#64)

Hi Andy,

I think the point with makes the difference is that hName, here, is the redirect target and any opaque information which the plugin has supplied (i.e. "host.example.com?opaque=xyz"). In this case with dpm-xrootd it's quite long, about 400 chars, with Path perhaps 100. I think an overwrite in XrdXrootdMonitor::Redirect is still quite compelling:

The core files are in:

/afs/cern.ch/user/d/dhsmith/xrd_crash/

(you should have read access with your 'abh' afs cern credential, let me know if not). The SL5 cores are in 'core-sl5' and SL6 in 'core-sl6' subdirectory. The SL5 sites was running SL 5.5, and the SL6 site 6.3 (64 bit). The corresponding xrootd sources are also in a subdirectory, it was version 3.3.3, the EPEL build (which was just replaced on Saturday by version 3.3.4 in the EPEL repository). On my SL6 machine I had to make sure some of the dependent libraries corresponded, I include them there too, glibc & openssl.

There are 10 core files. In summary:

SL5: core.25962 (in malloc_consolidate)
SL6: core 9502, 12282 (in libcrypto.so, called from XrdCryptosslX509)
core 20320, 24753, 26680, 27077, 30822, 32384, 4370 (in XrdScheduler)

Often in XrdScheduler the failure is calling DoIt() - the vtable seems to have a corrupt address. It is not always straight forward to know which class was originally being called. However in some cases I think it was XrdStatsJob (e.g. core.4370), although I think the jobs are varied between the core files.

The xrootd.monitor settings at the sites:

sl5:
xrootd.monitor all rbuff 32k auth flush 30s window 5s dest files info user io redir atl-prod05.slac.stanford.edu:9930

sl6:
xrootd.monitor all auth flush io 60s ident 5m mbuff 8k rbuff 4k rnums 3 window 10s dest files io info user redir xrootd.t2.ucsd.edu:9930

I think in most cases (all expect core 9502) it appears to be the same type of fault. Consider the SL6 core core.4370:

Program terminated with signal 11, Segmentation fault.
#0 0x000000368a654403 in XrdScheduler::Run (this=0x368b00b100)
at /usr/src/debug/xrootd-3.3.3/src/Xrd/XrdScheduler.cc:307
307 jp->DoIt();
(gdb) where
#0 0x000000368a654403 in XrdScheduler::Run (this=0x368b00b100)
at /usr/src/debug/xrootd-3.3.3/src/Xrd/XrdScheduler.cc:307
#1 0x000000368a6545f9 in XrdStartWorking (carg=)
at /usr/src/debug/xrootd-3.3.3/src/Xrd/XrdScheduler.cc:84
#2 0x000000368a621aef in XrdSysThread_Xeq (myargs=0x7fd9fc02f210)
at /usr/src/debug/xrootd-3.3.3/src/XrdSys/XrdSysPthread.cc:86
#3 0x000000381e407851 in start_thread () from /lib64/libpthread.so.0
#4 0x000000381dce811d in clone () from /lib64/libc.so.6
(gdb) set print elements 0
(gdb) disas $pc-0x10,$pc+1
Dump of assembler code from 0x368a6543f3 to 0x368a654404:
0x000000368a6543f3 XrdScheduler::Run()+195: mov 0x40(%rbp),%rax
0x000000368a6543f7 XrdScheduler::Run()+199: testb $0x40,(%rax)
0x000000368a6543fa XrdScheduler::Run()+202: jne 0x368a65444f XrdScheduler::Run()+287
0x000000368a6543fc XrdScheduler::Run()+204: mov (%r12),%rax
0x000000368a654400 XrdScheduler::Run()+208: mov %r12,%rdi
=> 0x000000368a654403 XrdScheduler::Run()+211: callq *(%rax)

(gdb) print/x $r12
$1 = 0xcf2010

jp is 0xcf2010. I assert that 0xcf2000 is the start of a malloc chunk, and that the previous chunk started 0x1010 before this.

(gdb) x/16xw $r12-0x1020
0xcf0ff0: 0x00000ff0 0x00000000 0x00001010 0x00000000
0xcf1000: 0x280f3a72 0x27f98652 0x000000f0 0x46041211
0xcf1010: 0x0a000000 0x03fa8652 0x46043085 0x2e000000
0xcf1020: 0x2d326364 0x64697267 0x6f6f702d 0x34612d6c

From what I little I know of the various linux malloc structures the first 16 bytes of this are the malloc chunk header. In this case the 0x1010 is the length ((16 byte chunk header + size of user area) | malloc flags in the lower bits). This makes the user area 0x1000 in length (4k). I think this is a chunk for a XrdXrootdMonBurr. (At the sl6 site rbuff is 4k).

The next 8 bytes are the XrdXrootdMonHeader, and the next 8 are
union {kXR_int64 sID;
kXR_char sXX[8]; };

Then starts the XrdXrootdMonRedir slots, the first slot is a timing mark. (e.g. 0x03fa8652 is network order unix time for 16 nov, 04:52:19 utc). The next 8 bytes are:

mtP->arg0.rdr.Type = XROOTD_MON_REDIRECT | opC;
mtP->arg0.rdr.Dent = static_cast(slots);
mtP->arg0.rdr.Port = htons(static_cast(Port));
mtP->arg1.dictid = mID;

and the next 8 are the message:

0xcf1020: "dc2-grid-pool-a4-02.brunel.ac.uk?&dpm.time=1384577547&dpm.voms=.&dpm.dhost=dc2-grid-pool-a4-02.brunel.ac.uk&dpm.sfn=/data2/dpmfs/cms/2013-02-15/362AC00A-AC71-E211-AA9F-002618943935.root.25731330.0&dpm.loc=1&dpm.chunk0=0,3966230341,dc2-grid-pool-a4-02.brune:/dpm/brunel.ac.uk/home/cms/store/data/Run2012B/SinglePhoton/AOD/22Jan2013-v1/30000/362AC00A-AC71-E211-AA9F-002618943935.root"

(see that after 256 characters there is ':/dpm/br' which I interpret as the start of Path after 256 of truncated hName).

(gdb) x/32xw 0xcf0ff0+0x1010-0x20
0xcf1fe0: 0x6f6f722e 0x00000074 0x4e4c3251 0x55674f38
0xcf1ff0: 0x3d3d6749 0x6d706426 0x3276682e 0x374b703d
0xcf2000: 0x424a6d6e 0x6d573362 0x4d705878 0x39643947
0xcf2010: 0x3d41456a 0x0000003d 0x00632740 0x00000000
0xcf2020: 0x8a65cfdd 0x00000036 0x5286fb07 0x00000000
0xcf2030: 0x8b00b100 0x00000036 0x00b75e90 0x00000000
0xcf2040: 0x0000003c 0x00000000 0x00001411 0x00000000
0xcf2050: 0x00000008 0x00000000 0x8a86a6b0 0x00000036

The next malloc chunk should start at cf2000, with the user data starting at cf2010. (Which was the value of the r12 register, it should be an XrdJob). It was the deference of the content of 0xcf2010 (which should be the vtable entry for DoIt()) which caused the segmentation fault. The pointer at 0xcf2020 should be "Comment" int he XrdJob class:

(gdb) x/s 0x368a65cfdd
0x368a65cfdd: "stats reporter"

(so I think this was the stats reporter job). The data starting at 0xcf2000 is clearly not a malloc chunk header. Assuming this is the corruption:

(gdb) x/s 0xcf2000
0xcf2000: "nmJBb3WmxXpMG9d9jEA=="

looks like the last part of dpm-xrootd opaque data. (So my guess that this is the tailing part of a previous hName).

Most of the core files are similar; although I couldn't easily tie core.9502 to this.

The SL6 has been running without the 'xrootd.monitor' setting on its redirector(s) since thursday; and hasn't crashed since, as far as I know. This does not confirm the problem, but it is still consistent.


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub.



Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1