Hi Nikolai, Thank you, this helped. Andy and I have figured it out ... it is a bug (uninitialized variable) in xcache that seems to get activated when authorization is configured. We're working on a fix / workaround for 5.0.0. Thank you for your patience and help ... and sorry for the trouble. Cheers, Matevz On 2020-07-09 08:28, Nikolai Hartmann wrote: > Hi, > > Unfortunately it seems to occur also for release 5 (5.0.0-1). I tried > with this container: > > https://urldefense.com/v3/__https://gitlab.physik.uni-muenchen.de/Nikolai.Hartmann/xcache-singularity-lrz/-/blob/v5/xcache.singularity.def__;!!Mih3wA!R_VQAVZ4HLYGZ4lLNwZGyqUy1mEw60Sbs0rQ7_uScOv-qSpeAMn8vxlSvXKSGPtPAW8n$ > > using basically the same configuration as before, except that i > commented the `namelib` directive since the rucioN2N plugin rpm i was > using required xrootd < 5.0.0-1. > > In case it helps, here the new image: > > https://urldefense.com/v3/__https://cloud.physik.lmu.de/index.php/s/4AcEcFRPwXqZb9b__;!!Mih3wA!R_VQAVZ4HLYGZ4lLNwZGyqUy1mEw60Sbs0rQ7_uScOv-qSpeAMn8vxlSvXKSGHGlL74g$ > > Thanks, > Nikolai > > On 7/9/20 1:31 AM, Matevz Tadel wrote: >> On 2020-07-08 15:00, Andrew Hanushevsky wrote: >>> Could you try git head? There were issues in RC4. There might still be >>> issues, but I never saw the one you and Nikolai tripped over. >> >> Yes, I managed to reproduce it on rc4 tag with just basic gsi enabled >> ... on mater / 5.0.0 it doesn't happen. >> >> Nikolai, please update to 5.0.0. >> >> Matevz >> >>> Andy >>> >>> >>> On Wed, 8 Jul 2020, Matevz Tadel wrote: >>> >>>> On 2020-07-08 14:19, Andrew Hanushevsky wrote: >>>>> What release? Git head? >>>> >>>> This was from Nikolai's image, 5-rc4. >>>> >>>> \m >>>> >>>>> On Wed, 8 Jul 2020, Matevz Tadel wrote: >>>>> >>>>>> Hi Andy, >>>>>> >>>>>> On 2020-07-08 13:49, Andrew Hanushevsky wrote: >>>>>>> Hi Matevz, >>>>>>> >>>>>>> Well, what kind of authentication? Clearly, the kind we use >>>>>>> doesn't cause this >>>>>>> problem. It could be just a random core smash but if it's random >>>>>>> we should be >>>>>>> various effects not just a crash in this particular code path, right? >>>>>> >>>>>> xcache without any security config, everything works smooth. >>>>>> >>>>>> xcache with sec.protocol /usr/lib64 gsi --- trouble: >>>>>> >>>>>> 200708 13:38:44 240995 XrootdXeq: matevz.241046:31@uaf-7 pub IPv4 >>>>>> login as d0ba0e6c.0 >>>>>> 200708 13:38:44 240995 Posix_P2L: file >>>>>> /eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root >>>>>> pfn2lfn >>>>>> /eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root >>>>>> >>>>>> [2020-07-08 13:38:44.739012 -0700][Error ][AsyncSock ] >>>>>> [[log in to unmask]:1094.0] Unable to connect: network is >>>>>> unreachable >>>>>> [2020-07-08 13:38:44.739092 -0700][Error ][PostMaster ] >>>>>> [[log in to unmask]:1094] elapsed = 0, pConnectionWindow = 120 >>>>>> seconds. >>>>>> [2020-07-08 13:38:45.637583 -0700][Error ][XRootDTransport ] >>>>>> [[log in to unmask]:1094.0] Authentication with gsi failed: >>>>>> [2020-07-08 13:38:45.974332 -0700][Error ][AsyncSock ] >>>>>> [[log in to unmask]:1095.0] Unable to connect: network is >>>>>> unreachable >>>>>> [2020-07-08 13:38:45.974400 -0700][Error ][PostMaster ] >>>>>> [[log in to unmask]:1095] elapsed = 0, pConnectionWindow = >>>>>> 120 seconds. >>>>>> 200708 13:38:46 240995 XrdPfc_Manager: info Cache::Attach() >>>>>> root:[log in to unmask] >>>>>> 200708 13:38:46 240995 XrdPfc_Manager: debug Cache::GetFile >>>>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root, >>>>>> io 0xe30f50 >>>>>> 200708 13:38:46 240995 XrdPfc_IO: debug >>>>>> IOEntireFile::initCachedStat get stat from client res = 0, size = >>>>>> 2272072 >>>>>> root:[log in to unmask] >>>>>> 200708 13:38:46 240995 XrdPfc_File: debug Creating new file info, >>>>>> data size = 2272072 num blocks = 3 >>>>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root >>>>>> 200708 13:38:46 240995 XrdPfc_Manager: debug Cache::inc_ref_cnt >>>>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root, >>>>>> cnt at exit = 1 >>>>>> 200708 13:38:46 240995 XrdPfc_File: debug File::AddIO() io = >>>>>> 0xe30f50 >>>>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root >>>>>> 200708 13:38:46 240995 XrdPfc_Manager: debug Cache::Attach() >>>>>> root:[log in to unmask] >>>>>> location: [log in to unmask]:1095 >>>>>> [2020-07-08 13:38:47.022428 -0700][Error ][AsyncSock ] >>>>>> [[log in to unmask]:1095.0] Socket error encountered: >>>>>> [ERROR] Invalid arguments >>>>>> [2020-07-08 13:38:47.022506 -0700][Error ][XRootD ] >>>>>> [[log in to unmask]:1095] Unable to get the response to >>>>>> request kXR_read (handle: 0x00000000, offset: 0, size: 1048576) >>>>>> [2020-07-08 13:38:47.022625 -0700][Error ][File ] >>>>>> [0xf0b040@root:[log in to unmask]:1094//eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root?xrdcl.requuid=6730e10b-8b40-43bf-9d0a-75da982939e8] >>>>>> Fatal file state error. Message kXR_read (handle: 0x00000000, >>>>>> offset: 0, size: 1048576) returned with [ERROR] Invalid arguments >>>>>> 200708 13:38:47 241052 XrdPfc_File: error >>>>>> File::ProcessBlockResponse block 0xff3440 0 error=-22 >>>>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root >>>>>> 200708 13:38:47 240995 XrdPfc_File: error File::Read() io 0xe30f50, >>>>>> block 0 finished with error 22 invalid argument >>>>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root >>>>>> src/tcmalloc.cc:284] Attempt to free invalid pointer >>>>>> 0x313262003543620a >>>>>> >>>>>> Note that while we get all these network errors at the start, cache >>>>>> still got the stat info from the server (knows the size of the file). >>>>>> >>>>>> I must admit I never test xcache with auth on :( I'll try it out >>>>>> now, well, after lunch :) >>>>>> >>>>>> Matevz >>>>>> >>>>>>> Andy >>>>>>> >>>>>>> >>>>>>> On Wed, 8 Jul 2020, Matevz Tadel wrote: >>>>>>> >>>>>>>> Yay, that was a journey ... but I can reproduce it now! >>>>>>>> >>>>>>>> It is super strange this happens with xcache with authentication >>>>>>>> on only ... >>>>>>>> this really should have no effect. I first tried without it and >>>>>>>> it worked and >>>>>>>> then something rang a bell that you said so in the email :). >>>>>>>> >>>>>>>> Andy, does this ring any bells for you? It looks like interaction >>>>>>>> between >>>>>>>> server / client usage of X509 stuffe. >>>>>>>> >>>>>>>> Anyway, I'm digging on on the xcache side ... >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Matevz >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 2020-07-08 07:41, Nikolai Hartmann wrote: >>>>>>>>> Hi Matevz, >>>>>>>>> >>>>>>>>> I might have something like a "minimal failing example". >>>>>>>>> Unfortunately >>>>>>>>> the problem only appears when authentication is required, so the >>>>>>>>> example >>>>>>>>> will only work on a machine that has a valid host certificate >>>>>>>>> and the >>>>>>>>> corresponding directory has to be bind-mounted into the container. >>>>>>>>> >>>>>>>>> I uploaded my container image here: >>>>>>>>> >>>>>>>>> https://urldefense.com/v3/__https://cloud.physik.lmu.de/index.php/s/RFC6Q89FBxxNMXF__;!!Mih3wA!S4S4O0y7f1Z5oNAgkr2EZ2J5683bZ5LRbG55GbcoHhyJTwOzaS2lABcIifddJxDGMy-N$ >>>>>>>>> >>>>>>>>> >>>>>>>>> and made a directory structure (tar archive attached) to bind >>>>>>>>> mount into >>>>>>>>> the container (and containing the minimal failing xcache config >>>>>>>>> and a >>>>>>>>> script for starting gdb inside the container) >>>>>>>>> >>>>>>>>> To reproduce, extract the archive, enter the directory and run (as >>>>>>>>> non-root user) >>>>>>>>> >>>>>>>>> singularity run -B $(pwd)/data:/data -B >>>>>>>>> $(pwd)/config:/etc/xrootd:ro -B >>>>>>>>> <hostkey-dir>:/etc/grid-security:ro <singularity-image> >>>>>>>>> >>>>>>>>> where <hostkey-dir> is a directory that contains >>>>>>>>> >>>>>>>>> hostkey.pem >>>>>>>>> hostcert.pem >>>>>>>>> vomsdir (will become X509_VOMS_DIR) >>>>>>>>> certificates (will become X509_CERT_DIR) >>>>>>>>> >>>>>>>>> and <singularity-image> is the path to the singularity image. >>>>>>>>> >>>>>>>>> That should run xrootd and the log should appear in >>>>>>>>> data/xrd/var/log/xrootd.log >>>>>>>>> >>>>>>>>> I used this example to produce the failure: >>>>>>>>> >>>>>>>>> xrdcp -f >>>>>>>>> root://lcg-lrz-xcache0.grid.lrz.de:1094//root://eospublic.cern.ch//eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root >>>>>>>>> >>>>>>>>> /dev/null >>>>>>>>> >>>>>>>>> The simplest way to run gdb seemed to directly start xrootd with >>>>>>>>> gdb. >>>>>>>>> This can be done with the script run_xcache_debug.sh in the >>>>>>>>> attached >>>>>>>>> archive. Instead of the command above just use >>>>>>>>> >>>>>>>>> singularity exec -B $(pwd)/data:/data -B >>>>>>>>> $(pwd)/config:/etc/xrootd:ro -B >>>>>>>>> <hostkey-dir>:/etc/grid-security:ro <singularity-image> >>>>>>>>> ./run_xcache_debug.sh >>>>>>>>> >>>>>>>>> Note: Before restarting, best delete the content of the data >>>>>>>>> directory >>>>>>>>> since the bug also did not seem to occur when the file was already >>>>>>>>> cached (e.g after testing without authentication) >>>>>>>>> >>>>>>>>> Sorry for the overly complicated reproducing steps, but since it >>>>>>>>> only >>>>>>>>> happened when i authentication was enabled i didn't know how to >>>>>>>>> do it >>>>>>>>> simpler. I hope it helps. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Nikolai >>>>>>>>> >>>>>>>>> On 7/7/20 8:42 PM, Matevz Tadel wrote: >>>>>>>>>> Thanks Nikolai, I shall continue my investigation :) >>>>>>>>>> >>>>>>>>>> Matevz >>>>>>>>>> >>>>>>>>>> On 2020-07-06 23:59, Nikolai Hartmann wrote: >>>>>>>>>>> Hi Matevz, >>>>>>>>>>> >>>>>>>>>>> Thanks a lot for looking into this. >>>>>>>>>>> >>>>>>>>>>> - The crash seems to happen always when i make a request >>>>>>>>>>> - Currently prefetching is disabled >>>>>>>>>>> - Yes, i think it is direct proxy mode >>>>>>>>>>> - stack trace is attached >>>>>>>>>>> >>>>>>>>>>> A similar setup seems to work for Ilija without issues with >>>>>>>>>>> the xcaches >>>>>>>>>>> using slate - i tried to mimic that setup closely. Running >>>>>>>>>>> xrootd from >>>>>>>>>>> this container image: >>>>>>>>>>> >>>>>>>>>>> https://urldefense.com/v3/__https://gitlab.physik.uni-muenchen.de/Nikolai.Hartmann/xcache-singularity-lrz/-/blob/51d2da52829eb6d8ea377539884f337208141aca/xcache.singularity.def__;!!Mih3wA!SJibOzmy2P3rdD8Ut7m7gYp_bah2pQX2dR2V9U6xiTq9PoQtfjb_MHHDljpOV0aWvVYj$ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> using this config >>>>>>>>>>> >>>>>>>>>>> https://urldefense.com/v3/__https://gitlab.physik.uni-muenchen.de/Nikolai.Hartmann/xcache-singularity-lrz/-/blob/51d2da52829eb6d8ea377539884f337208141aca/etc/xrootd/xcache.cfg__;!!Mih3wA!SJibOzmy2P3rdD8Ut7m7gYp_bah2pQX2dR2V9U6xiTq9PoQtfjb_MHHDljpOVzHQF5CU$ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Nikolai >>>>>>>>>>> >>>>>>>>>>> On 7/7/20 1:38 AM, Matevz Tadel wrote: >>>>>>>>>>>> Hi Nikolai, >>>>>>>>>>>> >>>>>>>>>>>> I tried to reproduce it with current master in nearly all ways, >>>>>>>>>>>> with/without prefetching and with direct/forwarding mode. >>>>>>>>>>>> Also, with std >>>>>>>>>>>> malloc and tcmalloc. No luck :( >>>>>>>>>>>> >>>>>>>>>>>> Backtrace or core would help a lot at this point. >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> Matevz >>>>>>>>>>>> >>>>>>>>>>>> On 2020-07-03 00:54, Nikolai Hartmann wrote: >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I'm trying to upgrade to xrootd5 rc4 for our xcache server to >>>>>>>>>>>>> mitigate a >>>>>>>>>>>>> problem with dCache. >>>>>>>>>>>>> >>>>>>>>>>>>> Now when i try to read a file through xcache it crashes with >>>>>>>>>>>>> "Attempt to >>>>>>>>>>>>> free invalid pointer". I attached the corresponding part of >>>>>>>>>>>>> the log. >>>>>>>>>>>>> Any ideas? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Nikolai >>>>>>>>>>>>> >>>>>>>>>>>>> ######################################################################## >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Use REPLY-ALL to reply to list >>>>>>>>>>>>> >>>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following >>>>>>>>>>>>> link: >>>>>>>>>>>>> https://urldefense.com/v3/__https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1__;!!Mih3wA!Xzk53aW-mEg2pavzme9Hd49MPmno8frpbkh2YetRsquNyAt5jiVsDB91pTNUHA$ >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> ######################################################################## >>>>>>>> >>>>>>>> Use REPLY-ALL to reply to list >>>>>>>> >>>>>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>>>>> https://urldefense.com/v3/__https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1__;!!Mih3wA!XFGFw19U2eq-rA0gnMt46KV3Nmc-QzeRzIK6fXXO8cvBFHyGZUlCmc9OXbuf2OOR3Nx7$ >>>>>>>> >>>>>> >>>>>> ######################################################################## >>>>>> >>>>>> Use REPLY-ALL to reply to list >>>>>> >>>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>>> https://urldefense.com/v3/__https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1__;!!Mih3wA!UUgXa89ev5J0vPpdGvvHbICVnr6QNXvQ2IcZ9n1-1EWyCIX2l3I_RnFsdnvmvvfjFolt$ >>>>> >>>>> >>>>> >>>>> ######################################################################## >>>>> >>>>> Use REPLY-ALL to reply to list >>>>> >>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>> https://urldefense.com/v3/__https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1__;!!Mih3wA!UUgXa89ev5J0vPpdGvvHbICVnr6QNXvQ2IcZ9n1-1EWyCIX2l3I_RnFsdnvmvvfjFolt$ >>>> >>>> >>>> >> >> ######################################################################## >> Use REPLY-ALL to reply to list >> >> To unsubscribe from the XROOTD-L list, click the following link: >> https://urldefense.com/v3/__https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1__;!!Mih3wA!R_VQAVZ4HLYGZ4lLNwZGyqUy1mEw60Sbs0rQ7_uScOv-qSpeAMn8vxlSvXKSGHXeHleb$ ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1