On 2020-07-08 15:00, Andrew Hanushevsky wrote: > Could you try git head? There were issues in RC4. There might still be issues, > but I never saw the one you and Nikolai tripped over. Yes, I managed to reproduce it on rc4 tag with just basic gsi enabled ... on mater / 5.0.0 it doesn't happen. Nikolai, please update to 5.0.0. Matevz > Andy > > > On Wed, 8 Jul 2020, Matevz Tadel wrote: > >> On 2020-07-08 14:19, Andrew Hanushevsky wrote: >>> What release? Git head? >> >> This was from Nikolai's image, 5-rc4. >> >> \m >> >>> On Wed, 8 Jul 2020, Matevz Tadel wrote: >>> >>>> Hi Andy, >>>> >>>> On 2020-07-08 13:49, Andrew Hanushevsky wrote: >>>>> Hi Matevz, >>>>> >>>>> Well, what kind of authentication? Clearly, the kind we use doesn't cause this >>>>> problem. It could be just a random core smash but if it's random we should be >>>>> various effects not just a crash in this particular code path, right? >>>> >>>> xcache without any security config, everything works smooth. >>>> >>>> xcache with sec.protocol /usr/lib64 gsi --- trouble: >>>> >>>> 200708 13:38:44 240995 XrootdXeq: matevz.241046:31@uaf-7 pub IPv4 login as >>>> d0ba0e6c.0 >>>> 200708 13:38:44 240995 Posix_P2L: file >>>> /eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root >>>> pfn2lfn /eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root >>>> [2020-07-08 13:38:44.739012 -0700][Error ][AsyncSock ] >>>> [[log in to unmask]:1094.0] Unable to connect: network is unreachable >>>> [2020-07-08 13:38:44.739092 -0700][Error ][PostMaster ] >>>> [[log in to unmask]:1094] elapsed = 0, pConnectionWindow = 120 seconds. >>>> [2020-07-08 13:38:45.637583 -0700][Error ][XRootDTransport ] >>>> [[log in to unmask]:1094.0] Authentication with gsi failed: >>>> [2020-07-08 13:38:45.974332 -0700][Error ][AsyncSock ] >>>> [[log in to unmask]:1095.0] Unable to connect: network is unreachable >>>> [2020-07-08 13:38:45.974400 -0700][Error ][PostMaster ] >>>> [[log in to unmask]:1095] elapsed = 0, pConnectionWindow = 120 >>>> seconds. >>>> 200708 13:38:46 240995 XrdPfc_Manager: info Cache::Attach() >>>> root:[log in to unmask] >>>> 200708 13:38:46 240995 XrdPfc_Manager: debug Cache::GetFile >>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root, io >>>> 0xe30f50 >>>> 200708 13:38:46 240995 XrdPfc_IO: debug IOEntireFile::initCachedStat get >>>> stat from client res = 0, size = 2272072 >>>> root:[log in to unmask] >>>> 200708 13:38:46 240995 XrdPfc_File: debug Creating new file info, data size >>>> = 2272072 num blocks = 3 >>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root >>>> 200708 13:38:46 240995 XrdPfc_Manager: debug Cache::inc_ref_cnt >>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root, cnt >>>> at exit = 1 >>>> 200708 13:38:46 240995 XrdPfc_File: debug File::AddIO() io = 0xe30f50 >>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root >>>> 200708 13:38:46 240995 XrdPfc_Manager: debug Cache::Attach() >>>> root:[log in to unmask] >>>> location: [log in to unmask]:1095 >>>> [2020-07-08 13:38:47.022428 -0700][Error ][AsyncSock ] >>>> [[log in to unmask]:1095.0] Socket error encountered: [ERROR] >>>> Invalid arguments >>>> [2020-07-08 13:38:47.022506 -0700][Error ][XRootD ] >>>> [[log in to unmask]:1095] Unable to get the response to request >>>> kXR_read (handle: 0x00000000, offset: 0, size: 1048576) >>>> [2020-07-08 13:38:47.022625 -0700][Error ][File ] >>>> [0xf0b040@root:[log in to unmask]:1094//eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root?xrdcl.requuid=6730e10b-8b40-43bf-9d0a-75da982939e8] >>>> Fatal file state error. Message kXR_read (handle: 0x00000000, offset: 0, >>>> size: 1048576) returned with [ERROR] Invalid arguments >>>> 200708 13:38:47 241052 XrdPfc_File: error File::ProcessBlockResponse block >>>> 0xff3440 0 error=-22 >>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root >>>> 200708 13:38:47 240995 XrdPfc_File: error File::Read() io 0xe30f50, block 0 >>>> finished with error 22 invalid argument >>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root >>>> src/tcmalloc.cc:284] Attempt to free invalid pointer 0x313262003543620a >>>> >>>> Note that while we get all these network errors at the start, cache still >>>> got the stat info from the server (knows the size of the file). >>>> >>>> I must admit I never test xcache with auth on :( I'll try it out now, well, >>>> after lunch :) >>>> >>>> Matevz >>>> >>>>> Andy >>>>> >>>>> >>>>> On Wed, 8 Jul 2020, Matevz Tadel wrote: >>>>> >>>>>> Yay, that was a journey ... but I can reproduce it now! >>>>>> >>>>>> It is super strange this happens with xcache with authentication on only ... >>>>>> this really should have no effect. I first tried without it and it worked and >>>>>> then something rang a bell that you said so in the email :). >>>>>> >>>>>> Andy, does this ring any bells for you? It looks like interaction between >>>>>> server / client usage of X509 stuffe. >>>>>> >>>>>> Anyway, I'm digging on on the xcache side ... >>>>>> >>>>>> Cheers, >>>>>> Matevz >>>>>> >>>>>> >>>>>> >>>>>> On 2020-07-08 07:41, Nikolai Hartmann wrote: >>>>>>> Hi Matevz, >>>>>>> >>>>>>> I might have something like a "minimal failing example". Unfortunately >>>>>>> the problem only appears when authentication is required, so the example >>>>>>> will only work on a machine that has a valid host certificate and the >>>>>>> corresponding directory has to be bind-mounted into the container. >>>>>>> >>>>>>> I uploaded my container image here: >>>>>>> >>>>>>> https://urldefense.com/v3/__https://cloud.physik.lmu.de/index.php/s/RFC6Q89FBxxNMXF__;!!Mih3wA!S4S4O0y7f1Z5oNAgkr2EZ2J5683bZ5LRbG55GbcoHhyJTwOzaS2lABcIifddJxDGMy-N$ >>>>>>> >>>>>>> >>>>>>> and made a directory structure (tar archive attached) to bind mount into >>>>>>> the container (and containing the minimal failing xcache config and a >>>>>>> script for starting gdb inside the container) >>>>>>> >>>>>>> To reproduce, extract the archive, enter the directory and run (as >>>>>>> non-root user) >>>>>>> >>>>>>> singularity run -B $(pwd)/data:/data -B $(pwd)/config:/etc/xrootd:ro -B >>>>>>> <hostkey-dir>:/etc/grid-security:ro <singularity-image> >>>>>>> >>>>>>> where <hostkey-dir> is a directory that contains >>>>>>> >>>>>>> hostkey.pem >>>>>>> hostcert.pem >>>>>>> vomsdir (will become X509_VOMS_DIR) >>>>>>> certificates (will become X509_CERT_DIR) >>>>>>> >>>>>>> and <singularity-image> is the path to the singularity image. >>>>>>> >>>>>>> That should run xrootd and the log should appear in >>>>>>> data/xrd/var/log/xrootd.log >>>>>>> >>>>>>> I used this example to produce the failure: >>>>>>> >>>>>>> xrdcp -f >>>>>>> root://lcg-lrz-xcache0.grid.lrz.de:1094//root://eospublic.cern.ch//eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root >>>>>>> >>>>>>> /dev/null >>>>>>> >>>>>>> The simplest way to run gdb seemed to directly start xrootd with gdb. >>>>>>> This can be done with the script run_xcache_debug.sh in the attached >>>>>>> archive. Instead of the command above just use >>>>>>> >>>>>>> singularity exec -B $(pwd)/data:/data -B $(pwd)/config:/etc/xrootd:ro -B >>>>>>> <hostkey-dir>:/etc/grid-security:ro <singularity-image> >>>>>>> ./run_xcache_debug.sh >>>>>>> >>>>>>> Note: Before restarting, best delete the content of the data directory >>>>>>> since the bug also did not seem to occur when the file was already >>>>>>> cached (e.g after testing without authentication) >>>>>>> >>>>>>> Sorry for the overly complicated reproducing steps, but since it only >>>>>>> happened when i authentication was enabled i didn't know how to do it >>>>>>> simpler. I hope it helps. >>>>>>> >>>>>>> Thanks, >>>>>>> Nikolai >>>>>>> >>>>>>> On 7/7/20 8:42 PM, Matevz Tadel wrote: >>>>>>>> Thanks Nikolai, I shall continue my investigation :) >>>>>>>> >>>>>>>> Matevz >>>>>>>> >>>>>>>> On 2020-07-06 23:59, Nikolai Hartmann wrote: >>>>>>>>> Hi Matevz, >>>>>>>>> >>>>>>>>> Thanks a lot for looking into this. >>>>>>>>> >>>>>>>>> - The crash seems to happen always when i make a request >>>>>>>>> - Currently prefetching is disabled >>>>>>>>> - Yes, i think it is direct proxy mode >>>>>>>>> - stack trace is attached >>>>>>>>> >>>>>>>>> A similar setup seems to work for Ilija without issues with the xcaches >>>>>>>>> using slate - i tried to mimic that setup closely. Running xrootd from >>>>>>>>> this container image: >>>>>>>>> >>>>>>>>> https://urldefense.com/v3/__https://gitlab.physik.uni-muenchen.de/Nikolai.Hartmann/xcache-singularity-lrz/-/blob/51d2da52829eb6d8ea377539884f337208141aca/xcache.singularity.def__;!!Mih3wA!SJibOzmy2P3rdD8Ut7m7gYp_bah2pQX2dR2V9U6xiTq9PoQtfjb_MHHDljpOV0aWvVYj$ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> using this config >>>>>>>>> >>>>>>>>> https://urldefense.com/v3/__https://gitlab.physik.uni-muenchen.de/Nikolai.Hartmann/xcache-singularity-lrz/-/blob/51d2da52829eb6d8ea377539884f337208141aca/etc/xrootd/xcache.cfg__;!!Mih3wA!SJibOzmy2P3rdD8Ut7m7gYp_bah2pQX2dR2V9U6xiTq9PoQtfjb_MHHDljpOVzHQF5CU$ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Nikolai >>>>>>>>> >>>>>>>>> On 7/7/20 1:38 AM, Matevz Tadel wrote: >>>>>>>>>> Hi Nikolai, >>>>>>>>>> >>>>>>>>>> I tried to reproduce it with current master in nearly all ways, >>>>>>>>>> with/without prefetching and with direct/forwarding mode. Also, with std >>>>>>>>>> malloc and tcmalloc. No luck :( >>>>>>>>>> >>>>>>>>>> Backtrace or core would help a lot at this point. >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Matevz >>>>>>>>>> >>>>>>>>>> On 2020-07-03 00:54, Nikolai Hartmann wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I'm trying to upgrade to xrootd5 rc4 for our xcache server to >>>>>>>>>>> mitigate a >>>>>>>>>>> problem with dCache. >>>>>>>>>>> >>>>>>>>>>> Now when i try to read a file through xcache it crashes with >>>>>>>>>>> "Attempt to >>>>>>>>>>> free invalid pointer". I attached the corresponding part of the log. >>>>>>>>>>> Any ideas? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Nikolai >>>>>>>>>>> >>>>>>>>>>> ######################################################################## >>>>>>>>>>> >>>>>>>>>>> Use REPLY-ALL to reply to list >>>>>>>>>>> >>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>>>>>>>> https://urldefense.com/v3/__https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1__;!!Mih3wA!Xzk53aW-mEg2pavzme9Hd49MPmno8frpbkh2YetRsquNyAt5jiVsDB91pTNUHA$ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>>>> ######################################################################## >>>>>> Use REPLY-ALL to reply to list >>>>>> >>>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>>> https://urldefense.com/v3/__https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1__;!!Mih3wA!XFGFw19U2eq-rA0gnMt46KV3Nmc-QzeRzIK6fXXO8cvBFHyGZUlCmc9OXbuf2OOR3Nx7$ >>>>>> >>>> >>>> ######################################################################## >>>> Use REPLY-ALL to reply to list >>>> >>>> To unsubscribe from the XROOTD-L list, click the following link: >>>> https://urldefense.com/v3/__https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1__;!!Mih3wA!UUgXa89ev5J0vPpdGvvHbICVnr6QNXvQ2IcZ9n1-1EWyCIX2l3I_RnFsdnvmvvfjFolt$ >>> >>> >>> ######################################################################## >>> Use REPLY-ALL to reply to list >>> >>> To unsubscribe from the XROOTD-L list, click the following link: >>> https://urldefense.com/v3/__https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1__;!!Mih3wA!UUgXa89ev5J0vPpdGvvHbICVnr6QNXvQ2IcZ9n1-1EWyCIX2l3I_RnFsdnvmvvfjFolt$ >> >> ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1