Print

Print


On 2020-07-08 15:00, Andrew Hanushevsky wrote:
> Could you try git head? There were issues in RC4. There might still be issues, 
> but I never saw the one you and Nikolai tripped over.

Yes, I managed to reproduce it on rc4 tag with just basic gsi enabled ... on 
mater / 5.0.0 it doesn't happen.

Nikolai, please update to 5.0.0.

Matevz

> Andy
> 
> 
> On Wed, 8 Jul 2020, Matevz Tadel wrote:
> 
>> On 2020-07-08 14:19, Andrew Hanushevsky wrote:
>>> What release? Git head?
>>
>> This was from Nikolai's image, 5-rc4.
>>
>> \m
>>
>>> On Wed, 8 Jul 2020, Matevz Tadel wrote:
>>>
>>>> Hi Andy,
>>>>
>>>> On 2020-07-08 13:49, Andrew Hanushevsky wrote:
>>>>> Hi Matevz,
>>>>>
>>>>> Well, what kind of authentication? Clearly, the kind we use doesn't cause this
>>>>> problem. It could be just a random core smash but if it's random we should be
>>>>> various effects not just a crash in this particular code path, right?
>>>>
>>>> xcache without any security config, everything works smooth.
>>>>
>>>> xcache with sec.protocol /usr/lib64 gsi --- trouble:
>>>>
>>>> 200708 13:38:44 240995 XrootdXeq: matevz.241046:31@uaf-7 pub IPv4 login as 
>>>> d0ba0e6c.0
>>>> 200708 13:38:44 240995 Posix_P2L: file 
>>>> /eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root 
>>>> pfn2lfn /eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root
>>>> [2020-07-08 13:38:44.739012 -0700][Error  ][AsyncSock         ] 
>>>> [[log in to unmask]:1094.0] Unable to connect: network is unreachable
>>>> [2020-07-08 13:38:44.739092 -0700][Error  ][PostMaster        ] 
>>>> [[log in to unmask]:1094] elapsed = 0, pConnectionWindow = 120 seconds.
>>>> [2020-07-08 13:38:45.637583 -0700][Error  ][XRootDTransport   ] 
>>>> [[log in to unmask]:1094.0] Authentication with gsi failed:
>>>> [2020-07-08 13:38:45.974332 -0700][Error  ][AsyncSock         ] 
>>>> [[log in to unmask]:1095.0] Unable to connect: network is unreachable
>>>> [2020-07-08 13:38:45.974400 -0700][Error  ][PostMaster        ] 
>>>> [[log in to unmask]:1095] elapsed = 0, pConnectionWindow = 120 
>>>> seconds.
>>>> 200708 13:38:46 240995 XrdPfc_Manager: info Cache::Attach() 
>>>> root:[log in to unmask] 
>>>> 200708 13:38:46 240995 XrdPfc_Manager: debug Cache::GetFile 
>>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root, io 
>>>> 0xe30f50
>>>> 200708 13:38:46 240995 XrdPfc_IO: debug IOEntireFile::initCachedStat get 
>>>> stat from client res = 0, size = 2272072 
>>>> root:[log in to unmask] 
>>>> 200708 13:38:46 240995 XrdPfc_File: debug Creating new file info, data size 
>>>> = 2272072 num blocks = 3 
>>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root
>>>> 200708 13:38:46 240995 XrdPfc_Manager: debug Cache::inc_ref_cnt 
>>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root, cnt 
>>>> at exit = 1
>>>> 200708 13:38:46 240995 XrdPfc_File: debug File::AddIO() io = 0xe30f50 
>>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root
>>>> 200708 13:38:46 240995 XrdPfc_Manager: debug Cache::Attach() 
>>>> root:[log in to unmask] 
>>>> location: [log in to unmask]:1095
>>>> [2020-07-08 13:38:47.022428 -0700][Error  ][AsyncSock         ] 
>>>> [[log in to unmask]:1095.0] Socket error encountered: [ERROR] 
>>>> Invalid arguments
>>>> [2020-07-08 13:38:47.022506 -0700][Error  ][XRootD            ] 
>>>> [[log in to unmask]:1095] Unable to get the response to request 
>>>> kXR_read (handle: 0x00000000, offset: 0, size: 1048576)
>>>> [2020-07-08 13:38:47.022625 -0700][Error  ][File              ] 
>>>> [0xf0b040@root:[log in to unmask]:1094//eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root?xrdcl.requuid=6730e10b-8b40-43bf-9d0a-75da982939e8] 
>>>> Fatal file state error. Message kXR_read (handle: 0x00000000, offset: 0, 
>>>> size: 1048576) returned with [ERROR] Invalid arguments
>>>> 200708 13:38:47 241052 XrdPfc_File: error File::ProcessBlockResponse block 
>>>> 0xff3440  0 error=-22 
>>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root
>>>> 200708 13:38:47 240995 XrdPfc_File: error File::Read() io 0xe30f50, block 0 
>>>> finished with error 22 invalid argument 
>>>> eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root
>>>> src/tcmalloc.cc:284] Attempt to free invalid pointer 0x313262003543620a
>>>>
>>>> Note that while we get all these network errors at the start, cache still 
>>>> got the stat info from the server (knows the size of the file).
>>>>
>>>> I must admit I never test xcache with auth on :( I'll try it out now, well, 
>>>> after lunch :)
>>>>
>>>> Matevz
>>>>
>>>>> Andy
>>>>>
>>>>>
>>>>> On Wed, 8 Jul 2020, Matevz Tadel wrote:
>>>>>
>>>>>> Yay, that was a journey ... but I can reproduce it now!
>>>>>>
>>>>>> It is super strange this happens with xcache with authentication on only ...
>>>>>> this really should have no effect. I first tried without it and it worked and
>>>>>> then something rang a bell that you said so in the email :).
>>>>>>
>>>>>> Andy, does this ring any bells for you? It looks like interaction between
>>>>>> server / client usage of X509 stuffe.
>>>>>>
>>>>>> Anyway, I'm digging on on the xcache side ...
>>>>>>
>>>>>> Cheers,
>>>>>> Matevz
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2020-07-08 07:41, Nikolai Hartmann wrote:
>>>>>>> Hi Matevz,
>>>>>>>
>>>>>>> I might have something like a "minimal failing example". Unfortunately
>>>>>>> the problem only appears when authentication is required, so the example
>>>>>>> will only work on a machine that has a valid host certificate and the
>>>>>>> corresponding directory has to be bind-mounted into the container.
>>>>>>>
>>>>>>> I uploaded my container image here:
>>>>>>>
>>>>>>> https://urldefense.com/v3/__https://cloud.physik.lmu.de/index.php/s/RFC6Q89FBxxNMXF__;!!Mih3wA!S4S4O0y7f1Z5oNAgkr2EZ2J5683bZ5LRbG55GbcoHhyJTwOzaS2lABcIifddJxDGMy-N$ 
>>>>>>>
>>>>>>>
>>>>>>> and made a directory structure (tar archive attached) to bind mount into
>>>>>>> the container (and containing the minimal failing xcache config and a
>>>>>>> script for starting gdb inside the container)
>>>>>>>
>>>>>>> To reproduce, extract the archive, enter the directory and run (as
>>>>>>> non-root user)
>>>>>>>
>>>>>>> singularity run -B $(pwd)/data:/data -B $(pwd)/config:/etc/xrootd:ro -B
>>>>>>> <hostkey-dir>:/etc/grid-security:ro <singularity-image>
>>>>>>>
>>>>>>> where <hostkey-dir> is a directory that contains
>>>>>>>
>>>>>>> hostkey.pem
>>>>>>> hostcert.pem
>>>>>>> vomsdir (will become X509_VOMS_DIR)
>>>>>>> certificates (will become X509_CERT_DIR)
>>>>>>>
>>>>>>> and <singularity-image> is the path to the singularity image.
>>>>>>>
>>>>>>> That should run xrootd and the log should appear in
>>>>>>> data/xrd/var/log/xrootd.log
>>>>>>>
>>>>>>> I used this example to produce the failure:
>>>>>>>
>>>>>>> xrdcp -f
>>>>>>> root://lcg-lrz-xcache0.grid.lrz.de:1094//root://eospublic.cern.ch//eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root 
>>>>>>>
>>>>>>> /dev/null
>>>>>>>
>>>>>>> The simplest way to run gdb seemed to directly start xrootd with gdb.
>>>>>>> This can be done with the script run_xcache_debug.sh in the attached
>>>>>>> archive. Instead of the command above just use
>>>>>>>
>>>>>>> singularity exec -B $(pwd)/data:/data -B $(pwd)/config:/etc/xrootd:ro -B
>>>>>>> <hostkey-dir>:/etc/grid-security:ro <singularity-image>
>>>>>>> ./run_xcache_debug.sh
>>>>>>>
>>>>>>> Note: Before restarting, best delete the content of the data directory
>>>>>>> since the bug also did not seem to occur when the file was already
>>>>>>> cached (e.g after testing without authentication)
>>>>>>>
>>>>>>> Sorry for the overly complicated reproducing steps, but since it only
>>>>>>> happened when i authentication was enabled i didn't know how to do it
>>>>>>> simpler. I hope it helps.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Nikolai
>>>>>>>
>>>>>>> On 7/7/20 8:42 PM, Matevz Tadel wrote:
>>>>>>>> Thanks Nikolai, I shall continue my investigation :)
>>>>>>>>
>>>>>>>> Matevz
>>>>>>>>
>>>>>>>> On 2020-07-06 23:59, Nikolai Hartmann wrote:
>>>>>>>>> Hi Matevz,
>>>>>>>>>
>>>>>>>>> Thanks a lot for looking into this.
>>>>>>>>>
>>>>>>>>> - The crash seems to happen always when i make a request
>>>>>>>>> - Currently prefetching is disabled
>>>>>>>>> - Yes, i think it is direct proxy mode
>>>>>>>>> - stack trace is attached
>>>>>>>>>
>>>>>>>>> A similar setup seems to work for Ilija without issues with the xcaches
>>>>>>>>> using slate - i tried to mimic that setup closely. Running xrootd from
>>>>>>>>> this container image:
>>>>>>>>>
>>>>>>>>> https://urldefense.com/v3/__https://gitlab.physik.uni-muenchen.de/Nikolai.Hartmann/xcache-singularity-lrz/-/blob/51d2da52829eb6d8ea377539884f337208141aca/xcache.singularity.def__;!!Mih3wA!SJibOzmy2P3rdD8Ut7m7gYp_bah2pQX2dR2V9U6xiTq9PoQtfjb_MHHDljpOV0aWvVYj$ 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> using this config
>>>>>>>>>
>>>>>>>>> https://urldefense.com/v3/__https://gitlab.physik.uni-muenchen.de/Nikolai.Hartmann/xcache-singularity-lrz/-/blob/51d2da52829eb6d8ea377539884f337208141aca/etc/xrootd/xcache.cfg__;!!Mih3wA!SJibOzmy2P3rdD8Ut7m7gYp_bah2pQX2dR2V9U6xiTq9PoQtfjb_MHHDljpOVzHQF5CU$ 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Nikolai
>>>>>>>>>
>>>>>>>>> On 7/7/20 1:38 AM, Matevz Tadel wrote:
>>>>>>>>>> Hi Nikolai,
>>>>>>>>>>
>>>>>>>>>> I tried to reproduce it with current master in nearly all ways,
>>>>>>>>>> with/without prefetching and with direct/forwarding mode. Also, with std
>>>>>>>>>> malloc and tcmalloc. No luck :(
>>>>>>>>>>
>>>>>>>>>> Backtrace or core would help a lot at this point.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Matevz
>>>>>>>>>>
>>>>>>>>>> On 2020-07-03 00:54, Nikolai Hartmann wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I'm trying to upgrade to xrootd5 rc4 for our xcache server to
>>>>>>>>>>> mitigate a
>>>>>>>>>>> problem with dCache.
>>>>>>>>>>>
>>>>>>>>>>> Now when i try to read a file through xcache it crashes with
>>>>>>>>>>> "Attempt to
>>>>>>>>>>> free invalid pointer". I attached the corresponding part of the log.
>>>>>>>>>>> Any ideas?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Nikolai
>>>>>>>>>>>
>>>>>>>>>>> ########################################################################
>>>>>>>>>>>
>>>>>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>>>>>
>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>>>>>>> https://urldefense.com/v3/__https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1__;!!Mih3wA!Xzk53aW-mEg2pavzme9Hd49MPmno8frpbkh2YetRsquNyAt5jiVsDB91pTNUHA$ 
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> ########################################################################
>>>>>> Use REPLY-ALL to reply to list
>>>>>>
>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>> https://urldefense.com/v3/__https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1__;!!Mih3wA!XFGFw19U2eq-rA0gnMt46KV3Nmc-QzeRzIK6fXXO8cvBFHyGZUlCmc9OXbuf2OOR3Nx7$ 
>>>>>>
>>>>
>>>> ########################################################################
>>>> Use REPLY-ALL to reply to list
>>>>
>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>> https://urldefense.com/v3/__https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1__;!!Mih3wA!UUgXa89ev5J0vPpdGvvHbICVnr6QNXvQ2IcZ9n1-1EWyCIX2l3I_RnFsdnvmvvfjFolt$ 
>>>
>>>
>>> ########################################################################
>>> Use REPLY-ALL to reply to list
>>>
>>> To unsubscribe from the XROOTD-L list, click the following link:
>>> https://urldefense.com/v3/__https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1__;!!Mih3wA!UUgXa89ev5J0vPpdGvvHbICVnr6QNXvQ2IcZ9n1-1EWyCIX2l3I_RnFsdnvmvvfjFolt$ 
>>
>>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1