Print

Print


Hi Andy,

On 2020-07-08 13:49, Andrew Hanushevsky wrote:
> Hi Matevz,
> 
> Well, what kind of authentication? Clearly, the kind we use doesn't cause this 
> problem. It could be just a random core smash but if it's random we should be 
> various effects not just a crash in this particular code path, right?

xcache without any security config, everything works smooth.

xcache with sec.protocol /usr/lib64 gsi --- trouble:

200708 13:38:44 240995 XrootdXeq: matevz.241046:31@uaf-7 pub IPv4 login as d0ba0e6c.0
200708 13:38:44 240995 Posix_P2L: file /eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root pfn2lfn /eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root
[2020-07-08 13:38:44.739012 -0700][Error  ][AsyncSock         ] [[log in to unmask]:1094.0] Unable to connect: network is unreachable
[2020-07-08 13:38:44.739092 -0700][Error  ][PostMaster        ] [[log in to unmask]:1094] elapsed = 0, pConnectionWindow = 120 seconds.
[2020-07-08 13:38:45.637583 -0700][Error  ][XRootDTransport   ] [[log in to unmask]:1094.0] Authentication with gsi failed:
[2020-07-08 13:38:45.974332 -0700][Error  ][AsyncSock         ] [[log in to unmask]:1095.0] Unable to connect: network is unreachable
[2020-07-08 13:38:45.974400 -0700][Error  ][PostMaster        ] [[log in to unmask]:1095] elapsed = 0, pConnectionWindow = 120 seconds.
200708 13:38:46 240995 XrdPfc_Manager: info Cache::Attach() root:[log in to unmask]
200708 13:38:46 240995 XrdPfc_Manager: debug Cache::GetFile eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root, io 0xe30f50
200708 13:38:46 240995 XrdPfc_IO: debug IOEntireFile::initCachedStat get stat from client res = 0, size = 2272072 root:[log in to unmask]
200708 13:38:46 240995 XrdPfc_File: debug Creating new file info, data size = 2272072 num blocks = 3 eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root
200708 13:38:46 240995 XrdPfc_Manager: debug Cache::inc_ref_cnt eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root, cnt at exit = 1
200708 13:38:46 240995 XrdPfc_File: debug File::AddIO() io = 0xe30f50 eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root
200708 13:38:46 240995 XrdPfc_Manager: debug Cache::Attach() root:[log in to unmask] location: [log in to unmask]:1095
[2020-07-08 13:38:47.022428 -0700][Error  ][AsyncSock         ] [[log in to unmask]:1095.0] Socket error encountered: [ERROR] Invalid arguments
[2020-07-08 13:38:47.022506 -0700][Error  ][XRootD            ] [[log in to unmask]:1095] Unable to get the response to request kXR_read (handle: 0x00000000, offset: 0, size: 1048576)
[2020-07-08 13:38:47.022625 -0700][Error  ][File              ] [0xf0b040@root:[log in to unmask]:1094//eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root?xrdcl.requuid=6730e10b-8b40-43bf-9d0a-75da982939e8] Fatal file state error. Message kXR_read (handle: 0x00000000, offset: 0, size: 1048576) returned with [ERROR] Invalid arguments
200708 13:38:47 241052 XrdPfc_File: error File::ProcessBlockResponse block 0xff3440  0 error=-22 eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root
200708 13:38:47 240995 XrdPfc_File: error File::Read() io 0xe30f50, block 0 finished with error 22 invalid argument eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root
src/tcmalloc.cc:284] Attempt to free invalid pointer 0x313262003543620a

Note that while we get all these network errors at the start, cache still got the stat info from the server (knows the size of the file).

I must admit I never test xcache with auth on :( I'll try it out now, well, after lunch :)

Matevz

> Andy
> 
> 
> On Wed, 8 Jul 2020, Matevz Tadel wrote:
> 
>> Yay, that was a journey ... but I can reproduce it now!
>>
>> It is super strange this happens with xcache with authentication on only ... 
>> this really should have no effect. I first tried without it and it worked and 
>> then something rang a bell that you said so in the email :).
>>
>> Andy, does this ring any bells for you? It looks like interaction between 
>> server / client usage of X509 stuffe.
>>
>> Anyway, I'm digging on on the xcache side ...
>>
>> Cheers,
>> Matevz
>>
>>
>>
>> On 2020-07-08 07:41, Nikolai Hartmann wrote:
>>> Hi Matevz,
>>>
>>> I might have something like a "minimal failing example". Unfortunately
>>> the problem only appears when authentication is required, so the example
>>> will only work on a machine that has a valid host certificate and the
>>> corresponding directory has to be bind-mounted into the container.
>>>
>>> I uploaded my container image here:
>>>
>>> https://urldefense.com/v3/__https://cloud.physik.lmu.de/index.php/s/RFC6Q89FBxxNMXF__;!!Mih3wA!S4S4O0y7f1Z5oNAgkr2EZ2J5683bZ5LRbG55GbcoHhyJTwOzaS2lABcIifddJxDGMy-N$ 
>>>
>>>
>>> and made a directory structure (tar archive attached) to bind mount into
>>> the container (and containing the minimal failing xcache config and a
>>> script for starting gdb inside the container)
>>>
>>> To reproduce, extract the archive, enter the directory and run (as
>>> non-root user)
>>>
>>> singularity run -B $(pwd)/data:/data -B $(pwd)/config:/etc/xrootd:ro -B
>>> <hostkey-dir>:/etc/grid-security:ro <singularity-image>
>>>
>>> where <hostkey-dir> is a directory that contains
>>>
>>> hostkey.pem
>>> hostcert.pem
>>> vomsdir (will become X509_VOMS_DIR)
>>> certificates (will become X509_CERT_DIR)
>>>
>>> and <singularity-image> is the path to the singularity image.
>>>
>>> That should run xrootd and the log should appear in
>>> data/xrd/var/log/xrootd.log
>>>
>>> I used this example to produce the failure:
>>>
>>> xrdcp -f
>>> root://lcg-lrz-xcache0.grid.lrz.de:1094//root://eospublic.cern.ch//eos/opendata/lhcb/AntimatterMatters2017/data/PhaseSpaceSimulation.root 
>>>
>>> /dev/null
>>>
>>> The simplest way to run gdb seemed to directly start xrootd with gdb.
>>> This can be done with the script run_xcache_debug.sh in the attached
>>> archive. Instead of the command above just use
>>>
>>> singularity exec -B $(pwd)/data:/data -B $(pwd)/config:/etc/xrootd:ro -B
>>> <hostkey-dir>:/etc/grid-security:ro <singularity-image>
>>> ./run_xcache_debug.sh
>>>
>>> Note: Before restarting, best delete the content of the data directory
>>> since the bug also did not seem to occur when the file was already
>>> cached (e.g after testing without authentication)
>>>
>>> Sorry for the overly complicated reproducing steps, but since it only
>>> happened when i authentication was enabled i didn't know how to do it
>>> simpler. I hope it helps.
>>>
>>> Thanks,
>>> Nikolai
>>>
>>> On 7/7/20 8:42 PM, Matevz Tadel wrote:
>>>> Thanks Nikolai, I shall continue my investigation :)
>>>>
>>>> Matevz
>>>>
>>>> On 2020-07-06 23:59, Nikolai Hartmann wrote:
>>>>> Hi Matevz,
>>>>>
>>>>> Thanks a lot for looking into this.
>>>>>
>>>>> - The crash seems to happen always when i make a request
>>>>> - Currently prefetching is disabled
>>>>> - Yes, i think it is direct proxy mode
>>>>> - stack trace is attached
>>>>>
>>>>> A similar setup seems to work for Ilija without issues with the xcaches
>>>>> using slate - i tried to mimic that setup closely. Running xrootd from
>>>>> this container image:
>>>>>
>>>>> https://urldefense.com/v3/__https://gitlab.physik.uni-muenchen.de/Nikolai.Hartmann/xcache-singularity-lrz/-/blob/51d2da52829eb6d8ea377539884f337208141aca/xcache.singularity.def__;!!Mih3wA!SJibOzmy2P3rdD8Ut7m7gYp_bah2pQX2dR2V9U6xiTq9PoQtfjb_MHHDljpOV0aWvVYj$ 
>>>>>
>>>>>
>>>>>
>>>>> using this config
>>>>>
>>>>> https://urldefense.com/v3/__https://gitlab.physik.uni-muenchen.de/Nikolai.Hartmann/xcache-singularity-lrz/-/blob/51d2da52829eb6d8ea377539884f337208141aca/etc/xrootd/xcache.cfg__;!!Mih3wA!SJibOzmy2P3rdD8Ut7m7gYp_bah2pQX2dR2V9U6xiTq9PoQtfjb_MHHDljpOVzHQF5CU$ 
>>>>>
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Nikolai
>>>>>
>>>>> On 7/7/20 1:38 AM, Matevz Tadel wrote:
>>>>>> Hi Nikolai,
>>>>>>
>>>>>> I tried to reproduce it with current master in nearly all ways,
>>>>>> with/without prefetching and with direct/forwarding mode. Also, with std
>>>>>> malloc and tcmalloc. No luck :(
>>>>>>
>>>>>> Backtrace or core would help a lot at this point.
>>>>>>
>>>>>> Cheers,
>>>>>> Matevz
>>>>>>
>>>>>> On 2020-07-03 00:54, Nikolai Hartmann wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm trying to upgrade to xrootd5 rc4 for our xcache server to
>>>>>>> mitigate a
>>>>>>> problem with dCache.
>>>>>>>
>>>>>>> Now when i try to read a file through xcache it crashes with
>>>>>>> "Attempt to
>>>>>>> free invalid pointer". I attached the corresponding part of the log.
>>>>>>> Any ideas?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Nikolai
>>>>>>>
>>>>>>> ########################################################################
>>>>>>>
>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>
>>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>>> https://urldefense.com/v3/__https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1__;!!Mih3wA!Xzk53aW-mEg2pavzme9Hd49MPmno8frpbkh2YetRsquNyAt5jiVsDB91pTNUHA$ 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>
>> ########################################################################
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the XROOTD-L list, click the following link:
>> https://urldefense.com/v3/__https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1__;!!Mih3wA!XFGFw19U2eq-rA0gnMt46KV3Nmc-QzeRzIK6fXXO8cvBFHyGZUlCmc9OXbuf2OOR3Nx7$ 
>>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1