LISTSERV 16.5 - XROOTD-L Archives

> Ciao Andrea,
>
> we practically do this:
>
>     ctx = creat_context(...
>     ctx.set_opt_string("XROOTD PLUGIN", "XRD.WANTPROT", "gsi,unix")
>     ctx.mkdir_rec(...
>     ctx.filecopy(..
>
> Is that the gfal transfer API?

yup, internally in gfal2 we then use the XrdCl to perform the copy ( 
both for local files and TPC)

today is holiday but we can start investigating with Michal tomorrow or 
Monday the possible issue

cheers

Andrea

> Cheers,
> Mario
>
>
>
> On 05/09/2019 10:35, andrea manzi wrote:
>> ciao Mario,
>>
>> i guess you are using the gfal2 transfer API in Rucio right ? ( the
>> equivalent to gfal-copy)
>>
>> in gfal2,  in the case of the transfer API  for xrootd, we don't do much
>> as we just use the xrootd client (  XrdCl::CopyProcess), so we don't
>> touch file offsets.
>>
>> so we may want to investigate if there is a problem in XrdCl. Do you
>> have the logs of the client?
>>
>> If you are using the gfal2 Posix API instead we need to investigate a
>> possible bug there
>>
>> cheers
>>
>> Andrea
>>
>>
>>> Hi Petr, Andy,
>>>
>>> once extra piece of information here: Rucio uses the python bindings to
>>> the gfal library to upload these files, and de-facto always uses the
>>> current stable gfal version that is published on ALRB. Rucio doesn't
>>> implement any other upload by itself.
>>>
>>> Cheers,
>>> Mario
>>>
>>> On 05/09/2019 04:21, Andrew Hanushevsky wrote:
>>>> Hi Petr,
>>>>   
>>>> So, look at this line:
>>>>   
>>>> 190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>> ofs_write: 1784104010@6506989258351653185
>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>> Why did the application doing the write specify an offset of
>>>> 5506989258351653185? So, that is the problem and, as you noted, it
>>>> doesn’t matter whether POSC is on or not. Indeed, it’s the first write
>>>> to the file and only the first write that suffers from this problem. So,
>>>> there are two possibilities here: a) there is a bug in the uploader, or
>>>> b) there was a network data corruption that managed to hit the offset
>>>> (somewhat unlikely).
>>>>   
>>>> Andy
>>>>   
>>>> *From:* Petr Vokac <mailto:[log in to unmask]>
>>>> *Sent:* Wednesday, September 04, 2019 12:19 PM
>>>> *To:* Andrew Hanushevsky <mailto:[log in to unmask]>
>>>> *Cc:* dpm-devel <mailto:[log in to unmask]> ; [log in to unmask]
>>>> <mailto:[log in to unmask]>
>>>> *Subject:* Re: Uploaded file 128TB on filesystem
>>>>   
>>>> I configured our diskservers with "ofs.trace write" (I did not yet add
>>>> in xrootd configuraton "ofs.persist hold 0") and today new file appeared
>>>> on one of our storage servers:
>>>>
>>>> *[root@dpmpool1 ~]# grep atlasprd.2817:152
>>>> /var/log/xrootd/dpmdisk/xrootd.log*
>>>> 190904 17:02:08 1894 XrootdXeq: atlasprd.2817:152@[2001:718:401:6e03::1:82] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>> 190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 1784104010@6506989258351653185 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>> 190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 262144@262144 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 262144@524288 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 262144@786432 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 262144@1048576 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 262144@1310720 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 262144@1572864 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 262144@1835008 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 135000@2097152 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>> 190904 17:07:32 2163 XrootdXeq: atlasprd.2817:152@[2001:718:401:6e03::1:82] disc 0:05:24
>>>>
>>>> *[root@dpmpool1 ~]# ls -la
>>>> /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328*
>>>> -rw-rw---- 1 dpmmgr dpmmgr 6506989258351919104 Sep  4 17:02 /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328
>>>>
>>>> *[root@dpmpool1 ~]# du --block-size=1
>>>> /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328*
>>>> 2236416	/mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328
>>>>
>>>> This file has first 256kb filled with 0x00 (adler32 up to original size
>>>> 2232152 doesn't match) and this time there is no ofs_Posc in the
>>>> xrootd.log file.
>>>>
>>>> Petr Vokac
>>>>
>>>> On 8/19/19 3:24 AM, Andrew Hanushevsky wrote:
>>>>> Hi Petr,
>>>>>
>>>>> The issue here is that the python script (I believe it's python) is
>>>>> disconnecting without closing the file and then reconnecting and
>>>>> writing at a high offset. The log is pretty clear on that. You can
>>>>> verify by adding "ofs.trace write" to see the write offsets (hough it
>>>>> will produce a lot of output). You can also prohibit reconnection by
>>>>> adding this to the persist directive.
>>>>>
>>>>> ofs.persist hold 0
>>>>>
>>>>> which would delete the file the moment the client disconnects without
>>>>> closing the file.
>>>>>
>>>>> Andy
>>>>>
>>>>> On Sat, 17 Aug 2019, Petr Vokac wrote:
>>>>>
>>>>>> Today new "65PB" file appeared on our storage and it was again
>>>>>> transferred with xrootd protocol. I can see again same error in the log
>>>>>> file:
>>>>>>
>>>>>> **[root@dpmpool21 ~]# ls -la
>>>>>> /mnt/fs4/atlas/2019-08-17/EVNT.18893579._003197.pool.root.1.rucio.upload.234855.1566031522*
>>>>>>
>>>>>> *-rw-rw---- 1 dpmmgr dpmmgr 72340172838077281 Aug 17 10:48
>>>>>> /mnt/fs4/atlas/2019-08-17/EVNT.18893579._003197.pool.root.1.rucio.upload.234855.1566031522
>>>>>> ***[root@dpmpool21 ~]# grep atlasprd.2757
>>>>>> /var/log/xrootd/dpmdisk/xrootd.log*
>>>>>> ...
>>>>>> 190817 10:47:55 57803 XrootdXeq:
>>>>>> atlasprd.2757:65@[2001:718:401:6e03::1:43] pub IP64 login as
>>>>>> /DC=ch/DC=cern/OU=Organic
>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>>>> 190817 10:47:55 61859 XrootdXeq:
>>>>>> atlasprd.2757:66@[2001:718:401:6e03::1:83] pub IP64 login as
>>>>>> /DC=ch/DC=cern/OU=Organic
>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>>>> 190817 10:48:52 62233 XrootdXeq:
>>>>>> atlasprd.2757:46@[2001:718:401:6e03::1:72] disc 0:00:58
>>>>>> 190817 10:48:52 62954 XrootdXeq:
>>>>>> atlasprd.2757:25@[2001:718:401:6e03::1:72] disc 0:03:30
>>>>>> 190817 10:48:52 57804 XrootdXeq:
>>>>>> atlasprd.2757:28@[2001:718:401:6e03::1:72] pub IP64 login as
>>>>>> /DC=ch/DC=cern/OU=Organic
>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>>>> 190817 10:48:52 57804 ofs_Posc: Creator changed from
>>>>>> atlasprd.2757:25@[2001:718:401:6e03::1:72] to
>>>>>> atlasprd.2757:28@[2001:718:401:6e03::1:72] for
>>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/8b/f3/EVNT.18893579._003197.pool.root.1.rucio.upload
>>>>>> 190817 10:48:53 57803 XrootdXeq:
>>>>>> atlasprd.2757:46@[2001:718:401:6017:20:0:14:44] pub IP64 login as
>>>>>> /DC=ch/DC=cern/OU=Organic
>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>>>> 190817 10:49:01 57818 XrootdXeq:
>>>>>> atlasprd.2757:70@[2001:718:401:6017:20:0:11:11] pub IP64 login as
>>>>>> /DC=ch/DC=cern/OU=Organic
>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>>>> 190817 10:49:07 62954 XrootdXeq:
>>>>>> atlasprd.2757:55@[2001:718:401:6e03::1:81] disc 0:10:17
>>>>>> ...
>>>>>>
>>>>>> Any suggestion how to trace what is causing this behavior?
>>>>>>
>>>>>> Petr
>>>>>>
>>>>>>
>>>>>> On 7/20/19 7:48 AM, Petr Vokac wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I ran the command you sent me
>>>>>>>
>>>>>>> grep "atlasprd.2757:" <logfile> | egrep
>>>>>>> '2001:718:401:6e03::1:21|2001:718:401:6e03::1:32' > <excerpt>
>>>>>>>
>>>>>>> and because the output was not just few lines long I uploaded it to the
>>>>>>>
>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool22.xrootd.log
>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool24.xrootd.log
>>>>>>>
>>>>>>> These machines on 2001:718:401:6e03::/64 subnet are in remote location
>>>>>>> (but relatively close with RTT ~ 0.7ms). I can see several ofs_Posc
>>>>>>> occurrences in the logs of our disknodes during "190717 17:xx:xx" time
>>>>>>> period:
>>>>>>>
>>>>>>> http://vokac.web.cern.ch/vokac/tmp/POSC.txt
>>>>>>>
>>>>>>> Petr
>>>>>>>
>>>>>>> On 7/19/19 8:51 PM, Andrew Hanushevsky wrote:
>>>>>>>> Hi Petr,
>>>>>>>>
>>>>>>>> Well, POSC says that the client reconnected for writing so not all
>>>>>>>> went well. Can I get the log output for that occurrence as I
>>>>>>>> previously described?
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> On Fri, 19 Jul 2019, Petr Vokac wrote:
>>>>>>>>
>>>>>>>>> We are using Rucio download client to transfer job outputs (it is
>>>>>>>>> build
>>>>>>>>> on top of gfal2 python interface).
>>>>>>>>>
>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool22.xrootd.log
>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool24.xrootd.log
>>>>>>>>>
>>>>>>>>> this is the full job output unfortunately with minimal gfal2 logging
>>>>>>>>> output.
>>>>>>>>>
>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/grid.2258136.4.out-from-2001:718:401:6e03::1:21-to-dpmpool22.txt
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/grid.2255921.7.out-from-2001:718:401:6e03::1:32-to-dpmpool24.txt
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It looks like according gfal2 transfer was OK
>>>>>>>>>
>>>>>>>>> 2019-07-17 15:09:17,782 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with
>>>>>>>>> success
>>>>>>>>> 2019-07-17 15:09:17,782 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,783 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with
>>>>>>>>> success
>>>>>>>>> 2019-07-17 15:09:17,783 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,783 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,783 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,784 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,784 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,784 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,793 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with
>>>>>>>>> success
>>>>>>>>> 2019-07-17 15:09:17,793 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,793 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with
>>>>>>>>> success
>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,795 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with
>>>>>>>>> success
>>>>>>>>> 2019-07-17 15:09:17,795 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,795 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with
>>>>>>>>> success
>>>>>>>>> 2019-07-17 15:09:17,795 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with success
>>>>>>>>> 2019-07-17 15:09:17,810 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | __gfal2_copy              | Event
>>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:ENTER
>>>>>>>>> 2019-07-17 15:09:17,810 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | __gfal2_copy              | Event
>>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:ITEM
>>>>>>>>> file:///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1
>>>>>>>>>
>>>>>>>>> =>
>>>>>>>>> root://golias100.farm.particle.cz:1094//dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload
>>>>>>>>>
>>>>>>>>> 2019-07-17 15:09:17,810 | INFO     | copytool_out        |
>>>>>>>>> gfal2                            | __gfal2_copy              | Event
>>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:EXIT
>>>>>>>>> 2019-07-17 15:09:17,811 | INFO     | Dummy-2             |
>>>>>>>>> gfal2                            | (unknown function)        | Event
>>>>>>>>> triggered: BOTH xroot TRANSFER:ENTER
>>>>>>>>> file://localhost///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&xrd.wantprot=gsi,unix
>>>>>>>>>
>>>>>>>>> =>
>>>>>>>>> root://golias100.farm.particle.cz:1094///dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&x
>>>>>>>>>
>>>>>>>>> 2019-07-17 15:09:17,811 | INFO     | Dummy-2             |
>>>>>>>>> gfal2                            | (unknown function)        | Event
>>>>>>>>> triggered: BOTH xroot TRANSFER:TYPE streamed
>>>>>>>>> 2019-07-17 15:11:23,914 | INFO     | Dummy-2             |
>>>>>>>>> gfal2                            | (unknown function)        | Event
>>>>>>>>> triggered: BOTH xroot TRANSFER:EXIT Job finished, [SUCCESS]
>>>>>>>>>
>>>>>>>>> but probably rucio tries to verify checksum (gfal-sum via python
>>>>>>>>> bindings) and it fails
>>>>>>>>>
>>>>>>>>> 2019-07-17 15:15:23,774 | INFO     | copytool_out        |
>>>>>>>>> pilot.copytool.rucio             | copy_out                  |
>>>>>>>>> stderr = An unknown exception occurred.
>>>>>>>>> Details: Checksum not validated
>>>>>>>>>
>>>>>>>>> because DPM is unable to calculate checksum for 128TB file within
>>>>>>>>> timeout.
>>>>>>>>>
>>>>>>>>> Petr
>>>>>>>>>
>>>>>>>>> On 7/19/19 5:07 AM, Andrew Hanushevsky wrote:
>>>>>>>>>> Hi Petr,
>>>>>>>>>>
>>>>>>>>>> Indeed, it looks like what was used to upload the files (xrdcp?)
>>>>>>>>>> created a very large sparse file by writing with a very large
>>>>>>>>>> offset.
>>>>>>>>>> The clue here is that an anomaly occurred during the write and these
>>>>>>>>>> two messages (lkely each corresponding to each file) tell us
>>>>>>>>>> something....
>>>>>>>>>>
>>>>>>>>>>> 190717 17:10:39 332092 ofs_Posc: Creator changed from
>>>>>>>>>>    atlasprd.2757:90@[2001:718:401:6e03::1:21] to
>>>>>>>>>>    atlasprd.2757:39@[2001:718:401:6e03::1:21] for
>>>>>>>>>>   
>>>>>>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/
>>>>>>>>>>
>>>>>>>>>>    log.18636890._001059.job.log.tgz.1.rucio.upload
>>>>>>>>>>
>>>>>>>>>>> 190717 17:29:26 14204 ofs_Posc: Creator changed from
>>>>>>>>>>    atlasprd.2757:33@[2001:718:401:6e03::1:32] to
>>>>>>>>>>    atlasprd.2757:95@[2001:718:401:6e03::1:32] for
>>>>>>>>>>   
>>>>>>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/
>>>>>>>>>>
>>>>>>>>>>    AOD.18595963._000786.pool.root.1.rucio.upload
>>>>>>>>>>
>>>>>>>>>> What the POSC system is saying is that in each case the original
>>>>>>>>>> creator of the file lost contact with the server. It then
>>>>>>>>>> reconnected
>>>>>>>>>> and resumed writing with a successful close. POSC allows this to
>>>>>>>>>> avoid
>>>>>>>>>> network glitches from deleting a file because a network glitch will
>>>>>>>>>> cause the client to reconnect and resume where it left off. This is
>>>>>>>>>> only allowed when the resuming client comes from the same IP
>>>>>>>>>> address,
>>>>>>>>>> has the same process ID, and the same logical name. In this case it
>>>>>>>>>> was "atlasprd.2757". I find that rather odd because it's next to
>>>>>>>>>> impossible for two processes on two different machines to have the
>>>>>>>>>> same process number (i.e. 2757). So, this is why I am wondering if
>>>>>>>>>> this was really an xrdcp or some other program.
>>>>>>>>>>
>>>>>>>>>> We should look to see what these client were actually doing during
>>>>>>>>>> this period as they clearly did not do the right thing when they
>>>>>>>>>> reconnected. So, could you grep out all interactions of these two
>>>>>>>>>> client from the log. Something like:
>>>>>>>>>>
>>>>>>>>>> grep "atlasprd.2757:" <logfile> | egrep
>>>>>>>>>> '2001:718:401:6e03::1:21|2001:718:401:6e03::1:32' > <excerpt>
>>>>>>>>>>
>>>>>>>>>> and send the excerpt to me?
>>>>>>>>>>
>>>>>>>>>> Andy
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, 18 Jul 2019, Petr Vokac wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> our DPM which is using native xrootd 4.9.1 yesterday created two
>>>>>>>>>>> different 128TB files on two different diskservers that were
>>>>>>>>>>> received as
>>>>>>>>>>> a job output from two different worker nodes. The size of original
>>>>>>>>>>> file
>>>>>>>>>>> was few megabytes and when I check beginning of stored files up to
>>>>>>>>>>> original size & checksum everything seems to be OK - except
>>>>>>>>>>> these are
>>>>>>>>>>> sparse files with size 128TB. Original transferred files from WN
>>>>>>>>>>> looked
>>>>>>>>>>> like:
>>>>>>>>>>>
>>>>>>>>>>> mc15_13TeV:log.18636890._001059.job.log.tgz.1 - original size
>>>>>>>>>>> 3101899, adler32 742e1336
>>>>>>>>>>> mc16_5TeV:AOD.18595963._000786.pool.root.1 - original size
>>>>>>>>>>> 737474514,
>>>>>>>>>>> adler32 dc68cb21
>>>>>>>>>>>
>>>>>>>>>>> but on our diskservers they were stored as
>>>>>>>>>>>
>>>>>>>>>>> [root@dpmpool22 ~]# ls -l
>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 140560257079236 Jul 17 17:11
>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [root@dpmpool22 ~]# du --block-size=1
>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 3108864
>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [root@dpmpool24 ~]# ls -l
>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 139734345056408 Jul 17 17:31
>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [root@dpmpool24 tmp]# du --block-size=1
>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 737480704
>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I was told that DPM is not really involved in transfering / storing
>>>>>>>>>>> files with xrootd protocol, so I guess something bad must happened
>>>>>>>>>>> inside XRootD libraries. Do you have any idea why these two
>>>>>>>>>>> transfers
>>>>>>>>>>> generated such extreme files? Unfortunately our xrootd logging is
>>>>>>>>>>> pretty
>>>>>>>>>>> low and I was able to find only info about file creation
>>>>>>>>>>>
>>>>>>>>>>> [root@dpmpool22 ~]# grep log.18636890._001059.job.log.tgz
>>>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log
>>>>>>>>>>> 190717 17:10:39 332092 ofs_Posc: Creator changed from
>>>>>>>>>>    atlasprd.2757:90@[2001:718:401:6e03::1:21] to
>>>>>>>>>>    atlasprd.2757:39@[2001:718:401:6e03::1:21] for
>>>>>>>>>>   
>>>>>>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/
>>>>>>>>>>
>>>>>>>>>>    log.18636890._001059.job.log.tgz.1.rucio.upload
>>>>>>>>>>> [root@dpmpool24 tmp]# grep AOD.18595963._000786.pool.root
>>>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log
>>>>>>>>>>> 190717 17:29:26 14204 ofs_Posc: Creator changed from
>>>>>>>>>>    atlasprd.2757:33@[2001:718:401:6e03::1:32] to
>>>>>>>>>>    atlasprd.2757:95@[2001:718:401:6e03::1:32] for
>>>>>>>>>>   
>>>>>>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/AOD.18595963.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    _000786.pool.root.1.rucio.upload
>>>>>>>>>>> Petr
>>>>>>>>>>>
>>>>>>>>>>> ########################################################################
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>>>>>
>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>>>>>>
>>>>>>>>>> ########################################################################
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>>>>
>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>>> ########################################################################
>>>>>>>>
>>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>>
>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>> ------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>
>>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> Use REPLY-ALL to reply to list
>>>>
>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1