> Ciao Andrea, > > we practically do this: > > ctx = creat_context(... > ctx.set_opt_string("XROOTD PLUGIN", "XRD.WANTPROT", "gsi,unix") > ctx.mkdir_rec(... > ctx.filecopy(.. > > Is that the gfal transfer API? yup, internally in gfal2 we then use the XrdCl to perform the copy ( both for local files and TPC) today is holiday but we can start investigating with Michal tomorrow or Monday the possible issue cheers Andrea > Cheers, > Mario > > > > On 05/09/2019 10:35, andrea manzi wrote: >> ciao Mario, >> >> i guess you are using the gfal2 transfer API in Rucio right ? ( the >> equivalent to gfal-copy) >> >> in gfal2, in the case of the transfer API for xrootd, we don't do much >> as we just use the xrootd client ( XrdCl::CopyProcess), so we don't >> touch file offsets. >> >> so we may want to investigate if there is a problem in XrdCl. Do you >> have the logs of the client? >> >> If you are using the gfal2 Posix API instead we need to investigate a >> possible bug there >> >> cheers >> >> Andrea >> >> >>> Hi Petr, Andy, >>> >>> once extra piece of information here: Rucio uses the python bindings to >>> the gfal library to upload these files, and de-facto always uses the >>> current stable gfal version that is published on ALRB. Rucio doesn't >>> implement any other upload by itself. >>> >>> Cheers, >>> Mario >>> >>> On 05/09/2019 04:21, Andrew Hanushevsky wrote: >>>> Hi Petr, >>>> >>>> So, look at this line: >>>> >>>> 190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>> ofs_write: 1784104010@6506989258351653185 >>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>> Why did the application doing the write specify an offset of >>>> 5506989258351653185? So, that is the problem and, as you noted, it >>>> doesn’t matter whether POSC is on or not. Indeed, it’s the first write >>>> to the file and only the first write that suffers from this problem. So, >>>> there are two possibilities here: a) there is a bug in the uploader, or >>>> b) there was a network data corruption that managed to hit the offset >>>> (somewhat unlikely). >>>> >>>> Andy >>>> >>>> *From:* Petr Vokac <mailto:[log in to unmask]> >>>> *Sent:* Wednesday, September 04, 2019 12:19 PM >>>> *To:* Andrew Hanushevsky <mailto:[log in to unmask]> >>>> *Cc:* dpm-devel <mailto:[log in to unmask]> ; [log in to unmask] >>>> <mailto:[log in to unmask]> >>>> *Subject:* Re: Uploaded file 128TB on filesystem >>>> >>>> I configured our diskservers with "ofs.trace write" (I did not yet add >>>> in xrootd configuraton "ofs.persist hold 0") and today new file appeared >>>> on one of our storage servers: >>>> >>>> *[root@dpmpool1 ~]# grep atlasprd.2817:152 >>>> /var/log/xrootd/dpmdisk/xrootd.log* >>>> 190904 17:02:08 1894 XrootdXeq: atlasprd.2817:152@[2001:718:401:6e03::1:82] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>> 190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 1784104010@6506989258351653185 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>> 190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 262144@262144 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 262144@524288 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 262144@786432 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 262144@1048576 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 262144@1310720 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 262144@1572864 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 262144@1835008 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] ofs_write: 135000@2097152 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>> 190904 17:07:32 2163 XrootdXeq: atlasprd.2817:152@[2001:718:401:6e03::1:82] disc 0:05:24 >>>> >>>> *[root@dpmpool1 ~]# ls -la >>>> /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328* >>>> -rw-rw---- 1 dpmmgr dpmmgr 6506989258351919104 Sep 4 17:02 /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328 >>>> >>>> *[root@dpmpool1 ~]# du --block-size=1 >>>> /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328* >>>> 2236416 /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328 >>>> >>>> This file has first 256kb filled with 0x00 (adler32 up to original size >>>> 2232152 doesn't match) and this time there is no ofs_Posc in the >>>> xrootd.log file. >>>> >>>> Petr Vokac >>>> >>>> On 8/19/19 3:24 AM, Andrew Hanushevsky wrote: >>>>> Hi Petr, >>>>> >>>>> The issue here is that the python script (I believe it's python) is >>>>> disconnecting without closing the file and then reconnecting and >>>>> writing at a high offset. The log is pretty clear on that. You can >>>>> verify by adding "ofs.trace write" to see the write offsets (hough it >>>>> will produce a lot of output). You can also prohibit reconnection by >>>>> adding this to the persist directive. >>>>> >>>>> ofs.persist hold 0 >>>>> >>>>> which would delete the file the moment the client disconnects without >>>>> closing the file. >>>>> >>>>> Andy >>>>> >>>>> On Sat, 17 Aug 2019, Petr Vokac wrote: >>>>> >>>>>> Today new "65PB" file appeared on our storage and it was again >>>>>> transferred with xrootd protocol. I can see again same error in the log >>>>>> file: >>>>>> >>>>>> **[root@dpmpool21 ~]# ls -la >>>>>> /mnt/fs4/atlas/2019-08-17/EVNT.18893579._003197.pool.root.1.rucio.upload.234855.1566031522* >>>>>> >>>>>> *-rw-rw---- 1 dpmmgr dpmmgr 72340172838077281 Aug 17 10:48 >>>>>> /mnt/fs4/atlas/2019-08-17/EVNT.18893579._003197.pool.root.1.rucio.upload.234855.1566031522 >>>>>> ***[root@dpmpool21 ~]# grep atlasprd.2757 >>>>>> /var/log/xrootd/dpmdisk/xrootd.log* >>>>>> ... >>>>>> 190817 10:47:55 57803 XrootdXeq: >>>>>> atlasprd.2757:65@[2001:718:401:6e03::1:43] pub IP64 login as >>>>>> /DC=ch/DC=cern/OU=Organic >>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>>>> 190817 10:47:55 61859 XrootdXeq: >>>>>> atlasprd.2757:66@[2001:718:401:6e03::1:83] pub IP64 login as >>>>>> /DC=ch/DC=cern/OU=Organic >>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>>>> 190817 10:48:52 62233 XrootdXeq: >>>>>> atlasprd.2757:46@[2001:718:401:6e03::1:72] disc 0:00:58 >>>>>> 190817 10:48:52 62954 XrootdXeq: >>>>>> atlasprd.2757:25@[2001:718:401:6e03::1:72] disc 0:03:30 >>>>>> 190817 10:48:52 57804 XrootdXeq: >>>>>> atlasprd.2757:28@[2001:718:401:6e03::1:72] pub IP64 login as >>>>>> /DC=ch/DC=cern/OU=Organic >>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>>>> 190817 10:48:52 57804 ofs_Posc: Creator changed from >>>>>> atlasprd.2757:25@[2001:718:401:6e03::1:72] to >>>>>> atlasprd.2757:28@[2001:718:401:6e03::1:72] for >>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/8b/f3/EVNT.18893579._003197.pool.root.1.rucio.upload >>>>>> 190817 10:48:53 57803 XrootdXeq: >>>>>> atlasprd.2757:46@[2001:718:401:6017:20:0:14:44] pub IP64 login as >>>>>> /DC=ch/DC=cern/OU=Organic >>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>>>> 190817 10:49:01 57818 XrootdXeq: >>>>>> atlasprd.2757:70@[2001:718:401:6017:20:0:11:11] pub IP64 login as >>>>>> /DC=ch/DC=cern/OU=Organic >>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>>>> 190817 10:49:07 62954 XrootdXeq: >>>>>> atlasprd.2757:55@[2001:718:401:6e03::1:81] disc 0:10:17 >>>>>> ... >>>>>> >>>>>> Any suggestion how to trace what is causing this behavior? >>>>>> >>>>>> Petr >>>>>> >>>>>> >>>>>> On 7/20/19 7:48 AM, Petr Vokac wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I ran the command you sent me >>>>>>> >>>>>>> grep "atlasprd.2757:" <logfile> | egrep >>>>>>> '2001:718:401:6e03::1:21|2001:718:401:6e03::1:32' > <excerpt> >>>>>>> >>>>>>> and because the output was not just few lines long I uploaded it to the >>>>>>> >>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool22.xrootd.log >>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool24.xrootd.log >>>>>>> >>>>>>> These machines on 2001:718:401:6e03::/64 subnet are in remote location >>>>>>> (but relatively close with RTT ~ 0.7ms). I can see several ofs_Posc >>>>>>> occurrences in the logs of our disknodes during "190717 17:xx:xx" time >>>>>>> period: >>>>>>> >>>>>>> http://vokac.web.cern.ch/vokac/tmp/POSC.txt >>>>>>> >>>>>>> Petr >>>>>>> >>>>>>> On 7/19/19 8:51 PM, Andrew Hanushevsky wrote: >>>>>>>> Hi Petr, >>>>>>>> >>>>>>>> Well, POSC says that the client reconnected for writing so not all >>>>>>>> went well. Can I get the log output for that occurrence as I >>>>>>>> previously described? >>>>>>>> >>>>>>>> Andy >>>>>>>> >>>>>>>> On Fri, 19 Jul 2019, Petr Vokac wrote: >>>>>>>> >>>>>>>>> We are using Rucio download client to transfer job outputs (it is >>>>>>>>> build >>>>>>>>> on top of gfal2 python interface). >>>>>>>>> >>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool22.xrootd.log >>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool24.xrootd.log >>>>>>>>> >>>>>>>>> this is the full job output unfortunately with minimal gfal2 logging >>>>>>>>> output. >>>>>>>>> >>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/grid.2258136.4.out-from-2001:718:401:6e03::1:21-to-dpmpool22.txt >>>>>>>>> >>>>>>>>> >>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/grid.2255921.7.out-from-2001:718:401:6e03::1:32-to-dpmpool24.txt >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> It looks like according gfal2 transfer was OK >>>>>>>>> >>>>>>>>> 2019-07-17 15:09:17,782 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with >>>>>>>>> success >>>>>>>>> 2019-07-17 15:09:17,782 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,783 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with >>>>>>>>> success >>>>>>>>> 2019-07-17 15:09:17,783 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,783 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,783 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,784 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,784 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,784 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,793 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with >>>>>>>>> success >>>>>>>>> 2019-07-17 15:09:17,793 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,793 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with >>>>>>>>> success >>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,795 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with >>>>>>>>> success >>>>>>>>> 2019-07-17 15:09:17,795 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,795 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with >>>>>>>>> success >>>>>>>>> 2019-07-17 15:09:17,795 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,796 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,796 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,796 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,796 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,796 | INFO | copytool_out | >>>>>>>>> gfal2 | connect | >>>>>>>>> [gfal_module_load] plugin >>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with success >>>>>>>>> 2019-07-17 15:09:17,810 | INFO | copytool_out | >>>>>>>>> gfal2 | __gfal2_copy | Event >>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:ENTER >>>>>>>>> 2019-07-17 15:09:17,810 | INFO | copytool_out | >>>>>>>>> gfal2 | __gfal2_copy | Event >>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:ITEM >>>>>>>>> file:///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1 >>>>>>>>> >>>>>>>>> => >>>>>>>>> root://golias100.farm.particle.cz:1094//dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload >>>>>>>>> >>>>>>>>> 2019-07-17 15:09:17,810 | INFO | copytool_out | >>>>>>>>> gfal2 | __gfal2_copy | Event >>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:EXIT >>>>>>>>> 2019-07-17 15:09:17,811 | INFO | Dummy-2 | >>>>>>>>> gfal2 | (unknown function) | Event >>>>>>>>> triggered: BOTH xroot TRANSFER:ENTER >>>>>>>>> file://localhost///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&xrd.wantprot=gsi,unix >>>>>>>>> >>>>>>>>> => >>>>>>>>> root://golias100.farm.particle.cz:1094///dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&x >>>>>>>>> >>>>>>>>> 2019-07-17 15:09:17,811 | INFO | Dummy-2 | >>>>>>>>> gfal2 | (unknown function) | Event >>>>>>>>> triggered: BOTH xroot TRANSFER:TYPE streamed >>>>>>>>> 2019-07-17 15:11:23,914 | INFO | Dummy-2 | >>>>>>>>> gfal2 | (unknown function) | Event >>>>>>>>> triggered: BOTH xroot TRANSFER:EXIT Job finished, [SUCCESS] >>>>>>>>> >>>>>>>>> but probably rucio tries to verify checksum (gfal-sum via python >>>>>>>>> bindings) and it fails >>>>>>>>> >>>>>>>>> 2019-07-17 15:15:23,774 | INFO | copytool_out | >>>>>>>>> pilot.copytool.rucio | copy_out | >>>>>>>>> stderr = An unknown exception occurred. >>>>>>>>> Details: Checksum not validated >>>>>>>>> >>>>>>>>> because DPM is unable to calculate checksum for 128TB file within >>>>>>>>> timeout. >>>>>>>>> >>>>>>>>> Petr >>>>>>>>> >>>>>>>>> On 7/19/19 5:07 AM, Andrew Hanushevsky wrote: >>>>>>>>>> Hi Petr, >>>>>>>>>> >>>>>>>>>> Indeed, it looks like what was used to upload the files (xrdcp?) >>>>>>>>>> created a very large sparse file by writing with a very large >>>>>>>>>> offset. >>>>>>>>>> The clue here is that an anomaly occurred during the write and these >>>>>>>>>> two messages (lkely each corresponding to each file) tell us >>>>>>>>>> something.... >>>>>>>>>> >>>>>>>>>>> 190717 17:10:39 332092 ofs_Posc: Creator changed from >>>>>>>>>> atlasprd.2757:90@[2001:718:401:6e03::1:21] to >>>>>>>>>> atlasprd.2757:39@[2001:718:401:6e03::1:21] for >>>>>>>>>> >>>>>>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/ >>>>>>>>>> >>>>>>>>>> log.18636890._001059.job.log.tgz.1.rucio.upload >>>>>>>>>> >>>>>>>>>>> 190717 17:29:26 14204 ofs_Posc: Creator changed from >>>>>>>>>> atlasprd.2757:33@[2001:718:401:6e03::1:32] to >>>>>>>>>> atlasprd.2757:95@[2001:718:401:6e03::1:32] for >>>>>>>>>> >>>>>>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/ >>>>>>>>>> >>>>>>>>>> AOD.18595963._000786.pool.root.1.rucio.upload >>>>>>>>>> >>>>>>>>>> What the POSC system is saying is that in each case the original >>>>>>>>>> creator of the file lost contact with the server. It then >>>>>>>>>> reconnected >>>>>>>>>> and resumed writing with a successful close. POSC allows this to >>>>>>>>>> avoid >>>>>>>>>> network glitches from deleting a file because a network glitch will >>>>>>>>>> cause the client to reconnect and resume where it left off. This is >>>>>>>>>> only allowed when the resuming client comes from the same IP >>>>>>>>>> address, >>>>>>>>>> has the same process ID, and the same logical name. In this case it >>>>>>>>>> was "atlasprd.2757". I find that rather odd because it's next to >>>>>>>>>> impossible for two processes on two different machines to have the >>>>>>>>>> same process number (i.e. 2757). So, this is why I am wondering if >>>>>>>>>> this was really an xrdcp or some other program. >>>>>>>>>> >>>>>>>>>> We should look to see what these client were actually doing during >>>>>>>>>> this period as they clearly did not do the right thing when they >>>>>>>>>> reconnected. So, could you grep out all interactions of these two >>>>>>>>>> client from the log. Something like: >>>>>>>>>> >>>>>>>>>> grep "atlasprd.2757:" <logfile> | egrep >>>>>>>>>> '2001:718:401:6e03::1:21|2001:718:401:6e03::1:32' > <excerpt> >>>>>>>>>> >>>>>>>>>> and send the excerpt to me? >>>>>>>>>> >>>>>>>>>> Andy >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, 18 Jul 2019, Petr Vokac wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> our DPM which is using native xrootd 4.9.1 yesterday created two >>>>>>>>>>> different 128TB files on two different diskservers that were >>>>>>>>>>> received as >>>>>>>>>>> a job output from two different worker nodes. The size of original >>>>>>>>>>> file >>>>>>>>>>> was few megabytes and when I check beginning of stored files up to >>>>>>>>>>> original size & checksum everything seems to be OK - except >>>>>>>>>>> these are >>>>>>>>>>> sparse files with size 128TB. Original transferred files from WN >>>>>>>>>>> looked >>>>>>>>>>> like: >>>>>>>>>>> >>>>>>>>>>> mc15_13TeV:log.18636890._001059.job.log.tgz.1 - original size >>>>>>>>>>> 3101899, adler32 742e1336 >>>>>>>>>>> mc16_5TeV:AOD.18595963._000786.pool.root.1 - original size >>>>>>>>>>> 737474514, >>>>>>>>>>> adler32 dc68cb21 >>>>>>>>>>> >>>>>>>>>>> but on our diskservers they were stored as >>>>>>>>>>> >>>>>>>>>>> [root@dpmpool22 ~]# ls -l >>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 140560257079236 Jul 17 17:11 >>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> [root@dpmpool22 ~]# du --block-size=1 >>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 3108864 >>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> [root@dpmpool24 ~]# ls -l >>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 139734345056408 Jul 17 17:31 >>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> [root@dpmpool24 tmp]# du --block-size=1 >>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 737480704 >>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I was told that DPM is not really involved in transfering / storing >>>>>>>>>>> files with xrootd protocol, so I guess something bad must happened >>>>>>>>>>> inside XRootD libraries. Do you have any idea why these two >>>>>>>>>>> transfers >>>>>>>>>>> generated such extreme files? Unfortunately our xrootd logging is >>>>>>>>>>> pretty >>>>>>>>>>> low and I was able to find only info about file creation >>>>>>>>>>> >>>>>>>>>>> [root@dpmpool22 ~]# grep log.18636890._001059.job.log.tgz >>>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log >>>>>>>>>>> 190717 17:10:39 332092 ofs_Posc: Creator changed from >>>>>>>>>> atlasprd.2757:90@[2001:718:401:6e03::1:21] to >>>>>>>>>> atlasprd.2757:39@[2001:718:401:6e03::1:21] for >>>>>>>>>> >>>>>>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/ >>>>>>>>>> >>>>>>>>>> log.18636890._001059.job.log.tgz.1.rucio.upload >>>>>>>>>>> [root@dpmpool24 tmp]# grep AOD.18595963._000786.pool.root >>>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log >>>>>>>>>>> 190717 17:29:26 14204 ofs_Posc: Creator changed from >>>>>>>>>> atlasprd.2757:33@[2001:718:401:6e03::1:32] to >>>>>>>>>> atlasprd.2757:95@[2001:718:401:6e03::1:32] for >>>>>>>>>> >>>>>>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/AOD.18595963. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _000786.pool.root.1.rucio.upload >>>>>>>>>>> Petr >>>>>>>>>>> >>>>>>>>>>> ######################################################################## >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Use REPLY-ALL to reply to list >>>>>>>>>>> >>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>>>>>>>>> >>>>>>>>>> ######################################################################## >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Use REPLY-ALL to reply to list >>>>>>>>>> >>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>>>>>> ######################################################################## >>>>>>>> >>>>>>>> Use REPLY-ALL to reply to list >>>>>>>> >>>>>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>>>>> ------------------------------------------------------------------------ >>>>>>> >>>>>>> >>>>>>> Use REPLY-ALL to reply to list >>>>>>> >>>>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>>>>> >>>> ------------------------------------------------------------------------ >>>> >>>> Use REPLY-ALL to reply to list >>>> >>>> To unsubscribe from the XROOTD-L list, click the following link: >>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>> ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1