Print

Print


Hi Petr,

The issue here is that the python script (I believe it's python) is 
disconnecting without closing the file and then reconnecting and writing 
at a high offset. The log is pretty clear on that. You can verify by 
adding "ofs.trace write" to see the write offsets (hough it will produce a 
lot of output). You can also prohibit reconnection by adding this to the 
persist directive.

ofs.persist hold 0

which would delete the file the moment the client disconnects without 
closing the file.

Andy

On Sat, 17 Aug 2019, Petr Vokac wrote:

> Today new "65PB" file appeared on our storage and it was again
> transferred with xrootd protocol. I can see again same error in the log
> file:
>
> **[root@dpmpool21 ~]# ls -la
> /mnt/fs4/atlas/2019-08-17/EVNT.18893579._003197.pool.root.1.rucio.upload.234855.1566031522*
> *-rw-rw---- 1 dpmmgr dpmmgr 72340172838077281 Aug 17 10:48 /mnt/fs4/atlas/2019-08-17/EVNT.18893579._003197.pool.root.1.rucio.upload.234855.1566031522
> ***[root@dpmpool21 ~]# grep atlasprd.2757 /var/log/xrootd/dpmdisk/xrootd.log*
> ...
> 190817 10:47:55 57803 XrootdXeq: atlasprd.2757:65@[2001:718:401:6e03::1:43] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
> 190817 10:47:55 61859 XrootdXeq: atlasprd.2757:66@[2001:718:401:6e03::1:83] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
> 190817 10:48:52 62233 XrootdXeq: atlasprd.2757:46@[2001:718:401:6e03::1:72] disc 0:00:58
> 190817 10:48:52 62954 XrootdXeq: atlasprd.2757:25@[2001:718:401:6e03::1:72] disc 0:03:30
> 190817 10:48:52 57804 XrootdXeq: atlasprd.2757:28@[2001:718:401:6e03::1:72] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
> 190817 10:48:52 57804 ofs_Posc: Creator changed from atlasprd.2757:25@[2001:718:401:6e03::1:72] to atlasprd.2757:28@[2001:718:401:6e03::1:72] for /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/8b/f3/EVNT.18893579._003197.pool.root.1.rucio.upload
> 190817 10:48:53 57803 XrootdXeq: atlasprd.2757:46@[2001:718:401:6017:20:0:14:44] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
> 190817 10:49:01 57818 XrootdXeq: atlasprd.2757:70@[2001:718:401:6017:20:0:11:11] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
> 190817 10:49:07 62954 XrootdXeq: atlasprd.2757:55@[2001:718:401:6e03::1:81] disc 0:10:17
> ...
>
> Any suggestion how to trace what is causing this behavior?
>
> Petr
>
>
> On 7/20/19 7:48 AM, Petr Vokac wrote:
>> Hi,
>>
>> I ran the command you sent me
>>
>> grep "atlasprd.2757:" <logfile> | egrep 
>> '2001:718:401:6e03::1:21|2001:718:401:6e03::1:32' > <excerpt>
>>
>> and because the output was not just few lines long I uploaded it to the
>>
>> https://vokac.web.cern.ch/vokac/tmp/dpmpool22.xrootd.log
>> https://vokac.web.cern.ch/vokac/tmp/dpmpool24.xrootd.log
>>
>> These machines on 2001:718:401:6e03::/64 subnet are in remote location
>> (but relatively close with RTT ~ 0.7ms). I can see several ofs_Posc
>> occurrences in the logs of our disknodes during "190717 17:xx:xx" time
>> period:
>>
>> http://vokac.web.cern.ch/vokac/tmp/POSC.txt
>>
>> Petr
>>
>> On 7/19/19 8:51 PM, Andrew Hanushevsky wrote:
>>> Hi Petr,
>>>
>>> Well, POSC says that the client reconnected for writing so not all
>>> went well. Can I get the log output for that occurrence as I
>>> previously described?
>>>
>>> Andy
>>>
>>> On Fri, 19 Jul 2019, Petr Vokac wrote:
>>>
>>>> We are using Rucio download client to transfer job outputs (it is build
>>>> on top of gfal2 python interface).
>>>>
>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool22.xrootd.log
>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool24.xrootd.log
>>>>
>>>> this is the full job output unfortunately with minimal gfal2 logging
>>>> output.
>>>>
>>>> https://vokac.web.cern.ch/vokac/tmp/grid.2258136.4.out-from-2001:718:401:6e03::1:21-to-dpmpool22.txt
>>>>
>>>> https://vokac.web.cern.ch/vokac/tmp/grid.2255921.7.out-from-2001:718:401:6e03::1:32-to-dpmpool24.txt
>>>>
>>>>
>>>> It looks like according gfal2 transfer was OK
>>>>
>>>> 2019-07-17 15:09:17,782 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with success
>>>> 2019-07-17 15:09:17,782 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with success
>>>> 2019-07-17 15:09:17,783 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with success
>>>> 2019-07-17 15:09:17,783 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with success
>>>> 2019-07-17 15:09:17,783 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with success
>>>> 2019-07-17 15:09:17,783 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with success
>>>> 2019-07-17 15:09:17,784 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with success
>>>> 2019-07-17 15:09:17,784 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with success
>>>> 2019-07-17 15:09:17,784 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with success
>>>> 2019-07-17 15:09:17,793 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with success
>>>> 2019-07-17 15:09:17,793 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with success
>>>> 2019-07-17 15:09:17,793 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with success
>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with success
>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with success
>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with success
>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with success
>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with success
>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with success
>>>> 2019-07-17 15:09:17,795 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with success
>>>> 2019-07-17 15:09:17,795 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with success
>>>> 2019-07-17 15:09:17,795 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with success
>>>> 2019-07-17 15:09:17,795 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with success
>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with success
>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with success
>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with success
>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with success
>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>> gfal2                            | connect                   |
>>>> [gfal_module_load] plugin
>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with success
>>>> 2019-07-17 15:09:17,810 | INFO     | copytool_out        |
>>>> gfal2                            | __gfal2_copy              | Event
>>>> triggered: BOTH GFAL2:CORE:COPY LIST:ENTER
>>>> 2019-07-17 15:09:17,810 | INFO     | copytool_out        |
>>>> gfal2                            | __gfal2_copy              | Event
>>>> triggered: BOTH GFAL2:CORE:COPY LIST:ITEM
>>>> file:///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1
>>>> =>
>>>> root://golias100.farm.particle.cz:1094//dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload
>>>> 2019-07-17 15:09:17,810 | INFO     | copytool_out        |
>>>> gfal2                            | __gfal2_copy              | Event
>>>> triggered: BOTH GFAL2:CORE:COPY LIST:EXIT
>>>> 2019-07-17 15:09:17,811 | INFO     | Dummy-2             |
>>>> gfal2                            | (unknown function)        | Event
>>>> triggered: BOTH xroot TRANSFER:ENTER
>>>> file://localhost///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&xrd.wantprot=gsi,unix
>>>> =>
>>>> root://golias100.farm.particle.cz:1094///dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&x
>>>> 2019-07-17 15:09:17,811 | INFO     | Dummy-2             |
>>>> gfal2                            | (unknown function)        | Event
>>>> triggered: BOTH xroot TRANSFER:TYPE streamed
>>>> 2019-07-17 15:11:23,914 | INFO     | Dummy-2             |
>>>> gfal2                            | (unknown function)        | Event
>>>> triggered: BOTH xroot TRANSFER:EXIT Job finished, [SUCCESS]
>>>>
>>>> but probably rucio tries to verify checksum (gfal-sum via python
>>>> bindings) and it fails
>>>>
>>>> 2019-07-17 15:15:23,774 | INFO     | copytool_out        |
>>>> pilot.copytool.rucio             | copy_out                  |
>>>> stderr = An unknown exception occurred.
>>>> Details: Checksum not validated
>>>>
>>>> because DPM is unable to calculate checksum for 128TB file within
>>>> timeout.
>>>>
>>>> Petr
>>>>
>>>> On 7/19/19 5:07 AM, Andrew Hanushevsky wrote:
>>>>> Hi Petr,
>>>>>
>>>>> Indeed, it looks like what was used to upload the files (xrdcp?)
>>>>> created a very large sparse file by writing with a very large offset.
>>>>> The clue here is that an anomaly occurred during the write and these
>>>>> two messages (lkely each corresponding to each file) tell us
>>>>> something....
>>>>>
>>>>>> 190717 17:10:39 332092 ofs_Posc: Creator changed from
>>>>>   atlasprd.2757:90@[2001:718:401:6e03::1:21] to
>>>>>   atlasprd.2757:39@[2001:718:401:6e03::1:21] for
>>>>>  
>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/
>>>>>   log.18636890._001059.job.log.tgz.1.rucio.upload
>>>>>
>>>>>> 190717 17:29:26 14204 ofs_Posc: Creator changed from
>>>>>   atlasprd.2757:33@[2001:718:401:6e03::1:32] to
>>>>>   atlasprd.2757:95@[2001:718:401:6e03::1:32] for
>>>>>  
>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/
>>>>>   AOD.18595963._000786.pool.root.1.rucio.upload
>>>>>
>>>>> What the POSC system is saying is that in each case the original
>>>>> creator of the file lost contact with the server. It then reconnected
>>>>> and resumed writing with a successful close. POSC allows this to avoid
>>>>> network glitches from deleting a file because a network glitch will
>>>>> cause the client to reconnect and resume where it left off. This is
>>>>> only allowed when the resuming client comes from the same IP address,
>>>>> has the same process ID, and the same logical name. In this case it
>>>>> was "atlasprd.2757". I find that rather odd because it's next to
>>>>> impossible for two processes on two different machines to have the
>>>>> same process number (i.e. 2757). So, this is why I am wondering if
>>>>> this was really an xrdcp or some other program.
>>>>>
>>>>> We should look to see what these client were actually doing during
>>>>> this period as they clearly did not do the right thing when they
>>>>> reconnected. So, could you grep out all interactions of these two
>>>>> client from the log. Something like:
>>>>>
>>>>> grep "atlasprd.2757:" <logfile> | egrep
>>>>> '2001:718:401:6e03::1:21|2001:718:401:6e03::1:32' > <excerpt>
>>>>>
>>>>> and send the excerpt to me?
>>>>>
>>>>> Andy
>>>>>
>>>>>
>>>>> On Thu, 18 Jul 2019, Petr Vokac wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> our DPM which is using native xrootd 4.9.1 yesterday created two
>>>>>> different 128TB files on two different diskservers that were
>>>>>> received as
>>>>>> a job output from two different worker nodes. The size of original
>>>>>> file
>>>>>> was few megabytes and when I check beginning of stored files up to
>>>>>> original size & checksum everything seems to be OK - except these are
>>>>>> sparse files with size 128TB. Original transferred files from WN
>>>>>> looked
>>>>>> like:
>>>>>>
>>>>>> mc15_13TeV:log.18636890._001059.job.log.tgz.1 - original size
>>>>>> 3101899, adler32 742e1336
>>>>>> mc16_5TeV:AOD.18595963._000786.pool.root.1 - original size 737474514,
>>>>>> adler32 dc68cb21
>>>>>>
>>>>>> but on our diskservers they were stored as
>>>>>>
>>>>>> [root@dpmpool22 ~]# ls -l
>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>>>>>>
>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 140560257079236 Jul 17 17:11
>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>>>>>>
>>>>>> [root@dpmpool22 ~]# du --block-size=1
>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>>>>>>
>>>>>> 3108864   
>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>>>>>>
>>>>>>
>>>>>> [root@dpmpool24 ~]# ls -l
>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>>>>>>
>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 139734345056408 Jul 17 17:31
>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>>>>>>
>>>>>> [root@dpmpool24 tmp]# du --block-size=1
>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>>>>>>
>>>>>> 737480704   
>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>>>>>>
>>>>>>
>>>>>> I was told that DPM is not really involved in transfering / storing
>>>>>> files with xrootd protocol, so I guess something bad must happened
>>>>>> inside XRootD libraries. Do you have any idea why these two transfers
>>>>>> generated such extreme files? Unfortunately our xrootd logging is
>>>>>> pretty
>>>>>> low and I was able to find only info about file creation
>>>>>>
>>>>>> [root@dpmpool22 ~]# grep log.18636890._001059.job.log.tgz
>>>>>> /var/log/xrootd/dpmdisk/xrootd.log
>>>>>> 190717 17:10:39 332092 ofs_Posc: Creator changed from
>>>>>   atlasprd.2757:90@[2001:718:401:6e03::1:21] to
>>>>>   atlasprd.2757:39@[2001:718:401:6e03::1:21] for
>>>>>  
>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/
>>>>>   log.18636890._001059.job.log.tgz.1.rucio.upload
>>>>>>
>>>>>> [root@dpmpool24 tmp]# grep AOD.18595963._000786.pool.root
>>>>>> /var/log/xrootd/dpmdisk/xrootd.log
>>>>>> 190717 17:29:26 14204 ofs_Posc: Creator changed from
>>>>>   atlasprd.2757:33@[2001:718:401:6e03::1:32] to
>>>>>   atlasprd.2757:95@[2001:718:401:6e03::1:32] for
>>>>>  
>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/AOD.18595963.
>>>>>
>>>>>   _000786.pool.root.1.rucio.upload
>>>>>>
>>>>>>
>>>>>> Petr
>>>>>>
>>>>>> ########################################################################
>>>>>>
>>>>>> Use REPLY-ALL to reply to list
>>>>>>
>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>
>>>>>
>>>>> ########################################################################
>>>>>
>>>>> Use REPLY-ALL to reply to list
>>>>>
>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>
>>>>
>>>
>>> ########################################################################
>>> Use REPLY-ALL to reply to list
>>>
>>> To unsubscribe from the XROOTD-L list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>
>>
>> ------------------------------------------------------------------------
>>
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the XROOTD-L list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>
>
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1