Print

Print


We are using Rucio download client to transfer job outputs (it is build
on top of gfal2 python interface).

https://vokac.web.cern.ch/vokac/tmp/dpmpool22.xrootd.log
https://vokac.web.cern.ch/vokac/tmp/dpmpool24.xrootd.log

this is the full job output unfortunately with minimal gfal2 logging output.

https://vokac.web.cern.ch/vokac/tmp/grid.2258136.4.out-from-2001:718:401:6e03::1:21-to-dpmpool22.txt
https://vokac.web.cern.ch/vokac/tmp/grid.2255921.7.out-from-2001:718:401:6e03::1:32-to-dpmpool24.txt

It looks like according gfal2 transfer was OK

2019-07-17 15:09:17,782 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with success 
2019-07-17 15:09:17,782 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with success 
2019-07-17 15:09:17,783 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with success 
2019-07-17 15:09:17,783 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with success 
2019-07-17 15:09:17,783 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with success 
2019-07-17 15:09:17,783 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with success 
2019-07-17 15:09:17,784 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with success 
2019-07-17 15:09:17,784 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with success 
2019-07-17 15:09:17,784 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with success 
2019-07-17 15:09:17,793 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with success 
2019-07-17 15:09:17,793 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with success 
2019-07-17 15:09:17,793 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with success 
2019-07-17 15:09:17,794 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with success 
2019-07-17 15:09:17,794 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with success 
2019-07-17 15:09:17,794 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with success 
2019-07-17 15:09:17,794 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with success 
2019-07-17 15:09:17,794 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with success 
2019-07-17 15:09:17,794 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with success 
2019-07-17 15:09:17,795 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with success 
2019-07-17 15:09:17,795 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with success 
2019-07-17 15:09:17,795 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with success 
2019-07-17 15:09:17,795 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with success 
2019-07-17 15:09:17,796 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with success 
2019-07-17 15:09:17,796 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with success 
2019-07-17 15:09:17,796 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with success 
2019-07-17 15:09:17,796 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with success 
2019-07-17 15:09:17,796 | INFO     | copytool_out        | gfal2                            | connect                   | [gfal_module_load] plugin /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with success 
2019-07-17 15:09:17,810 | INFO     | copytool_out        | gfal2                            | __gfal2_copy              | Event triggered: BOTH GFAL2:CORE:COPY LIST:ENTER 
2019-07-17 15:09:17,810 | INFO     | copytool_out        | gfal2                            | __gfal2_copy              | Event triggered: BOTH GFAL2:CORE:COPY LIST:ITEM file:///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1 => root://golias100.farm.particle.cz:1094//dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload
2019-07-17 15:09:17,810 | INFO     | copytool_out        | gfal2                            | __gfal2_copy              | Event triggered: BOTH GFAL2:CORE:COPY LIST:EXIT 
2019-07-17 15:09:17,811 | INFO     | Dummy-2             | gfal2                            | (unknown function)        | Event triggered: BOTH xroot TRANSFER:ENTER file://localhost///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&xrd.wantprot=gsi,unix => root://golias100.farm.particle.cz:1094///dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&x
2019-07-17 15:09:17,811 | INFO     | Dummy-2             | gfal2                            | (unknown function)        | Event triggered: BOTH xroot TRANSFER:TYPE streamed
2019-07-17 15:11:23,914 | INFO     | Dummy-2             | gfal2                            | (unknown function)        | Event triggered: BOTH xroot TRANSFER:EXIT Job finished, [SUCCESS] 

but probably rucio tries to verify checksum (gfal-sum via python
bindings) and it fails

2019-07-17 15:15:23,774 | INFO     | copytool_out        | pilot.copytool.rucio             | copy_out                  | stderr = An unknown exception occurred.
Details: Checksum not validated

because DPM is unable to calculate checksum for 128TB file within timeout.

Petr

On 7/19/19 5:07 AM, Andrew Hanushevsky wrote:
> Hi Petr,
>
> Indeed, it looks like what was used to upload the files (xrdcp?)
> created a very large sparse file by writing with a very large offset.
> The clue here is that an anomaly occurred during the write and these
> two messages (lkely each corresponding to each file) tell us
> something....
>
>> 190717 17:10:39 332092 ofs_Posc: Creator changed from
>   atlasprd.2757:90@[2001:718:401:6e03::1:21] to
>   atlasprd.2757:39@[2001:718:401:6e03::1:21] for
>   /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/
>   log.18636890._001059.job.log.tgz.1.rucio.upload
>
>> 190717 17:29:26 14204 ofs_Posc: Creator changed from
>   atlasprd.2757:33@[2001:718:401:6e03::1:32] to
>   atlasprd.2757:95@[2001:718:401:6e03::1:32] for
>   /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/
>   AOD.18595963._000786.pool.root.1.rucio.upload
>
> What the POSC system is saying is that in each case the original
> creator of the file lost contact with the server. It then reconnected
> and resumed writing with a successful close. POSC allows this to avoid
> network glitches from deleting a file because a network glitch will
> cause the client to reconnect and resume where it left off. This is
> only allowed when the resuming client comes from the same IP address,
> has the same process ID, and the same logical name. In this case it
> was "atlasprd.2757". I find that rather odd because it's next to
> impossible for two processes on two different machines to have the
> same process number (i.e. 2757). So, this is why I am wondering if
> this was really an xrdcp or some other program.
>
> We should look to see what these client were actually doing during
> this period as they clearly did not do the right thing when they
> reconnected. So, could you grep out all interactions of these two
> client from the log. Something like:
>
> grep "atlasprd.2757:" <logfile> | egrep
> '2001:718:401:6e03::1:21|2001:718:401:6e03::1:32' > <excerpt>
>
> and send the excerpt to me?
>
> Andy
>
>
> On Thu, 18 Jul 2019, Petr Vokac wrote:
>
>> Hi,
>>
>> our DPM which is using native xrootd 4.9.1 yesterday created two
>> different 128TB files on two different diskservers that were received as
>> a job output from two different worker nodes. The size of original file
>> was few megabytes and when I check beginning of stored files up to
>> original size & checksum everything seems to be OK - except these are
>> sparse files with size 128TB. Original transferred files from WN looked
>> like:
>>
>> mc15_13TeV:log.18636890._001059.job.log.tgz.1 - original size
>> 3101899, adler32 742e1336
>> mc16_5TeV:AOD.18595963._000786.pool.root.1 - original size 737474514,
>> adler32 dc68cb21
>>
>> but on our diskservers they were stored as
>>
>> [root@dpmpool22 ~]# ls -l
>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>> -rw-rw---- 1 dpmmgr dpmmgr 140560257079236 Jul 17 17:11
>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>> [root@dpmpool22 ~]# du --block-size=1
>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>> 3108864   
>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>>
>> [root@dpmpool24 ~]# ls -l
>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>> -rw-rw---- 1 dpmmgr dpmmgr 139734345056408 Jul 17 17:31
>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>> [root@dpmpool24 tmp]# du --block-size=1
>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>> 737480704   
>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>>
>> I was told that DPM is not really involved in transfering / storing
>> files with xrootd protocol, so I guess something bad must happened
>> inside XRootD libraries. Do you have any idea why these two transfers
>> generated such extreme files? Unfortunately our xrootd logging is pretty
>> low and I was able to find only info about file creation
>>
>> [root@dpmpool22 ~]# grep log.18636890._001059.job.log.tgz
>> /var/log/xrootd/dpmdisk/xrootd.log
>> 190717 17:10:39 332092 ofs_Posc: Creator changed from
>   atlasprd.2757:90@[2001:718:401:6e03::1:21] to
>   atlasprd.2757:39@[2001:718:401:6e03::1:21] for
>   /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/
>   log.18636890._001059.job.log.tgz.1.rucio.upload
>>
>> [root@dpmpool24 tmp]# grep AOD.18595963._000786.pool.root
>> /var/log/xrootd/dpmdisk/xrootd.log
>> 190717 17:29:26 14204 ofs_Posc: Creator changed from
>   atlasprd.2757:33@[2001:718:401:6e03::1:32] to
>   atlasprd.2757:95@[2001:718:401:6e03::1:32] for
>  
> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/AOD.18595963.
>   _000786.pool.root.1.rucio.upload
>>
>>
>> Petr
>>
>> ########################################################################
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the XROOTD-L list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1


########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1