Hi, I ran the command you sent me grep "atlasprd.2757:" <logfile> | egrep '2001:718:401:6e03::1:21|2001:718:401:6e03::1:32' > <excerpt> and because the output was not just few lines long I uploaded it to the https://vokac.web.cern.ch/vokac/tmp/dpmpool22.xrootd.log https://vokac.web.cern.ch/vokac/tmp/dpmpool24.xrootd.log These machines on 2001:718:401:6e03::/64 subnet are in remote location (but relatively close with RTT ~ 0.7ms). I can see several ofs_Posc occurrences in the logs of our disknodes during "190717 17:xx:xx" time period: http://vokac.web.cern.ch/vokac/tmp/POSC.txt Petr On 7/19/19 8:51 PM, Andrew Hanushevsky wrote: > Hi Petr, > > Well, POSC says that the client reconnected for writing so not all > went well. Can I get the log output for that occurrence as I > previously described? > > Andy > > On Fri, 19 Jul 2019, Petr Vokac wrote: > >> We are using Rucio download client to transfer job outputs (it is build >> on top of gfal2 python interface). >> >> https://vokac.web.cern.ch/vokac/tmp/dpmpool22.xrootd.log >> https://vokac.web.cern.ch/vokac/tmp/dpmpool24.xrootd.log >> >> this is the full job output unfortunately with minimal gfal2 logging >> output. >> >> https://vokac.web.cern.ch/vokac/tmp/grid.2258136.4.out-from-2001:718:401:6e03::1:21-to-dpmpool22.txt >> >> https://vokac.web.cern.ch/vokac/tmp/grid.2255921.7.out-from-2001:718:401:6e03::1:32-to-dpmpool24.txt >> >> >> It looks like according gfal2 transfer was OK >> >> 2019-07-17 15:09:17,782 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with success >> 2019-07-17 15:09:17,782 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with success >> 2019-07-17 15:09:17,783 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with success >> 2019-07-17 15:09:17,783 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with success >> 2019-07-17 15:09:17,783 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with success >> 2019-07-17 15:09:17,783 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with success >> 2019-07-17 15:09:17,784 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with success >> 2019-07-17 15:09:17,784 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with success >> 2019-07-17 15:09:17,784 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with success >> 2019-07-17 15:09:17,793 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with success >> 2019-07-17 15:09:17,793 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with success >> 2019-07-17 15:09:17,793 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with success >> 2019-07-17 15:09:17,794 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with success >> 2019-07-17 15:09:17,794 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with success >> 2019-07-17 15:09:17,794 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with success >> 2019-07-17 15:09:17,794 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with success >> 2019-07-17 15:09:17,794 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with success >> 2019-07-17 15:09:17,794 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with success >> 2019-07-17 15:09:17,795 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with success >> 2019-07-17 15:09:17,795 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with success >> 2019-07-17 15:09:17,795 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with success >> 2019-07-17 15:09:17,795 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with success >> 2019-07-17 15:09:17,796 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with success >> 2019-07-17 15:09:17,796 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with success >> 2019-07-17 15:09:17,796 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with success >> 2019-07-17 15:09:17,796 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with success >> 2019-07-17 15:09:17,796 | INFO | copytool_out | >> gfal2 | connect | >> [gfal_module_load] plugin >> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with success >> 2019-07-17 15:09:17,810 | INFO | copytool_out | >> gfal2 | __gfal2_copy | Event >> triggered: BOTH GFAL2:CORE:COPY LIST:ENTER >> 2019-07-17 15:09:17,810 | INFO | copytool_out | >> gfal2 | __gfal2_copy | Event >> triggered: BOTH GFAL2:CORE:COPY LIST:ITEM >> file:///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1 >> => >> root://golias100.farm.particle.cz:1094//dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload >> 2019-07-17 15:09:17,810 | INFO | copytool_out | >> gfal2 | __gfal2_copy | Event >> triggered: BOTH GFAL2:CORE:COPY LIST:EXIT >> 2019-07-17 15:09:17,811 | INFO | Dummy-2 | >> gfal2 | (unknown function) | Event >> triggered: BOTH xroot TRANSFER:ENTER >> file://localhost///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&xrd.wantprot=gsi,unix >> => >> root://golias100.farm.particle.cz:1094///dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&x >> 2019-07-17 15:09:17,811 | INFO | Dummy-2 | >> gfal2 | (unknown function) | Event >> triggered: BOTH xroot TRANSFER:TYPE streamed >> 2019-07-17 15:11:23,914 | INFO | Dummy-2 | >> gfal2 | (unknown function) | Event >> triggered: BOTH xroot TRANSFER:EXIT Job finished, [SUCCESS] >> >> but probably rucio tries to verify checksum (gfal-sum via python >> bindings) and it fails >> >> 2019-07-17 15:15:23,774 | INFO | copytool_out | >> pilot.copytool.rucio | copy_out | stderr >> = An unknown exception occurred. >> Details: Checksum not validated >> >> because DPM is unable to calculate checksum for 128TB file within >> timeout. >> >> Petr >> >> On 7/19/19 5:07 AM, Andrew Hanushevsky wrote: >>> Hi Petr, >>> >>> Indeed, it looks like what was used to upload the files (xrdcp?) >>> created a very large sparse file by writing with a very large offset. >>> The clue here is that an anomaly occurred during the write and these >>> two messages (lkely each corresponding to each file) tell us >>> something.... >>> >>>> 190717 17:10:39 332092 ofs_Posc: Creator changed from >>> atlasprd.2757:90@[2001:718:401:6e03::1:21] to >>> atlasprd.2757:39@[2001:718:401:6e03::1:21] for >>> >>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/ >>> log.18636890._001059.job.log.tgz.1.rucio.upload >>> >>>> 190717 17:29:26 14204 ofs_Posc: Creator changed from >>> atlasprd.2757:33@[2001:718:401:6e03::1:32] to >>> atlasprd.2757:95@[2001:718:401:6e03::1:32] for >>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/ >>> AOD.18595963._000786.pool.root.1.rucio.upload >>> >>> What the POSC system is saying is that in each case the original >>> creator of the file lost contact with the server. It then reconnected >>> and resumed writing with a successful close. POSC allows this to avoid >>> network glitches from deleting a file because a network glitch will >>> cause the client to reconnect and resume where it left off. This is >>> only allowed when the resuming client comes from the same IP address, >>> has the same process ID, and the same logical name. In this case it >>> was "atlasprd.2757". I find that rather odd because it's next to >>> impossible for two processes on two different machines to have the >>> same process number (i.e. 2757). So, this is why I am wondering if >>> this was really an xrdcp or some other program. >>> >>> We should look to see what these client were actually doing during >>> this period as they clearly did not do the right thing when they >>> reconnected. So, could you grep out all interactions of these two >>> client from the log. Something like: >>> >>> grep "atlasprd.2757:" <logfile> | egrep >>> '2001:718:401:6e03::1:21|2001:718:401:6e03::1:32' > <excerpt> >>> >>> and send the excerpt to me? >>> >>> Andy >>> >>> >>> On Thu, 18 Jul 2019, Petr Vokac wrote: >>> >>>> Hi, >>>> >>>> our DPM which is using native xrootd 4.9.1 yesterday created two >>>> different 128TB files on two different diskservers that were >>>> received as >>>> a job output from two different worker nodes. The size of original >>>> file >>>> was few megabytes and when I check beginning of stored files up to >>>> original size & checksum everything seems to be OK - except these are >>>> sparse files with size 128TB. Original transferred files from WN >>>> looked >>>> like: >>>> >>>> mc15_13TeV:log.18636890._001059.job.log.tgz.1 - original size >>>> 3101899, adler32 742e1336 >>>> mc16_5TeV:AOD.18595963._000786.pool.root.1 - original size 737474514, >>>> adler32 dc68cb21 >>>> >>>> but on our diskservers they were stored as >>>> >>>> [root@dpmpool22 ~]# ls -l >>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157 >>>> >>>> -rw-rw---- 1 dpmmgr dpmmgr 140560257079236 Jul 17 17:11 >>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157 >>>> >>>> [root@dpmpool22 ~]# du --block-size=1 >>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157 >>>> >>>> 3108864 >>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157 >>>> >>>> >>>> [root@dpmpool24 ~]# ls -l >>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083 >>>> >>>> -rw-rw---- 1 dpmmgr dpmmgr 139734345056408 Jul 17 17:31 >>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083 >>>> >>>> [root@dpmpool24 tmp]# du --block-size=1 >>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083 >>>> >>>> 737480704 >>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083 >>>> >>>> >>>> I was told that DPM is not really involved in transfering / storing >>>> files with xrootd protocol, so I guess something bad must happened >>>> inside XRootD libraries. Do you have any idea why these two transfers >>>> generated such extreme files? Unfortunately our xrootd logging is >>>> pretty >>>> low and I was able to find only info about file creation >>>> >>>> [root@dpmpool22 ~]# grep log.18636890._001059.job.log.tgz >>>> /var/log/xrootd/dpmdisk/xrootd.log >>>> 190717 17:10:39 332092 ofs_Posc: Creator changed from >>> atlasprd.2757:90@[2001:718:401:6e03::1:21] to >>> atlasprd.2757:39@[2001:718:401:6e03::1:21] for >>> >>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/ >>> log.18636890._001059.job.log.tgz.1.rucio.upload >>>> >>>> [root@dpmpool24 tmp]# grep AOD.18595963._000786.pool.root >>>> /var/log/xrootd/dpmdisk/xrootd.log >>>> 190717 17:29:26 14204 ofs_Posc: Creator changed from >>> atlasprd.2757:33@[2001:718:401:6e03::1:32] to >>> atlasprd.2757:95@[2001:718:401:6e03::1:32] for >>> >>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/AOD.18595963. >>> >>> _000786.pool.root.1.rucio.upload >>>> >>>> >>>> Petr >>>> >>>> ######################################################################## >>>> >>>> Use REPLY-ALL to reply to list >>>> >>>> To unsubscribe from the XROOTD-L list, click the following link: >>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>> >>> >>> ######################################################################## >>> >>> Use REPLY-ALL to reply to list >>> >>> To unsubscribe from the XROOTD-L list, click the following link: >>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >> >> > > ######################################################################## > Use REPLY-ALL to reply to list > > To unsubscribe from the XROOTD-L list, click the following link: > https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1