I updated "xrootd.async off" on most of our diskservers, ran test transfers for last few days and I can confirm that diskservers with updated config no longer produce corrupted files. DPM configuration by default use "xrood.async on" probably since xrootd ~ 3.6 (https://twiki.cern.ch/twiki/bin/view/DPM/DPMComponents_Dpm-Xrootd#Manual_Configuration_Configurati), but "on" parameter is not even in the documentation https://xrootd.slac.stanford.edu/doc/dev48/xrd_config.htm#_Toc496911332 Does this mean we should change xrootd.async default configuration in the upcoming DPM DOME 1.13.2? DPM DOME interface loaded as XRootD plugin doesn't bring globus dependency: *[root@dpmpool22 ~]# lsof -p 16232 | grep REG* xrootd 16232 dpmmgr txt REG 9,2 70576 662786 /usr/bin/xrootd xrootd 16232 dpmmgr mem REG 9,2 31408 673487 /usr/lib64/libnss_dns-2.17.so xrootd 16232 dpmmgr mem REG 9,2 217032 395689 /var/db/nscd/services xrootd 16232 dpmmgr mem REG 9,2 3135712 660855 /usr/lib64/mysql/libmysqlclient.so.18.0.0 xrootd 16232 dpmmgr mem REG 9,2 1975448 657235 /usr/lib64/libdome-4.so xrootd 16232 dpmmgr mem REG 9,2 217032 394245 /var/db/nscd/passwd xrootd 16232 dpmmgr mem REG 9,2 211240 660390 /usr/lib64/libXrdDPMCks-4.so xrootd 16232 dpmmgr mem REG 9,2 170024 659024 /usr/lib64/libXrdDPMDiskAcc-4.so xrootd 16232 dpmmgr mem REG 9,2 20112 676196 /usr/lib64/libuuid.so.1.3.0 xrootd 16232 dpmmgr mem REG 9,2 1509376 656808 /usr/lib64/libxml2.so.2.9.1 xrootd 16232 dpmmgr mem REG 9,2 1176008 661816 /usr/lib64/libdavix.so.0.7.4 xrootd 16232 dpmmgr mem REG 9,2 832600 662787 /usr/lib64/dmlite/plugin_domeadapter.so xrootd 16232 dpmmgr mem REG 9,2 36552 660872 /usr/lib64/dmlite/plugin_config.so xrootd 16232 dpmmgr mem REG 9,2 20781704 672658 /usr/lib64/libicudata.so.50.1.2 xrootd 16232 dpmmgr mem REG 9,2 2096104 672660 /usr/lib64/libicui18n.so.50.1.2 xrootd 16232 dpmmgr mem REG 9,2 1539544 672672 /usr/lib64/libicuuc.so.50.1.2 xrootd 16232 dpmmgr mem REG 9,2 6856 656621 /usr/lib64/libboost_atomic-mt.so.1.53.0 xrootd 16232 dpmmgr mem REG 9,2 32056 656588 /usr/lib64/libboost_chrono-mt.so.1.53.0 xrootd 16232 dpmmgr mem REG 9,2 1060624 672677 /usr/lib64/libboost_regex-mt.so.1.53.0 xrootd 16232 dpmmgr mem REG 9,2 94320 658557 /usr/lib64/libboost_thread-mt.so.1.53.0 xrootd 16232 dpmmgr mem REG 9,2 15608 658186 /usr/lib64/libboost_system-mt.so.1.53.0 xrootd 16232 dpmmgr mem REG 9,2 69184 658927 /usr/lib64/libboost_date_time-mt.so.1.53.0 xrootd 16232 dpmmgr mem REG 9,2 1213536 656626 /usr/lib64/libdmlite.so.1.13.1 xrootd 16232 dpmmgr mem REG 9,2 232064 659448 /usr/lib64/libXrdDPMOss-4.so xrootd 16232 dpmmgr mem REG 9,2 15688 656791 /usr/lib64/libXrdSecunix-4.so xrootd 16232 dpmmgr mem REG 9,2 173320 657303 /usr/lib64/libexpat.so.1.6.0 xrootd 16232 dpmmgr mem REG 9,2 437216 672545 /usr/lib64/libvomsapi.so.1.0.0 xrootd 16232 dpmmgr mem REG 9,2 32536 673484 /usr/lib64/libXrdSecgsiVOMS-4.so xrootd 16232 dpmmgr mem REG 9,2 174352 656623 /usr/lib64/libXrdCryptossl-4.so xrootd 16232 dpmmgr mem REG 9,2 185592 679728 /usr/lib64/libXrdSecgsi-4.so xrootd 16232 dpmmgr mem REG 9,2 61792 679746 /usr/lib64/libXrdSec-4.so xrootd 16232 dpmmgr mem REG 9,2 15688 658274 /usr/lib64/libkeyutils.so.1.5 xrootd 16232 dpmmgr mem REG 9,2 67104 675723 /usr/lib64/libkrb5support.so.0.1 xrootd 16232 dpmmgr mem REG 9,2 210824 675676 /usr/lib64/libk5crypto.so.3.1 xrootd 16232 dpmmgr mem REG 9,2 15920 655385 /usr/lib64/libcom_err.so.2.1 xrootd 16232 dpmmgr mem REG 9,2 967848 659013 /usr/lib64/libkrb5.so.3.3 xrootd 16232 dpmmgr mem REG 9,2 320400 672019 /usr/lib64/libgssapi_krb5.so.2.2 xrootd 16232 dpmmgr mem REG 9,2 2516624 659028 /usr/lib64/libcrypto.so.1.0.2k xrootd 16232 dpmmgr mem REG 9,2 470360 672745 /usr/lib64/libssl.so.1.0.2k xrootd 16232 dpmmgr mem REG 9,2 98496 656613 /usr/lib64/libXrdCrypto.so.1.0.0 xrootd 16232 dpmmgr mem REG 9,2 164664 659693 /usr/lib64/libXrdHttpUtils.so.1.0.0 xrootd 16232 dpmmgr mem REG 9,2 11536 659387 /usr/lib64/libXrdHttp-4.so xrootd 16232 dpmmgr mem REG 9,2 61624 673959 /usr/lib64/libnss_files-2.17.so xrootd 16232 dpmmgr mem REG 9,2 68192 656816 /usr/lib64/libbz2.so.1.0.6 xrootd 16232 dpmmgr mem REG 9,2 90248 656792 /usr/lib64/libz.so.1.2.7 xrootd 16232 dpmmgr mem REG 9,2 100008 676257 /usr/lib64/libelf-0.172.so xrootd 16232 dpmmgr mem REG 9,2 402384 656781 /usr/lib64/libpcre.so.1.2.0 xrootd 16232 dpmmgr mem REG 9,2 19896 657215 /usr/lib64/libattr.so.1.1.0 xrootd 16232 dpmmgr mem REG 9,2 330464 675819 /usr/lib64/libdw-0.172.so xrootd 16232 dpmmgr mem REG 9,2 105824 673967 /usr/lib64/libresolv-2.17.so xrootd 16232 dpmmgr mem REG 9,2 19384 656979 /usr/lib64/libgpg-error.so.0.10.0 xrootd 16232 dpmmgr mem REG 9,2 535064 656996 /usr/lib64/libgcrypt.so.11.8.2 xrootd 16232 dpmmgr mem REG 9,2 85952 675689 /usr/lib64/liblz4.so.1.7.5 xrootd 16232 dpmmgr mem REG 9,2 157424 656805 /usr/lib64/liblzma.so.5.2.2 xrootd 16232 dpmmgr mem REG 9,2 155784 676236 /usr/lib64/libselinux.so.1 xrootd 16232 dpmmgr mem REG 9,2 20032 657219 /usr/lib64/libcap.so.2.22 xrootd 16232 dpmmgr mem REG 9,2 203800 669579 /usr/lib64/libsystemd.so.0.6.0 xrootd 16232 dpmmgr mem REG 9,2 2151672 656593 /usr/lib64/libc-2.17.so xrootd 16232 dpmmgr mem REG 9,2 88776 656524 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 xrootd 16232 dpmmgr mem REG 9,2 1137024 673192 /usr/lib64/libm-2.17.so xrootd 16232 dpmmgr mem REG 9,2 991616 678393 /usr/lib64/libstdc++.so.6.0.19 xrootd 16232 dpmmgr mem REG 9,2 43776 673969 /usr/lib64/librt-2.17.so xrootd 16232 dpmmgr mem REG 9,2 141968 656619 /usr/lib64/libpthread-2.17.so xrootd 16232 dpmmgr mem REG 9,2 19288 673191 /usr/lib64/libdl-2.17.so xrootd 16232 dpmmgr mem REG 9,2 635008 679730 /usr/lib64/libXrdUtils.so.2.0.0 xrootd 16232 dpmmgr mem REG 9,2 843968 660388 /usr/lib64/libXrdServer.so.2.0.0 xrootd 16232 dpmmgr mem REG 9,2 163400 656586 /usr/lib64/ld-2.17.so xrootd 16232 dpmmgr mem REG 9,2 217032 395688 /var/db/nscd/hosts xrootd 16232 dpmmgr 2w REG 9,3 617903553 8388624 /var/log/xrootd/dpmdisk/xrootd.log xrootd 16232 dpmmgr 16w REG 9,3 27753273859 8388648 /var/log/xrootd/dpmdisk/xrootd.log-20190914 (deleted) xrootd 16232 dpmmgr 19w REG 9,3 27753273859 8388648 /var/log/xrootd/dpmdisk/xrootd.log-20190914 (deleted) xrootd 16232 dpmmgr 20w REG 9,3 27753273859 8388648 /var/log/xrootd/dpmdisk/xrootd.log-20190914 (deleted) xrootd 16232 dpmmgr 21w REG 9,3 27753273859 8388648 /var/log/xrootd/dpmdisk/xrootd.log-20190914 (deleted) xrootd 16232 dpmmgr 24u REG 9,2 6784 527318 /var/spool/xrootd/dpmdisk/.ofs/posc.log xrootd 16232 dpmmgr *768r REG 8,0 2998746176 77855990843 /mnt/fs1/atlas/2018-06-05/DAOD_TOPQ4.14324106._000186.pool.root.1.560762872.0 xrootd 16232 dpmmgr *769r REG 8,64 2733448798 88068008979 /mnt/fs4/atlas/2018-06-06/DAOD_TOPQ4.14324106._000146.pool.root.1.560814999.0 xrootd 16232 dpmmgr *770r REG 8,64 2553846404 84146182189 /mnt/fs4/atlas/2018-06-05/DAOD_TOPQ4.14324106._000020.pool.root.1.560750281.0 xrootd 16232 dpmmgr *771r REG 8,16 2396806963 80556399667 /mnt/fs2/atlas/2018-06-05/DAOD_TOPQ4.14324106._000154.pool.root.1.560808076.0 xrootd 16232 dpmmgr *772r REG 8,16 2453390567 79955741742 /mnt/fs2/atlas/2018-06-05/DAOD_TOPQ4.14324106._000203.pool.root.1.560760348.0 xrootd 16232 dpmmgr *773r REG 8,48 2837420884 80029443154 /mnt/fs3/atlas/2018-06-05/DAOD_TOPQ4.14324106._000176.pool.root.1.560763754.0 xrootd 16232 dpmmgr *774r REG 8,16 128181236 128888907798 /mnt/fs2/atlas/2018-11-24/DAOD_SUSY2.16167755._000007.pool.root.1.592538183.0 xrootd 16232 dpmmgr *776r REG 8,16 2411255622 171819183163 /mnt/fs2/atlas/2018-11-28/DAOD_SUSY2.16167755._000017.pool.root.1.39213.1543405080 xrootd 16232 dpmmgr *779r REG 8,48 4061471002 176141614133 /mnt/fs3/atlas/2019-06-05/DAOD_TOPQ1.15332443._000021.pool.root.1.4984.1559719508 xrootd 16232 dpmmgr *780u REG 8,16 167772160 191362651046 /mnt/fs2/atlas/2019-09-16/AOD.19091690._002701.pool.root.1.rucio.upload.365403.1568617954 Petr On 9/12/19 11:53 PM, Andrew Hanushevsky wrote: > > Hi Petr, > > OK, I see now, it wasn’t clear before. So, the corruption happens at > the xrootd-ofs interface. So, could you add the following config option: > > xrootd.async off > > to the disk node servers and see what happens afterwards. One more > question, do the disk servers here use the globus library at all? That > is, is that library linked with the anything that the disk server > uses? Usually, the lsof output on a running server will tell you which > libraries are in use. > > Andy > > *From:* Petr Vokac <mailto:[log in to unmask]> > *Sent:* Thursday, September 12, 2019 2:14 PM > *To:* Andrew Hanushevsky <mailto:[log in to unmask]> ; andrea manzi > <mailto:[log in to unmask]> ; [log in to unmask] > <mailto:[log in to unmask]> > *Cc:* dpm-devel <mailto:[log in to unmask]> ; > [log in to unmask] <mailto:[log in to unmask]> > *Subject:* Re: Uploaded file 128TB on filesystem > > On 9/11/19 10:26 PM, Andrew Hanushevsky wrote: >> I have yet to look at the logs to see where the corruption occurs. I >> assume the last email contains an instance of the problem with the >> tracing on. Once I see that I will have additional suggestions. >> Anyway, could you tell me what the crash was here, SEGV? > > Yes, core file tells me "Program terminated with signal 11, > Segmentation fault." > > > On 9/12/19 2:44 AM, Andrew Hanushevsky wrote: >> This isn’t exactly what I wanted. With the two directives in the >> config file for all servers, please find the server that created the >> huge file and give me it’s log file so I can see where the large >> offset is introduced. That’s the only way I can narrow down where the >> problem first occurs. > > Do you mean that disknode xrootd.log file filtered by > "condor.114208:47" for specific transfer that created big file is not > sufficient and you need whole file? These transfers that were done by > xrdcp doesn't log as much information on server side as transfers with > gfal-copy > *[root@dpmpool9 ~]# zgrep condor.44861: > /var/log/xrootd/dpmdisk/xrootd.log-20190912.gz * > 190912 02:04:06 3177 XrootdXeq: condor.44861:47@[2001:718:401:6e03::1:82] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac > 190912 02:04:06 3177 condor.44861:47@[2001:718:401:6e03::1:82] XrootdProtocol: open unmatp //dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000?dpm.chunk0=0,0,dpmpool9.farm.particle.cz:/mnt/fs2/atlas/2019-09-12/1M.00381.20190912020406.296583308.00000.219576.1568246646%3fsfn=%252Fdpm%252Ffarm.particle.cz%252Fhome%252Fatlas%252F1M.00381.20190912020406.296583308.00000&dpm.dhost=dpmpool9.farm.particle.cz&dpm.hv2=cpcbXGYCWjKth6eVgQFU8A==&dpm.loc=1&dpm.nonce=b7678c5d&dpm.put=1&dpm.time=1568246646&oss.asize=1048576&xrd.gsiusrpxy=/tmp/x509up_u996&xrd.wantprot=gsi,unix > 190912 02:04:06 3177 condor.44861:47@[2001:718:401:6e03::1:82] XrootdProtocol: fh=0 write 1048576@0 > 190912 02:04:11 21932 condor.44861:47@[2001:718:401:6e03::1:82] ofs_write: 209@3909485545868230665 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000 > 190912 02:04:11 21932 condor.44861:47@[2001:718:401:6e03::1:82] ofs_write: 131072@131072 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000 > 190912 02:04:11 21932 condor.44861:47@[2001:718:401:6e03::1:82] ofs_write: 131072@262144 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000 > 190912 02:04:12 21932 condor.44861:47@[2001:718:401:6e03::1:82] ofs_write: 131072@393216 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000 > 190912 02:04:12 21932 condor.44861:47@[2001:718:401:6e03::1:82] ofs_write: 131072@524288 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000 > 190912 02:04:12 21932 condor.44861:47@[2001:718:401:6e03::1:82] ofs_write: 131072@655360 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000 > 190912 02:04:12 21932 condor.44861:47@[2001:718:401:6e03::1:82] ofs_write: 131072@786432 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000 > 190912 02:04:12 21932 condor.44861:47@[2001:718:401:6e03::1:82] ofs_write: 131072@917504 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000 > 190912 02:05:22 22468 XrootdXeq: condor.44861:47@[2001:718:401:6e03::1:82] disc 0:01:16 > 190912 02:05:22 22468 condor.44861:47@[2001:718:401:6e03::1:82] XrootdFile: closing w /dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000 > 190912 02:05:22 23332 XrootdXeq: condor.44861:48@[2001:718:401:6e03::1:82] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac > 190912 02:05:22 23332 condor.44861:48@[2001:718:401:6e03::1:82] XrootdProtocol: open up //dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000?dpm.chunk0=0,0,dpmpool9.farm.particle.cz:/mnt/fs2/atlas/2019-09-12/1M.00381.20190912020406.296583308.00000.219576.1568246646%3fsfn=%252Fdpm%252Ffarm.particle.cz%252Fhome%252Fatlas%252F1M.00381.20190912020406.296583308.00000&dpm.dhost=dpmpool9.farm.particle.cz&dpm.hv2=cpcbXGYCWjKth6eVgQFU8A==&dpm.loc=1&dpm.nonce=b7678c5d&dpm.put=1&dpm.time=1568246646&oss.asize=1048576&xrd.gsiusrpxy=/tmp/x509up_u996&xrd.wantprot=gsi,unix > 190912 02:05:22 23332 ofs_Posc: Creator changed from condor.44861:47@[2001:718:401:6e03::1:82] to condor.44861:48@[2001:718:401:6e03::1:82] for /dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000 > 190912 02:05:22 23332 condor.44861:48@[2001:718:401:6e03::1:82] XrootdProtocol: fh=0 write 1048576@0 > 190912 02:05:36 23332 condor.44861:48@[2001:718:401:6e03::1:82] ofs_write: 1048576@0 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000 > 190912 02:05:36 23332 condor.44861:48@[2001:718:401:6e03::1:82] XrootdProtocol: close rc=0 fh=0 > 190912 02:05:36 23332 condor.44861:48@[2001:718:401:6e03::1:82] XrootdFile: closing w /dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000 > 190912 02:05:36 23332 XrootdXeq: condor.44861:48@[2001:718:401:6e03::1:82] disc 0:00:14 > Transfers with xrdcp has much shorter log on diskserver > *[root@dpmpool1 ~]# zgrep -E > 'condor.114208:|1M.xrdcp.20190911070503.057846787' > /var/log/xrootd/dpmdisk/xrootd.log-20190912.gz * > 190911 07:05:12 31717 XrootdXeq: condor.114208:47@[2001:718:401:6e03::1:82] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac > 190911 07:05:12 31717 condor.114208:47@[2001:718:401:6e03::1:82] XrootdProtocol: open unmat /dpm/farm.particle.cz/home/atlas/1M.xrdcp.20190911070503.057846787?dpm.chunk0=0,0,dpmpool1.farm.particle.cz:/mnt/fs1/atlas/2019-09-11/1M.xrdcp.20190911070503.057846787.143302.1568178311%3fsfn=%252Fdpm%252Ffarm.particle.cz%252Fhome%252Fatlas%252F1M.xrdcp.20190911070503.057846787&dpm.dhost=dpmpool1.farm.particle.cz&dpm.hv2=NBoSW+W5T7Eti/8yeLbu0w==&dpm.loc=1&dpm.nonce=870b737&dpm.put=1&dpm.time=1568178311&oss.asize=1048576 > 190911 07:05:12 31717 condor.114208:47@[2001:718:401:6e03::1:82] XrootdProtocol: fh=0 write 1048576@0 > 190911 07:05:20 23385 condor.114208:47@[2001:718:401:6e03::1:82] XrootdProtocol: close rc=0 fh=0 > 190911 07:05:20 23385 condor.114208:47@[2001:718:401:6e03::1:82] XrootdFile: closing w /dpm/farm.particle.cz/home/atlas/1M.xrdcp.20190911070503.057846787 > 190911 07:05:20 23385 XrootdXeq: condor.114208:47@[2001:718:401:6e03::1:82] disc 0:00:09 > I don't know why gfal transfer has more information in server > xrootd.log file, probably something in environment or transfer > parameters different from pure `xrdcp file:///... root://...` command. > > Petr > >> >> *From:* Petr Vokac <mailto:[log in to unmask]> >> *Sent:* Wednesday, September 11, 2019 12:35 AM >> *To:* Andrew Hanushevsky <mailto:[log in to unmask]> ; andrea >> manzi <mailto:[log in to unmask]> ; [log in to unmask] >> <mailto:[log in to unmask]> >> *Cc:* dpm-devel <mailto:[log in to unmask]> ; >> [log in to unmask] <mailto:[log in to unmask]> >> *Subject:* Re: Uploaded file 128TB on filesystem >> >> Now I have following lines on all our DPM diskservers >> ofs.trace write >> xrootd.trace fs >> and I created simple script [1] that use just xrdcp to copy 1MB file >> CURR=$(date +%Y%m%d%H%M%S.%N) >> export XRD_LOGLEVEL=Dump >> export XRD_LOGMASK=All >> export XRD_LOGFILE=1M.${CURR}.dump >> xrdcp "file:///tmp/1M" "root://golias100.farm.particle.cz//dpm/farm.particle.cz/home/atlas/1M.xrdcp.${CURR}" >> Even this simple code create corrupted files at our DPM storage, e.g. >> following 1.5GB file >> *[root@dpmpool1 ~]# ls -la >> /mnt/fs1/atlas/2019-09-11/1M.xrdcp.20190911070503.057846787.143302.1568178311* >> -rw-rw---- 1 dpmmgr dpmmgr 1761607680 Sep 11 07:05 /mnt/fs1/atlas/2019-09-11/1M.xrdcp.20190911070503.057846787.143302.1568178311 >> *[root@dpmpool1 ~]# du --block-size=1 >> /mnt/fs1/atlas/2019-09-11/1M.xrdcp.20190911070503.057846787.143302.1568178311* >> 1048576 /mnt/fs1/atlas/2019-09-11/1M.xrdcp.20190911070503.057846787.143302.1568178311 >> *[root@dpmpool1 ~]# grep condor.114208:47 >> /var/log/xrootd/dpmdisk/xrootd.log* >> 190911 07:05:12 31717 XrootdXeq: condor.114208:47@[2001:718:401:6e03::1:82] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac >> 190911 07:05:12 31717 condor.114208:47@[2001:718:401:6e03::1:82] XrootdProtocol: open unmat /dpm/farm.particle.cz/home/atlas/1M.xrdcp.20190911070503.057846787?dpm.chunk0=0,0,dpmpool1.farm.particle.cz:/mnt/fs1/atlas/2019-09-11/1M.xrdcp.20190911070503.057846787.143302.1568178311%3fsfn=%252Fdpm%252Ffarm.particle.cz%252Fhome%252Fatlas%252F1M.xrdcp.20190911070503.057846787&dpm.dhost=dpmpool1.farm.particle.cz&dpm.hv2=NBoSW+W5T7Eti/8yeLbu0w==&dpm.loc=1&dpm.nonce=870b737&dpm.put=1&dpm.time=1568178311&oss.asize=1048576 >> 190911 07:05:12 31717 condor.114208:47@[2001:718:401:6e03::1:82] XrootdProtocol: fh=0 write 1048576@0 >> 190911 07:05:20 23385 condor.114208:47@[2001:718:401:6e03::1:82] XrootdProtocol: close rc=0 fh=0 >> 190911 07:05:20 23385 condor.114208:47@[2001:718:401:6e03::1:82] XrootdFile: closing w /dpm/farm.particle.cz/home/atlas/1M.xrdcp.20190911070503.057846787 >> 190911 07:05:20 23385 XrootdXeq: condor.114208:47@[2001:718:401:6e03::1:82] disc 0:00:09 >> Client side details are in attachment. I have more examples [2] with >> following wrong size >> 1M.20190911005425.364456793.bad ... Wed Sep 11 00:54:31 CEST 2019 ERROR BAD SIZE 1124073472 >> 1M.20190911010555.932867187.bad ... Wed Sep 11 01:06:10 CEST 2019 ERROR BAD SIZE 0 >> 1M.20190911012755.924566113.bad ... Wed Sep 11 01:28:08 CEST 2019 ERROR BAD SIZE 917504 >> 1M.20190911014400.800640006.bad ... Wed Sep 11 01:44:24 CEST 2019 ERROR BAD SIZE 917504 >> 1M.20190911014622.520944167.bad ... Wed Sep 11 01:46:40 CEST 2019 ERROR BAD SIZE 3741319168 >> 1M.20190911014901.149846614.bad ... Wed Sep 11 01:49:26 CEST 2019 ERROR BAD SIZE 917504 >> 1M.20190911015203.831921517.bad ... Wed Sep 11 01:52:29 CEST 2019 ERROR BAD SIZE 917504 >> 1M.20190911020907.396911618.bad ... Wed Sep 11 02:09:32 CEST 2019 ERROR BAD SIZE 3791650816 >> 1M.20190911021159.705868237.bad ... Wed Sep 11 02:12:27 CEST 2019 ERROR BAD SIZE 917504 >> 1M.20190911021251.426135326.bad ... Wed Sep 11 02:13:13 CEST 2019 ERROR BAD SIZE 917504 >> 1M.20190911021737.228623622.bad ... Wed Sep 11 02:18:01 CEST 2019 ERROR BAD SIZE 771751936 >> 1M.20190911022813.246485614.bad ... Wed Sep 11 02:28:40 CEST 2019 ERROR BAD SIZE 917504 >> 1M.20190911025601.833862548.bad ... Wed Sep 11 02:56:20 CEST 2019 ERROR BAD SIZE 1694498816 >> 1M.20190911030159.405097568.bad ... Wed Sep 11 03:02:28 CEST 2019 ERROR BAD SIZE 5016387584 >> 1M.20190911030913.965667783.bad ... Wed Sep 11 03:09:54 CEST 2019 ERROR BAD SIZE 218103808 >> 1M.20190911030955.983267709.bad ... Wed Sep 11 03:10:01 CEST 2019 ERROR BAD SIZE 15946576 >> 1M.20190911031027.126885413.bad ... Wed Sep 11 03:10:55 CEST 2019 ERROR BAD SIZE 1459617792 >> 1M.20190911031535.972442346.bad ... Wed Sep 11 03:16:03 CEST 2019 ERROR BAD SIZE 24848976 >> 1M.20190911032921.385479218.bad ... Wed Sep 11 03:29:37 CEST 2019 ERROR BAD SIZE 352321536 >> 1M.20190911050741.513088940.bad ... Wed Sep 11 05:07:57 CEST 2019 ERROR BAD SIZE 917504 >> 1M.20190911053936.964249354.bad ... Wed Sep 11 05:39:56 CEST 2019 ERROR BAD SIZE 917504 >> 1M.20190911055721.511952376.bad ... Wed Sep 11 05:57:39 CEST 2019 ERROR BAD SIZE 917504 >> 1M.20190911060520.998050566.bad ... Wed Sep 11 06:05:27 CEST 2019 ERROR BAD SIZE 2292134 >> 1M.20190911061254.989202949.bad ... Wed Sep 11 06:13:20 CEST 2019 ERROR BAD SIZE 917504 >> 1M.20190911062609.645848195.bad ... Wed Sep 11 06:26:21 CEST 2019 ERROR BAD SIZE 1207959552 >> 1M.20190911063849.007385225.bad ... Wed Sep 11 06:38:55 CEST 2019 ERROR BAD SIZE 917504 >> 1M.20190911063907.064936962.bad ... Wed Sep 11 06:39:15 CEST 2019 ERROR BAD SIZE 26413431 >> 1M.20190911065351.412877379.bad ... Wed Sep 11 06:54:00 CEST 2019 ERROR BAD SIZE 917504 >> 1M.20190911070503.057846787.bad ... Wed Sep 11 07:05:20 CEST 2019 ERROR BAD SIZE 1761607680 >> 1M.20190911070525.814871895.bad ... Wed Sep 11 07:05:36 CEST 2019 ERROR BAD SIZE 917504 >> 1M.20190911070634.781534227.bad ... Wed Sep 11 07:06:42 CEST 2019 ERROR BAD SIZE 3758096384 >> 1M.20190911072040.275437982.bad ... Wed Sep 11 07:21:07 CEST 2019 ERROR BAD SIZE 786432 >> 1M.20190911072340.253938152.bad ... Wed Sep 11 07:24:00 CEST 2019 ERROR BAD SIZE 917504 >> 1M.20190911074911.208877064.bad ... Wed Sep 11 07:49:40 CEST 2019 ERROR BAD SIZE 388133552 >> 1M.20190911080141.583691537.bad ... Wed Sep 11 08:01:56 CEST 2019 ERROR BAD SIZE 917504 >> 1M.20190911082145.247040374.bad ... Wed Sep 11 08:22:17 CEST 2019 ERROR BAD SIZE 786432 >> 1M.20190911085008.626914519.bad ... Wed Sep 11 08:50:29 CEST 2019 ERROR BAD SIZE 917504 >> Petr >> >> [1] >> http://novastore.farm.particle.cz/tmp/xrdcp-20190911/test_dpm_xroot_file_copy.sh >> [2] http://novastore.farm.particle.cz/tmp/xrdcp-20190911/ >> >> On 9/10/19 2:12 AM, Andrew Hanushevsky wrote: >>> Hi Petr, >>> >>> If you also specify in the config file >>> >>> xrootd.trace fs >>> >>> then we will get a much earlier snapshot of the offset and length. >>> This will allow us to tell whether the data is getting corrupted >>> before it reaches the server or afterward. Could you do that (at >>> least for one test) to see where the corruption may happen? >>> >>> Andy >>> >>> *From:* Petr Vokac <mailto:[log in to unmask]> >>> *Sent:* Sunday, September 08, 2019 1:46 AM >>> *To:* andrea manzi <mailto:[log in to unmask]> ; >>> [log in to unmask] <mailto:[log in to unmask]> ; Andrew >>> Hanushevsky <mailto:[log in to unmask]> >>> *Cc:* dpm-devel <mailto:[log in to unmask]> ; >>> [log in to unmask] <mailto:[log in to unmask]> >>> *Subject:* Re: Uploaded file 128TB on filesystem >>> >>> I just found one file where problem with offset occurred in the >>> middle of transfer (not at the beginning): >>> 190907 20:03:53 25211 XrootdXeq: atlasprd.2817:168@[2001:718:401:6e03::1:51] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>> 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@0 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload >>> 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@262144 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload >>> 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@524288 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload >>> 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@786432 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload >>> 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@1048576 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload >>> ... >>> 190907 20:40:35 31589 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2952003584 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload >>> 190907 20:40:53 31589 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2952265728 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload >>> 190907 20:41:55 31609 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2952265728 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload >>> *190907 20:42:05 25089 atlasprd.2817:168@[2001:718:401:6e03::1:51] >>> ofs_write: 1919238960@7379557499097268579 >>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload****190907 20:42:13 25118 atlasprd.2817:168@[2001:718:401:6e03::1:51] >>> ofs_write: 1919238960@7379557499097268579 >>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload***190907 20:42:13 25118 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2953314304 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload >>> 190907 20:42:13 25118 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2953576448 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload >>> 190907 20:42:20 25118 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2953576448 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload >>> Whole xrootd.log for this transfer available at >>> http://novastore.farm.particle.cz/tmp/xrootd_bigfile_wrong_offset_in_the_middle_of_transfer.txt >>> >>> Petr >>> >>> On 9/8/19 10:25 AM, Petr Vokac wrote: >>>> I executed basically following python code that from bash script in >>>> a loop >>>> import gfal2 >>>> ctx = gfal2.creat_context() >>>> ctx.set_opt_string("XROOTD PLUGIN", "XRD.WANTPROT", "gsi,unix") >>>> params = ctx.transfer_parameters() >>>> params.timeout = 60 >>>> ctx.filecopy(params, "file:///tmp/1M", "root://golias100.farm.particle.cz/dpm/farm.particle.cz/home/atlas/1M") >>>> and after a while new corrupted files appeared at our storage >>>> including one big file: >>>> [vokac@ui1 ~]$ gfal-ls -l 'davs://golias100.farm.particle.cz/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000' >>>> -rw-rw-r-- 0 1310 5281 139703410778112 Sep 7 19:01 davs://golias100.farm.particle.cz/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >>>> Now I have also debug output from gfal client including >>>> XRD_LOGLEVEL=Dump + XRD_LOGMASK=All for this file with xrootd.log >>>> on server side: >>>> [root@dpmpool19 ~]$ grep condor.39253:171 /var/log/xrootd/dpmdisk/xrootd.log >>>> 190907 19:01:19 29645 XrootdXeq: condor.39253:171@[2001:718:401:6e03::1:82] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac >>>> 190907 19:01:22 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 1009440128@139703410643808 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >>>> 190907 19:01:23 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@131072 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >>>> 190907 19:01:23 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@262144 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >>>> 190907 19:01:24 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@393216 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >>>> 190907 19:01:24 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@524288 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >>>> 190907 19:01:25 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@655360 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >>>> 190907 19:01:25 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@786432 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >>>> 190907 19:01:25 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@917504 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >>>> 190907 19:01:26 32625 XrootdXeq: condor.39253:171@[2001:718:401:6e03::1:82] disc 0:00:07 >>>> This file was transferred from mff82.farm.particle.cz >>>> (2001:718:401:6e03::1:82) to the storage dpmpool19.farm.particle.cz >>>> (2001:718:401:6017:2::19) and both machines use >>>> xrootd-4.9.1-1.el7.x86_64 and mff82 use gfal2-2.16.3-1.el7.x86_64 >>>> >>>> I also tried to calculate adler32 for data blocks from positions >>>> logged by server and result match source data checksum. When I look >>>> at other transfers finished with "SUCCESS" but with the filesize >>>> different from source file I can see that problems always occurs >>>> only with first "ofs_write" log line where size@offset contains >>>> wrong numbers. >>>> >>>> Attached files: >>>> >>>> 1M.01556.20190907190118.646902567.00000.dump ... xrd with dump log >>>> level >>>> 1M.01556.20190907190118.646902567.00000.log ... gfal2 verbose >>>> logging (+ few log lines from my script) >>>> 1M.01556.20190907190118.646902567.00000.xrootd.long ... dpmpool19 >>>> xrootd.log >>>> >>>> I have more examples of successfully transferred files with wrong >>>> size (sometimes bigger than original 1M size and sometimes smaller): >>>> >>>> 07 Sep 18:39:33.879 [ERROR] dpmpool1.farm.particle.cz BAD SIZE 917504 >>>> 07 Sep 18:40:45.039 [ERROR] dpmpool21.farm.particle.cz BAD SIZE 917504 >>>> 07 Sep 18:41:22.555 [ERROR] dpmpool21.farm.particle.cz BAD SIZE 917504 >>>> 07 Sep 18:41:39.248 [ERROR] dpmpool21.farm.particle.cz BAD SIZE 917504 >>>> 07 Sep 18:56:51.113 [ERROR] dpmpool24.farm.particle.cz BAD SIZE 917504 >>>> 07 Sep 18:58:04.436 [ERROR] dpmpool1.farm.particle.cz BAD SIZE >>>> 393194936 >>>> 07 Sep 19:01:25.908 [ERROR] dpmpool19.farm.particle.cz BAD SIZE >>>> 139703410778112 >>>> 07 Sep 19:02:17.183 [ERROR] dpmpool24.farm.particle.cz BAD SIZE 0 >>>> 07 Sep 19:02:53.014 [ERROR] dpmpool16.farm.particle.cz BAD SIZE 917504 >>>> 07 Sep 19:03:49.424 [ERROR] dpmpool1.farm.particle.cz BAD SIZE 917504 >>>> 07 Sep 19:05:47.743 [ERROR] dpmpool21.farm.particle.cz BAD SIZE >>>> 13983727 >>>> 07 Sep 19:15:15.761 [ERROR] dpmpool22.farm.particle.cz BAD SIZE >>>> 12752345 >>>> 07 Sep 19:16:58.898 [ERROR] dpmpool20.farm.particle.cz BAD SIZE 917504 >>>> 07 Sep 19:18:51.590 [ERROR] dpmpool17.farm.particle.cz BAD SIZE 6302080 >>>> 07 Sep 19:20:37.006 [ERROR] dpmpool21.farm.particle.cz BAD SIZE >>>> 1456153896 >>>> 07 Sep 19:52:32.553 [ERROR] dpmpool16.farm.particle.cz BAD SIZE 917504 >>>> 07 Sep 19:53:45.155 [ERROR] dpmpool22.farm.particle.cz BAD SIZE >>>> 1000000000 >>>> 07 Sep 20:11:14.497 [ERROR] dpmpool22.farm.particle.cz BAD SIZE 917504 >>>> 07 Sep 20:12:51.081 [ERROR] dpmpool20.farm.particle.cz BAD SIZE 917504 >>>> 07 Sep 20:14:58.090 [ERROR] dpmpool20.farm.particle.cz BAD SIZE >>>> 100663296 >>>> 07 Sep 20:24:57.457 [ERROR] dpmpool1.farm.particle.cz BAD SIZE 917504 >>>> 07 Sep 20:27:37.596 [ERROR] dpmpool19.farm.particle.cz BAD SIZE 917504 >>>> 07 Sep 20:28:21.617 [ERROR] dpmpool16.farm.particle.cz BAD SIZE 0 >>>> >>>> Petr >>>> >>>> On 9/5/19 2:44 PM, Petr Vokac wrote: >>>>> I'm going to test in a loop uploads with this code and also enabled gfal >>>>> debug logging from our remote machines that talks with storage through >>>>> quite saturated network (we never saw same issue from cluster worker >>>>> nodes that shares room with diskservers). I also saw few transfers that >>>>> did not create big sparse file, but just failed while comparing >>>>> checksums after upload and this is also quite suspicious behavior. >>>>> >>>>> Petr >>>>> >>>>> On 9/5/19 12:57 PM, andrea manzi wrote: >>>>>>> Ciao Andrea, >>>>>>> >>>>>>> we practically do this: ctx = creat_context(... ctx.set_opt_string("XROOTD PLUGIN", "XRD.WANTPROT", "gsi,unix") ctx.mkdir_rec(... ctx.filecopy(.. >>>>>>> >>>>>>> Is that the gfal transfer API? >>>>>> yup, internally in gfal2 we then use the XrdCl to perform the copy ( >>>>>> both for local files and TPC) >>>>>> >>>>>> today is holiday but we can start investigating with Michal tomorrow >>>>>> or Monday the possible issue >>>>>> >>>>>> cheers >>>>>> >>>>>> Andrea >>>>>> >>>>>>> Cheers, >>>>>>> Mario >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 05/09/2019 10:35, andrea manzi wrote: >>>>>>>> ciao Mario, >>>>>>>> >>>>>>>> i guess you are using the gfal2 transfer API in Rucio right ? ( the >>>>>>>> equivalent to gfal-copy) >>>>>>>> >>>>>>>> in gfal2, in the case of the transfer API for xrootd, we don't do >>>>>>>> much >>>>>>>> as we just use the xrootd client ( XrdCl::CopyProcess), so we don't >>>>>>>> touch file offsets. >>>>>>>> >>>>>>>> so we may want to investigate if there is a problem in XrdCl. Do you >>>>>>>> have the logs of the client? >>>>>>>> >>>>>>>> If you are using the gfal2 Posix API instead we need to investigate a >>>>>>>> possible bug there >>>>>>>> >>>>>>>> cheers >>>>>>>> >>>>>>>> Andrea >>>>>>>> >>>>>>>> >>>>>>>>> Hi Petr, Andy, >>>>>>>>> >>>>>>>>> once extra piece of information here: Rucio uses the python >>>>>>>>> bindings to >>>>>>>>> the gfal library to upload these files, and de-facto always uses the >>>>>>>>> current stable gfal version that is published on ALRB. Rucio doesn't >>>>>>>>> implement any other upload by itself. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Mario >>>>>>>>> >>>>>>>>> On 05/09/2019 04:21, Andrew Hanushevsky wrote: >>>>>>>>>> Hi Petr, So, look at this line: 190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>>>> ofs_write: 1784104010@6506989258351653185 >>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>>>> >>>>>>>>>> Why did the application doing the write specify an offset of >>>>>>>>>> 5506989258351653185? So, that is the problem and, as you noted, it >>>>>>>>>> doesn’t matter whether POSC is on or not. Indeed, it’s the first >>>>>>>>>> write >>>>>>>>>> to the file and only the first write that suffers from this >>>>>>>>>> problem. So, >>>>>>>>>> there are two possibilities here: a) there is a bug in the >>>>>>>>>> uploader, or >>>>>>>>>> b) there was a network data corruption that managed to hit the offset >>>>>>>>>> (somewhat unlikely). Andy *From:* Petr Vokac mailto:[log in to unmask] >>>>>>>>>> *Sent:* Wednesday, September 04, 2019 12:19 PM >>>>>>>>>> *To:* Andrew Hanushevsky mailto:[log in to unmask] >>>>>>>>>> *Cc:* dpm-devel mailto:[log in to unmask] ; >>>>>>>>>> [log in to unmask] >>>>>>>>>> mailto:[log in to unmask] >>>>>>>>>> *Subject:* Re: Uploaded file 128TB on filesystem I configured our diskservers with "ofs.trace write" (I did not >>>>>>>>>> yet add >>>>>>>>>> in xrootd configuraton "ofs.persist hold 0") and today new file >>>>>>>>>> appeared >>>>>>>>>> on one of our storage servers: >>>>>>>>>> >>>>>>>>>> *[root@dpmpool1 ~]# grep atlasprd.2817:152 >>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log* >>>>>>>>>> 190904 17:02:08 1894 XrootdXeq: >>>>>>>>>> atlasprd.2817:152@[2001:718:401:6e03::1:82] pub IP64 login as >>>>>>>>>> /DC=ch/DC=cern/OU=Organic >>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>>>>>>>> 190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>>>> ofs_write: 1784104010@6506989258351653185 >>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>>>> 190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>>>> ofs_write: 262144@262144 >>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>>>> ofs_write: 262144@524288 >>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>>>> ofs_write: 262144@786432 >>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>>>> ofs_write: 262144@1048576 >>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>>>> ofs_write: 262144@1310720 >>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>>>> ofs_write: 262144@1572864 >>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>>>> ofs_write: 262144@1835008 >>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>>>> ofs_write: 135000@2097152 >>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>>>> 190904 17:07:32 2163 XrootdXeq: >>>>>>>>>> atlasprd.2817:152@[2001:718:401:6e03::1:82] disc 0:05:24 >>>>>>>>>> >>>>>>>>>> *[root@dpmpool1 ~]# ls -la >>>>>>>>>> /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328* >>>>>>>>>> >>>>>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 6506989258351919104 Sep 4 17:02 >>>>>>>>>> /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328 >>>>>>>>>> >>>>>>>>>> *[root@dpmpool1 ~]# du --block-size=1 >>>>>>>>>> /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328* >>>>>>>>>> >>>>>>>>>> 2236416 /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328 >>>>>>>>>> >>>>>>>>>> This file has first 256kb filled with 0x00 (adler32 up to original >>>>>>>>>> size >>>>>>>>>> 2232152 doesn't match) and this time there is no ofs_Posc in the >>>>>>>>>> xrootd.log file. >>>>>>>>>> >>>>>>>>>> Petr Vokac >>>>>>>>>> >>>>>>>>>> On 8/19/19 3:24 AM, Andrew Hanushevsky wrote: >>>>>>>>>>> Hi Petr, >>>>>>>>>>> >>>>>>>>>>> The issue here is that the python script (I believe it's python) is >>>>>>>>>>> disconnecting without closing the file and then reconnecting and >>>>>>>>>>> writing at a high offset. The log is pretty clear on that. You can >>>>>>>>>>> verify by adding "ofs.trace write" to see the write offsets >>>>>>>>>>> (hough it >>>>>>>>>>> will produce a lot of output). You can also prohibit reconnection by >>>>>>>>>>> adding this to the persist directive. >>>>>>>>>>> >>>>>>>>>>> ofs.persist hold 0 >>>>>>>>>>> >>>>>>>>>>> which would delete the file the moment the client disconnects >>>>>>>>>>> without >>>>>>>>>>> closing the file. >>>>>>>>>>> >>>>>>>>>>> Andy >>>>>>>>>>> >>>>>>>>>>> On Sat, 17 Aug 2019, Petr Vokac wrote: >>>>>>>>>>> >>>>>>>>>>>> Today new "65PB" file appeared on our storage and it was again >>>>>>>>>>>> transferred with xrootd protocol. I can see again same error in >>>>>>>>>>>> the log >>>>>>>>>>>> file: >>>>>>>>>>>> >>>>>>>>>>>> **[root@dpmpool21 ~]# ls -la >>>>>>>>>>>> /mnt/fs4/atlas/2019-08-17/EVNT.18893579._003197.pool.root.1.rucio.upload.234855.1566031522* >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *-rw-rw---- 1 dpmmgr dpmmgr 72340172838077281 Aug 17 10:48 >>>>>>>>>>>> /mnt/fs4/atlas/2019-08-17/EVNT.18893579._003197.pool.root.1.rucio.upload.234855.1566031522 >>>>>>>>>>>> >>>>>>>>>>>> ***[root@dpmpool21 ~]# grep atlasprd.2757 >>>>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log* >>>>>>>>>>>> ... >>>>>>>>>>>> 190817 10:47:55 57803 XrootdXeq: >>>>>>>>>>>> atlasprd.2757:65@[2001:718:401:6e03::1:43] pub IP64 login as >>>>>>>>>>>> /DC=ch/DC=cern/OU=Organic >>>>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>>>>>>>>>> 190817 10:47:55 61859 XrootdXeq: >>>>>>>>>>>> atlasprd.2757:66@[2001:718:401:6e03::1:83] pub IP64 login as >>>>>>>>>>>> /DC=ch/DC=cern/OU=Organic >>>>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>>>>>>>>>> 190817 10:48:52 62233 XrootdXeq: >>>>>>>>>>>> atlasprd.2757:46@[2001:718:401:6e03::1:72] disc 0:00:58 >>>>>>>>>>>> 190817 10:48:52 62954 XrootdXeq: >>>>>>>>>>>> atlasprd.2757:25@[2001:718:401:6e03::1:72] disc 0:03:30 >>>>>>>>>>>> 190817 10:48:52 57804 XrootdXeq: >>>>>>>>>>>> atlasprd.2757:28@[2001:718:401:6e03::1:72] pub IP64 login as >>>>>>>>>>>> /DC=ch/DC=cern/OU=Organic >>>>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>>>>>>>>>> 190817 10:48:52 57804 ofs_Posc: Creator changed from >>>>>>>>>>>> atlasprd.2757:25@[2001:718:401:6e03::1:72] to >>>>>>>>>>>> atlasprd.2757:28@[2001:718:401:6e03::1:72] for >>>>>>>>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/8b/f3/EVNT.18893579._003197.pool.root.1.rucio.upload >>>>>>>>>>>> >>>>>>>>>>>> 190817 10:48:53 57803 XrootdXeq: >>>>>>>>>>>> atlasprd.2757:46@[2001:718:401:6017:20:0:14:44] pub IP64 login as >>>>>>>>>>>> /DC=ch/DC=cern/OU=Organic >>>>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>>>>>>>>>> 190817 10:49:01 57818 XrootdXeq: >>>>>>>>>>>> atlasprd.2757:70@[2001:718:401:6017:20:0:11:11] pub IP64 login as >>>>>>>>>>>> /DC=ch/DC=cern/OU=Organic >>>>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>>>>>>>>>> 190817 10:49:07 62954 XrootdXeq: >>>>>>>>>>>> atlasprd.2757:55@[2001:718:401:6e03::1:81] disc 0:10:17 >>>>>>>>>>>> ... >>>>>>>>>>>> >>>>>>>>>>>> Any suggestion how to trace what is causing this behavior? >>>>>>>>>>>> >>>>>>>>>>>> Petr >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 7/20/19 7:48 AM, Petr Vokac wrote: >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I ran the command you sent me >>>>>>>>>>>>> >>>>>>>>>>>>> grep "atlasprd.2757:" <logfile> | egrep >>>>>>>>>>>>> '2001:718:401:6e03::1:21|2001:718:401:6e03::1:32' > <excerpt> >>>>>>>>>>>>> >>>>>>>>>>>>> and because the output was not just few lines long I uploaded >>>>>>>>>>>>> it to the >>>>>>>>>>>>> >>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool22.xrootd.log >>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool24.xrootd.log >>>>>>>>>>>>> >>>>>>>>>>>>> These machines on 2001:718:401:6e03::/64 subnet are in remote >>>>>>>>>>>>> location >>>>>>>>>>>>> (but relatively close with RTT ~ 0.7ms). I can see several >>>>>>>>>>>>> ofs_Posc >>>>>>>>>>>>> occurrences in the logs of our disknodes during "190717 >>>>>>>>>>>>> 17:xx:xx" time >>>>>>>>>>>>> period: >>>>>>>>>>>>> >>>>>>>>>>>>> http://vokac.web.cern.ch/vokac/tmp/POSC.txt >>>>>>>>>>>>> >>>>>>>>>>>>> Petr >>>>>>>>>>>>> >>>>>>>>>>>>> On 7/19/19 8:51 PM, Andrew Hanushevsky wrote: >>>>>>>>>>>>>> Hi Petr, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Well, POSC says that the client reconnected for writing so not >>>>>>>>>>>>>> all >>>>>>>>>>>>>> went well. Can I get the log output for that occurrence as I >>>>>>>>>>>>>> previously described? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Andy >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, 19 Jul 2019, Petr Vokac wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> We are using Rucio download client to transfer job outputs >>>>>>>>>>>>>>> (it is >>>>>>>>>>>>>>> build >>>>>>>>>>>>>>> on top of gfal2 python interface). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool22.xrootd.log >>>>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool24.xrootd.log >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> this is the full job output unfortunately with minimal gfal2 >>>>>>>>>>>>>>> logging >>>>>>>>>>>>>>> output. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/grid.2258136.4.out-from-2001:718:401:6e03::1:21-to-dpmpool22.txt >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/grid.2255921.7.out-from-2001:718:401:6e03::1:32-to-dpmpool24.txt >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It looks like according gfal2 transfer was OK >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-07-17 15:09:17,782 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,782 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,783 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,783 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,783 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,783 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,784 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,784 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,784 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,793 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,793 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,793 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,795 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,795 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,795 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,795 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with >>>>>>>>>>>>>>> success >>>>>>>>>>>>>>> 2019-07-17 15:09:17,810 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | __gfal2_copy | Event >>>>>>>>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:ENTER >>>>>>>>>>>>>>> 2019-07-17 15:09:17,810 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | __gfal2_copy | Event >>>>>>>>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:ITEM >>>>>>>>>>>>>>> file:///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> => >>>>>>>>>>>>>>> root://golias100.farm.particle.cz:1094//dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-07-17 15:09:17,810 | INFO | copytool_out | >>>>>>>>>>>>>>> gfal2 | __gfal2_copy | Event >>>>>>>>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:EXIT >>>>>>>>>>>>>>> 2019-07-17 15:09:17,811 | INFO | Dummy-2 | >>>>>>>>>>>>>>> gfal2 | (unknown function) | Event >>>>>>>>>>>>>>> triggered: BOTH xroot TRANSFER:ENTER >>>>>>>>>>>>>>> file://localhost///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&xrd.wantprot=gsi,unix >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> => >>>>>>>>>>>>>>> root://golias100.farm.particle.cz:1094///dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&x >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-07-17 15:09:17,811 | INFO | Dummy-2 | >>>>>>>>>>>>>>> gfal2 | (unknown function) | Event >>>>>>>>>>>>>>> triggered: BOTH xroot TRANSFER:TYPE streamed >>>>>>>>>>>>>>> 2019-07-17 15:11:23,914 | INFO | Dummy-2 | >>>>>>>>>>>>>>> gfal2 | (unknown function) | Event >>>>>>>>>>>>>>> triggered: BOTH xroot TRANSFER:EXIT Job finished, [SUCCESS] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> but probably rucio tries to verify checksum (gfal-sum via python >>>>>>>>>>>>>>> bindings) and it fails >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019-07-17 15:15:23,774 | INFO | copytool_out | >>>>>>>>>>>>>>> pilot.copytool.rucio | copy_out | >>>>>>>>>>>>>>> stderr = An unknown exception occurred. >>>>>>>>>>>>>>> Details: Checksum not validated >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> because DPM is unable to calculate checksum for 128TB file >>>>>>>>>>>>>>> within >>>>>>>>>>>>>>> timeout. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Petr >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 7/19/19 5:07 AM, Andrew Hanushevsky wrote: >>>>>>>>>>>>>>>> Hi Petr, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Indeed, it looks like what was used to upload the files >>>>>>>>>>>>>>>> (xrdcp?) >>>>>>>>>>>>>>>> created a very large sparse file by writing with a very large >>>>>>>>>>>>>>>> offset. >>>>>>>>>>>>>>>> The clue here is that an anomaly occurred during the write >>>>>>>>>>>>>>>> and these >>>>>>>>>>>>>>>> two messages (lkely each corresponding to each file) tell us >>>>>>>>>>>>>>>> something.... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 190717 17:10:39 332092 ofs_Posc: Creator changed from >>>>>>>>>>>>>>>> atlasprd.2757:90@[2001:718:401:6e03::1:21] to atlasprd.2757:39@[2001:718:401:6e03::1:21] for /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/ log.18636890._001059.job.log.tgz.1.rucio.upload >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 190717 17:29:26 14204 ofs_Posc: Creator changed from >>>>>>>>>>>>>>>> atlasprd.2757:33@[2001:718:401:6e03::1:32] to atlasprd.2757:95@[2001:718:401:6e03::1:32] for /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/ AOD.18595963._000786.pool.root.1.rucio.upload >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> What the POSC system is saying is that in each case the >>>>>>>>>>>>>>>> original >>>>>>>>>>>>>>>> creator of the file lost contact with the server. It then >>>>>>>>>>>>>>>> reconnected >>>>>>>>>>>>>>>> and resumed writing with a successful close. POSC allows >>>>>>>>>>>>>>>> this to >>>>>>>>>>>>>>>> avoid >>>>>>>>>>>>>>>> network glitches from deleting a file because a network >>>>>>>>>>>>>>>> glitch will >>>>>>>>>>>>>>>> cause the client to reconnect and resume where it left off. >>>>>>>>>>>>>>>> This is >>>>>>>>>>>>>>>> only allowed when the resuming client comes from the same IP >>>>>>>>>>>>>>>> address, >>>>>>>>>>>>>>>> has the same process ID, and the same logical name. In this >>>>>>>>>>>>>>>> case it >>>>>>>>>>>>>>>> was "atlasprd.2757". I find that rather odd because it's >>>>>>>>>>>>>>>> next to >>>>>>>>>>>>>>>> impossible for two processes on two different machines to >>>>>>>>>>>>>>>> have the >>>>>>>>>>>>>>>> same process number (i.e. 2757). So, this is why I am >>>>>>>>>>>>>>>> wondering if >>>>>>>>>>>>>>>> this was really an xrdcp or some other program. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We should look to see what these client were actually doing >>>>>>>>>>>>>>>> during >>>>>>>>>>>>>>>> this period as they clearly did not do the right thing when >>>>>>>>>>>>>>>> they >>>>>>>>>>>>>>>> reconnected. So, could you grep out all interactions of >>>>>>>>>>>>>>>> these two >>>>>>>>>>>>>>>> client from the log. Something like: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> grep "atlasprd.2757:" <logfile> | egrep >>>>>>>>>>>>>>>> '2001:718:401:6e03::1:21|2001:718:401:6e03::1:32' > <excerpt> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> and send the excerpt to me? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Andy >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, 18 Jul 2019, Petr Vokac wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> our DPM which is using native xrootd 4.9.1 yesterday >>>>>>>>>>>>>>>>> created two >>>>>>>>>>>>>>>>> different 128TB files on two different diskservers that were >>>>>>>>>>>>>>>>> received as >>>>>>>>>>>>>>>>> a job output from two different worker nodes. The size of >>>>>>>>>>>>>>>>> original >>>>>>>>>>>>>>>>> file >>>>>>>>>>>>>>>>> was few megabytes and when I check beginning of stored >>>>>>>>>>>>>>>>> files up to >>>>>>>>>>>>>>>>> original size & checksum everything seems to be OK - except >>>>>>>>>>>>>>>>> these are >>>>>>>>>>>>>>>>> sparse files with size 128TB. Original transferred files >>>>>>>>>>>>>>>>> from WN >>>>>>>>>>>>>>>>> looked >>>>>>>>>>>>>>>>> like: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> mc15_13TeV:log.18636890._001059.job.log.tgz.1 - original size >>>>>>>>>>>>>>>>> 3101899, adler32 742e1336 >>>>>>>>>>>>>>>>> mc16_5TeV:AOD.18595963._000786.pool.root.1 - original size >>>>>>>>>>>>>>>>> 737474514, >>>>>>>>>>>>>>>>> adler32 dc68cb21 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> but on our diskservers they were stored as >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [root@dpmpool22 ~]# ls -l >>>>>>>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 140560257079236 Jul 17 17:11 >>>>>>>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [root@dpmpool22 ~]# du --block-size=1 >>>>>>>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 3108864 >>>>>>>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [root@dpmpool24 ~]# ls -l >>>>>>>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 139734345056408 Jul 17 17:31 >>>>>>>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [root@dpmpool24 tmp]# du --block-size=1 >>>>>>>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 737480704 >>>>>>>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I was told that DPM is not really involved in transfering / >>>>>>>>>>>>>>>>> storing >>>>>>>>>>>>>>>>> files with xrootd protocol, so I guess something bad must >>>>>>>>>>>>>>>>> happened >>>>>>>>>>>>>>>>> inside XRootD libraries. Do you have any idea why these two >>>>>>>>>>>>>>>>> transfers >>>>>>>>>>>>>>>>> generated such extreme files? Unfortunately our xrootd >>>>>>>>>>>>>>>>> logging is >>>>>>>>>>>>>>>>> pretty >>>>>>>>>>>>>>>>> low and I was able to find only info about file creation >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [root@dpmpool22 ~]# grep log.18636890._001059.job.log.tgz >>>>>>>>>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log >>>>>>>>>>>>>>>>> 190717 17:10:39 332092 ofs_Posc: Creator changed from >>>>>>>>>>>>>>>> atlasprd.2757:90@[2001:718:401:6e03::1:21] to atlasprd.2757:39@[2001:718:401:6e03::1:21] for /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/ log.18636890._001059.job.log.tgz.1.rucio.upload >>>>>>>>>>>>>>>>> [root@dpmpool24 tmp]# grep AOD.18595963._000786.pool.root >>>>>>>>>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log >>>>>>>>>>>>>>>>> 190717 17:29:26 14204 ofs_Posc: Creator changed from >>>>>>>>>>>>>>>> atlasprd.2757:33@[2001:718:401:6e03::1:32] to atlasprd.2757:95@[2001:718:401:6e03::1:32] for /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/AOD.18595963. _000786.pool.root.1.rucio.upload >>>>>>>>>>>>>>>>> Petr >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ######################################################################## >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Use REPLY-ALL to reply to list >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following >>>>>>>>>>>>>>>>> link: >>>>>>>>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ######################################################################## >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Use REPLY-ALL to reply to list >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following >>>>>>>>>>>>>>>> link: >>>>>>>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> ######################################################################## >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Use REPLY-ALL to reply to list >>>>>>>>>>>>>> >>>>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>>>>>>>>>>> ------------------------------------------------------------------------ >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Use REPLY-ALL to reply to list >>>>>>>>>>>>> >>>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>>>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------------ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Use REPLY-ALL to reply to list >>>>>>>>>> >>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>>>>>>>> >>>>>> ######################################################################## >>>>>> Use REPLY-ALL to reply to list >>>>>> >>>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>>> ######################################################################## >>>>> Use REPLY-ALL to reply to list >>>>> >>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> Use REPLY-ALL to reply to list >>>> >>>> To unsubscribe from the XROOTD-L list, click the following link: >>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>> >>> >>> >>> ------------------------------------------------------------------------ >>> >>> Use REPLY-ALL to reply to list >>> >>> To unsubscribe from the XROOTD-L list, click the following link: >>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>> >> >> >> ------------------------------------------------------------------------ >> >> Use REPLY-ALL to reply to list >> >> To unsubscribe from the XROOTD-L list, click the following link: >> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >> > ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1