Print

Print


I updated "xrootd.async off" on most of our diskservers, ran test
transfers for last few days and I can confirm that diskservers with
updated config no longer produce corrupted files. DPM configuration by
default use "xrood.async on" probably since xrootd ~ 3.6
(https://twiki.cern.ch/twiki/bin/view/DPM/DPMComponents_Dpm-Xrootd#Manual_Configuration_Configurati),
but "on" parameter is not even in the documentation
https://xrootd.slac.stanford.edu/doc/dev48/xrd_config.htm#_Toc496911332

Does this mean we should change xrootd.async default configuration in
the upcoming DPM DOME 1.13.2?

DPM DOME interface loaded as XRootD plugin doesn't bring globus dependency:

*[root@dpmpool22 ~]# lsof -p 16232 | grep REG*
xrootd  16232 dpmmgr  txt       REG                9,2       70576       662786 /usr/bin/xrootd
xrootd  16232 dpmmgr  mem       REG                9,2       31408       673487 /usr/lib64/libnss_dns-2.17.so
xrootd  16232 dpmmgr  mem       REG                9,2      217032       395689 /var/db/nscd/services
xrootd  16232 dpmmgr  mem       REG                9,2     3135712       660855 /usr/lib64/mysql/libmysqlclient.so.18.0.0
xrootd  16232 dpmmgr  mem       REG                9,2     1975448       657235 /usr/lib64/libdome-4.so
xrootd  16232 dpmmgr  mem       REG                9,2      217032       394245 /var/db/nscd/passwd
xrootd  16232 dpmmgr  mem       REG                9,2      211240       660390 /usr/lib64/libXrdDPMCks-4.so
xrootd  16232 dpmmgr  mem       REG                9,2      170024       659024 /usr/lib64/libXrdDPMDiskAcc-4.so
xrootd  16232 dpmmgr  mem       REG                9,2       20112       676196 /usr/lib64/libuuid.so.1.3.0
xrootd  16232 dpmmgr  mem       REG                9,2     1509376       656808 /usr/lib64/libxml2.so.2.9.1
xrootd  16232 dpmmgr  mem       REG                9,2     1176008       661816 /usr/lib64/libdavix.so.0.7.4
xrootd  16232 dpmmgr  mem       REG                9,2      832600       662787 /usr/lib64/dmlite/plugin_domeadapter.so
xrootd  16232 dpmmgr  mem       REG                9,2       36552       660872 /usr/lib64/dmlite/plugin_config.so
xrootd  16232 dpmmgr  mem       REG                9,2    20781704       672658 /usr/lib64/libicudata.so.50.1.2
xrootd  16232 dpmmgr  mem       REG                9,2     2096104       672660 /usr/lib64/libicui18n.so.50.1.2
xrootd  16232 dpmmgr  mem       REG                9,2     1539544       672672 /usr/lib64/libicuuc.so.50.1.2
xrootd  16232 dpmmgr  mem       REG                9,2        6856       656621 /usr/lib64/libboost_atomic-mt.so.1.53.0
xrootd  16232 dpmmgr  mem       REG                9,2       32056       656588 /usr/lib64/libboost_chrono-mt.so.1.53.0
xrootd  16232 dpmmgr  mem       REG                9,2     1060624       672677 /usr/lib64/libboost_regex-mt.so.1.53.0
xrootd  16232 dpmmgr  mem       REG                9,2       94320       658557 /usr/lib64/libboost_thread-mt.so.1.53.0
xrootd  16232 dpmmgr  mem       REG                9,2       15608       658186 /usr/lib64/libboost_system-mt.so.1.53.0
xrootd  16232 dpmmgr  mem       REG                9,2       69184       658927 /usr/lib64/libboost_date_time-mt.so.1.53.0
xrootd  16232 dpmmgr  mem       REG                9,2     1213536       656626 /usr/lib64/libdmlite.so.1.13.1
xrootd  16232 dpmmgr  mem       REG                9,2      232064       659448 /usr/lib64/libXrdDPMOss-4.so
xrootd  16232 dpmmgr  mem       REG                9,2       15688       656791 /usr/lib64/libXrdSecunix-4.so
xrootd  16232 dpmmgr  mem       REG                9,2      173320       657303 /usr/lib64/libexpat.so.1.6.0
xrootd  16232 dpmmgr  mem       REG                9,2      437216       672545 /usr/lib64/libvomsapi.so.1.0.0
xrootd  16232 dpmmgr  mem       REG                9,2       32536       673484 /usr/lib64/libXrdSecgsiVOMS-4.so
xrootd  16232 dpmmgr  mem       REG                9,2      174352       656623 /usr/lib64/libXrdCryptossl-4.so
xrootd  16232 dpmmgr  mem       REG                9,2      185592       679728 /usr/lib64/libXrdSecgsi-4.so
xrootd  16232 dpmmgr  mem       REG                9,2       61792       679746 /usr/lib64/libXrdSec-4.so
xrootd  16232 dpmmgr  mem       REG                9,2       15688       658274 /usr/lib64/libkeyutils.so.1.5
xrootd  16232 dpmmgr  mem       REG                9,2       67104       675723 /usr/lib64/libkrb5support.so.0.1
xrootd  16232 dpmmgr  mem       REG                9,2      210824       675676 /usr/lib64/libk5crypto.so.3.1
xrootd  16232 dpmmgr  mem       REG                9,2       15920       655385 /usr/lib64/libcom_err.so.2.1
xrootd  16232 dpmmgr  mem       REG                9,2      967848       659013 /usr/lib64/libkrb5.so.3.3
xrootd  16232 dpmmgr  mem       REG                9,2      320400       672019 /usr/lib64/libgssapi_krb5.so.2.2
xrootd  16232 dpmmgr  mem       REG                9,2     2516624       659028 /usr/lib64/libcrypto.so.1.0.2k
xrootd  16232 dpmmgr  mem       REG                9,2      470360       672745 /usr/lib64/libssl.so.1.0.2k
xrootd  16232 dpmmgr  mem       REG                9,2       98496       656613 /usr/lib64/libXrdCrypto.so.1.0.0
xrootd  16232 dpmmgr  mem       REG                9,2      164664       659693 /usr/lib64/libXrdHttpUtils.so.1.0.0
xrootd  16232 dpmmgr  mem       REG                9,2       11536       659387 /usr/lib64/libXrdHttp-4.so
xrootd  16232 dpmmgr  mem       REG                9,2       61624       673959 /usr/lib64/libnss_files-2.17.so
xrootd  16232 dpmmgr  mem       REG                9,2       68192       656816 /usr/lib64/libbz2.so.1.0.6
xrootd  16232 dpmmgr  mem       REG                9,2       90248       656792 /usr/lib64/libz.so.1.2.7
xrootd  16232 dpmmgr  mem       REG                9,2      100008       676257 /usr/lib64/libelf-0.172.so
xrootd  16232 dpmmgr  mem       REG                9,2      402384       656781 /usr/lib64/libpcre.so.1.2.0
xrootd  16232 dpmmgr  mem       REG                9,2       19896       657215 /usr/lib64/libattr.so.1.1.0
xrootd  16232 dpmmgr  mem       REG                9,2      330464       675819 /usr/lib64/libdw-0.172.so
xrootd  16232 dpmmgr  mem       REG                9,2      105824       673967 /usr/lib64/libresolv-2.17.so
xrootd  16232 dpmmgr  mem       REG                9,2       19384       656979 /usr/lib64/libgpg-error.so.0.10.0
xrootd  16232 dpmmgr  mem       REG                9,2      535064       656996 /usr/lib64/libgcrypt.so.11.8.2
xrootd  16232 dpmmgr  mem       REG                9,2       85952       675689 /usr/lib64/liblz4.so.1.7.5
xrootd  16232 dpmmgr  mem       REG                9,2      157424       656805 /usr/lib64/liblzma.so.5.2.2
xrootd  16232 dpmmgr  mem       REG                9,2      155784       676236 /usr/lib64/libselinux.so.1
xrootd  16232 dpmmgr  mem       REG                9,2       20032       657219 /usr/lib64/libcap.so.2.22
xrootd  16232 dpmmgr  mem       REG                9,2      203800       669579 /usr/lib64/libsystemd.so.0.6.0
xrootd  16232 dpmmgr  mem       REG                9,2     2151672       656593 /usr/lib64/libc-2.17.so
xrootd  16232 dpmmgr  mem       REG                9,2       88776       656524 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
xrootd  16232 dpmmgr  mem       REG                9,2     1137024       673192 /usr/lib64/libm-2.17.so
xrootd  16232 dpmmgr  mem       REG                9,2      991616       678393 /usr/lib64/libstdc++.so.6.0.19
xrootd  16232 dpmmgr  mem       REG                9,2       43776       673969 /usr/lib64/librt-2.17.so
xrootd  16232 dpmmgr  mem       REG                9,2      141968       656619 /usr/lib64/libpthread-2.17.so
xrootd  16232 dpmmgr  mem       REG                9,2       19288       673191 /usr/lib64/libdl-2.17.so
xrootd  16232 dpmmgr  mem       REG                9,2      635008       679730 /usr/lib64/libXrdUtils.so.2.0.0
xrootd  16232 dpmmgr  mem       REG                9,2      843968       660388 /usr/lib64/libXrdServer.so.2.0.0
xrootd  16232 dpmmgr  mem       REG                9,2      163400       656586 /usr/lib64/ld-2.17.so
xrootd  16232 dpmmgr  mem       REG                9,2      217032       395688 /var/db/nscd/hosts
xrootd  16232 dpmmgr    2w      REG                9,3   617903553      8388624 /var/log/xrootd/dpmdisk/xrootd.log
xrootd  16232 dpmmgr   16w      REG                9,3 27753273859      8388648 /var/log/xrootd/dpmdisk/xrootd.log-20190914 (deleted)
xrootd  16232 dpmmgr   19w      REG                9,3 27753273859      8388648 /var/log/xrootd/dpmdisk/xrootd.log-20190914 (deleted)
xrootd  16232 dpmmgr   20w      REG                9,3 27753273859      8388648 /var/log/xrootd/dpmdisk/xrootd.log-20190914 (deleted)
xrootd  16232 dpmmgr   21w      REG                9,3 27753273859      8388648 /var/log/xrootd/dpmdisk/xrootd.log-20190914 (deleted)
xrootd  16232 dpmmgr   24u      REG                9,2        6784       527318 /var/spool/xrootd/dpmdisk/.ofs/posc.log
xrootd  16232 dpmmgr *768r      REG                8,0  2998746176  77855990843 /mnt/fs1/atlas/2018-06-05/DAOD_TOPQ4.14324106._000186.pool.root.1.560762872.0
xrootd  16232 dpmmgr *769r      REG               8,64  2733448798  88068008979 /mnt/fs4/atlas/2018-06-06/DAOD_TOPQ4.14324106._000146.pool.root.1.560814999.0
xrootd  16232 dpmmgr *770r      REG               8,64  2553846404  84146182189 /mnt/fs4/atlas/2018-06-05/DAOD_TOPQ4.14324106._000020.pool.root.1.560750281.0
xrootd  16232 dpmmgr *771r      REG               8,16  2396806963  80556399667 /mnt/fs2/atlas/2018-06-05/DAOD_TOPQ4.14324106._000154.pool.root.1.560808076.0
xrootd  16232 dpmmgr *772r      REG               8,16  2453390567  79955741742 /mnt/fs2/atlas/2018-06-05/DAOD_TOPQ4.14324106._000203.pool.root.1.560760348.0
xrootd  16232 dpmmgr *773r      REG               8,48  2837420884  80029443154 /mnt/fs3/atlas/2018-06-05/DAOD_TOPQ4.14324106._000176.pool.root.1.560763754.0
xrootd  16232 dpmmgr *774r      REG               8,16   128181236 128888907798 /mnt/fs2/atlas/2018-11-24/DAOD_SUSY2.16167755._000007.pool.root.1.592538183.0
xrootd  16232 dpmmgr *776r      REG               8,16  2411255622 171819183163 /mnt/fs2/atlas/2018-11-28/DAOD_SUSY2.16167755._000017.pool.root.1.39213.1543405080
xrootd  16232 dpmmgr *779r      REG               8,48  4061471002 176141614133 /mnt/fs3/atlas/2019-06-05/DAOD_TOPQ1.15332443._000021.pool.root.1.4984.1559719508
xrootd  16232 dpmmgr *780u      REG               8,16   167772160 191362651046 /mnt/fs2/atlas/2019-09-16/AOD.19091690._002701.pool.root.1.rucio.upload.365403.1568617954


Petr

On 9/12/19 11:53 PM, Andrew Hanushevsky wrote:
>  
> Hi Petr,
>  
> OK, I see now, it wasn’t clear before. So, the corruption happens at
> the xrootd-ofs interface. So, could you add the following config option:
>  
> xrootd.async off
>  
> to the disk node servers and see what happens afterwards. One more
> question, do the disk servers here use the globus library at all? That
> is, is that library linked with the anything that the disk server
> uses? Usually, the lsof output on a running server will tell you which
> libraries are in use.
>  
> Andy
>  
> *From:* Petr Vokac <mailto:[log in to unmask]>
> *Sent:* Thursday, September 12, 2019 2:14 PM
> *To:* Andrew Hanushevsky <mailto:[log in to unmask]> ; andrea manzi
> <mailto:[log in to unmask]> ; [log in to unmask]
> <mailto:[log in to unmask]>
> *Cc:* dpm-devel <mailto:[log in to unmask]> ;
> [log in to unmask] <mailto:[log in to unmask]>
> *Subject:* Re: Uploaded file 128TB on filesystem
>  
> On 9/11/19 10:26 PM, Andrew Hanushevsky wrote:
>> I have yet to look at the logs to see where the corruption occurs. I
>> assume the last email contains an instance of the problem with the
>> tracing on. Once I see that I will have additional suggestions.
>> Anyway, could you tell me what the crash was here, SEGV? 
>
> Yes, core file tells me "Program terminated with signal 11,
> Segmentation fault."
>
>
> On 9/12/19 2:44 AM, Andrew Hanushevsky wrote:
>> This isn’t exactly what I wanted. With the two directives in the
>> config file for all servers, please find the server that created the
>> huge file and give me it’s log file so I can see where the large
>> offset is introduced. That’s the only way I can narrow down where the
>> problem first occurs.
>
> Do you mean that disknode xrootd.log file filtered by
> "condor.114208:47" for specific transfer that created big file is not
> sufficient and you need whole file? These transfers that were done by
> xrdcp doesn't log as much information on server side as transfers with
> gfal-copy
> *[root@dpmpool9 ~]# zgrep condor.44861:
> /var/log/xrootd/dpmdisk/xrootd.log-20190912.gz *
> 190912 02:04:06 3177 XrootdXeq: condor.44861:47@[2001:718:401:6e03::1:82] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac
> 190912 02:04:06 3177 condor.44861:47@[2001:718:401:6e03::1:82] XrootdProtocol: open unmatp //dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000?dpm.chunk0=0,0,dpmpool9.farm.particle.cz:/mnt/fs2/atlas/2019-09-12/1M.00381.20190912020406.296583308.00000.219576.1568246646%3fsfn=%252Fdpm%252Ffarm.particle.cz%252Fhome%252Fatlas%252F1M.00381.20190912020406.296583308.00000&dpm.dhost=dpmpool9.farm.particle.cz&dpm.hv2=cpcbXGYCWjKth6eVgQFU8A==&dpm.loc=1&dpm.nonce=b7678c5d&dpm.put=1&dpm.time=1568246646&oss.asize=1048576&xrd.gsiusrpxy=/tmp/x509up_u996&xrd.wantprot=gsi,unix
> 190912 02:04:06 3177 condor.44861:47@[2001:718:401:6e03::1:82] XrootdProtocol: fh=0 write 1048576@0
> 190912 02:04:11 21932 condor.44861:47@[2001:718:401:6e03::1:82] ofs_write: 209@3909485545868230665 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000
> 190912 02:04:11 21932 condor.44861:47@[2001:718:401:6e03::1:82] ofs_write: 131072@131072 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000
> 190912 02:04:11 21932 condor.44861:47@[2001:718:401:6e03::1:82] ofs_write: 131072@262144 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000
> 190912 02:04:12 21932 condor.44861:47@[2001:718:401:6e03::1:82] ofs_write: 131072@393216 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000
> 190912 02:04:12 21932 condor.44861:47@[2001:718:401:6e03::1:82] ofs_write: 131072@524288 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000
> 190912 02:04:12 21932 condor.44861:47@[2001:718:401:6e03::1:82] ofs_write: 131072@655360 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000
> 190912 02:04:12 21932 condor.44861:47@[2001:718:401:6e03::1:82] ofs_write: 131072@786432 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000
> 190912 02:04:12 21932 condor.44861:47@[2001:718:401:6e03::1:82] ofs_write: 131072@917504 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000
> 190912 02:05:22 22468 XrootdXeq: condor.44861:47@[2001:718:401:6e03::1:82] disc 0:01:16
> 190912 02:05:22 22468 condor.44861:47@[2001:718:401:6e03::1:82] XrootdFile: closing w /dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000
> 190912 02:05:22 23332 XrootdXeq: condor.44861:48@[2001:718:401:6e03::1:82] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac
> 190912 02:05:22 23332 condor.44861:48@[2001:718:401:6e03::1:82] XrootdProtocol: open up //dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000?dpm.chunk0=0,0,dpmpool9.farm.particle.cz:/mnt/fs2/atlas/2019-09-12/1M.00381.20190912020406.296583308.00000.219576.1568246646%3fsfn=%252Fdpm%252Ffarm.particle.cz%252Fhome%252Fatlas%252F1M.00381.20190912020406.296583308.00000&dpm.dhost=dpmpool9.farm.particle.cz&dpm.hv2=cpcbXGYCWjKth6eVgQFU8A==&dpm.loc=1&dpm.nonce=b7678c5d&dpm.put=1&dpm.time=1568246646&oss.asize=1048576&xrd.gsiusrpxy=/tmp/x509up_u996&xrd.wantprot=gsi,unix
> 190912 02:05:22 23332 ofs_Posc: Creator changed from condor.44861:47@[2001:718:401:6e03::1:82] to condor.44861:48@[2001:718:401:6e03::1:82] for /dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000
> 190912 02:05:22 23332 condor.44861:48@[2001:718:401:6e03::1:82] XrootdProtocol: fh=0 write 1048576@0
> 190912 02:05:36 23332 condor.44861:48@[2001:718:401:6e03::1:82] ofs_write: 1048576@0 fn=/dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000
> 190912 02:05:36 23332 condor.44861:48@[2001:718:401:6e03::1:82] XrootdProtocol: close rc=0 fh=0
> 190912 02:05:36 23332 condor.44861:48@[2001:718:401:6e03::1:82] XrootdFile: closing w /dpm/farm.particle.cz/home/atlas/1M.00381.20190912020406.296583308.00000
> 190912 02:05:36 23332 XrootdXeq: condor.44861:48@[2001:718:401:6e03::1:82] disc 0:00:14
> Transfers with xrdcp has much shorter log on diskserver
> *[root@dpmpool1 ~]# zgrep -E
> 'condor.114208:|1M.xrdcp.20190911070503.057846787'
> /var/log/xrootd/dpmdisk/xrootd.log-20190912.gz *
> 190911 07:05:12 31717 XrootdXeq: condor.114208:47@[2001:718:401:6e03::1:82] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac
> 190911 07:05:12 31717 condor.114208:47@[2001:718:401:6e03::1:82] XrootdProtocol: open unmat /dpm/farm.particle.cz/home/atlas/1M.xrdcp.20190911070503.057846787?dpm.chunk0=0,0,dpmpool1.farm.particle.cz:/mnt/fs1/atlas/2019-09-11/1M.xrdcp.20190911070503.057846787.143302.1568178311%3fsfn=%252Fdpm%252Ffarm.particle.cz%252Fhome%252Fatlas%252F1M.xrdcp.20190911070503.057846787&dpm.dhost=dpmpool1.farm.particle.cz&dpm.hv2=NBoSW+W5T7Eti/8yeLbu0w==&dpm.loc=1&dpm.nonce=870b737&dpm.put=1&dpm.time=1568178311&oss.asize=1048576
> 190911 07:05:12 31717 condor.114208:47@[2001:718:401:6e03::1:82] XrootdProtocol: fh=0 write 1048576@0
> 190911 07:05:20 23385 condor.114208:47@[2001:718:401:6e03::1:82] XrootdProtocol: close rc=0 fh=0
> 190911 07:05:20 23385 condor.114208:47@[2001:718:401:6e03::1:82] XrootdFile: closing w /dpm/farm.particle.cz/home/atlas/1M.xrdcp.20190911070503.057846787
> 190911 07:05:20 23385 XrootdXeq: condor.114208:47@[2001:718:401:6e03::1:82] disc 0:00:09
> I don't know why gfal transfer has more information in server
> xrootd.log file, probably something in environment or transfer
> parameters different from pure `xrdcp file:///... root://...` command.
>
> Petr
>
>>  
>> *From:* Petr Vokac <mailto:[log in to unmask]>
>> *Sent:* Wednesday, September 11, 2019 12:35 AM
>> *To:* Andrew Hanushevsky <mailto:[log in to unmask]> ; andrea
>> manzi <mailto:[log in to unmask]> ; [log in to unmask]
>> <mailto:[log in to unmask]>
>> *Cc:* dpm-devel <mailto:[log in to unmask]> ;
>> [log in to unmask] <mailto:[log in to unmask]>
>> *Subject:* Re: Uploaded file 128TB on filesystem
>>  
>> Now I have following lines on all our DPM diskservers
>> ofs.trace write
>> xrootd.trace fs
>> and I created simple script [1] that use just xrdcp to copy 1MB file
>> CURR=$(date +%Y%m%d%H%M%S.%N)
>> export XRD_LOGLEVEL=Dump
>> export XRD_LOGMASK=All
>> export XRD_LOGFILE=1M.${CURR}.dump
>> xrdcp "file:///tmp/1M" "root://golias100.farm.particle.cz//dpm/farm.particle.cz/home/atlas/1M.xrdcp.${CURR}"
>> Even this simple code create corrupted files at our DPM storage, e.g.
>> following 1.5GB file
>> *[root@dpmpool1 ~]# ls -la
>> /mnt/fs1/atlas/2019-09-11/1M.xrdcp.20190911070503.057846787.143302.1568178311*
>> -rw-rw---- 1 dpmmgr dpmmgr 1761607680 Sep 11 07:05 /mnt/fs1/atlas/2019-09-11/1M.xrdcp.20190911070503.057846787.143302.1568178311
>> *[root@dpmpool1 ~]# du --block-size=1
>> /mnt/fs1/atlas/2019-09-11/1M.xrdcp.20190911070503.057846787.143302.1568178311*
>> 1048576	/mnt/fs1/atlas/2019-09-11/1M.xrdcp.20190911070503.057846787.143302.1568178311
>> *[root@dpmpool1 ~]# grep condor.114208:47
>> /var/log/xrootd/dpmdisk/xrootd.log*
>> 190911 07:05:12 31717 XrootdXeq: condor.114208:47@[2001:718:401:6e03::1:82] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac
>> 190911 07:05:12 31717 condor.114208:47@[2001:718:401:6e03::1:82] XrootdProtocol: open unmat /dpm/farm.particle.cz/home/atlas/1M.xrdcp.20190911070503.057846787?dpm.chunk0=0,0,dpmpool1.farm.particle.cz:/mnt/fs1/atlas/2019-09-11/1M.xrdcp.20190911070503.057846787.143302.1568178311%3fsfn=%252Fdpm%252Ffarm.particle.cz%252Fhome%252Fatlas%252F1M.xrdcp.20190911070503.057846787&dpm.dhost=dpmpool1.farm.particle.cz&dpm.hv2=NBoSW+W5T7Eti/8yeLbu0w==&dpm.loc=1&dpm.nonce=870b737&dpm.put=1&dpm.time=1568178311&oss.asize=1048576
>> 190911 07:05:12 31717 condor.114208:47@[2001:718:401:6e03::1:82] XrootdProtocol: fh=0 write 1048576@0
>> 190911 07:05:20 23385 condor.114208:47@[2001:718:401:6e03::1:82] XrootdProtocol: close rc=0 fh=0
>> 190911 07:05:20 23385 condor.114208:47@[2001:718:401:6e03::1:82] XrootdFile: closing w /dpm/farm.particle.cz/home/atlas/1M.xrdcp.20190911070503.057846787
>> 190911 07:05:20 23385 XrootdXeq: condor.114208:47@[2001:718:401:6e03::1:82] disc 0:00:09
>> Client side details are in attachment. I have more examples [2] with
>> following wrong size
>> 1M.20190911005425.364456793.bad ... Wed Sep 11 00:54:31 CEST 2019 ERROR BAD SIZE 1124073472
>> 1M.20190911010555.932867187.bad ... Wed Sep 11 01:06:10 CEST 2019 ERROR BAD SIZE 0
>> 1M.20190911012755.924566113.bad ... Wed Sep 11 01:28:08 CEST 2019 ERROR BAD SIZE 917504
>> 1M.20190911014400.800640006.bad ... Wed Sep 11 01:44:24 CEST 2019 ERROR BAD SIZE 917504
>> 1M.20190911014622.520944167.bad ... Wed Sep 11 01:46:40 CEST 2019 ERROR BAD SIZE 3741319168
>> 1M.20190911014901.149846614.bad ... Wed Sep 11 01:49:26 CEST 2019 ERROR BAD SIZE 917504
>> 1M.20190911015203.831921517.bad ... Wed Sep 11 01:52:29 CEST 2019 ERROR BAD SIZE 917504
>> 1M.20190911020907.396911618.bad ... Wed Sep 11 02:09:32 CEST 2019 ERROR BAD SIZE 3791650816
>> 1M.20190911021159.705868237.bad ... Wed Sep 11 02:12:27 CEST 2019 ERROR BAD SIZE 917504
>> 1M.20190911021251.426135326.bad ... Wed Sep 11 02:13:13 CEST 2019 ERROR BAD SIZE 917504
>> 1M.20190911021737.228623622.bad ... Wed Sep 11 02:18:01 CEST 2019 ERROR BAD SIZE 771751936
>> 1M.20190911022813.246485614.bad ... Wed Sep 11 02:28:40 CEST 2019 ERROR BAD SIZE 917504
>> 1M.20190911025601.833862548.bad ... Wed Sep 11 02:56:20 CEST 2019 ERROR BAD SIZE 1694498816
>> 1M.20190911030159.405097568.bad ... Wed Sep 11 03:02:28 CEST 2019 ERROR BAD SIZE 5016387584
>> 1M.20190911030913.965667783.bad ... Wed Sep 11 03:09:54 CEST 2019 ERROR BAD SIZE 218103808
>> 1M.20190911030955.983267709.bad ... Wed Sep 11 03:10:01 CEST 2019 ERROR BAD SIZE 15946576
>> 1M.20190911031027.126885413.bad ... Wed Sep 11 03:10:55 CEST 2019 ERROR BAD SIZE 1459617792
>> 1M.20190911031535.972442346.bad ... Wed Sep 11 03:16:03 CEST 2019 ERROR BAD SIZE 24848976
>> 1M.20190911032921.385479218.bad ... Wed Sep 11 03:29:37 CEST 2019 ERROR BAD SIZE 352321536
>> 1M.20190911050741.513088940.bad ... Wed Sep 11 05:07:57 CEST 2019 ERROR BAD SIZE 917504
>> 1M.20190911053936.964249354.bad ... Wed Sep 11 05:39:56 CEST 2019 ERROR BAD SIZE 917504
>> 1M.20190911055721.511952376.bad ... Wed Sep 11 05:57:39 CEST 2019 ERROR BAD SIZE 917504
>> 1M.20190911060520.998050566.bad ... Wed Sep 11 06:05:27 CEST 2019 ERROR BAD SIZE 2292134
>> 1M.20190911061254.989202949.bad ... Wed Sep 11 06:13:20 CEST 2019 ERROR BAD SIZE 917504
>> 1M.20190911062609.645848195.bad ... Wed Sep 11 06:26:21 CEST 2019 ERROR BAD SIZE 1207959552
>> 1M.20190911063849.007385225.bad ... Wed Sep 11 06:38:55 CEST 2019 ERROR BAD SIZE 917504
>> 1M.20190911063907.064936962.bad ... Wed Sep 11 06:39:15 CEST 2019 ERROR BAD SIZE 26413431
>> 1M.20190911065351.412877379.bad ... Wed Sep 11 06:54:00 CEST 2019 ERROR BAD SIZE 917504
>> 1M.20190911070503.057846787.bad ... Wed Sep 11 07:05:20 CEST 2019 ERROR BAD SIZE 1761607680
>> 1M.20190911070525.814871895.bad ... Wed Sep 11 07:05:36 CEST 2019 ERROR BAD SIZE 917504
>> 1M.20190911070634.781534227.bad ... Wed Sep 11 07:06:42 CEST 2019 ERROR BAD SIZE 3758096384
>> 1M.20190911072040.275437982.bad ... Wed Sep 11 07:21:07 CEST 2019 ERROR BAD SIZE 786432
>> 1M.20190911072340.253938152.bad ... Wed Sep 11 07:24:00 CEST 2019 ERROR BAD SIZE 917504
>> 1M.20190911074911.208877064.bad ... Wed Sep 11 07:49:40 CEST 2019 ERROR BAD SIZE 388133552
>> 1M.20190911080141.583691537.bad ... Wed Sep 11 08:01:56 CEST 2019 ERROR BAD SIZE 917504
>> 1M.20190911082145.247040374.bad ... Wed Sep 11 08:22:17 CEST 2019 ERROR BAD SIZE 786432
>> 1M.20190911085008.626914519.bad ... Wed Sep 11 08:50:29 CEST 2019 ERROR BAD SIZE 917504
>> Petr
>>
>> [1]
>> http://novastore.farm.particle.cz/tmp/xrdcp-20190911/test_dpm_xroot_file_copy.sh
>> [2] http://novastore.farm.particle.cz/tmp/xrdcp-20190911/
>>
>> On 9/10/19 2:12 AM, Andrew Hanushevsky wrote:
>>> Hi Petr,
>>>  
>>> If you also specify in the config file
>>>  
>>> xrootd.trace fs
>>>  
>>> then we will get a much earlier snapshot of the offset and length.
>>> This will allow us to tell whether the data is getting corrupted
>>> before it reaches the server or afterward. Could you do that (at
>>> least for one test) to see where the corruption may happen?
>>>  
>>> Andy
>>>  
>>> *From:* Petr Vokac <mailto:[log in to unmask]>
>>> *Sent:* Sunday, September 08, 2019 1:46 AM
>>> *To:* andrea manzi <mailto:[log in to unmask]> ;
>>> [log in to unmask] <mailto:[log in to unmask]> ; Andrew
>>> Hanushevsky <mailto:[log in to unmask]>
>>> *Cc:* dpm-devel <mailto:[log in to unmask]> ;
>>> [log in to unmask] <mailto:[log in to unmask]>
>>> *Subject:* Re: Uploaded file 128TB on filesystem
>>>  
>>> I just found one file where problem with offset occurred in the
>>> middle of transfer (not at the beginning):
>>> 190907 20:03:53 25211 XrootdXeq: atlasprd.2817:168@[2001:718:401:6e03::1:51] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>> 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@0 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>>> 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@262144 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>>> 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@524288 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>>> 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@786432 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>>> 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@1048576 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>>> ...
>>> 190907 20:40:35 31589 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2952003584 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>>> 190907 20:40:53 31589 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2952265728 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>>> 190907 20:41:55 31609 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2952265728 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>>> *190907 20:42:05 25089 atlasprd.2817:168@[2001:718:401:6e03::1:51]
>>> ofs_write: 1919238960@7379557499097268579
>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload****190907 20:42:13 25118 atlasprd.2817:168@[2001:718:401:6e03::1:51]
>>> ofs_write: 1919238960@7379557499097268579
>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload***190907 20:42:13 25118 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2953314304 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>>> 190907 20:42:13 25118 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2953576448 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>>> 190907 20:42:20 25118 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2953576448 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>>> Whole xrootd.log for this transfer available at
>>> http://novastore.farm.particle.cz/tmp/xrootd_bigfile_wrong_offset_in_the_middle_of_transfer.txt
>>>
>>> Petr
>>>
>>> On 9/8/19 10:25 AM, Petr Vokac wrote:
>>>> I executed basically following python code that from bash script in
>>>> a loop
>>>> import gfal2
>>>> ctx = gfal2.creat_context()
>>>> ctx.set_opt_string("XROOTD PLUGIN", "XRD.WANTPROT", "gsi,unix")
>>>> params = ctx.transfer_parameters()
>>>> params.timeout = 60
>>>> ctx.filecopy(params, "file:///tmp/1M", "root://golias100.farm.particle.cz/dpm/farm.particle.cz/home/atlas/1M")
>>>> and after a while new corrupted files appeared at our storage
>>>> including one big file:
>>>> [vokac@ui1 ~]$ gfal-ls -l 'davs://golias100.farm.particle.cz/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000'
>>>> -rw-rw-r--   0 1310  5281  139703410778112 Sep  7 19:01 davs://golias100.farm.particle.cz/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000	
>>>> Now I have also debug output from gfal client including
>>>> XRD_LOGLEVEL=Dump + XRD_LOGMASK=All for this file with xrootd.log
>>>> on server side:
>>>> [root@dpmpool19 ~]$ grep condor.39253:171 /var/log/xrootd/dpmdisk/xrootd.log
>>>> 190907 19:01:19 29645 XrootdXeq: condor.39253:171@[2001:718:401:6e03::1:82] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac
>>>> 190907 19:01:22 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 1009440128@139703410643808 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000
>>>> 190907 19:01:23 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@131072 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000
>>>> 190907 19:01:23 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@262144 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000
>>>> 190907 19:01:24 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@393216 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000
>>>> 190907 19:01:24 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@524288 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000
>>>> 190907 19:01:25 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@655360 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000
>>>> 190907 19:01:25 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@786432 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000
>>>> 190907 19:01:25 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@917504 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000
>>>> 190907 19:01:26 32625 XrootdXeq: condor.39253:171@[2001:718:401:6e03::1:82] disc 0:00:07
>>>> This file was transferred from mff82.farm.particle.cz
>>>> (2001:718:401:6e03::1:82) to the storage dpmpool19.farm.particle.cz
>>>> (2001:718:401:6017:2::19) and both machines use
>>>> xrootd-4.9.1-1.el7.x86_64 and mff82 use gfal2-2.16.3-1.el7.x86_64
>>>>
>>>> I also tried to calculate adler32 for data blocks from positions
>>>> logged by server and result match source data checksum. When I look
>>>> at other transfers finished with "SUCCESS" but with the filesize
>>>> different from source file I can see that problems always occurs
>>>> only with first "ofs_write" log line where size@offset contains
>>>> wrong numbers.
>>>>
>>>> Attached files:
>>>>
>>>> 1M.01556.20190907190118.646902567.00000.dump ... xrd with dump log
>>>> level
>>>> 1M.01556.20190907190118.646902567.00000.log ... gfal2 verbose
>>>> logging (+ few log lines from my script)
>>>> 1M.01556.20190907190118.646902567.00000.xrootd.long ... dpmpool19
>>>> xrootd.log
>>>>
>>>> I have more examples of successfully transferred files with wrong
>>>> size (sometimes bigger than original 1M size and sometimes smaller):
>>>>
>>>> 07 Sep 18:39:33.879 [ERROR] dpmpool1.farm.particle.cz BAD SIZE 917504
>>>> 07 Sep 18:40:45.039 [ERROR] dpmpool21.farm.particle.cz BAD SIZE 917504
>>>> 07 Sep 18:41:22.555 [ERROR] dpmpool21.farm.particle.cz BAD SIZE 917504
>>>> 07 Sep 18:41:39.248 [ERROR] dpmpool21.farm.particle.cz BAD SIZE 917504
>>>> 07 Sep 18:56:51.113 [ERROR] dpmpool24.farm.particle.cz BAD SIZE 917504
>>>> 07 Sep 18:58:04.436 [ERROR] dpmpool1.farm.particle.cz BAD SIZE
>>>> 393194936
>>>> 07 Sep 19:01:25.908 [ERROR] dpmpool19.farm.particle.cz BAD SIZE
>>>> 139703410778112
>>>> 07 Sep 19:02:17.183 [ERROR] dpmpool24.farm.particle.cz BAD SIZE 0
>>>> 07 Sep 19:02:53.014 [ERROR] dpmpool16.farm.particle.cz BAD SIZE 917504
>>>> 07 Sep 19:03:49.424 [ERROR] dpmpool1.farm.particle.cz BAD SIZE 917504
>>>> 07 Sep 19:05:47.743 [ERROR] dpmpool21.farm.particle.cz BAD SIZE
>>>> 13983727
>>>> 07 Sep 19:15:15.761 [ERROR] dpmpool22.farm.particle.cz BAD SIZE
>>>> 12752345
>>>> 07 Sep 19:16:58.898 [ERROR] dpmpool20.farm.particle.cz BAD SIZE 917504
>>>> 07 Sep 19:18:51.590 [ERROR] dpmpool17.farm.particle.cz BAD SIZE 6302080
>>>> 07 Sep 19:20:37.006 [ERROR] dpmpool21.farm.particle.cz BAD SIZE
>>>> 1456153896
>>>> 07 Sep 19:52:32.553 [ERROR] dpmpool16.farm.particle.cz BAD SIZE 917504
>>>> 07 Sep 19:53:45.155 [ERROR] dpmpool22.farm.particle.cz BAD SIZE
>>>> 1000000000
>>>> 07 Sep 20:11:14.497 [ERROR] dpmpool22.farm.particle.cz BAD SIZE 917504
>>>> 07 Sep 20:12:51.081 [ERROR] dpmpool20.farm.particle.cz BAD SIZE 917504
>>>> 07 Sep 20:14:58.090 [ERROR] dpmpool20.farm.particle.cz BAD SIZE
>>>> 100663296
>>>> 07 Sep 20:24:57.457 [ERROR] dpmpool1.farm.particle.cz BAD SIZE 917504
>>>> 07 Sep 20:27:37.596 [ERROR] dpmpool19.farm.particle.cz BAD SIZE 917504
>>>> 07 Sep 20:28:21.617 [ERROR] dpmpool16.farm.particle.cz BAD SIZE 0
>>>>
>>>> Petr
>>>>
>>>> On 9/5/19 2:44 PM, Petr Vokac wrote:
>>>>> I'm going to test in a loop uploads with this code and also enabled gfal
>>>>> debug logging from our remote machines that talks with storage through
>>>>> quite saturated network (we never saw same issue from cluster worker
>>>>> nodes that shares room with diskservers). I also saw few transfers that
>>>>> did not create big sparse file, but just failed while comparing
>>>>> checksums after upload and this is also quite suspicious behavior.
>>>>>
>>>>> Petr
>>>>>
>>>>> On 9/5/19 12:57 PM, andrea manzi wrote:
>>>>>>> Ciao Andrea,
>>>>>>>
>>>>>>> we practically do this:     ctx = creat_context(...     ctx.set_opt_string("XROOTD PLUGIN", "XRD.WANTPROT", "gsi,unix")     ctx.mkdir_rec(...     ctx.filecopy(..
>>>>>>>
>>>>>>> Is that the gfal transfer API?
>>>>>> yup, internally in gfal2 we then use the XrdCl to perform the copy (
>>>>>> both for local files and TPC)
>>>>>>
>>>>>> today is holiday but we can start investigating with Michal tomorrow
>>>>>> or Monday the possible issue
>>>>>>
>>>>>> cheers
>>>>>>
>>>>>> Andrea
>>>>>>
>>>>>>> Cheers,
>>>>>>> Mario
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 05/09/2019 10:35, andrea manzi wrote:
>>>>>>>> ciao Mario,
>>>>>>>>
>>>>>>>> i guess you are using the gfal2 transfer API in Rucio right ? ( the
>>>>>>>> equivalent to gfal-copy)
>>>>>>>>
>>>>>>>> in gfal2,  in the case of the transfer API  for xrootd, we don't do
>>>>>>>> much
>>>>>>>> as we just use the xrootd client (  XrdCl::CopyProcess), so we don't
>>>>>>>> touch file offsets.
>>>>>>>>
>>>>>>>> so we may want to investigate if there is a problem in XrdCl. Do you
>>>>>>>> have the logs of the client?
>>>>>>>>
>>>>>>>> If you are using the gfal2 Posix API instead we need to investigate a
>>>>>>>> possible bug there
>>>>>>>>
>>>>>>>> cheers
>>>>>>>>
>>>>>>>> Andrea
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi Petr, Andy,
>>>>>>>>>
>>>>>>>>> once extra piece of information here: Rucio uses the python
>>>>>>>>> bindings to
>>>>>>>>> the gfal library to upload these files, and de-facto always uses the
>>>>>>>>> current stable gfal version that is published on ALRB. Rucio doesn't
>>>>>>>>> implement any other upload by itself.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Mario
>>>>>>>>>
>>>>>>>>> On 05/09/2019 04:21, Andrew Hanushevsky wrote:
>>>>>>>>>> Hi Petr,   So, look at this line:   190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>>> ofs_write: 1784104010@6506989258351653185
>>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>>>
>>>>>>>>>> Why did the application doing the write specify an offset of
>>>>>>>>>> 5506989258351653185? So, that is the problem and, as you noted, it
>>>>>>>>>> doesn’t matter whether POSC is on or not. Indeed, it’s the first
>>>>>>>>>> write
>>>>>>>>>> to the file and only the first write that suffers from this
>>>>>>>>>> problem. So,
>>>>>>>>>> there are two possibilities here: a) there is a bug in the
>>>>>>>>>> uploader, or
>>>>>>>>>> b) there was a network data corruption that managed to hit the offset
>>>>>>>>>> (somewhat unlikely).   Andy   *From:* Petr Vokac mailto:[log in to unmask]
>>>>>>>>>> *Sent:* Wednesday, September 04, 2019 12:19 PM
>>>>>>>>>> *To:* Andrew Hanushevsky mailto:[log in to unmask]
>>>>>>>>>> *Cc:* dpm-devel mailto:[log in to unmask] ;
>>>>>>>>>> [log in to unmask]
>>>>>>>>>> mailto:[log in to unmask]
>>>>>>>>>> *Subject:* Re: Uploaded file 128TB on filesystem   I configured our diskservers with "ofs.trace write" (I did not
>>>>>>>>>> yet add
>>>>>>>>>> in xrootd configuraton "ofs.persist hold 0") and today new file
>>>>>>>>>> appeared
>>>>>>>>>> on one of our storage servers:
>>>>>>>>>>
>>>>>>>>>> *[root@dpmpool1 ~]# grep atlasprd.2817:152
>>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log*
>>>>>>>>>> 190904 17:02:08 1894 XrootdXeq:
>>>>>>>>>> atlasprd.2817:152@[2001:718:401:6e03::1:82] pub IP64 login as
>>>>>>>>>> /DC=ch/DC=cern/OU=Organic
>>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>>>>>>>> 190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>>> ofs_write: 1784104010@6506989258351653185
>>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>>> 190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>>> ofs_write: 262144@262144
>>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>>> ofs_write: 262144@524288
>>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>>> ofs_write: 262144@786432
>>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>>> ofs_write: 262144@1048576
>>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>>> ofs_write: 262144@1310720
>>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>>> ofs_write: 262144@1572864
>>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>>> ofs_write: 262144@1835008
>>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>>> ofs_write: 135000@2097152
>>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>>> 190904 17:07:32 2163 XrootdXeq:
>>>>>>>>>> atlasprd.2817:152@[2001:718:401:6e03::1:82] disc 0:05:24
>>>>>>>>>>
>>>>>>>>>> *[root@dpmpool1 ~]# ls -la
>>>>>>>>>> /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328*
>>>>>>>>>>
>>>>>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 6506989258351919104 Sep  4 17:02
>>>>>>>>>> /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328
>>>>>>>>>>
>>>>>>>>>> *[root@dpmpool1 ~]# du --block-size=1
>>>>>>>>>> /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328*
>>>>>>>>>>
>>>>>>>>>> 2236416    /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328
>>>>>>>>>>
>>>>>>>>>> This file has first 256kb filled with 0x00 (adler32 up to original
>>>>>>>>>> size
>>>>>>>>>> 2232152 doesn't match) and this time there is no ofs_Posc in the
>>>>>>>>>> xrootd.log file.
>>>>>>>>>>
>>>>>>>>>> Petr Vokac
>>>>>>>>>>
>>>>>>>>>> On 8/19/19 3:24 AM, Andrew Hanushevsky wrote:
>>>>>>>>>>> Hi Petr,
>>>>>>>>>>>
>>>>>>>>>>> The issue here is that the python script (I believe it's python) is
>>>>>>>>>>> disconnecting without closing the file and then reconnecting and
>>>>>>>>>>> writing at a high offset. The log is pretty clear on that. You can
>>>>>>>>>>> verify by adding "ofs.trace write" to see the write offsets
>>>>>>>>>>> (hough it
>>>>>>>>>>> will produce a lot of output). You can also prohibit reconnection by
>>>>>>>>>>> adding this to the persist directive.
>>>>>>>>>>>
>>>>>>>>>>> ofs.persist hold 0
>>>>>>>>>>>
>>>>>>>>>>> which would delete the file the moment the client disconnects
>>>>>>>>>>> without
>>>>>>>>>>> closing the file.
>>>>>>>>>>>
>>>>>>>>>>> Andy
>>>>>>>>>>>
>>>>>>>>>>> On Sat, 17 Aug 2019, Petr Vokac wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Today new "65PB" file appeared on our storage and it was again
>>>>>>>>>>>> transferred with xrootd protocol. I can see again same error in
>>>>>>>>>>>> the log
>>>>>>>>>>>> file:
>>>>>>>>>>>>
>>>>>>>>>>>> **[root@dpmpool21 ~]# ls -la
>>>>>>>>>>>> /mnt/fs4/atlas/2019-08-17/EVNT.18893579._003197.pool.root.1.rucio.upload.234855.1566031522*
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *-rw-rw---- 1 dpmmgr dpmmgr 72340172838077281 Aug 17 10:48
>>>>>>>>>>>> /mnt/fs4/atlas/2019-08-17/EVNT.18893579._003197.pool.root.1.rucio.upload.234855.1566031522
>>>>>>>>>>>>
>>>>>>>>>>>> ***[root@dpmpool21 ~]# grep atlasprd.2757
>>>>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log*
>>>>>>>>>>>> ...
>>>>>>>>>>>> 190817 10:47:55 57803 XrootdXeq:
>>>>>>>>>>>> atlasprd.2757:65@[2001:718:401:6e03::1:43] pub IP64 login as
>>>>>>>>>>>> /DC=ch/DC=cern/OU=Organic
>>>>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>>>>>>>>>> 190817 10:47:55 61859 XrootdXeq:
>>>>>>>>>>>> atlasprd.2757:66@[2001:718:401:6e03::1:83] pub IP64 login as
>>>>>>>>>>>> /DC=ch/DC=cern/OU=Organic
>>>>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>>>>>>>>>> 190817 10:48:52 62233 XrootdXeq:
>>>>>>>>>>>> atlasprd.2757:46@[2001:718:401:6e03::1:72] disc 0:00:58
>>>>>>>>>>>> 190817 10:48:52 62954 XrootdXeq:
>>>>>>>>>>>> atlasprd.2757:25@[2001:718:401:6e03::1:72] disc 0:03:30
>>>>>>>>>>>> 190817 10:48:52 57804 XrootdXeq:
>>>>>>>>>>>> atlasprd.2757:28@[2001:718:401:6e03::1:72] pub IP64 login as
>>>>>>>>>>>> /DC=ch/DC=cern/OU=Organic
>>>>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>>>>>>>>>> 190817 10:48:52 57804 ofs_Posc: Creator changed from
>>>>>>>>>>>> atlasprd.2757:25@[2001:718:401:6e03::1:72] to
>>>>>>>>>>>> atlasprd.2757:28@[2001:718:401:6e03::1:72] for
>>>>>>>>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/8b/f3/EVNT.18893579._003197.pool.root.1.rucio.upload
>>>>>>>>>>>>
>>>>>>>>>>>> 190817 10:48:53 57803 XrootdXeq:
>>>>>>>>>>>> atlasprd.2757:46@[2001:718:401:6017:20:0:14:44] pub IP64 login as
>>>>>>>>>>>> /DC=ch/DC=cern/OU=Organic
>>>>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>>>>>>>>>> 190817 10:49:01 57818 XrootdXeq:
>>>>>>>>>>>> atlasprd.2757:70@[2001:718:401:6017:20:0:11:11] pub IP64 login as
>>>>>>>>>>>> /DC=ch/DC=cern/OU=Organic
>>>>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>>>>>>>>>> 190817 10:49:07 62954 XrootdXeq:
>>>>>>>>>>>> atlasprd.2757:55@[2001:718:401:6e03::1:81] disc 0:10:17
>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>> Any suggestion how to trace what is causing this behavior?
>>>>>>>>>>>>
>>>>>>>>>>>> Petr
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 7/20/19 7:48 AM, Petr Vokac wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I ran the command you sent me
>>>>>>>>>>>>>
>>>>>>>>>>>>> grep "atlasprd.2757:" <logfile> | egrep
>>>>>>>>>>>>> '2001:718:401:6e03::1:21|2001:718:401:6e03::1:32' > <excerpt>
>>>>>>>>>>>>>
>>>>>>>>>>>>> and because the output was not just few lines long I uploaded
>>>>>>>>>>>>> it to the
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool22.xrootd.log
>>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool24.xrootd.log
>>>>>>>>>>>>>
>>>>>>>>>>>>> These machines on 2001:718:401:6e03::/64 subnet are in remote
>>>>>>>>>>>>> location
>>>>>>>>>>>>> (but relatively close with RTT ~ 0.7ms). I can see several
>>>>>>>>>>>>> ofs_Posc
>>>>>>>>>>>>> occurrences in the logs of our disknodes during "190717
>>>>>>>>>>>>> 17:xx:xx" time
>>>>>>>>>>>>> period:
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://vokac.web.cern.ch/vokac/tmp/POSC.txt
>>>>>>>>>>>>>
>>>>>>>>>>>>> Petr
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/19/19 8:51 PM, Andrew Hanushevsky wrote:
>>>>>>>>>>>>>> Hi Petr,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Well, POSC says that the client reconnected for writing so not
>>>>>>>>>>>>>> all
>>>>>>>>>>>>>> went well. Can I get the log output for that occurrence as I
>>>>>>>>>>>>>> previously described?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, 19 Jul 2019, Petr Vokac wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We are using Rucio download client to transfer job outputs
>>>>>>>>>>>>>>> (it is
>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>> on top of gfal2 python interface).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool22.xrootd.log
>>>>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool24.xrootd.log
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> this is the full job output unfortunately with minimal gfal2
>>>>>>>>>>>>>>> logging
>>>>>>>>>>>>>>> output.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/grid.2258136.4.out-from-2001:718:401:6e03::1:21-to-dpmpool22.txt
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/grid.2255921.7.out-from-2001:718:401:6e03::1:32-to-dpmpool24.txt
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It looks like according gfal2 transfer was OK
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,782 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,782 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,783 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,783 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,783 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,783 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,784 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,784 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,784 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,793 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,793 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,793 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,795 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,795 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,795 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,795 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with
>>>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,810 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | __gfal2_copy              | Event
>>>>>>>>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:ENTER
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,810 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | __gfal2_copy              | Event
>>>>>>>>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:ITEM
>>>>>>>>>>>>>>> file:///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> =>
>>>>>>>>>>>>>>> root://golias100.farm.particle.cz:1094//dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,810 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> gfal2                            | __gfal2_copy              | Event
>>>>>>>>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:EXIT
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,811 | INFO     | Dummy-2             |
>>>>>>>>>>>>>>> gfal2                            | (unknown function)        | Event
>>>>>>>>>>>>>>> triggered: BOTH xroot TRANSFER:ENTER
>>>>>>>>>>>>>>> file://localhost///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&xrd.wantprot=gsi,unix
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> =>
>>>>>>>>>>>>>>> root://golias100.farm.particle.cz:1094///dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&x
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-07-17 15:09:17,811 | INFO     | Dummy-2             |
>>>>>>>>>>>>>>> gfal2                            | (unknown function)        | Event
>>>>>>>>>>>>>>> triggered: BOTH xroot TRANSFER:TYPE streamed
>>>>>>>>>>>>>>> 2019-07-17 15:11:23,914 | INFO     | Dummy-2             |
>>>>>>>>>>>>>>> gfal2                            | (unknown function)        | Event
>>>>>>>>>>>>>>> triggered: BOTH xroot TRANSFER:EXIT Job finished, [SUCCESS]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> but probably rucio tries to verify checksum (gfal-sum via python
>>>>>>>>>>>>>>> bindings) and it fails
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019-07-17 15:15:23,774 | INFO     | copytool_out        |
>>>>>>>>>>>>>>> pilot.copytool.rucio             | copy_out                  |
>>>>>>>>>>>>>>> stderr = An unknown exception occurred.
>>>>>>>>>>>>>>> Details: Checksum not validated
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> because DPM is unable to calculate checksum for 128TB file
>>>>>>>>>>>>>>> within
>>>>>>>>>>>>>>> timeout.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Petr
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/19/19 5:07 AM, Andrew Hanushevsky wrote:
>>>>>>>>>>>>>>>> Hi Petr,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Indeed, it looks like what was used to upload the files
>>>>>>>>>>>>>>>> (xrdcp?)
>>>>>>>>>>>>>>>> created a very large sparse file by writing with a very large
>>>>>>>>>>>>>>>> offset.
>>>>>>>>>>>>>>>> The clue here is that an anomaly occurred during the write
>>>>>>>>>>>>>>>> and these
>>>>>>>>>>>>>>>> two messages (lkely each corresponding to each file) tell us
>>>>>>>>>>>>>>>> something....
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 190717 17:10:39 332092 ofs_Posc: Creator changed from
>>>>>>>>>>>>>>>>    atlasprd.2757:90@[2001:718:401:6e03::1:21] to    atlasprd.2757:39@[2001:718:401:6e03::1:21] for   /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/    log.18636890._001059.job.log.tgz.1.rucio.upload
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 190717 17:29:26 14204 ofs_Posc: Creator changed from
>>>>>>>>>>>>>>>>    atlasprd.2757:33@[2001:718:401:6e03::1:32] to    atlasprd.2757:95@[2001:718:401:6e03::1:32] for   /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/    AOD.18595963._000786.pool.root.1.rucio.upload
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What the POSC system is saying is that in each case the
>>>>>>>>>>>>>>>> original
>>>>>>>>>>>>>>>> creator of the file lost contact with the server. It then
>>>>>>>>>>>>>>>> reconnected
>>>>>>>>>>>>>>>> and resumed writing with a successful close. POSC allows
>>>>>>>>>>>>>>>> this to
>>>>>>>>>>>>>>>> avoid
>>>>>>>>>>>>>>>> network glitches from deleting a file because a network
>>>>>>>>>>>>>>>> glitch will
>>>>>>>>>>>>>>>> cause the client to reconnect and resume where it left off.
>>>>>>>>>>>>>>>> This is
>>>>>>>>>>>>>>>> only allowed when the resuming client comes from the same IP
>>>>>>>>>>>>>>>> address,
>>>>>>>>>>>>>>>> has the same process ID, and the same logical name. In this
>>>>>>>>>>>>>>>> case it
>>>>>>>>>>>>>>>> was "atlasprd.2757". I find that rather odd because it's
>>>>>>>>>>>>>>>> next to
>>>>>>>>>>>>>>>> impossible for two processes on two different machines to
>>>>>>>>>>>>>>>> have the
>>>>>>>>>>>>>>>> same process number (i.e. 2757). So, this is why I am
>>>>>>>>>>>>>>>> wondering if
>>>>>>>>>>>>>>>> this was really an xrdcp or some other program.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We should look to see what these client were actually doing
>>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>>> this period as they clearly did not do the right thing when
>>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>>> reconnected. So, could you grep out all interactions of
>>>>>>>>>>>>>>>> these two
>>>>>>>>>>>>>>>> client from the log. Something like:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> grep "atlasprd.2757:" <logfile> | egrep
>>>>>>>>>>>>>>>> '2001:718:401:6e03::1:21|2001:718:401:6e03::1:32' > <excerpt>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> and send the excerpt to me?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, 18 Jul 2019, Petr Vokac wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> our DPM which is using native xrootd 4.9.1 yesterday
>>>>>>>>>>>>>>>>> created two
>>>>>>>>>>>>>>>>> different 128TB files on two different diskservers that were
>>>>>>>>>>>>>>>>> received as
>>>>>>>>>>>>>>>>> a job output from two different worker nodes. The size of
>>>>>>>>>>>>>>>>> original
>>>>>>>>>>>>>>>>> file
>>>>>>>>>>>>>>>>> was few megabytes and when I check beginning of stored
>>>>>>>>>>>>>>>>> files up to
>>>>>>>>>>>>>>>>> original size & checksum everything seems to be OK - except
>>>>>>>>>>>>>>>>> these are
>>>>>>>>>>>>>>>>> sparse files with size 128TB. Original transferred files
>>>>>>>>>>>>>>>>> from WN
>>>>>>>>>>>>>>>>> looked
>>>>>>>>>>>>>>>>> like:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> mc15_13TeV:log.18636890._001059.job.log.tgz.1 - original size
>>>>>>>>>>>>>>>>> 3101899, adler32 742e1336
>>>>>>>>>>>>>>>>> mc16_5TeV:AOD.18595963._000786.pool.root.1 - original size
>>>>>>>>>>>>>>>>> 737474514,
>>>>>>>>>>>>>>>>> adler32 dc68cb21
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> but on our diskservers they were stored as
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [root@dpmpool22 ~]# ls -l
>>>>>>>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 140560257079236 Jul 17 17:11
>>>>>>>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [root@dpmpool22 ~]# du --block-size=1
>>>>>>>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 3108864
>>>>>>>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [root@dpmpool24 ~]# ls -l
>>>>>>>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 139734345056408 Jul 17 17:31
>>>>>>>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [root@dpmpool24 tmp]# du --block-size=1
>>>>>>>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 737480704
>>>>>>>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I was told that DPM is not really involved in transfering /
>>>>>>>>>>>>>>>>> storing
>>>>>>>>>>>>>>>>> files with xrootd protocol, so I guess something bad must
>>>>>>>>>>>>>>>>> happened
>>>>>>>>>>>>>>>>> inside XRootD libraries. Do you have any idea why these two
>>>>>>>>>>>>>>>>> transfers
>>>>>>>>>>>>>>>>> generated such extreme files? Unfortunately our xrootd
>>>>>>>>>>>>>>>>> logging is
>>>>>>>>>>>>>>>>> pretty
>>>>>>>>>>>>>>>>> low and I was able to find only info about file creation
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [root@dpmpool22 ~]# grep log.18636890._001059.job.log.tgz
>>>>>>>>>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log
>>>>>>>>>>>>>>>>> 190717 17:10:39 332092 ofs_Posc: Creator changed from
>>>>>>>>>>>>>>>>    atlasprd.2757:90@[2001:718:401:6e03::1:21] to    atlasprd.2757:39@[2001:718:401:6e03::1:21] for   /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/    log.18636890._001059.job.log.tgz.1.rucio.upload
>>>>>>>>>>>>>>>>> [root@dpmpool24 tmp]# grep AOD.18595963._000786.pool.root
>>>>>>>>>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log
>>>>>>>>>>>>>>>>> 190717 17:29:26 14204 ofs_Posc: Creator changed from
>>>>>>>>>>>>>>>>    atlasprd.2757:33@[2001:718:401:6e03::1:32] to    atlasprd.2757:95@[2001:718:401:6e03::1:32] for   /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/AOD.18595963.    _000786.pool.root.1.rucio.upload
>>>>>>>>>>>>>>>>> Petr
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ########################################################################
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following
>>>>>>>>>>>>>>>>> link:
>>>>>>>>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ########################################################################
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following
>>>>>>>>>>>>>>>> link:
>>>>>>>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ########################################################################
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>>>>>>>
>>>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>>>>
>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>>>>>
>>>>>> ########################################################################
>>>>>> Use REPLY-ALL to reply to list
>>>>>>
>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>> ########################################################################
>>>>> Use REPLY-ALL to reply to list
>>>>>
>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> Use REPLY-ALL to reply to list
>>>>
>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> Use REPLY-ALL to reply to list
>>>
>>> To unsubscribe from the XROOTD-L list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the XROOTD-L list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>
>


########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1