Print

Print


I found that xrootd daemon occasionally crash (it is automatically
restarted by systemd) and for problematic file mentioned below I can
find core file that deals with this transfer

(gdb) bt
#0  XrdXrootdAioReq::sendError (this=0x2069cd8, tident=0x7fbcfc00a547 "condor.114208:47@[2001:718:401:6e03::1:82]") at /usr/src/debug/xrootd-4.9.1/src/XrdXrootd/XrdXrootdAio.cc:652
#1  0x00007fbd138c020c in XrdXrootdAioReq::endWrite (this=0x2069cd8) at /usr/src/debug/xrootd-4.9.1/src/XrdXrootd/XrdXrootdAio.cc:601
#2  0x00007fbd1365699f in XrdScheduler::Run (this=0x610e98 <XrdMain::Config+440>) at /usr/src/debug/xrootd-4.9.1/src/Xrd/XrdScheduler.cc:357
#3  0x00007fbd13656ae9 in XrdStartWorking (carg=<optimized out>) at /usr/src/debug/xrootd-4.9.1/src/Xrd/XrdScheduler.cc:87
#4  0x00007fbd13619617 in XrdSysThread_Xeq (myargs=0x7fbcfc010d20) at /usr/src/debug/xrootd-4.9.1/src/XrdSys/XrdSysPthread.cc:86
#5  0x00007fbd131ccdd5 in start_thread (arg=0x7fbd018fa700) at pthread_create.c:307
#6  0x00007fbd124cf02d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

http://novastore.farm.particle.cz/tmp/xrdcp-20190911/1M.20190911070503.057846787.thread_apply_all_bt

Petr

On 9/11/19 9:35 AM, Petr Vokac wrote:
> Now I have following lines on all our DPM diskservers
> ofs.trace write
> xrootd.trace fs
> and I created simple script [1] that use just xrdcp to copy 1MB file
> CURR=$(date +%Y%m%d%H%M%S.%N)
> export XRD_LOGLEVEL=Dump
> export XRD_LOGMASK=All
> export XRD_LOGFILE=1M.${CURR}.dump
> xrdcp "file:///tmp/1M" "root://golias100.farm.particle.cz//dpm/farm.particle.cz/home/atlas/1M.xrdcp.${CURR}"
> Even this simple code create corrupted files at our DPM storage, e.g.
> following 1.5GB file
> *[root@dpmpool1 ~]# ls -la
> /mnt/fs1/atlas/2019-09-11/1M.xrdcp.20190911070503.057846787.143302.1568178311*
> -rw-rw---- 1 dpmmgr dpmmgr 1761607680 Sep 11 07:05 /mnt/fs1/atlas/2019-09-11/1M.xrdcp.20190911070503.057846787.143302.1568178311
> *[root@dpmpool1 ~]# du --block-size=1
> /mnt/fs1/atlas/2019-09-11/1M.xrdcp.20190911070503.057846787.143302.1568178311*
> 1048576	/mnt/fs1/atlas/2019-09-11/1M.xrdcp.20190911070503.057846787.143302.1568178311
> *[root@dpmpool1 ~]# grep condor.114208:47
> /var/log/xrootd/dpmdisk/xrootd.log*
> 190911 07:05:12 31717 XrootdXeq: condor.114208:47@[2001:718:401:6e03::1:82] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac
> 190911 07:05:12 31717 condor.114208:47@[2001:718:401:6e03::1:82] XrootdProtocol: open unmat /dpm/farm.particle.cz/home/atlas/1M.xrdcp.20190911070503.057846787?dpm.chunk0=0,0,dpmpool1.farm.particle.cz:/mnt/fs1/atlas/2019-09-11/1M.xrdcp.20190911070503.057846787.143302.1568178311%3fsfn=%252Fdpm%252Ffarm.particle.cz%252Fhome%252Fatlas%252F1M.xrdcp.20190911070503.057846787&dpm.dhost=dpmpool1.farm.particle.cz&dpm.hv2=NBoSW+W5T7Eti/8yeLbu0w==&dpm.loc=1&dpm.nonce=870b737&dpm.put=1&dpm.time=1568178311&oss.asize=1048576
> 190911 07:05:12 31717 condor.114208:47@[2001:718:401:6e03::1:82] XrootdProtocol: fh=0 write 1048576@0
> 190911 07:05:20 23385 condor.114208:47@[2001:718:401:6e03::1:82] XrootdProtocol: close rc=0 fh=0
> 190911 07:05:20 23385 condor.114208:47@[2001:718:401:6e03::1:82] XrootdFile: closing w /dpm/farm.particle.cz/home/atlas/1M.xrdcp.20190911070503.057846787
> 190911 07:05:20 23385 XrootdXeq: condor.114208:47@[2001:718:401:6e03::1:82] disc 0:00:09
> Client side details are in attachment. I have more examples [2] with
> following wrong size
> 1M.20190911005425.364456793.bad ... Wed Sep 11 00:54:31 CEST 2019 ERROR BAD SIZE 1124073472
> 1M.20190911010555.932867187.bad ... Wed Sep 11 01:06:10 CEST 2019 ERROR BAD SIZE 0
> 1M.20190911012755.924566113.bad ... Wed Sep 11 01:28:08 CEST 2019 ERROR BAD SIZE 917504
> 1M.20190911014400.800640006.bad ... Wed Sep 11 01:44:24 CEST 2019 ERROR BAD SIZE 917504
> 1M.20190911014622.520944167.bad ... Wed Sep 11 01:46:40 CEST 2019 ERROR BAD SIZE 3741319168
> 1M.20190911014901.149846614.bad ... Wed Sep 11 01:49:26 CEST 2019 ERROR BAD SIZE 917504
> 1M.20190911015203.831921517.bad ... Wed Sep 11 01:52:29 CEST 2019 ERROR BAD SIZE 917504
> 1M.20190911020907.396911618.bad ... Wed Sep 11 02:09:32 CEST 2019 ERROR BAD SIZE 3791650816
> 1M.20190911021159.705868237.bad ... Wed Sep 11 02:12:27 CEST 2019 ERROR BAD SIZE 917504
> 1M.20190911021251.426135326.bad ... Wed Sep 11 02:13:13 CEST 2019 ERROR BAD SIZE 917504
> 1M.20190911021737.228623622.bad ... Wed Sep 11 02:18:01 CEST 2019 ERROR BAD SIZE 771751936
> 1M.20190911022813.246485614.bad ... Wed Sep 11 02:28:40 CEST 2019 ERROR BAD SIZE 917504
> 1M.20190911025601.833862548.bad ... Wed Sep 11 02:56:20 CEST 2019 ERROR BAD SIZE 1694498816
> 1M.20190911030159.405097568.bad ... Wed Sep 11 03:02:28 CEST 2019 ERROR BAD SIZE 5016387584
> 1M.20190911030913.965667783.bad ... Wed Sep 11 03:09:54 CEST 2019 ERROR BAD SIZE 218103808
> 1M.20190911030955.983267709.bad ... Wed Sep 11 03:10:01 CEST 2019 ERROR BAD SIZE 15946576
> 1M.20190911031027.126885413.bad ... Wed Sep 11 03:10:55 CEST 2019 ERROR BAD SIZE 1459617792
> 1M.20190911031535.972442346.bad ... Wed Sep 11 03:16:03 CEST 2019 ERROR BAD SIZE 24848976
> 1M.20190911032921.385479218.bad ... Wed Sep 11 03:29:37 CEST 2019 ERROR BAD SIZE 352321536
> 1M.20190911050741.513088940.bad ... Wed Sep 11 05:07:57 CEST 2019 ERROR BAD SIZE 917504
> 1M.20190911053936.964249354.bad ... Wed Sep 11 05:39:56 CEST 2019 ERROR BAD SIZE 917504
> 1M.20190911055721.511952376.bad ... Wed Sep 11 05:57:39 CEST 2019 ERROR BAD SIZE 917504
> 1M.20190911060520.998050566.bad ... Wed Sep 11 06:05:27 CEST 2019 ERROR BAD SIZE 2292134
> 1M.20190911061254.989202949.bad ... Wed Sep 11 06:13:20 CEST 2019 ERROR BAD SIZE 917504
> 1M.20190911062609.645848195.bad ... Wed Sep 11 06:26:21 CEST 2019 ERROR BAD SIZE 1207959552
> 1M.20190911063849.007385225.bad ... Wed Sep 11 06:38:55 CEST 2019 ERROR BAD SIZE 917504
> 1M.20190911063907.064936962.bad ... Wed Sep 11 06:39:15 CEST 2019 ERROR BAD SIZE 26413431
> 1M.20190911065351.412877379.bad ... Wed Sep 11 06:54:00 CEST 2019 ERROR BAD SIZE 917504
> 1M.20190911070503.057846787.bad ... Wed Sep 11 07:05:20 CEST 2019 ERROR BAD SIZE 1761607680
> 1M.20190911070525.814871895.bad ... Wed Sep 11 07:05:36 CEST 2019 ERROR BAD SIZE 917504
> 1M.20190911070634.781534227.bad ... Wed Sep 11 07:06:42 CEST 2019 ERROR BAD SIZE 3758096384
> 1M.20190911072040.275437982.bad ... Wed Sep 11 07:21:07 CEST 2019 ERROR BAD SIZE 786432
> 1M.20190911072340.253938152.bad ... Wed Sep 11 07:24:00 CEST 2019 ERROR BAD SIZE 917504
> 1M.20190911074911.208877064.bad ... Wed Sep 11 07:49:40 CEST 2019 ERROR BAD SIZE 388133552
> 1M.20190911080141.583691537.bad ... Wed Sep 11 08:01:56 CEST 2019 ERROR BAD SIZE 917504
> 1M.20190911082145.247040374.bad ... Wed Sep 11 08:22:17 CEST 2019 ERROR BAD SIZE 786432
> 1M.20190911085008.626914519.bad ... Wed Sep 11 08:50:29 CEST 2019 ERROR BAD SIZE 917504
> Petr
>
> [1]
> http://novastore.farm.particle.cz/tmp/xrdcp-20190911/test_dpm_xroot_file_copy.sh
> [2] http://novastore.farm.particle.cz/tmp/xrdcp-20190911/
>
> On 9/10/19 2:12 AM, Andrew Hanushevsky wrote:
>> Hi Petr,
>>  
>> If you also specify in the config file
>>  
>> xrootd.trace fs
>>  
>> then we will get a much earlier snapshot of the offset and length.
>> This will allow us to tell whether the data is getting corrupted
>> before it reaches the server or afterward. Could you do that (at
>> least for one test) to see where the corruption may happen?
>>  
>> Andy
>>  
>> *From:* Petr Vokac <mailto:[log in to unmask]>
>> *Sent:* Sunday, September 08, 2019 1:46 AM
>> *To:* andrea manzi <mailto:[log in to unmask]> ;
>> [log in to unmask] <mailto:[log in to unmask]> ; Andrew
>> Hanushevsky <mailto:[log in to unmask]>
>> *Cc:* dpm-devel <mailto:[log in to unmask]> ;
>> [log in to unmask] <mailto:[log in to unmask]>
>> *Subject:* Re: Uploaded file 128TB on filesystem
>>  
>> I just found one file where problem with offset occurred in the
>> middle of transfer (not at the beginning):
>> 190907 20:03:53 25211 XrootdXeq: atlasprd.2817:168@[2001:718:401:6e03::1:51] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>> 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@0 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>> 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@262144 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>> 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@524288 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>> 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@786432 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>> 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@1048576 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>> ...
>> 190907 20:40:35 31589 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2952003584 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>> 190907 20:40:53 31589 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2952265728 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>> 190907 20:41:55 31609 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2952265728 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>> *190907 20:42:05 25089 atlasprd.2817:168@[2001:718:401:6e03::1:51]
>> ofs_write: 1919238960@7379557499097268579
>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload****190907 20:42:13 25118 atlasprd.2817:168@[2001:718:401:6e03::1:51]
>> ofs_write: 1919238960@7379557499097268579
>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload***190907 20:42:13 25118 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2953314304 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>> 190907 20:42:13 25118 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2953576448 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>> 190907 20:42:20 25118 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2953576448 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload
>> Whole xrootd.log for this transfer available at
>> http://novastore.farm.particle.cz/tmp/xrootd_bigfile_wrong_offset_in_the_middle_of_transfer.txt
>>
>> Petr
>>
>> On 9/8/19 10:25 AM, Petr Vokac wrote:
>>> I executed basically following python code that from bash script in
>>> a loop
>>> import gfal2
>>> ctx = gfal2.creat_context()
>>> ctx.set_opt_string("XROOTD PLUGIN", "XRD.WANTPROT", "gsi,unix")
>>> params = ctx.transfer_parameters()
>>> params.timeout = 60
>>> ctx.filecopy(params, "file:///tmp/1M", "root://golias100.farm.particle.cz/dpm/farm.particle.cz/home/atlas/1M")
>>> and after a while new corrupted files appeared at our storage
>>> including one big file:
>>> [vokac@ui1 ~]$ gfal-ls -l 'davs://golias100.farm.particle.cz/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000'
>>> -rw-rw-r--   0 1310  5281  139703410778112 Sep  7 19:01 davs://golias100.farm.particle.cz/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000	
>>> Now I have also debug output from gfal client including
>>> XRD_LOGLEVEL=Dump + XRD_LOGMASK=All for this file with xrootd.log on
>>> server side:
>>> [root@dpmpool19 ~]$ grep condor.39253:171 /var/log/xrootd/dpmdisk/xrootd.log
>>> 190907 19:01:19 29645 XrootdXeq: condor.39253:171@[2001:718:401:6e03::1:82] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac
>>> 190907 19:01:22 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 1009440128@139703410643808 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000
>>> 190907 19:01:23 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@131072 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000
>>> 190907 19:01:23 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@262144 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000
>>> 190907 19:01:24 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@393216 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000
>>> 190907 19:01:24 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@524288 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000
>>> 190907 19:01:25 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@655360 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000
>>> 190907 19:01:25 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@786432 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000
>>> 190907 19:01:25 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@917504 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000
>>> 190907 19:01:26 32625 XrootdXeq: condor.39253:171@[2001:718:401:6e03::1:82] disc 0:00:07
>>> This file was transferred from mff82.farm.particle.cz
>>> (2001:718:401:6e03::1:82) to the storage dpmpool19.farm.particle.cz
>>> (2001:718:401:6017:2::19) and both machines use
>>> xrootd-4.9.1-1.el7.x86_64 and mff82 use gfal2-2.16.3-1.el7.x86_64
>>>
>>> I also tried to calculate adler32 for data blocks from positions
>>> logged by server and result match source data checksum. When I look
>>> at other transfers finished with "SUCCESS" but with the filesize
>>> different from source file I can see that problems always occurs
>>> only with first "ofs_write" log line where size@offset contains
>>> wrong numbers.
>>>
>>> Attached files:
>>>
>>> 1M.01556.20190907190118.646902567.00000.dump ... xrd with dump log level
>>> 1M.01556.20190907190118.646902567.00000.log ... gfal2 verbose
>>> logging (+ few log lines from my script)
>>> 1M.01556.20190907190118.646902567.00000.xrootd.long ... dpmpool19
>>> xrootd.log
>>>
>>> I have more examples of successfully transferred files with wrong
>>> size (sometimes bigger than original 1M size and sometimes smaller):
>>>
>>> 07 Sep 18:39:33.879 [ERROR] dpmpool1.farm.particle.cz BAD SIZE 917504
>>> 07 Sep 18:40:45.039 [ERROR] dpmpool21.farm.particle.cz BAD SIZE 917504
>>> 07 Sep 18:41:22.555 [ERROR] dpmpool21.farm.particle.cz BAD SIZE 917504
>>> 07 Sep 18:41:39.248 [ERROR] dpmpool21.farm.particle.cz BAD SIZE 917504
>>> 07 Sep 18:56:51.113 [ERROR] dpmpool24.farm.particle.cz BAD SIZE 917504
>>> 07 Sep 18:58:04.436 [ERROR] dpmpool1.farm.particle.cz BAD SIZE 393194936
>>> 07 Sep 19:01:25.908 [ERROR] dpmpool19.farm.particle.cz BAD SIZE
>>> 139703410778112
>>> 07 Sep 19:02:17.183 [ERROR] dpmpool24.farm.particle.cz BAD SIZE 0
>>> 07 Sep 19:02:53.014 [ERROR] dpmpool16.farm.particle.cz BAD SIZE 917504
>>> 07 Sep 19:03:49.424 [ERROR] dpmpool1.farm.particle.cz BAD SIZE 917504
>>> 07 Sep 19:05:47.743 [ERROR] dpmpool21.farm.particle.cz BAD SIZE 13983727
>>> 07 Sep 19:15:15.761 [ERROR] dpmpool22.farm.particle.cz BAD SIZE 12752345
>>> 07 Sep 19:16:58.898 [ERROR] dpmpool20.farm.particle.cz BAD SIZE 917504
>>> 07 Sep 19:18:51.590 [ERROR] dpmpool17.farm.particle.cz BAD SIZE 6302080
>>> 07 Sep 19:20:37.006 [ERROR] dpmpool21.farm.particle.cz BAD SIZE
>>> 1456153896
>>> 07 Sep 19:52:32.553 [ERROR] dpmpool16.farm.particle.cz BAD SIZE 917504
>>> 07 Sep 19:53:45.155 [ERROR] dpmpool22.farm.particle.cz BAD SIZE
>>> 1000000000
>>> 07 Sep 20:11:14.497 [ERROR] dpmpool22.farm.particle.cz BAD SIZE 917504
>>> 07 Sep 20:12:51.081 [ERROR] dpmpool20.farm.particle.cz BAD SIZE 917504
>>> 07 Sep 20:14:58.090 [ERROR] dpmpool20.farm.particle.cz BAD SIZE
>>> 100663296
>>> 07 Sep 20:24:57.457 [ERROR] dpmpool1.farm.particle.cz BAD SIZE 917504
>>> 07 Sep 20:27:37.596 [ERROR] dpmpool19.farm.particle.cz BAD SIZE 917504
>>> 07 Sep 20:28:21.617 [ERROR] dpmpool16.farm.particle.cz BAD SIZE 0
>>>
>>> Petr
>>>
>>> On 9/5/19 2:44 PM, Petr Vokac wrote:
>>>> I'm going to test in a loop uploads with this code and also enabled gfal
>>>> debug logging from our remote machines that talks with storage through
>>>> quite saturated network (we never saw same issue from cluster worker
>>>> nodes that shares room with diskservers). I also saw few transfers that
>>>> did not create big sparse file, but just failed while comparing
>>>> checksums after upload and this is also quite suspicious behavior.
>>>>
>>>> Petr
>>>>
>>>> On 9/5/19 12:57 PM, andrea manzi wrote:
>>>>>> Ciao Andrea,
>>>>>>
>>>>>> we practically do this:     ctx = creat_context(...     ctx.set_opt_string("XROOTD PLUGIN", "XRD.WANTPROT", "gsi,unix")     ctx.mkdir_rec(...     ctx.filecopy(..
>>>>>>
>>>>>> Is that the gfal transfer API?
>>>>> yup, internally in gfal2 we then use the XrdCl to perform the copy (
>>>>> both for local files and TPC)
>>>>>
>>>>> today is holiday but we can start investigating with Michal tomorrow
>>>>> or Monday the possible issue
>>>>>
>>>>> cheers
>>>>>
>>>>> Andrea
>>>>>
>>>>>> Cheers,
>>>>>> Mario
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 05/09/2019 10:35, andrea manzi wrote:
>>>>>>> ciao Mario,
>>>>>>>
>>>>>>> i guess you are using the gfal2 transfer API in Rucio right ? ( the
>>>>>>> equivalent to gfal-copy)
>>>>>>>
>>>>>>> in gfal2,  in the case of the transfer API  for xrootd, we don't do
>>>>>>> much
>>>>>>> as we just use the xrootd client (  XrdCl::CopyProcess), so we don't
>>>>>>> touch file offsets.
>>>>>>>
>>>>>>> so we may want to investigate if there is a problem in XrdCl. Do you
>>>>>>> have the logs of the client?
>>>>>>>
>>>>>>> If you are using the gfal2 Posix API instead we need to investigate a
>>>>>>> possible bug there
>>>>>>>
>>>>>>> cheers
>>>>>>>
>>>>>>> Andrea
>>>>>>>
>>>>>>>
>>>>>>>> Hi Petr, Andy,
>>>>>>>>
>>>>>>>> once extra piece of information here: Rucio uses the python
>>>>>>>> bindings to
>>>>>>>> the gfal library to upload these files, and de-facto always uses the
>>>>>>>> current stable gfal version that is published on ALRB. Rucio doesn't
>>>>>>>> implement any other upload by itself.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Mario
>>>>>>>>
>>>>>>>> On 05/09/2019 04:21, Andrew Hanushevsky wrote:
>>>>>>>>> Hi Petr,   So, look at this line:   190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>> ofs_write: 1784104010@6506989258351653185
>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>>
>>>>>>>>> Why did the application doing the write specify an offset of
>>>>>>>>> 5506989258351653185? So, that is the problem and, as you noted, it
>>>>>>>>> doesn’t matter whether POSC is on or not. Indeed, it’s the first
>>>>>>>>> write
>>>>>>>>> to the file and only the first write that suffers from this
>>>>>>>>> problem. So,
>>>>>>>>> there are two possibilities here: a) there is a bug in the
>>>>>>>>> uploader, or
>>>>>>>>> b) there was a network data corruption that managed to hit the offset
>>>>>>>>> (somewhat unlikely).   Andy   *From:* Petr Vokac mailto:[log in to unmask]
>>>>>>>>> *Sent:* Wednesday, September 04, 2019 12:19 PM
>>>>>>>>> *To:* Andrew Hanushevsky mailto:[log in to unmask]
>>>>>>>>> *Cc:* dpm-devel mailto:[log in to unmask] ;
>>>>>>>>> [log in to unmask]
>>>>>>>>> mailto:[log in to unmask]
>>>>>>>>> *Subject:* Re: Uploaded file 128TB on filesystem   I configured our diskservers with "ofs.trace write" (I did not
>>>>>>>>> yet add
>>>>>>>>> in xrootd configuraton "ofs.persist hold 0") and today new file
>>>>>>>>> appeared
>>>>>>>>> on one of our storage servers:
>>>>>>>>>
>>>>>>>>> *[root@dpmpool1 ~]# grep atlasprd.2817:152
>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log*
>>>>>>>>> 190904 17:02:08 1894 XrootdXeq:
>>>>>>>>> atlasprd.2817:152@[2001:718:401:6e03::1:82] pub IP64 login as
>>>>>>>>> /DC=ch/DC=cern/OU=Organic
>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>>>>>>> 190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>> ofs_write: 1784104010@6506989258351653185
>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>> 190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>> ofs_write: 262144@262144
>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>> ofs_write: 262144@524288
>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>> ofs_write: 262144@786432
>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>> ofs_write: 262144@1048576
>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>> ofs_write: 262144@1310720
>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>> ofs_write: 262144@1572864
>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>> ofs_write: 262144@1835008
>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82]
>>>>>>>>> ofs_write: 135000@2097152
>>>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload
>>>>>>>>> 190904 17:07:32 2163 XrootdXeq:
>>>>>>>>> atlasprd.2817:152@[2001:718:401:6e03::1:82] disc 0:05:24
>>>>>>>>>
>>>>>>>>> *[root@dpmpool1 ~]# ls -la
>>>>>>>>> /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328*
>>>>>>>>>
>>>>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 6506989258351919104 Sep  4 17:02
>>>>>>>>> /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328
>>>>>>>>>
>>>>>>>>> *[root@dpmpool1 ~]# du --block-size=1
>>>>>>>>> /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328*
>>>>>>>>>
>>>>>>>>> 2236416    /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328
>>>>>>>>>
>>>>>>>>> This file has first 256kb filled with 0x00 (adler32 up to original
>>>>>>>>> size
>>>>>>>>> 2232152 doesn't match) and this time there is no ofs_Posc in the
>>>>>>>>> xrootd.log file.
>>>>>>>>>
>>>>>>>>> Petr Vokac
>>>>>>>>>
>>>>>>>>> On 8/19/19 3:24 AM, Andrew Hanushevsky wrote:
>>>>>>>>>> Hi Petr,
>>>>>>>>>>
>>>>>>>>>> The issue here is that the python script (I believe it's python) is
>>>>>>>>>> disconnecting without closing the file and then reconnecting and
>>>>>>>>>> writing at a high offset. The log is pretty clear on that. You can
>>>>>>>>>> verify by adding "ofs.trace write" to see the write offsets
>>>>>>>>>> (hough it
>>>>>>>>>> will produce a lot of output). You can also prohibit reconnection by
>>>>>>>>>> adding this to the persist directive.
>>>>>>>>>>
>>>>>>>>>> ofs.persist hold 0
>>>>>>>>>>
>>>>>>>>>> which would delete the file the moment the client disconnects
>>>>>>>>>> without
>>>>>>>>>> closing the file.
>>>>>>>>>>
>>>>>>>>>> Andy
>>>>>>>>>>
>>>>>>>>>> On Sat, 17 Aug 2019, Petr Vokac wrote:
>>>>>>>>>>
>>>>>>>>>>> Today new "65PB" file appeared on our storage and it was again
>>>>>>>>>>> transferred with xrootd protocol. I can see again same error in
>>>>>>>>>>> the log
>>>>>>>>>>> file:
>>>>>>>>>>>
>>>>>>>>>>> **[root@dpmpool21 ~]# ls -la
>>>>>>>>>>> /mnt/fs4/atlas/2019-08-17/EVNT.18893579._003197.pool.root.1.rucio.upload.234855.1566031522*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *-rw-rw---- 1 dpmmgr dpmmgr 72340172838077281 Aug 17 10:48
>>>>>>>>>>> /mnt/fs4/atlas/2019-08-17/EVNT.18893579._003197.pool.root.1.rucio.upload.234855.1566031522
>>>>>>>>>>>
>>>>>>>>>>> ***[root@dpmpool21 ~]# grep atlasprd.2757
>>>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log*
>>>>>>>>>>> ...
>>>>>>>>>>> 190817 10:47:55 57803 XrootdXeq:
>>>>>>>>>>> atlasprd.2757:65@[2001:718:401:6e03::1:43] pub IP64 login as
>>>>>>>>>>> /DC=ch/DC=cern/OU=Organic
>>>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>>>>>>>>> 190817 10:47:55 61859 XrootdXeq:
>>>>>>>>>>> atlasprd.2757:66@[2001:718:401:6e03::1:83] pub IP64 login as
>>>>>>>>>>> /DC=ch/DC=cern/OU=Organic
>>>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>>>>>>>>> 190817 10:48:52 62233 XrootdXeq:
>>>>>>>>>>> atlasprd.2757:46@[2001:718:401:6e03::1:72] disc 0:00:58
>>>>>>>>>>> 190817 10:48:52 62954 XrootdXeq:
>>>>>>>>>>> atlasprd.2757:25@[2001:718:401:6e03::1:72] disc 0:03:30
>>>>>>>>>>> 190817 10:48:52 57804 XrootdXeq:
>>>>>>>>>>> atlasprd.2757:28@[2001:718:401:6e03::1:72] pub IP64 login as
>>>>>>>>>>> /DC=ch/DC=cern/OU=Organic
>>>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>>>>>>>>> 190817 10:48:52 57804 ofs_Posc: Creator changed from
>>>>>>>>>>> atlasprd.2757:25@[2001:718:401:6e03::1:72] to
>>>>>>>>>>> atlasprd.2757:28@[2001:718:401:6e03::1:72] for
>>>>>>>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/8b/f3/EVNT.18893579._003197.pool.root.1.rucio.upload
>>>>>>>>>>>
>>>>>>>>>>> 190817 10:48:53 57803 XrootdXeq:
>>>>>>>>>>> atlasprd.2757:46@[2001:718:401:6017:20:0:14:44] pub IP64 login as
>>>>>>>>>>> /DC=ch/DC=cern/OU=Organic
>>>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>>>>>>>>> 190817 10:49:01 57818 XrootdXeq:
>>>>>>>>>>> atlasprd.2757:70@[2001:718:401:6017:20:0:11:11] pub IP64 login as
>>>>>>>>>>> /DC=ch/DC=cern/OU=Organic
>>>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
>>>>>>>>>>> 190817 10:49:07 62954 XrootdXeq:
>>>>>>>>>>> atlasprd.2757:55@[2001:718:401:6e03::1:81] disc 0:10:17
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>> Any suggestion how to trace what is causing this behavior?
>>>>>>>>>>>
>>>>>>>>>>> Petr
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 7/20/19 7:48 AM, Petr Vokac wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I ran the command you sent me
>>>>>>>>>>>>
>>>>>>>>>>>> grep "atlasprd.2757:" <logfile> | egrep
>>>>>>>>>>>> '2001:718:401:6e03::1:21|2001:718:401:6e03::1:32' > <excerpt>
>>>>>>>>>>>>
>>>>>>>>>>>> and because the output was not just few lines long I uploaded
>>>>>>>>>>>> it to the
>>>>>>>>>>>>
>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool22.xrootd.log
>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool24.xrootd.log
>>>>>>>>>>>>
>>>>>>>>>>>> These machines on 2001:718:401:6e03::/64 subnet are in remote
>>>>>>>>>>>> location
>>>>>>>>>>>> (but relatively close with RTT ~ 0.7ms). I can see several
>>>>>>>>>>>> ofs_Posc
>>>>>>>>>>>> occurrences in the logs of our disknodes during "190717
>>>>>>>>>>>> 17:xx:xx" time
>>>>>>>>>>>> period:
>>>>>>>>>>>>
>>>>>>>>>>>> http://vokac.web.cern.ch/vokac/tmp/POSC.txt
>>>>>>>>>>>>
>>>>>>>>>>>> Petr
>>>>>>>>>>>>
>>>>>>>>>>>> On 7/19/19 8:51 PM, Andrew Hanushevsky wrote:
>>>>>>>>>>>>> Hi Petr,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Well, POSC says that the client reconnected for writing so not
>>>>>>>>>>>>> all
>>>>>>>>>>>>> went well. Can I get the log output for that occurrence as I
>>>>>>>>>>>>> previously described?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, 19 Jul 2019, Petr Vokac wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> We are using Rucio download client to transfer job outputs
>>>>>>>>>>>>>> (it is
>>>>>>>>>>>>>> build
>>>>>>>>>>>>>> on top of gfal2 python interface).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool22.xrootd.log
>>>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool24.xrootd.log
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> this is the full job output unfortunately with minimal gfal2
>>>>>>>>>>>>>> logging
>>>>>>>>>>>>>> output.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/grid.2258136.4.out-from-2001:718:401:6e03::1:21-to-dpmpool22.txt
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/grid.2255921.7.out-from-2001:718:401:6e03::1:32-to-dpmpool24.txt
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It looks like according gfal2 transfer was OK
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-07-17 15:09:17,782 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,782 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,783 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,783 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,783 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,783 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,784 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,784 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,784 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,793 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,793 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,793 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,795 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,795 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,795 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,795 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | connect                   |
>>>>>>>>>>>>>> [gfal_module_load] plugin
>>>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with
>>>>>>>>>>>>>> success
>>>>>>>>>>>>>> 2019-07-17 15:09:17,810 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | __gfal2_copy              | Event
>>>>>>>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:ENTER
>>>>>>>>>>>>>> 2019-07-17 15:09:17,810 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | __gfal2_copy              | Event
>>>>>>>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:ITEM
>>>>>>>>>>>>>> file:///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> =>
>>>>>>>>>>>>>> root://golias100.farm.particle.cz:1094//dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-07-17 15:09:17,810 | INFO     | copytool_out        |
>>>>>>>>>>>>>> gfal2                            | __gfal2_copy              | Event
>>>>>>>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:EXIT
>>>>>>>>>>>>>> 2019-07-17 15:09:17,811 | INFO     | Dummy-2             |
>>>>>>>>>>>>>> gfal2                            | (unknown function)        | Event
>>>>>>>>>>>>>> triggered: BOTH xroot TRANSFER:ENTER
>>>>>>>>>>>>>> file://localhost///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&xrd.wantprot=gsi,unix
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> =>
>>>>>>>>>>>>>> root://golias100.farm.particle.cz:1094///dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&x
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-07-17 15:09:17,811 | INFO     | Dummy-2             |
>>>>>>>>>>>>>> gfal2                            | (unknown function)        | Event
>>>>>>>>>>>>>> triggered: BOTH xroot TRANSFER:TYPE streamed
>>>>>>>>>>>>>> 2019-07-17 15:11:23,914 | INFO     | Dummy-2             |
>>>>>>>>>>>>>> gfal2                            | (unknown function)        | Event
>>>>>>>>>>>>>> triggered: BOTH xroot TRANSFER:EXIT Job finished, [SUCCESS]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> but probably rucio tries to verify checksum (gfal-sum via python
>>>>>>>>>>>>>> bindings) and it fails
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019-07-17 15:15:23,774 | INFO     | copytool_out        |
>>>>>>>>>>>>>> pilot.copytool.rucio             | copy_out                  |
>>>>>>>>>>>>>> stderr = An unknown exception occurred.
>>>>>>>>>>>>>> Details: Checksum not validated
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> because DPM is unable to calculate checksum for 128TB file
>>>>>>>>>>>>>> within
>>>>>>>>>>>>>> timeout.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Petr
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/19/19 5:07 AM, Andrew Hanushevsky wrote:
>>>>>>>>>>>>>>> Hi Petr,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Indeed, it looks like what was used to upload the files
>>>>>>>>>>>>>>> (xrdcp?)
>>>>>>>>>>>>>>> created a very large sparse file by writing with a very large
>>>>>>>>>>>>>>> offset.
>>>>>>>>>>>>>>> The clue here is that an anomaly occurred during the write
>>>>>>>>>>>>>>> and these
>>>>>>>>>>>>>>> two messages (lkely each corresponding to each file) tell us
>>>>>>>>>>>>>>> something....
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 190717 17:10:39 332092 ofs_Posc: Creator changed from
>>>>>>>>>>>>>>>    atlasprd.2757:90@[2001:718:401:6e03::1:21] to    atlasprd.2757:39@[2001:718:401:6e03::1:21] for   /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/    log.18636890._001059.job.log.tgz.1.rucio.upload
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 190717 17:29:26 14204 ofs_Posc: Creator changed from
>>>>>>>>>>>>>>>    atlasprd.2757:33@[2001:718:401:6e03::1:32] to    atlasprd.2757:95@[2001:718:401:6e03::1:32] for   /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/    AOD.18595963._000786.pool.root.1.rucio.upload
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What the POSC system is saying is that in each case the
>>>>>>>>>>>>>>> original
>>>>>>>>>>>>>>> creator of the file lost contact with the server. It then
>>>>>>>>>>>>>>> reconnected
>>>>>>>>>>>>>>> and resumed writing with a successful close. POSC allows
>>>>>>>>>>>>>>> this to
>>>>>>>>>>>>>>> avoid
>>>>>>>>>>>>>>> network glitches from deleting a file because a network
>>>>>>>>>>>>>>> glitch will
>>>>>>>>>>>>>>> cause the client to reconnect and resume where it left off.
>>>>>>>>>>>>>>> This is
>>>>>>>>>>>>>>> only allowed when the resuming client comes from the same IP
>>>>>>>>>>>>>>> address,
>>>>>>>>>>>>>>> has the same process ID, and the same logical name. In this
>>>>>>>>>>>>>>> case it
>>>>>>>>>>>>>>> was "atlasprd.2757". I find that rather odd because it's
>>>>>>>>>>>>>>> next to
>>>>>>>>>>>>>>> impossible for two processes on two different machines to
>>>>>>>>>>>>>>> have the
>>>>>>>>>>>>>>> same process number (i.e. 2757). So, this is why I am
>>>>>>>>>>>>>>> wondering if
>>>>>>>>>>>>>>> this was really an xrdcp or some other program.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We should look to see what these client were actually doing
>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>> this period as they clearly did not do the right thing when
>>>>>>>>>>>>>>> they
>>>>>>>>>>>>>>> reconnected. So, could you grep out all interactions of
>>>>>>>>>>>>>>> these two
>>>>>>>>>>>>>>> client from the log. Something like:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> grep "atlasprd.2757:" <logfile> | egrep
>>>>>>>>>>>>>>> '2001:718:401:6e03::1:21|2001:718:401:6e03::1:32' > <excerpt>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and send the excerpt to me?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andy
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, 18 Jul 2019, Petr Vokac wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> our DPM which is using native xrootd 4.9.1 yesterday
>>>>>>>>>>>>>>>> created two
>>>>>>>>>>>>>>>> different 128TB files on two different diskservers that were
>>>>>>>>>>>>>>>> received as
>>>>>>>>>>>>>>>> a job output from two different worker nodes. The size of
>>>>>>>>>>>>>>>> original
>>>>>>>>>>>>>>>> file
>>>>>>>>>>>>>>>> was few megabytes and when I check beginning of stored
>>>>>>>>>>>>>>>> files up to
>>>>>>>>>>>>>>>> original size & checksum everything seems to be OK - except
>>>>>>>>>>>>>>>> these are
>>>>>>>>>>>>>>>> sparse files with size 128TB. Original transferred files
>>>>>>>>>>>>>>>> from WN
>>>>>>>>>>>>>>>> looked
>>>>>>>>>>>>>>>> like:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> mc15_13TeV:log.18636890._001059.job.log.tgz.1 - original size
>>>>>>>>>>>>>>>> 3101899, adler32 742e1336
>>>>>>>>>>>>>>>> mc16_5TeV:AOD.18595963._000786.pool.root.1 - original size
>>>>>>>>>>>>>>>> 737474514,
>>>>>>>>>>>>>>>> adler32 dc68cb21
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> but on our diskservers they were stored as
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [root@dpmpool22 ~]# ls -l
>>>>>>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 140560257079236 Jul 17 17:11
>>>>>>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [root@dpmpool22 ~]# du --block-size=1
>>>>>>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 3108864
>>>>>>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [root@dpmpool24 ~]# ls -l
>>>>>>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 139734345056408 Jul 17 17:31
>>>>>>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [root@dpmpool24 tmp]# du --block-size=1
>>>>>>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 737480704
>>>>>>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I was told that DPM is not really involved in transfering /
>>>>>>>>>>>>>>>> storing
>>>>>>>>>>>>>>>> files with xrootd protocol, so I guess something bad must
>>>>>>>>>>>>>>>> happened
>>>>>>>>>>>>>>>> inside XRootD libraries. Do you have any idea why these two
>>>>>>>>>>>>>>>> transfers
>>>>>>>>>>>>>>>> generated such extreme files? Unfortunately our xrootd
>>>>>>>>>>>>>>>> logging is
>>>>>>>>>>>>>>>> pretty
>>>>>>>>>>>>>>>> low and I was able to find only info about file creation
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [root@dpmpool22 ~]# grep log.18636890._001059.job.log.tgz
>>>>>>>>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log
>>>>>>>>>>>>>>>> 190717 17:10:39 332092 ofs_Posc: Creator changed from
>>>>>>>>>>>>>>>    atlasprd.2757:90@[2001:718:401:6e03::1:21] to    atlasprd.2757:39@[2001:718:401:6e03::1:21] for   /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/    log.18636890._001059.job.log.tgz.1.rucio.upload
>>>>>>>>>>>>>>>> [root@dpmpool24 tmp]# grep AOD.18595963._000786.pool.root
>>>>>>>>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log
>>>>>>>>>>>>>>>> 190717 17:29:26 14204 ofs_Posc: Creator changed from
>>>>>>>>>>>>>>>    atlasprd.2757:33@[2001:718:401:6e03::1:32] to    atlasprd.2757:95@[2001:718:401:6e03::1:32] for   /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/AOD.18595963.    _000786.pool.root.1.rucio.upload
>>>>>>>>>>>>>>>> Petr
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ########################################################################
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following
>>>>>>>>>>>>>>>> link:
>>>>>>>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ########################################################################
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following
>>>>>>>>>>>>>>> link:
>>>>>>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>> ########################################################################
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>>>>>>>
>>>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>>>>>>
>>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Use REPLY-ALL to reply to list
>>>>>>>>>
>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>>>>>>
>>>>> ########################################################################
>>>>> Use REPLY-ALL to reply to list
>>>>>
>>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>> ########################################################################
>>>> Use REPLY-ALL to reply to list
>>>>
>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> Use REPLY-ALL to reply to list
>>>
>>> To unsubscribe from the XROOTD-L list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> Use REPLY-ALL to reply to list
>>
>> To unsubscribe from the XROOTD-L list, click the following link:
>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>
>
>
> ------------------------------------------------------------------------
>
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>


########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1