Hi Petr, thanks a lot for your investigation! from the xrootd client dump logs you attached, the portion which logs the write to your diskserver is the following [2019-09-07 19:01:19.397348 +0200][Dump ][XRootD ] [dpmpool19.farm.particle.cz:1095] Sending message kXR_write (handle: 0x00000000, offset: 0, size: 1048576) [2019-09-07 19:01:19.397371 +0200][Debug ][ExDbgMsg ] [dpmpool19.farm.particle.cz:1095] MsgHandler created: 0x8c008230 (message: kXR_write (handle: 0x00000000, offset: 0, size: 1048576) ). [2019-09-07 19:01:19.397420 +0200][Dump ][PostMaster ] [dpmpool19.farm.particle.cz:1095 #0] Sending message kXR_write (handle: 0x00000000, offset: 0, size: 1048576) (0x8c007020) through substream 0 expecting answer at 0 [2019-09-07 19:01:19.397438 +0200][Dump ][Poller ] <[2001:718:401:6e03::1:82]:35474><--><[2001:718:401:6017:2::19]:1095> Enable write notifications, timeout: 15 [2019-09-07 19:01:19.397477 +0200][Dump ][Poller ] <[2001:718:401:6e03::1:82]:35474><--><[2001:718:401:6017:2::19]:1095> Got an event: ReadyToWrite [2019-09-07 19:01:19.397505 +0200][Dump ][AsyncSock ] [dpmpool19.farm.particle.cz:1095 #0.0] Will write the payload in one go with the header for message: kXR_write (handle: 0x00000000, offset: 0, size: 1048576) (0x8c007020). [2019-09-07 19:01:19.727585 +0200][Dump ][TaskMgr ] Running task: "FileTimer task" [2019-09-07 19:01:19.727646 +0200][Dump ][TaskMgr ] Will rerun task "FileTimer task" at [2019-09-07 19:01:34 +0200] [2019-09-07 19:01:23.306001 +0200][Dump ][Poller ] <[2001:718:401:6e03::1:82]:35474><--><[2001:718:401:6017:2::19]:1095> Got an event: ReadyToWrite [2019-09-07 19:01:23.306078 +0200][Dump ][AsyncSock ] [dpmpool19.farm.particle.cz:1095 #0.0] Will write the payload in one go with the header for message: kXR_write (handle: 0x00000000, offset: 0, size: 1048576) (0x8c007020). [2019-09-07 19:01:23.306103 +0200][Dump ][AsyncSock ] [dpmpool19.farm.particle.cz:1095 #0.0] WroteV a message: kXR_write (handle: 0x00000000, offset: 0, size: 1048576) (0x8c007020), 24 bytes [2019-09-07 19:01:23.306129 +0200][Dump ][AsyncSock ] [dpmpool19.farm.particle.cz:1095 #0.0] WroteV raw data: 448 bytes [2019-09-07 19:01:23.306141 +0200][Dump ][AsyncSock ] [dpmpool19.farm.particle.cz:1095 #0.0] Successfully sent message: kXR_write (handle: 0x00000000, offset: 0, size: 1048576) (0x8c007020). [2019-09-07 19:01:23.306156 +0200][Dump ][XRootD ] [dpmpool19.farm.particle.cz:1095] Message kXR_write (handle: 0x00000000, offset: 0, size: 1048576) has been successfully sent. [2019-09-07 19:01:23.306169 +0200][Debug ][ExDbgMsg ] [dpmpool19.farm.particle.cz:1095] Moving MsgHandler: 0x8c008230 (message: kXR_write (handle: 0x00000000, offset: 0, size: 1048576) ) from out-queu to in-queue. [2019-09-07 19:01:23.306185 +0200][Dump ][PostMaster ] [dpmpool19.farm.particle.cz:1095 #0.0] All messages consumed, disable uplink [2019-09-07 19:01:23.306198 +0200][Dump ][Poller ] <[2001:718:401:6e03::1:82]:35474><--><[2001:718:401:6017:2::19]:1095> Disable write notifications [2019-09-07 19:01:25.890592 +0200][Dump ][Poller ] <[2001:718:401:6e03::1:82]:35474><--><[2001:718:401:6017:2::19]:1095> Got an event: ReadyToRead [2019-09-07 19:01:25.890665 +0200][Dump ][XRootDTransport ] [msg: 0x7c005610] Expecting 0 bytes of message body [2019-09-07 19:01:25.890682 +0200][Dump ][AsyncSock ] [dpmpool19.farm.particle.cz:1095 #0.0] Received message header for 0x7c005610 size: 8 [2019-09-07 19:01:25.890701 +0200][Dump ][AsyncSock ] [dpmpool19.farm.particle.cz:1095 #0.0] Received message 0x7c005610 of 8 bytes [2019-09-07 19:01:25.890717 +0200][Dump ][PostMaster ] [dpmpool19.farm.particle.cz:1095 #0] Handling received message: 0x7c005610. [2019-09-07 19:01:25.890816 +0200][Dump ][XRootD ] [dpmpool19.farm.particle.cz:1095] Got a kXR_ok response to request kXR_write (handle: 0x00000000, offset: 0, size: 1048576) [2019-09-07 19:01:25.890868 +0200][Debug ][ExDbgMsg ] [dpmpool19.farm.particle.cz:1095] Calling MsgHandler: 0x8c008230 (message: kXR_write (handle: 0x00000000, offset: 0, size: 1048576) ) with status: [SUCCESS] . [2019-09-07 19:01:25.890889 +0200][Dump ][File ] [0x8c006770@root://golias100.farm.particle.cz:1094///dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000?oss.asize=1048576&xrd.gsiusrpxy=/tmp/x509up_u996&xrd.wantprot=gsi,unix] Got state response for message kXR_write (handle: 0x00000000, offset: 0, size: 1048576) [2019-09-07 19:01:25.890924 +0200][Debug ][ExDbgMsg ] [dpmpool19.farm.particle.cz:1095] Destroying MsgHandler: 0x8c008230. [2019-09-07 19:01:25.891157 +0200][Debug ][File ] [0x8c006770@root://golias100.farm.particle.cz:1094///dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000?oss.asize=1048576&xrd.gsiusrpxy=/tmp/x509up_u996&xrd.wantprot=gsi,unix] Sending a close command for handle 0x0 to dpmpool19.farm.particle.cz:1095 [2019-09-07 19:01:25.891193 +0200][Dump ][XRootD ] [dpmpool19.farm.particle.cz:1095] Sending message kXR_close (handle: 0x00000000) and i don't see there any "strange" offset or size. i'm waiting for Michal so we can have a look together cheers Andrea Il 08.09.19 10:46, Petr Vokac ha scritto: > I just found one file where problem with offset occurred in the middle > of transfer (not at the beginning): > 190907 20:03:53 25211 XrootdXeq: atlasprd.2817:168@[2001:718:401:6e03::1:51] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 > 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@0 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload > 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@262144 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload > 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@524288 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload > 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@786432 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload > 190907 20:03:55 25211 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@1048576 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload > ... > 190907 20:40:35 31589 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2952003584 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload > 190907 20:40:53 31589 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2952265728 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload > 190907 20:41:55 31609 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2952265728 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload > *190907 20:42:05 25089 atlasprd.2817:168@[2001:718:401:6e03::1:51] > ofs_write: 1919238960@7379557499097268579 > fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload****190907 20:42:13 25118 atlasprd.2817:168@[2001:718:401:6e03::1:51] > ofs_write: 1919238960@7379557499097268579 > fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload***190907 20:42:13 25118 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2953314304 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload > 190907 20:42:13 25118 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2953576448 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload > 190907 20:42:20 25118 atlasprd.2817:168@[2001:718:401:6e03::1:51] ofs_write: 262144@2953576448 fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_13TeV/9c/6a/AOD.18892557._000593.pool.root.1.rucio.upload > Whole xrootd.log for this transfer available at > http://novastore.farm.particle.cz/tmp/xrootd_bigfile_wrong_offset_in_the_middle_of_transfer.txt > > Petr > > On 9/8/19 10:25 AM, Petr Vokac wrote: >> I executed basically following python code that from bash script in a >> loop >> import gfal2 >> ctx = gfal2.creat_context() >> ctx.set_opt_string("XROOTD PLUGIN", "XRD.WANTPROT", "gsi,unix") >> params = ctx.transfer_parameters() >> params.timeout = 60 >> ctx.filecopy(params,"file:///tmp/1M", "root://golias100.farm.particle.cz/dpm/farm.particle.cz/home/atlas/1M") >> and after a while new corrupted files appeared at our storage >> including one big file: >> [vokac@ui1 ~]$ gfal-ls -l 'davs://golias100.farm.particle.cz/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000' >> -rw-rw-r-- 0 1310 5281 139703410778112 Sep 7 19:01 davs://golias100.farm.particle.cz/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >> Now I have also debug output from gfal client including >> XRD_LOGLEVEL=Dump + XRD_LOGMASK=All for this file with xrootd.log on >> server side: >> [root@dpmpool19 ~]$ grep condor.39253:171 /var/log/xrootd/dpmdisk/xrootd.log >> 190907 19:01:19 29645 XrootdXeq: condor.39253:171@[2001:718:401:6e03::1:82] pub IP64 login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=vokac/CN=610071/CN=Petr Vokac >> 190907 19:01:22 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 1009440128@139703410643808 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >> 190907 19:01:23 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@131072 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >> 190907 19:01:23 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@262144 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >> 190907 19:01:24 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@393216 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >> 190907 19:01:24 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@524288 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >> 190907 19:01:25 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@655360 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >> 190907 19:01:25 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@786432 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >> 190907 19:01:25 32625 condor.39253:171@[2001:718:401:6e03::1:82] ofs_write: 131072@917504 fn=/dpm/farm.particle.cz/home/atlas/1M.01556.20190907190118.646902567.00000 >> 190907 19:01:26 32625 XrootdXeq: condor.39253:171@[2001:718:401:6e03::1:82] disc 0:00:07 >> This file was transferred from mff82.farm.particle.cz >> (2001:718:401:6e03::1:82) to the storage dpmpool19.farm.particle.cz >> (2001:718:401:6017:2::19) and both machines use >> xrootd-4.9.1-1.el7.x86_64 and mff82 use gfal2-2.16.3-1.el7.x86_64 >> >> I also tried to calculate adler32 for data blocks from positions >> logged by server and result match source data checksum. When I look >> at other transfers finished with "SUCCESS" but with the filesize >> different from source file I can see that problems always occurs only >> with first "ofs_write" log line where size@offset contains wrong numbers. >> >> Attached files: >> >> 1M.01556.20190907190118.646902567.00000.dump ... xrd with dump log level >> 1M.01556.20190907190118.646902567.00000.log ... gfal2 verbose logging >> (+ few log lines from my script) >> 1M.01556.20190907190118.646902567.00000.xrootd.long ... dpmpool19 >> xrootd.log >> >> I have more examples of successfully transferred files with wrong >> size (sometimes bigger than original 1M size and sometimes smaller): >> >> 07 Sep 18:39:33.879 [ERROR] dpmpool1.farm.particle.cz BAD SIZE 917504 >> 07 Sep 18:40:45.039 [ERROR] dpmpool21.farm.particle.cz BAD SIZE 917504 >> 07 Sep 18:41:22.555 [ERROR] dpmpool21.farm.particle.cz BAD SIZE 917504 >> 07 Sep 18:41:39.248 [ERROR] dpmpool21.farm.particle.cz BAD SIZE 917504 >> 07 Sep 18:56:51.113 [ERROR] dpmpool24.farm.particle.cz BAD SIZE 917504 >> 07 Sep 18:58:04.436 [ERROR] dpmpool1.farm.particle.cz BAD SIZE 393194936 >> 07 Sep 19:01:25.908 [ERROR] dpmpool19.farm.particle.cz BAD SIZE >> 139703410778112 >> 07 Sep 19:02:17.183 [ERROR] dpmpool24.farm.particle.cz BAD SIZE 0 >> 07 Sep 19:02:53.014 [ERROR] dpmpool16.farm.particle.cz BAD SIZE 917504 >> 07 Sep 19:03:49.424 [ERROR] dpmpool1.farm.particle.cz BAD SIZE 917504 >> 07 Sep 19:05:47.743 [ERROR] dpmpool21.farm.particle.cz BAD SIZE 13983727 >> 07 Sep 19:15:15.761 [ERROR] dpmpool22.farm.particle.cz BAD SIZE 12752345 >> 07 Sep 19:16:58.898 [ERROR] dpmpool20.farm.particle.cz BAD SIZE 917504 >> 07 Sep 19:18:51.590 [ERROR] dpmpool17.farm.particle.cz BAD SIZE 6302080 >> 07 Sep 19:20:37.006 [ERROR] dpmpool21.farm.particle.cz BAD SIZE >> 1456153896 >> 07 Sep 19:52:32.553 [ERROR] dpmpool16.farm.particle.cz BAD SIZE 917504 >> 07 Sep 19:53:45.155 [ERROR] dpmpool22.farm.particle.cz BAD SIZE >> 1000000000 >> 07 Sep 20:11:14.497 [ERROR] dpmpool22.farm.particle.cz BAD SIZE 917504 >> 07 Sep 20:12:51.081 [ERROR] dpmpool20.farm.particle.cz BAD SIZE 917504 >> 07 Sep 20:14:58.090 [ERROR] dpmpool20.farm.particle.cz BAD SIZE 100663296 >> 07 Sep 20:24:57.457 [ERROR] dpmpool1.farm.particle.cz BAD SIZE 917504 >> 07 Sep 20:27:37.596 [ERROR] dpmpool19.farm.particle.cz BAD SIZE 917504 >> 07 Sep 20:28:21.617 [ERROR] dpmpool16.farm.particle.cz BAD SIZE 0 >> >> Petr >> >> On 9/5/19 2:44 PM, Petr Vokac wrote: >>> I'm going to test in a loop uploads with this code and also enabled gfal >>> debug logging from our remote machines that talks with storage through >>> quite saturated network (we never saw same issue from cluster worker >>> nodes that shares room with diskservers). I also saw few transfers that >>> did not create big sparse file, but just failed while comparing >>> checksums after upload and this is also quite suspicious behavior. >>> >>> Petr >>> >>> On 9/5/19 12:57 PM, andrea manzi wrote: >>>>> Ciao Andrea, >>>>> >>>>> we practically do this: >>>>> >>>>> ctx = creat_context(... >>>>> ctx.set_opt_string("XROOTD PLUGIN", "XRD.WANTPROT", "gsi,unix") >>>>> ctx.mkdir_rec(... >>>>> ctx.filecopy(.. >>>>> >>>>> Is that the gfal transfer API? >>>> yup, internally in gfal2 we then use the XrdCl to perform the copy ( >>>> both for local files and TPC) >>>> >>>> today is holiday but we can start investigating with Michal tomorrow >>>> or Monday the possible issue >>>> >>>> cheers >>>> >>>> Andrea >>>> >>>>> Cheers, >>>>> Mario >>>>> >>>>> >>>>> >>>>> On 05/09/2019 10:35, andrea manzi wrote: >>>>>> ciao Mario, >>>>>> >>>>>> i guess you are using the gfal2 transfer API in Rucio right ? ( the >>>>>> equivalent to gfal-copy) >>>>>> >>>>>> in gfal2, in the case of the transfer API for xrootd, we don't do >>>>>> much >>>>>> as we just use the xrootd client ( XrdCl::CopyProcess), so we don't >>>>>> touch file offsets. >>>>>> >>>>>> so we may want to investigate if there is a problem in XrdCl. Do you >>>>>> have the logs of the client? >>>>>> >>>>>> If you are using the gfal2 Posix API instead we need to investigate a >>>>>> possible bug there >>>>>> >>>>>> cheers >>>>>> >>>>>> Andrea >>>>>> >>>>>> >>>>>>> Hi Petr, Andy, >>>>>>> >>>>>>> once extra piece of information here: Rucio uses the python >>>>>>> bindings to >>>>>>> the gfal library to upload these files, and de-facto always uses the >>>>>>> current stable gfal version that is published on ALRB. Rucio doesn't >>>>>>> implement any other upload by itself. >>>>>>> >>>>>>> Cheers, >>>>>>> Mario >>>>>>> >>>>>>> On 05/09/2019 04:21, Andrew Hanushevsky wrote: >>>>>>>> Hi Petr, >>>>>>>> So, look at this line: >>>>>>>> 190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>> ofs_write: 1784104010@6506989258351653185 >>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>> >>>>>>>> Why did the application doing the write specify an offset of >>>>>>>> 5506989258351653185? So, that is the problem and, as you noted, it >>>>>>>> doesn’t matter whether POSC is on or not. Indeed, it’s the first >>>>>>>> write >>>>>>>> to the file and only the first write that suffers from this >>>>>>>> problem. So, >>>>>>>> there are two possibilities here: a) there is a bug in the >>>>>>>> uploader, or >>>>>>>> b) there was a network data corruption that managed to hit the offset >>>>>>>> (somewhat unlikely). >>>>>>>> Andy >>>>>>>> *From:* Petr Vokac<mailto:[log in to unmask]> >>>>>>>> *Sent:* Wednesday, September 04, 2019 12:19 PM >>>>>>>> *To:* Andrew Hanushevsky<mailto:[log in to unmask]> >>>>>>>> *Cc:* dpm-devel<mailto:[log in to unmask]> ; >>>>>>>> [log in to unmask] >>>>>>>> <mailto:[log in to unmask]> >>>>>>>> *Subject:* Re: Uploaded file 128TB on filesystem >>>>>>>> I configured our diskservers with "ofs.trace write" (I did not >>>>>>>> yet add >>>>>>>> in xrootd configuraton "ofs.persist hold 0") and today new file >>>>>>>> appeared >>>>>>>> on one of our storage servers: >>>>>>>> >>>>>>>> *[root@dpmpool1 ~]# grep atlasprd.2817:152 >>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log* >>>>>>>> 190904 17:02:08 1894 XrootdXeq: >>>>>>>> atlasprd.2817:152@[2001:718:401:6e03::1:82] pub IP64 login as >>>>>>>> /DC=ch/DC=cern/OU=Organic >>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>>>>>> 190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>> ofs_write: 1784104010@6506989258351653185 >>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>> 190904 17:02:16 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>> ofs_write: 262144@262144 >>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>> ofs_write: 262144@524288 >>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>> ofs_write: 262144@786432 >>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>> ofs_write: 262144@1048576 >>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>> ofs_write: 262144@1310720 >>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>> ofs_write: 262144@1572864 >>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>> ofs_write: 262144@1835008 >>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>> 190904 17:02:17 22915 atlasprd.2817:152@[2001:718:401:6e03::1:82] >>>>>>>> ofs_write: 135000@2097152 >>>>>>>> fn=/dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_5TeV/90/9f/EVNT.18960107._035789.pool.root.1.rucio.upload >>>>>>>> 190904 17:07:32 2163 XrootdXeq: >>>>>>>> atlasprd.2817:152@[2001:718:401:6e03::1:82] disc 0:05:24 >>>>>>>> >>>>>>>> *[root@dpmpool1 ~]# ls -la >>>>>>>> /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328* >>>>>>>> >>>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 6506989258351919104 Sep 4 17:02 >>>>>>>> /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328 >>>>>>>> >>>>>>>> *[root@dpmpool1 ~]# du --block-size=1 >>>>>>>> /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328* >>>>>>>> >>>>>>>> 2236416 >>>>>>>> /mnt/fs2/atlas/2019-09-04/EVNT.18960107._035789.pool.root.1.rucio.upload.309650.1567609328 >>>>>>>> >>>>>>>> This file has first 256kb filled with 0x00 (adler32 up to original >>>>>>>> size >>>>>>>> 2232152 doesn't match) and this time there is no ofs_Posc in the >>>>>>>> xrootd.log file. >>>>>>>> >>>>>>>> Petr Vokac >>>>>>>> >>>>>>>> On 8/19/19 3:24 AM, Andrew Hanushevsky wrote: >>>>>>>>> Hi Petr, >>>>>>>>> >>>>>>>>> The issue here is that the python script (I believe it's python) is >>>>>>>>> disconnecting without closing the file and then reconnecting and >>>>>>>>> writing at a high offset. The log is pretty clear on that. You can >>>>>>>>> verify by adding "ofs.trace write" to see the write offsets >>>>>>>>> (hough it >>>>>>>>> will produce a lot of output). You can also prohibit reconnection by >>>>>>>>> adding this to the persist directive. >>>>>>>>> >>>>>>>>> ofs.persist hold 0 >>>>>>>>> >>>>>>>>> which would delete the file the moment the client disconnects >>>>>>>>> without >>>>>>>>> closing the file. >>>>>>>>> >>>>>>>>> Andy >>>>>>>>> >>>>>>>>> On Sat, 17 Aug 2019, Petr Vokac wrote: >>>>>>>>> >>>>>>>>>> Today new "65PB" file appeared on our storage and it was again >>>>>>>>>> transferred with xrootd protocol. I can see again same error in >>>>>>>>>> the log >>>>>>>>>> file: >>>>>>>>>> >>>>>>>>>> **[root@dpmpool21 ~]# ls -la >>>>>>>>>> /mnt/fs4/atlas/2019-08-17/EVNT.18893579._003197.pool.root.1.rucio.upload.234855.1566031522* >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *-rw-rw---- 1 dpmmgr dpmmgr 72340172838077281 Aug 17 10:48 >>>>>>>>>> /mnt/fs4/atlas/2019-08-17/EVNT.18893579._003197.pool.root.1.rucio.upload.234855.1566031522 >>>>>>>>>> >>>>>>>>>> ***[root@dpmpool21 ~]# grep atlasprd.2757 >>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log* >>>>>>>>>> ... >>>>>>>>>> 190817 10:47:55 57803 XrootdXeq: >>>>>>>>>> atlasprd.2757:65@[2001:718:401:6e03::1:43] pub IP64 login as >>>>>>>>>> /DC=ch/DC=cern/OU=Organic >>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>>>>>>>> 190817 10:47:55 61859 XrootdXeq: >>>>>>>>>> atlasprd.2757:66@[2001:718:401:6e03::1:83] pub IP64 login as >>>>>>>>>> /DC=ch/DC=cern/OU=Organic >>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>>>>>>>> 190817 10:48:52 62233 XrootdXeq: >>>>>>>>>> atlasprd.2757:46@[2001:718:401:6e03::1:72] disc 0:00:58 >>>>>>>>>> 190817 10:48:52 62954 XrootdXeq: >>>>>>>>>> atlasprd.2757:25@[2001:718:401:6e03::1:72] disc 0:03:30 >>>>>>>>>> 190817 10:48:52 57804 XrootdXeq: >>>>>>>>>> atlasprd.2757:28@[2001:718:401:6e03::1:72] pub IP64 login as >>>>>>>>>> /DC=ch/DC=cern/OU=Organic >>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>>>>>>>> 190817 10:48:52 57804 ofs_Posc: Creator changed from >>>>>>>>>> atlasprd.2757:25@[2001:718:401:6e03::1:72] to >>>>>>>>>> atlasprd.2757:28@[2001:718:401:6e03::1:72] for >>>>>>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/8b/f3/EVNT.18893579._003197.pool.root.1.rucio.upload >>>>>>>>>> >>>>>>>>>> 190817 10:48:53 57803 XrootdXeq: >>>>>>>>>> atlasprd.2757:46@[2001:718:401:6017:20:0:14:44] pub IP64 login as >>>>>>>>>> /DC=ch/DC=cern/OU=Organic >>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>>>>>>>> 190817 10:49:01 57818 XrootdXeq: >>>>>>>>>> atlasprd.2757:70@[2001:718:401:6017:20:0:11:11] pub IP64 login as >>>>>>>>>> /DC=ch/DC=cern/OU=Organic >>>>>>>>>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1 >>>>>>>>>> 190817 10:49:07 62954 XrootdXeq: >>>>>>>>>> atlasprd.2757:55@[2001:718:401:6e03::1:81] disc 0:10:17 >>>>>>>>>> ... >>>>>>>>>> >>>>>>>>>> Any suggestion how to trace what is causing this behavior? >>>>>>>>>> >>>>>>>>>> Petr >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 7/20/19 7:48 AM, Petr Vokac wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I ran the command you sent me >>>>>>>>>>> >>>>>>>>>>> grep "atlasprd.2757:" <logfile> | egrep >>>>>>>>>>> '2001:718:401:6e03::1:21|2001:718:401:6e03::1:32' > <excerpt> >>>>>>>>>>> >>>>>>>>>>> and because the output was not just few lines long I uploaded >>>>>>>>>>> it to the >>>>>>>>>>> >>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool22.xrootd.log >>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool24.xrootd.log >>>>>>>>>>> >>>>>>>>>>> These machines on 2001:718:401:6e03::/64 subnet are in remote >>>>>>>>>>> location >>>>>>>>>>> (but relatively close with RTT ~ 0.7ms). I can see several >>>>>>>>>>> ofs_Posc >>>>>>>>>>> occurrences in the logs of our disknodes during "190717 >>>>>>>>>>> 17:xx:xx" time >>>>>>>>>>> period: >>>>>>>>>>> >>>>>>>>>>> http://vokac.web.cern.ch/vokac/tmp/POSC.txt >>>>>>>>>>> >>>>>>>>>>> Petr >>>>>>>>>>> >>>>>>>>>>> On 7/19/19 8:51 PM, Andrew Hanushevsky wrote: >>>>>>>>>>>> Hi Petr, >>>>>>>>>>>> >>>>>>>>>>>> Well, POSC says that the client reconnected for writing so not >>>>>>>>>>>> all >>>>>>>>>>>> went well. Can I get the log output for that occurrence as I >>>>>>>>>>>> previously described? >>>>>>>>>>>> >>>>>>>>>>>> Andy >>>>>>>>>>>> >>>>>>>>>>>> On Fri, 19 Jul 2019, Petr Vokac wrote: >>>>>>>>>>>> >>>>>>>>>>>>> We are using Rucio download client to transfer job outputs >>>>>>>>>>>>> (it is >>>>>>>>>>>>> build >>>>>>>>>>>>> on top of gfal2 python interface). >>>>>>>>>>>>> >>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool22.xrootd.log >>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/dpmpool24.xrootd.log >>>>>>>>>>>>> >>>>>>>>>>>>> this is the full job output unfortunately with minimal gfal2 >>>>>>>>>>>>> logging >>>>>>>>>>>>> output. >>>>>>>>>>>>> >>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/grid.2258136.4.out-from-2001:718:401:6e03::1:21-to-dpmpool22.txt >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> https://vokac.web.cern.ch/vokac/tmp/grid.2255921.7.out-from-2001:718:401:6e03::1:32-to-dpmpool24.txt >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> It looks like according gfal2 transfer was OK >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-07-17 15:09:17,782 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,782 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,783 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,783 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,783 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,783 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,784 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,784 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,784 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,793 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,793 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,793 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,794 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,795 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_xrootd.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,795 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_rfio.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,795 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_gridftp.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,795 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_http.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_lfc.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_file.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_sftp.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_dcap.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,796 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | connect | >>>>>>>>>>>>> [gfal_module_load] plugin >>>>>>>>>>>>> /usr/lib64/gfal2-plugins//libgfal_plugin_srm.so loaded with >>>>>>>>>>>>> success >>>>>>>>>>>>> 2019-07-17 15:09:17,810 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | __gfal2_copy >>>>>>>>>>>>> | Event >>>>>>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:ENTER >>>>>>>>>>>>> 2019-07-17 15:09:17,810 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | __gfal2_copy >>>>>>>>>>>>> | Event >>>>>>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:ITEM >>>>>>>>>>>>> file:///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> => >>>>>>>>>>>>> root://golias100.farm.particle.cz:1094//dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-07-17 15:09:17,810 | INFO | copytool_out | >>>>>>>>>>>>> gfal2 | __gfal2_copy >>>>>>>>>>>>> | Event >>>>>>>>>>>>> triggered: BOTH GFAL2:CORE:COPY LIST:EXIT >>>>>>>>>>>>> 2019-07-17 15:09:17,811 | INFO | Dummy-2 | >>>>>>>>>>>>> gfal2 | (unknown function) >>>>>>>>>>>>> | Event >>>>>>>>>>>>> triggered: BOTH xroot TRANSFER:ENTER >>>>>>>>>>>>> file://localhost///scratch/condor/dir_24031/atlas_HudmF2he/PanDA_Pilot-4419868006/log.18636890._001059.job.log.tgz.1?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&xrd.wantprot=gsi,unix >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> => >>>>>>>>>>>>> root://golias100.farm.particle.cz:1094///dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/log.18636890._001059.job.log.tgz.1.rucio.upload?xrd.gsiusrpxy=/scratch/condor/dir_24031/job.9XONDm6yx7unnoBGSqYX2MjnABFKDmABFKDmSyhTDmEBFKDmqAGN3n.proxy&x >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-07-17 15:09:17,811 | INFO | Dummy-2 | >>>>>>>>>>>>> gfal2 | (unknown function) >>>>>>>>>>>>> | Event >>>>>>>>>>>>> triggered: BOTH xroot TRANSFER:TYPE streamed >>>>>>>>>>>>> 2019-07-17 15:11:23,914 | INFO | Dummy-2 | >>>>>>>>>>>>> gfal2 | (unknown function) >>>>>>>>>>>>> | Event >>>>>>>>>>>>> triggered: BOTH xroot TRANSFER:EXIT Job finished, [SUCCESS] >>>>>>>>>>>>> >>>>>>>>>>>>> but probably rucio tries to verify checksum (gfal-sum via python >>>>>>>>>>>>> bindings) and it fails >>>>>>>>>>>>> >>>>>>>>>>>>> 2019-07-17 15:15:23,774 | INFO | copytool_out | >>>>>>>>>>>>> pilot.copytool.rucio | copy_out | >>>>>>>>>>>>> stderr = An unknown exception occurred. >>>>>>>>>>>>> Details: Checksum not validated >>>>>>>>>>>>> >>>>>>>>>>>>> because DPM is unable to calculate checksum for 128TB file >>>>>>>>>>>>> within >>>>>>>>>>>>> timeout. >>>>>>>>>>>>> >>>>>>>>>>>>> Petr >>>>>>>>>>>>> >>>>>>>>>>>>> On 7/19/19 5:07 AM, Andrew Hanushevsky wrote: >>>>>>>>>>>>>> Hi Petr, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Indeed, it looks like what was used to upload the files >>>>>>>>>>>>>> (xrdcp?) >>>>>>>>>>>>>> created a very large sparse file by writing with a very large >>>>>>>>>>>>>> offset. >>>>>>>>>>>>>> The clue here is that an anomaly occurred during the write >>>>>>>>>>>>>> and these >>>>>>>>>>>>>> two messages (lkely each corresponding to each file) tell us >>>>>>>>>>>>>> something.... >>>>>>>>>>>>>> >>>>>>>>>>>>>>> 190717 17:10:39 332092 ofs_Posc: Creator changed from >>>>>>>>>>>>>> atlasprd.2757:90@[2001:718:401:6e03::1:21] to >>>>>>>>>>>>>> atlasprd.2757:39@[2001:718:401:6e03::1:21] for >>>>>>>>>>>>>> >>>>>>>>>>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> log.18636890._001059.job.log.tgz.1.rucio.upload >>>>>>>>>>>>>> >>>>>>>>>>>>>>> 190717 17:29:26 14204 ofs_Posc: Creator changed from >>>>>>>>>>>>>> atlasprd.2757:33@[2001:718:401:6e03::1:32] to >>>>>>>>>>>>>> atlasprd.2757:95@[2001:718:401:6e03::1:32] for >>>>>>>>>>>>>> >>>>>>>>>>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> AOD.18595963._000786.pool.root.1.rucio.upload >>>>>>>>>>>>>> >>>>>>>>>>>>>> What the POSC system is saying is that in each case the >>>>>>>>>>>>>> original >>>>>>>>>>>>>> creator of the file lost contact with the server. It then >>>>>>>>>>>>>> reconnected >>>>>>>>>>>>>> and resumed writing with a successful close. POSC allows >>>>>>>>>>>>>> this to >>>>>>>>>>>>>> avoid >>>>>>>>>>>>>> network glitches from deleting a file because a network >>>>>>>>>>>>>> glitch will >>>>>>>>>>>>>> cause the client to reconnect and resume where it left off. >>>>>>>>>>>>>> This is >>>>>>>>>>>>>> only allowed when the resuming client comes from the same IP >>>>>>>>>>>>>> address, >>>>>>>>>>>>>> has the same process ID, and the same logical name. In this >>>>>>>>>>>>>> case it >>>>>>>>>>>>>> was "atlasprd.2757". I find that rather odd because it's >>>>>>>>>>>>>> next to >>>>>>>>>>>>>> impossible for two processes on two different machines to >>>>>>>>>>>>>> have the >>>>>>>>>>>>>> same process number (i.e. 2757). So, this is why I am >>>>>>>>>>>>>> wondering if >>>>>>>>>>>>>> this was really an xrdcp or some other program. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We should look to see what these client were actually doing >>>>>>>>>>>>>> during >>>>>>>>>>>>>> this period as they clearly did not do the right thing when >>>>>>>>>>>>>> they >>>>>>>>>>>>>> reconnected. So, could you grep out all interactions of >>>>>>>>>>>>>> these two >>>>>>>>>>>>>> client from the log. Something like: >>>>>>>>>>>>>> >>>>>>>>>>>>>> grep "atlasprd.2757:" <logfile> | egrep >>>>>>>>>>>>>> '2001:718:401:6e03::1:21|2001:718:401:6e03::1:32' > <excerpt> >>>>>>>>>>>>>> >>>>>>>>>>>>>> and send the excerpt to me? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Andy >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, 18 Jul 2019, Petr Vokac wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> our DPM which is using native xrootd 4.9.1 yesterday >>>>>>>>>>>>>>> created two >>>>>>>>>>>>>>> different 128TB files on two different diskservers that were >>>>>>>>>>>>>>> received as >>>>>>>>>>>>>>> a job output from two different worker nodes. The size of >>>>>>>>>>>>>>> original >>>>>>>>>>>>>>> file >>>>>>>>>>>>>>> was few megabytes and when I check beginning of stored >>>>>>>>>>>>>>> files up to >>>>>>>>>>>>>>> original size & checksum everything seems to be OK - except >>>>>>>>>>>>>>> these are >>>>>>>>>>>>>>> sparse files with size 128TB. Original transferred files >>>>>>>>>>>>>>> from WN >>>>>>>>>>>>>>> looked >>>>>>>>>>>>>>> like: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> mc15_13TeV:log.18636890._001059.job.log.tgz.1 - original size >>>>>>>>>>>>>>> 3101899, adler32 742e1336 >>>>>>>>>>>>>>> mc16_5TeV:AOD.18595963._000786.pool.root.1 - original size >>>>>>>>>>>>>>> 737474514, >>>>>>>>>>>>>>> adler32 dc68cb21 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> but on our diskservers they were stored as >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [root@dpmpool22 ~]# ls -l >>>>>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 140560257079236 Jul 17 17:11 >>>>>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [root@dpmpool22 ~]# du --block-size=1 >>>>>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 3108864 >>>>>>>>>>>>>>> /mnt/fs1/atlas/2019-07-17/log.18636890._001059.job.log.tgz.1.rucio.upload.209928.1563376157 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [root@dpmpool24 ~]# ls -l >>>>>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -rw-rw---- 1 dpmmgr dpmmgr 139734345056408 Jul 17 17:31 >>>>>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [root@dpmpool24 tmp]# du --block-size=1 >>>>>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 737480704 >>>>>>>>>>>>>>> /mnt/fs2/atlas/2019-07-17/AOD.18595963._000786.pool.root.1.rucio.upload.210480.1563377083 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I was told that DPM is not really involved in transfering / >>>>>>>>>>>>>>> storing >>>>>>>>>>>>>>> files with xrootd protocol, so I guess something bad must >>>>>>>>>>>>>>> happened >>>>>>>>>>>>>>> inside XRootD libraries. Do you have any idea why these two >>>>>>>>>>>>>>> transfers >>>>>>>>>>>>>>> generated such extreme files? Unfortunately our xrootd >>>>>>>>>>>>>>> logging is >>>>>>>>>>>>>>> pretty >>>>>>>>>>>>>>> low and I was able to find only info about file creation >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [root@dpmpool22 ~]# grep log.18636890._001059.job.log.tgz >>>>>>>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log >>>>>>>>>>>>>>> 190717 17:10:39 332092 ofs_Posc: Creator changed from >>>>>>>>>>>>>> atlasprd.2757:90@[2001:718:401:6e03::1:21] to >>>>>>>>>>>>>> atlasprd.2757:39@[2001:718:401:6e03::1:21] for >>>>>>>>>>>>>> >>>>>>>>>>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc15_13TeV/26/98/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> log.18636890._001059.job.log.tgz.1.rucio.upload >>>>>>>>>>>>>>> [root@dpmpool24 tmp]# grep AOD.18595963._000786.pool.root >>>>>>>>>>>>>>> /var/log/xrootd/dpmdisk/xrootd.log >>>>>>>>>>>>>>> 190717 17:29:26 14204 ofs_Posc: Creator changed from >>>>>>>>>>>>>> atlasprd.2757:33@[2001:718:401:6e03::1:32] to >>>>>>>>>>>>>> atlasprd.2757:95@[2001:718:401:6e03::1:32] for >>>>>>>>>>>>>> >>>>>>>>>>>>>> /dpm/farm.particle.cz/home/atlas/atlasdatadisk/rucio/mc16_5TeV/09/bc/AOD.18595963. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _000786.pool.root.1.rucio.upload >>>>>>>>>>>>>>> Petr >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ######################################################################## >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Use REPLY-ALL to reply to list >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following >>>>>>>>>>>>>>> link: >>>>>>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> ######################################################################## >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Use REPLY-ALL to reply to list >>>>>>>>>>>>>> >>>>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following >>>>>>>>>>>>>> link: >>>>>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>>>>>>>>>>>> >>>>>>>>>>>> ######################################################################## >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Use REPLY-ALL to reply to list >>>>>>>>>>>> >>>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>>>>>>>>> ------------------------------------------------------------------------ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Use REPLY-ALL to reply to list >>>>>>>>>>> >>>>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>>>>>>>>> >>>>>>>> ------------------------------------------------------------------------ >>>>>>>> >>>>>>>> >>>>>>>> Use REPLY-ALL to reply to list >>>>>>>> >>>>>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>>>>>>> >>>> ######################################################################## >>>> Use REPLY-ALL to reply to list >>>> >>>> To unsubscribe from the XROOTD-L list, click the following link: >>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >>> ######################################################################## >>> Use REPLY-ALL to reply to list >>> >>> To unsubscribe from the XROOTD-L list, click the following link: >>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >> >> >> ------------------------------------------------------------------------ >> >> Use REPLY-ALL to reply to list >> >> To unsubscribe from the XROOTD-L list, click the following link: >> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1 >> > ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1