Print

Print


Are there any known issues with TPC PULL where the final checksum does not match the user provided one?
We are running EOS on top of XRootD 4.12.7 and LHCb started using HTTP TPC in production but some of the TPC PULL transfers fail eventually since the final checksum does not match the user provided one. FTS later retries using probably TPC push and the transfers succeeds with the correct checksum.

In the EOS logs we don't see any errors whatsoever. For example on the disk node when such an file with the wrong checksum is written, we get clean and successful transfer log:

210219 05:10:34 26159 TPC_PullRequest: event=PULL_START, local=/eos/lhcb/grid/prod/lhcb/MC/2012/RXCHAD.STRIP.DST/00122396/0000/00122396_00001840_1.rxchad.strip.dst, remote=https://lhcbwebdav-kit.gridka.de:2880/pnfs/gridka.de/lhcb/LHCb-Disk/lhcb/MC/2012/RXCHAD.STRIP.DST/00122396/0000/00122
396_00001840_1.rxchad.strip.dst, user=(anonymous); Starting a push request
210219 05:10:34 time=1613707834.930317 func=open                     level=INFO  logid=6a336346-7268-11eb-a7ac-a4bf0114cb20 [log in to unmask]:1095 tid=00007f80735fb700 source=XrdFstOfsFile:120              tident=? sec=      uid=0 gid=0 name= geo="" path=/eos/lhcb/grid/prod/lhcb/MC/2012/RXCHAD.STRIP.DST/00122396/0000/00122396_00001840_1.rxchad.strip.dst info=cap.sym=<...>&cap.msg=<...>&mgm.logid=6a336346-7268-11eb-a7ac-a4bf0114cb20&mgm.replicaindex=0&mgm.replicahead=0&mgm.etag="342790846266998784:00000000"&mgm.id=4c1d6756&authz=<...> open_mode=201
210219 05:10:34 time=1613707834.930530 func=ProcessCapOpaque         level=INFO  logid=6a336346-7268-11eb-a7ac-a4bf0114cb20 [log in to unmask]:1095 tid=00007f80735fb700 source=XrdFstOfsFile:2253             tident=? sec=(null) uid=99 gid=99 name=(null) geo="" capability=&mgm.access=create&mgm.ruid=7947&mgm.rgid=1470&mgm.uid=99&mgm.gid=99&mgm.path=/eos/lhcb/grid/prod/lhcb/MC/2012/RXCHAD.STRIP.DST/00122396/0000/00122396_00001840_1.rxchad.strip.dst&mgm.manager=eoslhcb-qdb-52bedb7c30.cern.ch:1094&mgm.fid=4c1d6756&mgm.cid=113766179&mgm.sec=https|lhcbprod|[2001:1458:301:cd::100:1ab]||||/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=fstagni/CN=693025/CN=Federico Stagni|&mgm.lid=1048850&mgm.bookingsize=1000000000&mgm.fsid=11575&mgm.url0=root://st-048-bbead64b.cern.ch:1095//&mgm.fsid0=11575&mgm.url1=root://p06636710f33060.cern.ch:1095//&mgm.fsid1=10611&cap.valid=1613711434
210219 05:10:34 time=1613707834.930577 func=open                     level=INFO  logid=6a336346-7268-11eb-a7ac-a4bf0114cb20 [log in to unmask]:1095 tid=00007f80735fb700 source=XrdFstOfsFile:198              tident=? sec=(null) uid=7947 gid=1470 name=nobody geo="" ns_path=/eos/lhcb/grid/prod/lhcb/MC/2012/RXCHAD.STRIP.DST/00122396/0000/00122396_00001840_1.rxchad.strip.dst fst_path=/data05/0001f2d3/4c1d6756
210219 05:10:34 time=1613707834.930926 func=open                     level=INFO  logid=6a336346-7268-11eb-a7ac-a4bf0114cb20 [log in to unmask]:1095 tid=00007f80735fb700 source=XrdFstOfsFile:468              tident=? sec=(null) uid=7947 gid=1470 name=nobody geo="" fst_path=/data05/0001f2d3/4c1d6756 open-mode=301 create-mode=41a4 layout-name=replica oss-opaque=&mgm.lid=1048850&mgm.bookingsize=1000000000
210219 05:10:34 time=1613707834.930943 func=Open                     level=INFO  logid=6a336346-7268-11eb-a7ac-a4bf0114cb20 [log in to unmask]:1095 tid=00007f80735fb700 source=ReplicaParLayout:104           tident=? sec=      uid=0 gid=0 name= geo="" replica_head=0, replica_index=0
....
210219 05:10:41 time=1613707841.983478 func=VerifyChecksum           level=INFO  logid=6a336346-7268-11eb-a7ac-a4bf0114cb20 [log in to unmask]:1095 tid=00007f80735fb700 source=XrdFstOfsFile:3017             tident=? sec=      uid=7947 gid=1470 name=nobody geo="" (write) checksum type: adler checksum hex: 1edadd8c requested-checksum hex: -none-
210219 05:10:42 time=1613707842.071020 func=_close                   level=INFO  logid=6a336346-7268-11eb-a7ac-a4bf0114cb20 [log in to unmask]:1095 tid=00007f80735fb700 source=XrdFstOfsFile:1807             tident=? sec=      uid=7947 gid=1470 name=nobody geo="" msg="done close" rc=0 errc=0
210219 05:10:42 26159 TPC_PullRequest: event=TRANSFER_SUCCESS, local=/eos/lhcb/grid/prod/lhcb/MC/2012/RXCHAD.STRIP.DST/00122396/0000/00122396_00001840_1.rxchad.strip.dst, remote=https://lhcbwebdav-kit.gridka.de:2880/pnfs/gridka.de/lhcb/LHCb-Disk/lhcb/MC/2012/RXCHAD.STRIP.DST/00122396/0000/00122396_00001840_1.rxchad.strip.dst, user=(anonymous), bytes_transferred=180232171, tpc_status=200

And the corresponding error in FTS reads:

INFO Fri, 19 Feb 2021 05:10:42 +0100; [1613707842071] BOTH http_plugin TRANSFER:EXIT https://lhcbwebdav-kit.gridka.de:2880/pnfs/gridka.de/lhcb/LHCb-Disk/lhcb/MC/2012/RXCHAD.STRIP.DST/00122396/0000/00122396_00001840_1.rxchad.strip.dst => https://eoslhcb.cern.ch/eos/lhcb/grid/prod/lhcb/MC/2012/RXCHAD.STRIP.DST/00122396/0000/00122396_00001840_1.rxchad.strip.dst
ERR Fri, 19 Feb 2021 05:10:42 +0100; Non recoverable error: [5] DESTINATION CHECKSUM MISMATCH User-defined and destination ADLER32 do not match (fc21471b != 1edadd8c)

The EOS OFS layer is quite robust when it comes such write operations, so I would suspect some possible issues in the XrdTpc layer related to curl reading from remote destination/ buffering etc. Does this ring a bell?

Thanks!


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/issues/1404", "url": "https://github.com/xrootd/xrootd/issues/1404", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1