LISTSERV 16.5 - XROOTD-DEV Archives

Hi guys, sorry the late replay but I took few days off.

> Otoh it seems to try for every operation (given the overall slowdown of the reading to >40 minutes, from originally 30 seconds), 

@Axel-Naumann : I had a look at the logs and here's what I see:

The first request is being send at:
```
[2022-09-06 20:06:06.225956 +0200][Dump   ][XRootD            ] [eospublic.cern.ch:1094] Sending message kXR_open (file: /eos/root-eos/cms_opendata_2012_nanoaod_skimmed/ZZTo4mu.root, mode: 00, flags: kXR_open_read kXR_async kXR_retstat )
```

this triggers the client to open a connection, the client first fails to connect with IPv6 and then retries and succeeds with IPv4 (this takes about ~1.5 min):
```
[2022-09-06 20:06:06.228241 +0200][Debug  ][AsyncSock         ] [eospublic.cern.ch:1094.0] Attempting connection to [2001:1458:301:17::100:e]:1094
...
[2022-09-06 20:07:21.227311 +0200][Error  ][AsyncSock         ] [eospublic.cern.ch:1094.0] Unable to connect: operation timed out
...
[2022-09-06 20:07:21.227431 +0200][Debug  ][AsyncSock         ] [eospublic.cern.ch:1094.0] Attempting connection to [::ffff:128.142.160.145]:1094
...
[2022-09-06 20:07:27.591630 +0200][Debug  ][PostMaster        ] [eospublic.cern.ch:1094] Stream 0 connected.
```

then the client gets redirected to a data server and again fails to connect to the IPv6 address but then succeeds with IPv4 (this again takes less than 1.5 min):
```
[2022-09-06 20:07:27.606806 +0200][Debug  ][AsyncSock         ] [p06636710d91266.cern.ch:1095.0] Attempting connection to [2001:1458:301:c4::100:8]:1095
...
[2022-09-06 20:08:42.604962 +0200][Error  ][AsyncSock         ] [p06636710d91266.cern.ch:1095.0] Unable to connect: operation timed out
...
[2022-09-06 20:08:42.605062 +0200][Debug  ][AsyncSock         ] [p06636710d91266.cern.ch:1095.0] Attempting connection to [::ffff:128.142.215.72]:1095
...
[2022-09-06 20:08:48.829024 +0200][Debug  ][PostMaster        ] [p06636710d91266.cern.ch:1095] Stream 0 connected.
```

After a successful open, the client sends 3 read requests, a query and finally closes the file at:
```
[2022-09-06 20:08:48.922313 +0200][Dump   ][XRootD            ] [p06636710d91266.cern.ch:1095] Got a kXR_ok response to request kXR_close (handle: 0x00000000)
```

In total it takes about 2 minutes 40 seconds, now I don't understand where are the 40 minutes you mention coming from? Once the connection is established it is reused for all the request the client issues.


> and I'd hope that the resilience could happen a bit faster than 2.5 minutes - but I do not know the details nor whether that's cause by our usage in ROOT or xrootd.

It is tunable, you can set the connection window with `XRD_CONNECTIONWINDOW` envar (the default is 120 seconds).


> Perhaps a simpler "fail fast" algorithm than Happy Eyeballs is, for hostnames which resolve to N addresses, have a "short connection timeout" for the first N-1 addresses and use the standard connection timeout for the final address.
> 
> The downside of the latter idea is "complexity kills" for what is ultimately an end-user misconfiguration. The potential upside is that it'd help immensely with cases where the N independent addresses represent N independent servers -- an unresponsive server (which users aren't at fault for) would be quickly ignored instead of having the client wait for the full timeout window.

@bbockelm : we could implement this as a feature to be enabled by the user

-- 
Reply to this email directly or view it on GitHub:
https://github.com/xrootd/xrootd/issues/1779#issuecomment-1245371418
You are receiving this because you are subscribed to this thread.

Message ID: <[log in to unmask]>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1