Print

Print


Hi Adrian,

The reason you get these errors points out the standard behaviour of an
XRootD cluster. To prevent you from leaving file turds around, once a
write is started at a server the client is bound to that server and you
can't simply ask for another one because you'll proliferate multiple files
which are different but have the same filename; a nightmare scenario.
This doesn't apply to clusters that offer write recovery like EOS where
the client will be able to write recover at another server.

Ah, you say, but I enabled POSC. Indeed, if you did that you would be
technically able to recover once that file gets deleted but that happens
in an indeterminate time and the redirector won't let you recover
elsewhere until that happens (which for a dead server might not happen
for a long time). Even if the server were not dead, you would have to
tune POSC to delete the file within the client's retry window which is
essentially 0 seconds and from a practical standpoint that just
creates a race condition and makes recovery non-deterministic.

Andy

On Thu, 3 Jun 2021, Adrian Sevcenco wrote:

> @simonmichal
>> @adriansev : I'm not sure I got your example ;-)
>>
>> Let's say we have a `CopyProcess` with 2 `CopyJob`s (be the source a Metalink or an ordinary file, doesn't matter), I would assume that `retry=1` means that in case of failure each of the `CopyJob`s should be retried once.
>
> yeah, exactly!
> but it would be fantastic if for metafiles also the internal list would also be retried with the same number of retries
>
>> Also, I suppose it would be wise to come up with a criteria when it makes sense to retry a failed `CopyJob`, e.g. I suppose it doesn't make sense to retry file-does-not-exist or authentication-failed error.
>
> yeah exactly .. i think that the only error conditions that make sense to retry are the timeouts .. maybe also for disk full when writing? because at retry another data server could be selected.. also i usually have in the external write tests an error like `[3011] Unable to access file; eligible servers shunned. (destination)` maybe the retry could help?
> also dropped connections i think that also go to timeout isn't it?
>
>> Let me know your thoughts!
>
> What do you think? Thanks a lot!!
>
>
>
> --
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly or view it on GitHub:
> https://github.com/xrootd/xrootd/issues/1139#issuecomment-853951266


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/issues/1139#issuecomment-854102947", "url": "https://github.com/xrootd/xrootd/issues/1139#issuecomment-854102947", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1