Print

Print


Hi Gregory,

After some discussion with other people, it probably would be better if the
olbd has a somewhat more clever safe mode. That is, if the file can be
found, then serve it. If it cannot be found, then delay the client. We can
add a "full" safe-mode that is command driven. As for the defaults for
safe-mode, I'm more than happy to have a suggestion on an algorithm that
works well on both sides of the spectrum (i.e., few servers and many
servers).

Andy

----- Original Message ----- 
From: "Gregory J. Sharp" <[log in to unmask]>
To: "Xrootd Mailing List" <[log in to unmask]>
Sent: Friday, December 10, 2004 12:13 PM
Subject: tolerating server failures


> I have a problem with xrootd which I can't really explain, although I
> have some suspicions.
> My systems run RHEL 3. As a result, I am using round-robin scheduling
> rather than load balancing.
> A have one director (D) and three data servers (S1, S2 & S3).  I have
> set olbd.delay drop 1m.
> I have three files cached, one on each of the data servers. The HSM
> system is up and running, and
> able to deliver all three files to any server that requests it.
>
> Now if I run a program that reads all three files, then D caches the
> location status. If I now kill S1 and S2 and try to run my program
> again, it hangs for 1 minute waiting for D to drop S1 and S2. After
> that, D sits in an (apparently) endless loop of
>
>      041210 15:05:38 010 do_Select Select delay XXXXXX.lns.cornell.edu 15
>
> (where XXXXXX is the deleted hostname) instead of caching the two files
> to S3 that are not already cached there.
>
> I suspect that this is because it cached the negative responses from S3
> regarding the two files not cached on it, which were made when S1 and
> S2 were still running. I have to hope that when S1 and S2 were dropped
> that all the files that D believed they cached were removed from the
> olb cache. But when that happens, it might also be a good idea to drop
> all negative cache information.
>
> Is there a parameter to shorten the lifetime of the negative cache
> responses? olb.fxhold seems like a candidate, but that deletes positive
> caching information as well, which I don't want to do.
>
> Am I barking up the wrong tree here and I need to tweak something else
> to stop this aberrant behavior?
> Any help would be greatly appreciated.
>
> --
> Gregory J. Sharp                   email: [log in to unmask]
> Wilson Synchrotron Laboratory      url:
> http://www.lepp.cornell.edu/~gregor
> Dryden Rd                          ph:  +1 607 255 4882
> Ithaca, NY 14853                   fax: +1 607 255 8062
>
>