Hi Gregory, After some discussion with other people, it probably would be better if the olbd has a somewhat more clever safe mode. That is, if the file can be found, then serve it. If it cannot be found, then delay the client. We can add a "full" safe-mode that is command driven. As for the defaults for safe-mode, I'm more than happy to have a suggestion on an algorithm that works well on both sides of the spectrum (i.e., few servers and many servers). Andy ----- Original Message ----- From: "Gregory J. Sharp" <[log in to unmask]> To: "Xrootd Mailing List" <[log in to unmask]> Sent: Friday, December 10, 2004 12:13 PM Subject: tolerating server failures > I have a problem with xrootd which I can't really explain, although I > have some suspicions. > My systems run RHEL 3. As a result, I am using round-robin scheduling > rather than load balancing. > A have one director (D) and three data servers (S1, S2 & S3). I have > set olbd.delay drop 1m. > I have three files cached, one on each of the data servers. The HSM > system is up and running, and > able to deliver all three files to any server that requests it. > > Now if I run a program that reads all three files, then D caches the > location status. If I now kill S1 and S2 and try to run my program > again, it hangs for 1 minute waiting for D to drop S1 and S2. After > that, D sits in an (apparently) endless loop of > > 041210 15:05:38 010 do_Select Select delay XXXXXX.lns.cornell.edu 15 > > (where XXXXXX is the deleted hostname) instead of caching the two files > to S3 that are not already cached there. > > I suspect that this is because it cached the negative responses from S3 > regarding the two files not cached on it, which were made when S1 and > S2 were still running. I have to hope that when S1 and S2 were dropped > that all the files that D believed they cached were removed from the > olb cache. But when that happens, it might also be a good idea to drop > all negative cache information. > > Is there a parameter to shorten the lifetime of the negative cache > responses? olb.fxhold seems like a candidate, but that deletes positive > caching information as well, which I don't want to do. > > Am I barking up the wrong tree here and I need to tweak something else > to stop this aberrant behavior? > Any help would be greatly appreciated. > > -- > Gregory J. Sharp email: [log in to unmask] > Wilson Synchrotron Laboratory url: > http://www.lepp.cornell.edu/~gregor > Dryden Rd ph: +1 607 255 4882 > Ithaca, NY 14853 fax: +1 607 255 8062 > >