Print

Print


Hi Gregory,

Ah, no you ran into the olbd going into safe-mode. The problem is that you
killed too many servers. When that happens, the olbd becomes paranoid and
goes into safe-mode (delaying clients) until enough servers come back. The
default setting is if more than 20% of the servers that the olbd has known
disappear, it must go ito safe-mode. This works well if you have enough
servers but is rather poor if you have only a few servers (suggestions
welcome). You can set another default using the olb.delay servers <n | n%>
where you can specify the minimum number of servers you need to continue or
a percentage indicating the percent of the maimum number of servers you had.
In your case, you should specufy 1.

Andy

----- Original Message ----- 
From: "Gregory J. Sharp" <[log in to unmask]>
To: "Xrootd Mailing List" <[log in to unmask]>
Sent: Friday, December 10, 2004 12:13 PM
Subject: tolerating server failures


> I have a problem with xrootd which I can't really explain, although I
> have some suspicions.
> My systems run RHEL 3. As a result, I am using round-robin scheduling
> rather than load balancing.
> A have one director (D) and three data servers (S1, S2 & S3).  I have
> set olbd.delay drop 1m.
> I have three files cached, one on each of the data servers. The HSM
> system is up and running, and
> able to deliver all three files to any server that requests it.
>
> Now if I run a program that reads all three files, then D caches the
> location status. If I now kill S1 and S2 and try to run my program
> again, it hangs for 1 minute waiting for D to drop S1 and S2. After
> that, D sits in an (apparently) endless loop of
>
>      041210 15:05:38 010 do_Select Select delay XXXXXX.lns.cornell.edu 15
>
> (where XXXXXX is the deleted hostname) instead of caching the two files
> to S3 that are not already cached there.
>
> I suspect that this is because it cached the negative responses from S3
> regarding the two files not cached on it, which were made when S1 and
> S2 were still running. I have to hope that when S1 and S2 were dropped
> that all the files that D believed they cached were removed from the
> olb cache. But when that happens, it might also be a good idea to drop
> all negative cache information.
>
> Is there a parameter to shorten the lifetime of the negative cache
> responses? olb.fxhold seems like a candidate, but that deletes positive
> caching information as well, which I don't want to do.
>
> Am I barking up the wrong tree here and I need to tweak something else
> to stop this aberrant behavior?
> Any help would be greatly appreciated.
>
> --
> Gregory J. Sharp                   email: [log in to unmask]
> Wilson Synchrotron Laboratory      url:
> http://www.lepp.cornell.edu/~gregor
> Dryden Rd                          ph:  +1 607 255 4882
> Ithaca, NY 14853                   fax: +1 607 255 8062
>
>