Hi Dan,
OK, you have uncovered a bug. Indeed, the stalls are comming from
supervisors that don't have data servers. It's an xrdadmin command problem
as it's trying to do a deep locate irrespective of file presence. Normal
stats, opens, etc won't show this behaviour. It's isolated to the locate
command.
As you guessed, the distribution of servers to supervisors is random (it's
designed that way) and you happened to get into a situation where one or
more didn't get any subscribers. This becomes more likely as you increase
the number of supervisors beyond what is necessary. The immediate solution
is to reduce the number of supervisors. The real solution is to allow
supervisors the flexibility to not have any subscribers. So, could you
post a bug report about this problem so we don't forget to fix it? Thanks
for bringing this to our attention.
Andy
On Fri, 30 Mar 2012, Dan Bradley wrote:
> Hello,
>
> At our site (Wisconsin CMS T2), we have more than 64 data servers, so we
> configured some supervisors. We added more than currently needed, under the
> assumption that this couldn't hurt. However, after a recent upgrade (and
> restart) of xrootd, I find that the xrd locateall command now hangs forever.
> It gets a response from the central redirector, but then it connects to one
> of the supervisors and hangs. The supervisor logs indicate that it is
> stalling the client over and over:
>
> XrootdProtocol: stalling client for 10 sec
> XrootdProtocol: stalling client for 10 sec
> XrootdProtocol: stalling client for 10 sec
>
> By querying the supervisors directly with locateall, I find that some of them
> respond and some of them keep stalling. I'm guessing that the ones that are
> stalling don't have any data servers associated with them. The supervisors
> that do respond to locateall show me that there is a very uneven distribution
> of data servers associated with each supervisor: 61, 26, 16, 8. The other
> five supervisors stall and therefore presumably have none.
>
> I do not know why this behavior did not show up before the upgrade. Perhaps
> it is a feature of the new version? Or perhaps it is an effect of the
> startup order?
>
> We are currently running xrootd-server-3.1.1-1.osg.el5.xu.
>
> Any ideas/suggestions? I can easily reduce the number of supervisors, but
> I'd like to know if the behavior I am seeing is expected.
>
> Thanks,
> --Dan
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
|