Hi Gregory,
rereading this message, I realized that this scenario is very similar
to the tricky one that Andy and I debugged and fixed together last week.
So, it might be useful for us to know:
- which client/server verison you are using (or the head of which day if
you are used to take the cvs head)
- what the client side is doing (xrdcp or some other prog?) In
particular which flags/options you specified in the Open request.
In any case, from the logs you provided, I see no evidence of the fact
that the client is creating many connections. I suggest you to increase
the debug level at both sides, just to make things (or bugs) clearer.
Fabrizio
Gregory J. Sharp wrote:
> I have stared at the code for nearly a day, and I can't figure this one
> out. (Maybe 4 hours sleep last night just wasn't enough?) The message is
> long, but hopefully the log extracts hold the clues to solve it.
>
> My setup is that sol199 is the director and lnx6211 is the dataserver.
>
> My xrootd data director on sol199 produces the following messages for
> every connection. It looks to my naive eye that the connections are not
> being closed cleanly, but perhaps "link read error" is just a poor
> choice of error message. It occurs in two places in the code, so it
> isn't clear which piece of code produces the error. Anyway, things
> pretty much work okay while this is going on...
>
> 041216 12:46:49 017 XrootdXeq: User logged in as gregor.31733:17@lnx7108
> 041216 12:47:28 020 XrdLink: gregor.31733:17@lnx7108 disconnected after
> 0:00:39
> (link read error)
> 041216 12:48:53 019 XrootdXeq: User logged in as gregor.31739:18@lnx7108
> 041216 12:48:54 019 XrdLink: gregor.31739:18@lnx7108 disconnected after
> 0:00:01
> (link read error)
>
> Then suddenly I get this in the xrootd data server log... lots of
> connections being made but never terminated.
>
> 041216 12:49:14 020 XrootdXeq: User logged in as gregor.31754:17@lnx7108
> 041216 12:51:22 016 XrootdXeq: User logged in as gregor.31754:18@lnx7108
> 041216 12:53:22 018 XrootdXeq: User logged in as gregor.31754:19@lnx7108
> 041216 12:53:51 017 XrootdXeq: User logged in as gregor.31764:20@lnx7108
> 041216 12:55:51 019 XrootdXeq: User logged in as gregor.31764:21@lnx7108
> 041216 12:55:54 021 XrootdXeq: User logged in as gregor.31769:22@lnx7108
> 041216 12:55:56 022 XrootdXeq: User logged in as gregor.31773:23@lnx7108
>
> Meanwhile, the client doing the connecting keeps printing
>
> 041216 12:51:22 001 Xrd: ReadPartialAnswer Error reading msg from
> connmgr (server [sol199.lns.cornell.edu:1094]).
> 041216 12:53:22 001 Xrd: ReadPartialAnswer Error reading msg from
> connmgr (server [sol199.lns.cornell.edu:1094]).
>
> until I kill it.
>
> When I do an ls /proc/21843/fd on lnx6211 I see the following:
> total 0
> lr-x------ 1 gregor cleo 64 Dec 16 13:00 0 -> /dev/null
> l-wx------ 1 gregor cleo 64 Dec 16 13:00 1 ->
> /A/lns101/nfs/homes/cleo/gregor/xrootd.inst/logs/xrd.nohup.lnx6211
> lrwx------ 1 gregor cleo 64 Dec 16 13:00 10 ->
> socket:[10855876]
> lrwx------ 1 gregor cleo 64 Dec 16 13:00 11 ->
> socket:[10855893]
> l-wx------ 1 gregor cleo 64 Dec 16 13:00 2 ->
> /A/lns101/nfs/homes/cleo/gregor/xrootd.inst/logs/xrootd-lnx6211
> l-wx------ 1 gregor cleo 64 Dec 16 13:00 3 ->
> /A/lns101/nfs/homes/cleo/gregor/xrootd.inst/logs/xrd.nohup.lnx6211
> lr-x------ 1 gregor cleo 64 Dec 16 13:00 4 ->
> pipe:[10855851]
> l-wx------ 1 gregor cleo 64 Dec 16 13:00 5 ->
> pipe:[10855851]
> lr-x------ 1 gregor cleo 64 Dec 16 13:00 6 ->
> pipe:[10855852]
> l-wx------ 1 gregor cleo 64 Dec 16 13:00 7 ->
> pipe:[10855852]
> lr-x------ 1 gregor cleo 64 Dec 16 13:00 8 ->
> pipe:[10855853]
> l-wx------ 1 gregor cleo 64 Dec 16 13:00 9 ->
> pipe:[10855853]
>
> But ALL the socket and pipe lines are flashing red to indicate broken
> symlinks. It would seem to have lost contact with the director, since it
> received no new messages after the point where the client started to
> complain.
>
> On sol199 (solaris 8) there are 24 open files, but none of them
> particularly enlightening to me.
>
> total 32
> c--------- 1 root sys 13, 2 Dec 16 11:07 0
> --w------- 1 gregor cleo 0 Dec 16 11:32 1
> p--------- 0 gregor cleo 0 Dec 16 12:29 10
> c--------- 0 root root 138, 2 Dec 16 12:30 11
> p--------- 0 gregor cleo 0 Dec 16 12:29 12
> p--------- 0 gregor cleo 0 Dec 16 12:29 13
> c--------- 0 root root 41,997 Dec 16 11:32 14
> s--------- 0 root root 0 Dec 16 11:32 15
> s--------- 0 root root 0 Dec 16 12:49 16
> s--------- 0 root root 0 Dec 16 12:49 17
> s--------- 0 root root 0 Dec 16 12:51 18
> s--------- 0 root root 0 Dec 16 12:53 19
> --w------- 1 gregor cleo 16282 Dec 16 12:55 2
> s--------- 0 root root 0 Dec 16 12:53 20
> s--------- 0 root root 0 Dec 16 12:55 21
> s--------- 0 root root 0 Dec 16 12:55 22
> s--------- 0 root root 0 Dec 16 12:55 23
> s--------- 0 root root 0 Dec 16 12:55 24
> D--------- 1 root root 0 Jul 17 2002 3
> --w------- 1 gregor cleo 0 Dec 16 11:32 4
> c--------- 1 root sys 138, 0 Dec 16 12:49 5
> p--------- 0 gregor cleo 0 Dec 16 12:49 6
> p--------- 0 gregor cleo 0 Dec 16 12:49 7
> c--------- 0 root root 138, 1 Dec 16 12:30 8
> p--------- 0 gregor cleo 0 Dec 16 12:29 9
>
> --
> Gregory J. Sharp email: [log in to unmask]
> Wilson Synchrotron Laboratory url: http://www.lepp.cornell.edu/~gregor
> Dryden Rd ph: +1 607 255 4882
> Ithaca, NY 14853 fax: +1 607 255 8062
|