Print

Print


Hi Gregory,

  rereading this message, I realized that this scenario is very similar 
to the tricky one that Andy and I debugged and fixed together last week. 
So, it might be useful for us to know:
- which client/server verison you are using (or the head of which day if 
you are used to take the cvs head)
- what the client side is doing (xrdcp or some other prog?) In 
particular which flags/options you specified in the Open request.

  In any case, from the logs you provided, I see no evidence of the fact 
that the client is creating many connections. I suggest you to increase 
the debug level at both sides, just to make things (or bugs) clearer.

  Fabrizio

Gregory J. Sharp wrote:
> I have stared at the code for nearly a day, and I can't figure this one 
> out. (Maybe 4 hours sleep last night just wasn't enough?) The message is 
> long, but hopefully the log extracts hold the clues to solve it.
> 
> My setup is that sol199 is the director and lnx6211 is the dataserver.
> 
> My xrootd data director on sol199 produces the following messages for 
> every connection. It looks to my naive eye that the connections are not 
> being closed cleanly, but perhaps "link read error" is just a poor 
> choice of error message. It occurs in two places in the code, so it 
> isn't clear which piece of code produces the error.  Anyway, things 
> pretty much work okay while this is going on...
> 
> 041216 12:46:49 017 XrootdXeq: User logged in as gregor.31733:17@lnx7108
> 041216 12:47:28 020 XrdLink: gregor.31733:17@lnx7108 disconnected after 
> 0:00:39
> (link read error)
> 041216 12:48:53 019 XrootdXeq: User logged in as gregor.31739:18@lnx7108
> 041216 12:48:54 019 XrdLink: gregor.31739:18@lnx7108 disconnected after 
> 0:00:01
> (link read error)
> 
> Then suddenly I get this in the xrootd data server log... lots of 
> connections being made but never terminated.
> 
> 041216 12:49:14 020 XrootdXeq: User logged in as gregor.31754:17@lnx7108
> 041216 12:51:22 016 XrootdXeq: User logged in as gregor.31754:18@lnx7108
> 041216 12:53:22 018 XrootdXeq: User logged in as gregor.31754:19@lnx7108
> 041216 12:53:51 017 XrootdXeq: User logged in as gregor.31764:20@lnx7108
> 041216 12:55:51 019 XrootdXeq: User logged in as gregor.31764:21@lnx7108
> 041216 12:55:54 021 XrootdXeq: User logged in as gregor.31769:22@lnx7108
> 041216 12:55:56 022 XrootdXeq: User logged in as gregor.31773:23@lnx7108
> 
> Meanwhile, the client doing the connecting keeps printing
> 
> 041216 12:51:22 001 Xrd: ReadPartialAnswer Error reading msg from 
> connmgr (server [sol199.lns.cornell.edu:1094]).
> 041216 12:53:22 001 Xrd: ReadPartialAnswer Error reading msg from 
> connmgr (server [sol199.lns.cornell.edu:1094]).
> 
> until I kill it.
> 
> When I do an ls /proc/21843/fd on lnx6211 I see the following:
> total 0
> lr-x------    1 gregor   cleo           64 Dec 16 13:00 0 -> /dev/null
> l-wx------    1 gregor   cleo           64 Dec 16 13:00 1 -> 
> /A/lns101/nfs/homes/cleo/gregor/xrootd.inst/logs/xrd.nohup.lnx6211
> lrwx------    1 gregor   cleo           64 Dec 16 13:00 10 -> 
> socket:[10855876]
> lrwx------    1 gregor   cleo           64 Dec 16 13:00 11 -> 
> socket:[10855893]
> l-wx------    1 gregor   cleo           64 Dec 16 13:00 2 -> 
> /A/lns101/nfs/homes/cleo/gregor/xrootd.inst/logs/xrootd-lnx6211
> l-wx------    1 gregor   cleo           64 Dec 16 13:00 3 -> 
> /A/lns101/nfs/homes/cleo/gregor/xrootd.inst/logs/xrd.nohup.lnx6211
> lr-x------    1 gregor   cleo           64 Dec 16 13:00 4 -> 
> pipe:[10855851]
> l-wx------    1 gregor   cleo           64 Dec 16 13:00 5 -> 
> pipe:[10855851]
> lr-x------    1 gregor   cleo           64 Dec 16 13:00 6 -> 
> pipe:[10855852]
> l-wx------    1 gregor   cleo           64 Dec 16 13:00 7 -> 
> pipe:[10855852]
> lr-x------    1 gregor   cleo           64 Dec 16 13:00 8 -> 
> pipe:[10855853]
> l-wx------    1 gregor   cleo           64 Dec 16 13:00 9 -> 
> pipe:[10855853]
> 
> But ALL the socket and pipe lines are flashing red to indicate broken 
> symlinks. It would seem to have lost contact with the director, since it 
> received no new messages after the point where the client started to 
> complain.
> 
> On sol199 (solaris 8) there are 24 open files, but none of them 
> particularly enlightening to me.
> 
> total 32
> c---------   1 root     sys       13,  2 Dec 16 11:07 0
> --w-------   1 gregor   cleo           0 Dec 16 11:32 1
> p---------   0 gregor   cleo           0 Dec 16 12:29 10
> c---------   0 root     root     138,  2 Dec 16 12:30 11
> p---------   0 gregor   cleo           0 Dec 16 12:29 12
> p---------   0 gregor   cleo           0 Dec 16 12:29 13
> c---------   0 root     root      41,997 Dec 16 11:32 14
> s---------   0 root     root           0 Dec 16 11:32 15
> s---------   0 root     root           0 Dec 16 12:49 16
> s---------   0 root     root           0 Dec 16 12:49 17
> s---------   0 root     root           0 Dec 16 12:51 18
> s---------   0 root     root           0 Dec 16 12:53 19
> --w-------   1 gregor   cleo       16282 Dec 16 12:55 2
> s---------   0 root     root           0 Dec 16 12:53 20
> s---------   0 root     root           0 Dec 16 12:55 21
> s---------   0 root     root           0 Dec 16 12:55 22
> s---------   0 root     root           0 Dec 16 12:55 23
> s---------   0 root     root           0 Dec 16 12:55 24
> D---------   1 root     root           0 Jul 17  2002 3
> --w-------   1 gregor   cleo           0 Dec 16 11:32 4
> c---------   1 root     sys      138,  0 Dec 16 12:49 5
> p---------   0 gregor   cleo           0 Dec 16 12:49 6
> p---------   0 gregor   cleo           0 Dec 16 12:49 7
> c---------   0 root     root     138,  1 Dec 16 12:30 8
> p---------   0 gregor   cleo           0 Dec 16 12:29 9
> 
> -- 
> Gregory J. Sharp                   email: [log in to unmask]
> Wilson Synchrotron Laboratory      url: http://www.lepp.cornell.edu/~gregor
> Dryden Rd                          ph:  +1 607 255 4882
> Ithaca, NY 14853                   fax: +1 607 255 8062