I have stared at the code for nearly a day, and I can't figure this one
out. (Maybe 4 hours sleep last night just wasn't enough?) The message
is long, but hopefully the log extracts hold the clues to solve it.
My setup is that sol199 is the director and lnx6211 is the dataserver.
My xrootd data director on sol199 produces the following messages for
every connection. It looks to my naive eye that the connections are not
being closed cleanly, but perhaps "link read error" is just a poor
choice of error message. It occurs in two places in the code, so it
isn't clear which piece of code produces the error. Anyway, things
pretty much work okay while this is going on...
041216 12:46:49 017 XrootdXeq: User logged in as gregor.31733:17@lnx7108
041216 12:47:28 020 XrdLink: gregor.31733:17@lnx7108 disconnected after
0:00:39
(link read error)
041216 12:48:53 019 XrootdXeq: User logged in as gregor.31739:18@lnx7108
041216 12:48:54 019 XrdLink: gregor.31739:18@lnx7108 disconnected after
0:00:01
(link read error)
Then suddenly I get this in the xrootd data server log... lots of
connections being made but never terminated.
041216 12:49:14 020 XrootdXeq: User logged in as gregor.31754:17@lnx7108
041216 12:51:22 016 XrootdXeq: User logged in as gregor.31754:18@lnx7108
041216 12:53:22 018 XrootdXeq: User logged in as gregor.31754:19@lnx7108
041216 12:53:51 017 XrootdXeq: User logged in as gregor.31764:20@lnx7108
041216 12:55:51 019 XrootdXeq: User logged in as gregor.31764:21@lnx7108
041216 12:55:54 021 XrootdXeq: User logged in as gregor.31769:22@lnx7108
041216 12:55:56 022 XrootdXeq: User logged in as gregor.31773:23@lnx7108
Meanwhile, the client doing the connecting keeps printing
041216 12:51:22 001 Xrd: ReadPartialAnswer Error reading msg from
connmgr (server [sol199.lns.cornell.edu:1094]).
041216 12:53:22 001 Xrd: ReadPartialAnswer Error reading msg from
connmgr (server [sol199.lns.cornell.edu:1094]).
until I kill it.
When I do an ls /proc/21843/fd on lnx6211 I see the following:
total 0
lr-x------ 1 gregor cleo 64 Dec 16 13:00 0 -> /dev/null
l-wx------ 1 gregor cleo 64 Dec 16 13:00 1 ->
/A/lns101/nfs/homes/cleo/gregor/xrootd.inst/logs/xrd.nohup.lnx6211
lrwx------ 1 gregor cleo 64 Dec 16 13:00 10 ->
socket:[10855876]
lrwx------ 1 gregor cleo 64 Dec 16 13:00 11 ->
socket:[10855893]
l-wx------ 1 gregor cleo 64 Dec 16 13:00 2 ->
/A/lns101/nfs/homes/cleo/gregor/xrootd.inst/logs/xrootd-lnx6211
l-wx------ 1 gregor cleo 64 Dec 16 13:00 3 ->
/A/lns101/nfs/homes/cleo/gregor/xrootd.inst/logs/xrd.nohup.lnx6211
lr-x------ 1 gregor cleo 64 Dec 16 13:00 4 ->
pipe:[10855851]
l-wx------ 1 gregor cleo 64 Dec 16 13:00 5 ->
pipe:[10855851]
lr-x------ 1 gregor cleo 64 Dec 16 13:00 6 ->
pipe:[10855852]
l-wx------ 1 gregor cleo 64 Dec 16 13:00 7 ->
pipe:[10855852]
lr-x------ 1 gregor cleo 64 Dec 16 13:00 8 ->
pipe:[10855853]
l-wx------ 1 gregor cleo 64 Dec 16 13:00 9 ->
pipe:[10855853]
But ALL the socket and pipe lines are flashing red to indicate broken
symlinks. It would seem to have lost contact with the director, since
it received no new messages after the point where the client started to
complain.
On sol199 (solaris 8) there are 24 open files, but none of them
particularly enlightening to me.
total 32
c--------- 1 root sys 13, 2 Dec 16 11:07 0
--w------- 1 gregor cleo 0 Dec 16 11:32 1
p--------- 0 gregor cleo 0 Dec 16 12:29 10
c--------- 0 root root 138, 2 Dec 16 12:30 11
p--------- 0 gregor cleo 0 Dec 16 12:29 12
p--------- 0 gregor cleo 0 Dec 16 12:29 13
c--------- 0 root root 41,997 Dec 16 11:32 14
s--------- 0 root root 0 Dec 16 11:32 15
s--------- 0 root root 0 Dec 16 12:49 16
s--------- 0 root root 0 Dec 16 12:49 17
s--------- 0 root root 0 Dec 16 12:51 18
s--------- 0 root root 0 Dec 16 12:53 19
--w------- 1 gregor cleo 16282 Dec 16 12:55 2
s--------- 0 root root 0 Dec 16 12:53 20
s--------- 0 root root 0 Dec 16 12:55 21
s--------- 0 root root 0 Dec 16 12:55 22
s--------- 0 root root 0 Dec 16 12:55 23
s--------- 0 root root 0 Dec 16 12:55 24
D--------- 1 root root 0 Jul 17 2002 3
--w------- 1 gregor cleo 0 Dec 16 11:32 4
c--------- 1 root sys 138, 0 Dec 16 12:49 5
p--------- 0 gregor cleo 0 Dec 16 12:49 6
p--------- 0 gregor cleo 0 Dec 16 12:49 7
c--------- 0 root root 138, 1 Dec 16 12:30 8
p--------- 0 gregor cleo 0 Dec 16 12:29 9
--
Gregory J. Sharp email: [log in to unmask]
Wilson Synchrotron Laboratory url:
http://www.lepp.cornell.edu/~gregor
Dryden Rd ph: +1 607 255 4882
Ithaca, NY 14853 fax: +1 607 255 8062
|