Print

Print


I have stared at the code for nearly a day, and I can't figure this one 
out. (Maybe 4 hours sleep last night just wasn't enough?) The message 
is long, but hopefully the log extracts hold the clues to solve it.

My setup is that sol199 is the director and lnx6211 is the dataserver.

My xrootd data director on sol199 produces the following messages for 
every connection. It looks to my naive eye that the connections are not 
being closed cleanly, but perhaps "link read error" is just a poor 
choice of error message. It occurs in two places in the code, so it 
isn't clear which piece of code produces the error.  Anyway, things 
pretty much work okay while this is going on...

041216 12:46:49 017 XrootdXeq: User logged in as gregor.31733:17@lnx7108
041216 12:47:28 020 XrdLink: gregor.31733:17@lnx7108 disconnected after 
0:00:39
(link read error)
041216 12:48:53 019 XrootdXeq: User logged in as gregor.31739:18@lnx7108
041216 12:48:54 019 XrdLink: gregor.31739:18@lnx7108 disconnected after 
0:00:01
(link read error)

Then suddenly I get this in the xrootd data server log... lots of 
connections being made but never terminated.

041216 12:49:14 020 XrootdXeq: User logged in as gregor.31754:17@lnx7108
041216 12:51:22 016 XrootdXeq: User logged in as gregor.31754:18@lnx7108
041216 12:53:22 018 XrootdXeq: User logged in as gregor.31754:19@lnx7108
041216 12:53:51 017 XrootdXeq: User logged in as gregor.31764:20@lnx7108
041216 12:55:51 019 XrootdXeq: User logged in as gregor.31764:21@lnx7108
041216 12:55:54 021 XrootdXeq: User logged in as gregor.31769:22@lnx7108
041216 12:55:56 022 XrootdXeq: User logged in as gregor.31773:23@lnx7108

Meanwhile, the client doing the connecting keeps printing

041216 12:51:22 001 Xrd: ReadPartialAnswer Error reading msg from 
connmgr (server [sol199.lns.cornell.edu:1094]).
041216 12:53:22 001 Xrd: ReadPartialAnswer Error reading msg from 
connmgr (server [sol199.lns.cornell.edu:1094]).

until I kill it.

When I do an ls /proc/21843/fd on lnx6211 I see the following:
total 0
lr-x------    1 gregor   cleo           64 Dec 16 13:00 0 -> /dev/null
l-wx------    1 gregor   cleo           64 Dec 16 13:00 1 -> 
/A/lns101/nfs/homes/cleo/gregor/xrootd.inst/logs/xrd.nohup.lnx6211
lrwx------    1 gregor   cleo           64 Dec 16 13:00 10 -> 
socket:[10855876]
lrwx------    1 gregor   cleo           64 Dec 16 13:00 11 -> 
socket:[10855893]
l-wx------    1 gregor   cleo           64 Dec 16 13:00 2 -> 
/A/lns101/nfs/homes/cleo/gregor/xrootd.inst/logs/xrootd-lnx6211
l-wx------    1 gregor   cleo           64 Dec 16 13:00 3 -> 
/A/lns101/nfs/homes/cleo/gregor/xrootd.inst/logs/xrd.nohup.lnx6211
lr-x------    1 gregor   cleo           64 Dec 16 13:00 4 -> 
pipe:[10855851]
l-wx------    1 gregor   cleo           64 Dec 16 13:00 5 -> 
pipe:[10855851]
lr-x------    1 gregor   cleo           64 Dec 16 13:00 6 -> 
pipe:[10855852]
l-wx------    1 gregor   cleo           64 Dec 16 13:00 7 -> 
pipe:[10855852]
lr-x------    1 gregor   cleo           64 Dec 16 13:00 8 -> 
pipe:[10855853]
l-wx------    1 gregor   cleo           64 Dec 16 13:00 9 -> 
pipe:[10855853]

But ALL the socket and pipe lines are flashing red to indicate broken 
symlinks. It would seem to have lost contact with the director, since 
it received no new messages after the point where the client started to 
complain.

On sol199 (solaris 8) there are 24 open files, but none of them 
particularly enlightening to me.

total 32
c---------   1 root     sys       13,  2 Dec 16 11:07 0
--w-------   1 gregor   cleo           0 Dec 16 11:32 1
p---------   0 gregor   cleo           0 Dec 16 12:29 10
c---------   0 root     root     138,  2 Dec 16 12:30 11
p---------   0 gregor   cleo           0 Dec 16 12:29 12
p---------   0 gregor   cleo           0 Dec 16 12:29 13
c---------   0 root     root      41,997 Dec 16 11:32 14
s---------   0 root     root           0 Dec 16 11:32 15
s---------   0 root     root           0 Dec 16 12:49 16
s---------   0 root     root           0 Dec 16 12:49 17
s---------   0 root     root           0 Dec 16 12:51 18
s---------   0 root     root           0 Dec 16 12:53 19
--w-------   1 gregor   cleo       16282 Dec 16 12:55 2
s---------   0 root     root           0 Dec 16 12:53 20
s---------   0 root     root           0 Dec 16 12:55 21
s---------   0 root     root           0 Dec 16 12:55 22
s---------   0 root     root           0 Dec 16 12:55 23
s---------   0 root     root           0 Dec 16 12:55 24
D---------   1 root     root           0 Jul 17  2002 3
--w-------   1 gregor   cleo           0 Dec 16 11:32 4
c---------   1 root     sys      138,  0 Dec 16 12:49 5
p---------   0 gregor   cleo           0 Dec 16 12:49 6
p---------   0 gregor   cleo           0 Dec 16 12:49 7
c---------   0 root     root     138,  1 Dec 16 12:30 8
p---------   0 gregor   cleo           0 Dec 16 12:29 9

--
Gregory J. Sharp                   email: [log in to unmask]
Wilson Synchrotron Laboratory      url: 
http://www.lepp.cornell.edu/~gregor
Dryden Rd                          ph:  +1 607 255 4882
Ithaca, NY 14853                   fax: +1 607 255 8062