On Oct 8, 2010, at 4:48 PM, Andrew Hanushevsky wrote: > Hi Brian, > > Could you also send the xrootd log snipped that includes the first time the client reported the error and that includes all traffic? What I see below is that the log is restricted to only the particular client (not all messages contain the client's ID) and starts 11:33:20 while the client complained at 11:32:41 to 11:32:48. > > Andy > > On Fri, 8 Oct 2010, Brian Bockelman wrote: > >> Hi Andy, >> >> Nope, this is a user analysis job running on EC2. The job is ntuple-like and runs at about 1KHz. So, each file takes about 2 minutes to process and the connection is trashed by the client in the "normal" manner after a few minutes. >> >> The sequence, as far as I can tell is: >> >> 1) Client reads file from red-gridftp9. >> 2) About 5 minutes after the file is closed and the connection idles, the client disconnects >> 3) Later on, the client hits another file at red-gridftp9. >> 4) Client instructs red-gridftp9 to close the session. red-gridftp9 says "no such session". >> 5) Client continues to puke error messages once a second, regardless if there are any outstanding requests. >> >> Note that the first two dumps below have the following error message: >> >>>> Xrd: Read: Timeout or error waiting outstanding blocks. Retrying sync! List of outstanding reqs follows >> >> but the list of requests are empty! >> >> Brian >> >> On Oct 8, 2010, at 3:06 PM, Andrew Hanushevsky wrote: >> >>> Hi Brian, >>> >>> Looks like the client starts to hiccup every 7 minutes on the dot. The endsess is a normal part of the reconnect process and a recommendation in the protocol. The reason that the previous session wasn't found is that the client actually disconnected ending the previous session. I assume the reason is that the client thinks the server is not responding and so it tries to simply reconnect. It would seem that the client lost several blocks that it wanted to transfer. I assume this is related to xtreme copy. >>> >>> Andy >>> >>> On Fri, 8 Oct 2010, Brian Bockelman wrote: >>> >>>> Hi all, >>>> >>>> I'm seeing some issues with xrootd in CMSSW jobs that are causing severe performance issues. See below for a snippet from the client and relevant server. >>>> >>>> It seems that the client is trying to end a previously terminated security session and, after this happens the first time, I always get these "Timeout or error waiting on outstanding blocks". This happens about every second of a multi-hour job, making the log file a touch verbose. >>>> >>>> In this case, the client was talking to red-gridftp9. I tried a few test transfers, and the server is otherwise responsive (although by no means idle, xrootd is exporting about 4Gbps today). >>>> >>>> Any ideas? I've only seen these timeout issues when there's the server error. My search has not been exhaustive, so it could just be a coincidence. >>>> >>>> Brian >>>> >>>> 101008 11:32:41 001 Xrd: CheckErrorStatus: Server declared: session not found(error code: 3011) >>>> 101008 11:32:41 001 Xrd: Open: File opened succesfully. >>>> 08-Oct-2010 11:32:42 CDT Successfully opened file root://xrootd.unl.edu//store/data/Run2010A/JetMETTau/RECO/Sep17ReReco_v2/0051/BC729A70-BEC6-DF11-AFF4-0018F3D096EE.root >>>> 101008 11:32:48 001 Xrd: Read: Timeout or error waiting outstanding blocks. Retrying sync! List of outstanding reqs follows. >>>> -------------------------------------------------- start outstanding reqs dump. freesids: 65533 >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ end outstanding reqs dump. >>>> 101008 11:32:48 001 Xrd: Read: Timeout or error waiting outstanding blocks. Retrying sync! List of outstanding reqs follows. >>>> -------------------------------------------------- start outstanding reqs dump. freesids: 65533 >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ end outstanding reqs dump. >>>> 101008 11:32:48 001 Xrd: Read: Timeout or error waiting outstanding blocks. Retrying sync! List of outstanding reqs follows. >>>> -------------------------------------------------- start outstanding reqs dump. freesids: 65532 >>>> >>>> >>>> ================= DUMPING CLIENT REQUEST HEADER ================= >>>> ClientHeader.streamid = 0x75 0x0d >>>> ClientHeader.requestid = kXR_read (3013) >>>> ClientHeader.read.fhandle = 0x00 0x00 0x00 0x00 >>>> ClientHeader.read.offset = 11937606541 >>>> ClientHeader.read.rlen = 562291 >>>> ClientHeader.header.dlen = 0 >>>> =================== END CLIENT HEADER DUMPING =================== >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ end outstanding reqs dump. >>>> >>>> [root@red-gridftp9 ~]# grep glidein_.10759 /var/log/xrootd/xrootd.log >>>> 101008 11:33:20 8224 glidein_.10759:[log in to unmask] XrootdProtocol: more auth requested; sz=2082 >>>> 101008 11:33:20 8224 XrootdXeq: glidein_.10759:[log in to unmask] login as 92a2e9e2.0 >>>> 101008 11:33:20 8224 acc_Audit: glidein_.10759:[log in to unmask] grant gsi [log in to unmask] read /store/data/Run2010A/JetMETTau/RECO/Sep17ReReco_v2/0050/D6E7ADBB-B7C6-DF11-AFC6-002618943975.root >>>> 101008 11:40:24 8224 XrootdXeq: glidein_.10759:[log in to unmask] disc 0:07:04 >>>> 101008 12:26:57 8224 glidein_.10759:[log in to unmask] XrootdProtocol: more auth requested; sz=2082 >>>> 101008 12:26:58 8224 XrootdXeq: glidein_.10759:[log in to unmask] login as 92a2e9e2.0 >>>> 101008 12:26:58 8224 glidein_.10759:[log in to unmask] XrootdProtocol: endsess 8222:371.14718 >>>> 101008 12:26:58 8224 glidein_.10759:[log in to unmask] XrootdProtocol: endsess 8222:371.14718 rc=-3 (No such process) >>>> 101008 12:26:58 8224 glidein_.10759:[log in to unmask] XrootdResponse: sending err 3011: session not found >>>> 101008 12:26:58 8224 acc_Audit: glidein_.10759:[log in to unmask] grant gsi [log in to unmask] read /store/data/Run2010A/JetMETTau/RECO/Sep17ReReco_v2/0050/E62D2728-B8C6-DF11-8E71-0018F3D0962C.root >>>> 101008 12:33:59 8224 XrootdXeq: glidein_.10759:[log in to unmask] disc 0:07:02 >>>> 101008 12:56:01 8224 glidein_.10759:[log in to unmask] XrootdProtocol: more auth requested; sz=2082 >>>> 101008 12:56:01 8224 XrootdXeq: glidein_.10759:[log in to unmask] login as 92a2e9e2.0 >>>> 101008 12:56:01 8224 glidein_.10759:[log in to unmask] XrootdProtocol: endsess 8222:451.14760 >>>> 101008 12:56:01 8224 glidein_.10759:[log in to unmask] XrootdProtocol: endsess 8222:451.14760 rc=-3 (No such process) >>>> 101008 12:56:01 8224 glidein_.10759:[log in to unmask] XrootdResponse: sending err 3011: session not found >>>> 101008 12:56:01 8224 acc_Audit: glidein_.10759:[log in to unmask] grant gsi [log in to unmask] read /store/data/Run2010A/JetMETTau/RECO/Sep17ReReco_v2/0047/884E3212-6AC6-DF11-B1C9-0026189438AF.root >>>> >>>> >> >>