Print

Print


Hi Lincoln,

Well, that server certainly is not acting like a regular disk server. When 
I see the message "delaying client up to 1800 sec" it typically means it's 
trying to fetch the file from a remote location. So, something is amiss. 
Is it acting that way for any file you try to read?

Also, unfortunately, there is no way to display the name mapping.

Andy

On Tue, 7 Sep 2021, Lincoln Bryant wrote:

> It's a plain server with attached disks configured in a RAID-6. Is there a simple way to take the logical name and get the physical name on disk?
>
> Thanks,
> Lincoln
> ________________________________
> From: [log in to unmask] <[log in to unmask]> on behalf of Andrew Hanushevsky <[log in to unmask]>
> Sent: Tuesday, September 7, 2021 2:49 PM
> To: Lincoln Bryant <[log in to unmask]>
> Cc: Yang, Wei <[log in to unmask]>; [log in to unmask] <[log in to unmask]>
> Subject: Re: Can't remove missing file
>
> Hi Lincoln,
>
> What is /uct3-s1.mwt2.org? Is it an Xcache server, a plain server, a tape
> backed server? Clearly, that server is having issues and it would be
> intgersting to find out why it can't provide the file even though it
> thinks it can. I say that because by delaying the client it means it is
> attempting to get the file.
>
> Andy
>
>
> On Tue, 7 Sep 2021, Lincoln Bryant wrote:
>
>> I tried to xrdcp this file. Seems the redirector forwards the client to uct3-s1, where it believes the file is stored, and then the client sits for 1800s and times out.
>>
>>> From the xrdcp debug:
>>
>> [2021-09-07 14:01:09.141934 -0500][Dump   ][XRootDTransport   ] [uct3-s1.mwt2.org:1094.0] Stream inactive since 1787 seconds, TTL: 300, allocated SIDs: 1, open files: 0, bound file objects: 0
>> [2021-09-07 14:01:09.141984 -0500][Dump   ][XRootDTransport   ] [uct3-s1.mwt2.org:1094.0] Stream inactive since 1787 seconds, stream timeout: 60, allocated SIDs: 1, wait barrier: 2021-09-07 14:01:22 -0500
>> [2021-09-07 14:01:22.637006 -0500][Dump   ][TaskMgr           ] Running task: "TickGeneratorTask for: root://uct3-s1.mwt2.org:1094"
>> [2021-09-07 14:01:22.637065 -0500][Dump   ][XRootD            ] [uct3-s1.mwt2.org:1094] Stream event reported for msg kXR_open (file: /atlas/uct3/data/users/jano/qual/matched/truth/ntuple_0.root?, mode: 00, flags: kXR_open_read kXR_async kXR_retstat )
>> [2021-09-07 14:01:22.637100 -0500][Debug  ][XRootD            ] [uct3-s1.mwt2.org:1094] Handling error while processing kXR_open (file: /atlas/uct3/data/users/jano/qual/matched/truth/ntuple_0.root?, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ): [ERROR] Operation expired.
>> [2021-09-07 14:01:22.637125 -0500][Error  ][XRootD            ] [uct3-s1.mwt2.org:1094] Unable to get the response to request kXR_open (file: /atlas/uct3/data/users/jano/qual/matched/truth/ntuple_0.root?, mode: 00, flags: kXR_open_read kXR_async kXR_retstat )
>> [2021-09-07 14:01:22.637157 -0500][Debug  ][ExDbgMsg          ] [uct3-s1.mwt2.org:1094] Passing to the thread-pool MsgHandler: 0x1205420 (message: kXR_open (file: /atlas/uct3/data/users/jano/qual/matched/truth/ntuple_0.root?, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ) ).
>> [2021-09-07 14:01:22.637182 -0500][Dump   ][TaskMgr           ] Will rerun task "TickGeneratorTask for: root://uct3-s1.mwt2.org:1094" at [2021-09-07 14:01:37 -0500]
>> [2021-09-07 14:01:22.637283 -0500][Debug  ][ExDbgMsg          ] [uct3-s1.mwt2.org:1094] Calling MsgHandler: 0x1205420 (message: kXR_open (file: /atlas/uct3/data/users/jano/qual/matched/truth/ntuple_0.root?, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ) ) with status: [ERROR] Operation expired.
>> [2021-09-07 14:01:22.637373 -0500][Debug  ][File              ] [0x12005a0@root://uct3-xrd.mwt2.org:1094//atlas/uct3/data/users/jano/qual/matched/truth/ntuple_0.root?xrdcl.requuid=2686890d-010a-42b5-bcf6-c5d77f0213e0] Open has returned with status [ERROR] Operation expired
>> [2021-09-07 14:01:22.637400 -0500][Debug  ][File              ] [0x12005a0@root://uct3-xrd.mwt2.org:1094//atlas/uct3/data/users/jano/qual/matched/truth/ntuple_0.root?xrdcl.requuid=2686890d-010a-42b5-bcf6-c5d77f0213e0] Error while opening at uct3-s1.mwt2.org:1094: [ERROR] Operation expired
>> [2021-09-07 14:01:22.637441 -0500][Debug  ][XRootD            ] Redirect trace-back:
>> [2021-09-07 14:01:22.637441 -0500][Debug  ][XRootD            ]         0. Redirected from: root://uct3-xrd.mwt2.org:1094//atlas/uct3/data/users/jano/qual/matched/truth/ntuple_0.root to: root://uct3-s1.mwt2.org:1094/
>> [2021-09-07 14:01:22.637468 -0500][Debug  ][ExDbgMsg          ] [uct3-s1.mwt2.org:1094] Destroying MsgHandler: 0x1205420.
>> [0B/0B][100%][==================================================][0B/s]
>> Run: [ERROR] Operation expired:  (source)
>> [2021-09-07 14:01:22.638020 -0500][Debug  ][JobMgr            ] Stopping the job manager...
>> [2021-09-07 14:01:22.638040 -0500][Dump   ][JobMgr            ] Stopping worker #0...
>> [2021-09-07 14:01:22.638245 -0500][Dump   ][JobMgr            ] Worker #0 stopped
>> [2021-09-07 14:01:22.638273 -0500][Dump   ][JobMgr            ] Stopping worker #1...
>> [2021-09-07 14:01:22.638480 -0500][Dump   ][JobMgr            ] Worker #1 stopped
>> [2021-09-07 14:01:22.638503 -0500][Dump   ][JobMgr            ] Stopping worker #2...
>> [2021-09-07 14:01:22.638678 -0500][Dump   ][JobMgr            ] Worker #2 stopped
>> [2021-09-07 14:01:22.638701 -0500][Debug  ][JobMgr            ] Job manager stopped
>> [2021-09-07 14:01:22.638718 -0500][Debug  ][TaskMgr           ] Stopping the task manager...
>> [2021-09-07 14:01:22.638945 -0500][Debug  ][TaskMgr           ] Task manager stopped
>> [2021-09-07 14:01:22.638969 -0500][Debug  ][Poller            ] Stopping the poller...
>> [2021-09-07 14:01:22.639154 -0500][Debug  ][TaskMgr           ] Requesting unregistration of: "TickGeneratorTask for: root://uct3-s1.mwt2.org:1094"
>> [2021-09-07 14:01:22.639178 -0500][Debug  ][AsyncSock         ] [uct3-s1.mwt2.org:1094.0] Closing the socket
>> [2021-09-07 14:01:22.639190 -0500][Debug  ][Poller            ] <[::ffff:192.170.227.124]:38632><--><[::ffff:192.170.227.124]:1094> Removing socket from the poller
>> [2021-09-07 14:01:22.639225 -0500][Debug  ][PostMaster        ] [uct3-s1.mwt2.org:1094] Destroying stream
>> [2021-09-07 14:01:22.639237 -0500][Debug  ][AsyncSock         ] [uct3-s1.mwt2.org:1094.0] Closing the socket
>>
>> I believe these are the relevant logs from the xrootd.log on uct3-s1:
>> 210907 13:31:22 1365782 XrootdXeq: root.1372035:26@uct3-s1 pub IPv4 login
>> 210907 13:31:22 1365782 root.1372035:26@uct3-s1 ofs_open: 0-600 fn=/atlas/uct3/data/users/jano/qual/matched/truth/ntuple_0.root
>> 210907 13:31:22 1365782 root.1372035:26@uct3-s1 XrootdProtocol: delaying client up to 1800 sec
>> 210907 13:31:22 1365782 root.1372035:26@uct3-s1 ofs_close: use=0 fn=dummy
>> 210907 13:31:22 1365782 XrdPoll: Poller 0 enabled root.1372035:26@uct3-s1
>> 210907 13:31:23 1366108 XrdSched: running underused thread monitor inq=0
>> 210907 13:31:23 1366108 XrdSched: 3 threads; 1 idle
>> 210907 13:31:23 1366108 XrdSched: scheduling underused thread monitor in 780 seconds
>> 210907 13:44:23 1365782 XrdSched: running underused thread monitor inq=0
>> 210907 13:44:23 1365782 XrdSched: 3 threads; 1 idle
>> 210907 13:44:23 1365782 XrdSched: scheduling underused thread monitor in 780 seconds
>> 210907 13:45:23 1365776 XrdBuffManager: Reshaper has 2K; target 2450580K
>> 210907 13:57:23 1366108 XrdSched: running underused thread monitor inq=0
>> 210907 13:57:23 1366108 XrdSched: 3 threads; 1 idle
>> 210907 13:57:23 1366108 XrdSched: scheduling underused thread monitor in 780 seconds
>>
>> Thoughts?
>>
>> Thanks,
>> Lincoln
>>
>> ________________________________
>> From: Andrew Hanushevsky <[log in to unmask]>
>> Sent: Tuesday, September 7, 2021 1:15 PM
>> To: Lincoln Bryant <[log in to unmask]>
>> Cc: Yang, Wei <[log in to unmask]>; [log in to unmask] <[log in to unmask]>
>> Subject: Re: Can't remove missing file
>>
>> Hi Lincoln,
>>
>> Now that I have read the original email more closely, I think you can
>> ignore the problem. Notice that the redirector says the file is "pending".
>> That means it made a commitment to place the file on the server. Future
>> access to the file will vector the client to that server and the cache
>> should clean up it's act. Since the file doesn't exist (yet) you get the
>> error that it cannot be removed (because it can't). I'd suggest trying to
>> read it and see what happens.
>>
>> Andy
>>
>>
>> On Tue, 7 Sep 2021, Lincoln Bryant wrote:
>>
>>> Hi Andy,
>>>
>>> This was first noticed with an xrootd-fuse client:
>>> # ls /atlas/uct3/data/users/jano/qual/matched/truth
>>> ls: cannot access /atlas/uct3/data/users/jano/qual/matched/truth/ntuple_0.root: No such file or directory
>>> ntuple_0.root
>>> # rpm -q xrootd-fuse
>>> xrootd-fuse-5.3.1-1.el7.x86_64
>>>
>>> Then I also tried with the 'xrdfs' tool:
>>> # xrdfs root://uct3-xrd.mwt2.org:1094 ls /atlas/uct3/data/users/jano/qual/matched/truth/ntuple_0.root
>>> [ERROR] Server responded with an error: [3011] No servers have read access to the file
>>> # rpm -q xrootd-client
>>> xrootd-client-5.3.1-1.el7.x86_64
>>>
>>> I have tried the same clients on different hosts, which report the same problem.
>>>
>>> I have tried restarting the cmsd/xrootd on all 3 servers and the redirector. File is still in the same state.
>>>
>>> Advice on how to proceed?
>>>
>>> Thanks much,
>>> Lincoln
>>> ________________________________
>>> From: Andrew Hanushevsky <[log in to unmask]>
>>> Sent: Tuesday, September 7, 2021 1:03 PM
>>> To: Yang, Wei <[log in to unmask]>
>>> Cc: Lincoln Bryant <[log in to unmask]>; [log in to unmask] <[log in to unmask]>
>>> Subject: Re: Can't remove missing file
>>>
>>> For an xroot client, when this occurs, the client automatically asks the
>>> redirector to correct the cache. So, this should not be an issue. On the
>>> other hand, http has no such protocol so the rror persists until the cache
>>> times out. So, what client was getting this error?
>>>
>>> On Tue, 7 Sep 2021, Yang, Wei wrote:
>>>
>>>> Hi Lincoln,
>>>>
>>>> Xrootd?s redirector will cache the info of this file for 8 hours (I think it can be shortened). So the ghost will disappear after that time. Alternatively you can restart the redirector.
>>>>
>>>> I don?t know if there is a way to tell the redirector to refresh the cache (or cache of this particular entry).
>>>>
>>>> regards,
>>>> --
>>>> Wei Yang  |  [log in to unmask]<mailto:[log in to unmask]>  |  650-926-3338(O)
>>>>
>>>>
>>>>
>>>> ########################################################################
>>>> Use REPLY-ALL to reply to list
>>>>
>>>> To unsubscribe from the XROOTD-L list, click the following link:
>>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>>
>>>
>>> ########################################################################
>>> Use REPLY-ALL to reply to list
>>>
>>> To unsubscribe from the XROOTD-L list, click the following link:
>>> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>>>
>>
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>
> ########################################################################
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-L list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1
>

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-L list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1