Print

Print


Hi,

Today's update using version 20041124-0752 for client and server  
(currently oprserv08 and bbrprod05, rest will be updated shortly).


On Nov 24, 2004, at 3:50 AM, Peter Elmer wrote:

>   Hi Remi,
>
> On Tue, Nov 23, 2004 at 08:28:59PM -0800, Remi Mommsen wrote:
>
>> With the load balancer (oprserv08)
>> ==================================
>>
>> - reading collections with standard access methods works *except* for
>> files written in the last 8(?) hours when the load balancer was asked
>> for the file before it did exist. This collection was copied on
>> bbrprod01 and KanCollUtil finds it:
>>
>> KanCollUtil
>> root://bbrprod01:1094//prod/store/SPskims/R14/16.0.1a/ 
>> BCharmoniumToHad/
>> 23/BCharmoniumToHad_2379
>> root://bbrprod01:1094//prod/store/SPskims/R14/16.0.1a/ 
>> BCharmoniumToHad/
>> 23/BCharmoniumToHad_2379 (48609 events)
>>
>> but asking the load balancer does not return the file:
> <....>
>
>   This is presumably because you are xrdcp-ing the file directly to an
> individual machine. When you switch to copying things in via the load
> balancer itself, this should not be a problem. Why aren't you copying
> things into the buffer via the load blanacer? (i.e. what is the last
> outstanding problem for doing that?)

Well, xrdcp cannot cope with a load balancer as it tries to make  
directories first which is not possible on the server. Thus, xrdcp  
would first need to start writing the file, being redirected and fail  
as the directory not exists. It then has to take the server it was  
redirected to and start creating the directories there.

[noric01] /u/br/bbrskim/releases/test-16.0.1a/workdir > xrdcp -d1  
test.root root://oprserv08:1094//prod/bar/foo/
041124 19:37:38 001 Xrd: main (C) 2004 SLAC INFN xrdcp 0.2 beta
041124 19:37:38 001 Xrd: main test.root -->  
root://oprserv08:1094//prod/bar/foo/
041124 19:37:38 001 Xrd:  (C) 2004 SLAC XrdClientAdmin 0.2 beta
041124 19:37:38 001 Xrd: XrdClientUrlSet List of servers to connect to  
is [oprserv08:1094]
041124 19:37:38 001 Xrd: ShowUrls The converted URLs count is 1
041124 19:37:38 001 Xrd: ShowUrls URL n.1:  
oprserv08.slac.stanford.edu:1094//.
041124 19:37:38 001 Xrd: Create Access to server granted.
041124 19:37:38 001 Xrd: Connect Connected.
041124 19:37:38 001 Xrd: Stat Server [oprserv08.slac.stanford.edu:1094]  
did not return OK message for last request.
041124 19:37:38 001 Xrd: SendGenCommand Server declared error 3005:No  
servers are available to read the file.
041124 19:37:38 001 Xrd: Stat Server [oprserv08.slac.stanford.edu:1094]  
did not return OK message for last request.
041124 19:37:38 001 Xrd: SendGenCommand Server declared error 3005:No  
servers are available to read the file.
041124 19:37:38 001 Xrd: Mkdir Server  
[oprserv08.slac.stanford.edu:1094] did not return OK message for last  
request.
041124 19:37:38 001 Xrd: SendGenCommand Server declared error 3005:No  
servers are available to write the file.
041124 19:37:38 001 Xrd: Mkdir Server  
[oprserv08.slac.stanford.edu:1094] did not return OK message for last  
request.
041124 19:37:38 001 Xrd: SendGenCommand Server declared error 3005:No  
servers are available to write the file.
Caching info: MissRate=0 MissCount=0 ReadsCounter=0
Caching info: BytesUsefulness=0 BytesSubmitted=0 BytesHit=0
041124 19:37:38 001 Xrd: Create (C) 2004 SLAC INFN XrdClient 0.2 beta
041124 19:37:38 001 Xrd: XrdClientUrlSet List of servers to connect to  
is [oprserv08:1094]
041124 19:37:38 001 Xrd: ShowUrls The converted URLs count is 1
041124 19:37:38 001 Xrd: ShowUrls URL n.1:  
oprserv08.slac.stanford.edu:1094//.
041124 19:37:38 001 Xrd: Create Access to server granted.
041124 19:37:38 001 Xrd: Create Opening the remote file  
/prod/bar/foo//test.root
041124 19:37:38 001 Xrd: Open Server [oprserv08.slac.stanford.edu:1094]  
did not return OK message for last request.
041124 19:37:38 001 Xrd: SendGenCommand Server declared error 3005:No  
servers are available to write the file.
041124 19:37:38 001 Xrd: Create Error opening the file  
/prod/bar/foo//test.root on host oprserv08:1094
041124 19:37:38 001 Xrd: xrdcp Error opening remote destination file  
root://oprserv08:1094//prod/bar/foo//test.root
Caching info: MissRate=0 MissCount=0 ReadsCounter=0
Caching info: BytesUsefulness=0 BytesSubmitted=0 BytesHit=0



>> Not working at all:
>> - XrdClientAdmin::XrdExistFiles, XrdClientAdmin::XrdDirList and
>> XrdClientAdmin::XrdGetChecksum just look on the load balancer itself,
>> i.e. they do not find anything by definition
>
>   Andy committed a fix for this (server side) last night. I'll make a
> version later today which can be started.

This is indeed solved, but the load balancer cashes the information  
once it has retrieved it.
I copied with xrdcp a file test2.root directly to bbrprod05 into  
/prod/foo and then asked oprserv08 for it:
041124 14:30:35 001 Xrd: HandleServerError Received redirection to  
[bbrprod05.slac.stanford.edu:1094]. Token=[].
chmod: 1  /prod/foo/test2.root
Checksum local: 2271761656 - xrd: crc32 2271761656
file exist: 1  /prod/foo/test2.root
Delete /prod/foo/test2.root: 1

which works and the file is indeed deleted. However, oprserv08 cashes  
the result. On any later invocation I get the same answer even that the  
file is gone:
041124 14:31:00 001 Xrd: HandleServerError Received redirection to  
[bbrprod05.slac.stanford.edu:1094]. Token=[].
chmod: 1  /prod/foo/test2.root
Checksum local: 2271761656 - xrd: crc32 2271761656
file exist: 1  /prod/foo/test2.root
Delete /prod/foo/test2.root: 1


>> - xrdcp, XrdClientAdmin::XrdChmod and XrdClientAdmin::XrdRm hang:
>> 041123 16:36:20 001 Xrd: SendGenCommand Server
>> [oprserv08.slac.stanford.edu:1094] requested 5 seconds of wait
>> 041123 16:36:25 001 Xrd: SendGenCommand Server
>> [oprserv08.slac.stanford.edu:1094] requested 5 seconds of wait
>> 041123 16:36:30 001 Xrd: SendGenCommand Server
>> [oprserv08.slac.stanford.edu:1094] requested 5 seconds of wait
>> 041123 16:36:35 001 Xrd: SendGenCommand Server
>> [oprserv08.slac.stanford.edu:1094] requested 5 seconds of wait
>> 041123 16:36:40 001 Xrd: SendGenCommand Server
>> [oprserv08.slac.stanford.edu:1094] requested 5 seconds of wait
>> 041123 16:36:45 001 Xrd: SendGenCommand Server
>> [oprserv08.slac.stanford.edu:1094] requested 5 seconds of wait
>
>   It sounds like this one should also be fixed with the new server  
> version.

This is fixed.

>> In summary we are close to get the production going if we talk  
>> directly
>> to individual servers. This is clearly less than we wished for. Load
>> balancing is done on a random basis, not taking server load and disk
>> space into consideration. Finding files, getting checksums and  
>> removing
>> files requires asking all servers. This increases the overall load on
>> the system.
>
>   Let's see if the latest round of fixes allow you to do everything via
> the load balancer...

Unfortunately, this is not the case.

>> As currently no wildcard operations work, we
>> need to get a list of files from the server and then ask the server to
>> remove each file individually.
>
>   This shouldn't be a big deal. Actually I don't understand why you
> need to ask the server for the list of files. Don't you already know  
> that
> as part of the skim bookkeeping anyway? Or do you mean: we need to get
> the list of servers the files are on since rm wasn't working via the  
> load
> balancer and then delete them manually? (That shouldn't be a problem  
> now
> as mentioned above.)

In principle this information is around, but the pieces server name,  
collection name, and individual filename extensions have to be gathered  
at often not very obvious places. The current implementation of the  
wrapper scripts just do "rm /path/to/my/temporary/files/*root". As we  
are still using the soon to be replaced split-skim wrappers, the  
easiest translation is just to put "root://bbrprod0X:1094/" in front  
and delete all files found on the given server. This works quite well  
right now. A better implementation will hopefully be part of the skim  
task management v.2.

Cheers,
		Remi


---------------------------------------------------------------------
Intelligence is like a four-wheel drive vehicle: it allows you to get
stuck in much more remote places.

*********************************************************************
Remigius K. Mommsen                 e-mail: [log in to unmask]
University of California, Irvine       URL:    http://cern.ch/mommsen
c/o SLAC                             voice:        ++1 (650) 926-3595
2575 Sand Hill Road #35                fax:        ++1 (650) 926-3882
Menlo Park, CA 94025, US              home:        ++1 (650) 233-9041
*********************************************************************