Print

Print


Hi Remi,
   |
   |
   V

Remi Mommsen wrote:
> Hi,
> 
> Here a summary what is and what is not working using today's xrootd cvs  
> version for the client code, and version 20041118-0948_dbg for the  
> servers at SLAC (restarted this afternoon).
> 
> 
> With a given server (bbrprod0X)
> ===============================
> 
> OK:
> - xrdcp including creation of non existing directories and file  
> permissions (thanks Fabrizio)
> 
> - checksum using XrdClientAdmin::XrdGetChecksum, buffer overrun is  
> fixed (thanks Fabrizio)
> 
>   CAVEAT: timeout of connection while checksum is calculated.
>    http://www.slac.stanford.edu/cgi-bin/lwgate/XROOTD-L/archives/xrootd- 
> l.200411/Subject/article-88.html
>   This will likely become an issue once we start production with 100s  
> of jobs
> 
> - changing permissions and removing files
> 
> - listing directories (thanks Wilko)
> 
> Not working:
> - removing directories
> 
> 
> 
> With the load balancer (oprserv08)
> ==================================
> 
> - reading collections with standard access methods works *except* for  
> files written in the last 8(?) hours when the load balancer was asked  
> for the file before it did exist. This collection was copied on  
> bbrprod01 and KanCollUtil finds it:
> 
> KanCollUtil  
> root://bbrprod01:1094//prod/store/SPskims/R14/16.0.1a/BCharmoniumToHad/ 
> 23/BCharmoniumToHad_2379
> root://bbrprod01:1094//prod/store/SPskims/R14/16.0.1a/BCharmoniumToHad/ 
> 23/BCharmoniumToHad_2379 (48609 events)
> 
> 
> but asking the load balancer does not return the file:
> 
> KanCollUtil  
> root://oprserv08:1094//prod/store/SPskims/R14/16.0.1a/BCharmoniumToHad/ 
> 23/BCharmoniumToHad_2379
> 2004-11-23 16:23:22 7642 Err : TXNetConn::Open                - Server  
> [oprserv08.slac.stanford.edu:1094] did not return OK message for last  
> request.
> 2004-11-23 16:23:22 7642 Err : TXNetConn::SendGenCommand      - Server  
> declared error 3005: 'No servers are available to read the file.'
> 2004-11-23 16:23:22 7642 Err : TXNetFile::CreateTXNf          - Error  
> opening the file  //prod/store/SPskims/R14/16.0.1a/BCharmoniumToHad/23/ 
> BCharmoniumToHad_2379.01.root on host oprserv08.slac.stanford.edu:1094
> ERR Could not open a file expected to contain the event header:
> ERR    LFN =  
> root://oprserv08:1094//prod/store/SPskims/R14/16.0.1a/BCharmoniumToHad/ 
> 23/BCharmoniumToHad_2379.01.root
> ERR    PFN =  
> root://oprserv08:1094//prod/store/SPskims/R14/16.0.1a/BCharmoniumToHad/ 
> 23/BCharmoniumToHad_2379.01.root
> ERR Check collection name and access method...
> 
> 
> 
> 
> Not working at all:
> - XrdClientAdmin::XrdExistFiles, XrdClientAdmin::XrdDirList and  
> XrdClientAdmin::XrdGetChecksum just look on the load balancer itself,  
> i.e. they do not find anything by definition
> 
> - xrdcp, XrdClientAdmin::XrdChmod and XrdClientAdmin::XrdRm hang:
> 041123 16:36:20 001 Xrd: SendGenCommand Server  
> [oprserv08.slac.stanford.edu:1094] requested 5 seconds of wait
> 041123 16:36:25 001 Xrd: SendGenCommand Server  
> [oprserv08.slac.stanford.edu:1094] requested 5 seconds of wait
> 041123 16:36:30 001 Xrd: SendGenCommand Server  
> [oprserv08.slac.stanford.edu:1094] requested 5 seconds of wait
> 041123 16:36:35 001 Xrd: SendGenCommand Server  
> [oprserv08.slac.stanford.edu:1094] requested 5 seconds of wait
> 041123 16:36:40 001 Xrd: SendGenCommand Server  
> [oprserv08.slac.stanford.edu:1094] requested 5 seconds of wait
> 041123 16:36:45 001 Xrd: SendGenCommand Server  
> [oprserv08.slac.stanford.edu:1094] requested 5 seconds of wait
> 
> 

  I do not understand the trouble here. I supposed that this problem has 
been fixed (it was an issue related to the client trying to repeatedly 
refresh the load balancer's cache).

  If you are able to trigger this problem easily, could you please 
provide me with a level 3 client log ?


Fabrizio


> 
> In summary we are close to get the production going if we talk directly  
> to individual servers. This is clearly less than we wished for. Load  
> balancing is done on a random basis, not taking server load and disk  
> space into consideration. Finding files, getting checksums and removing  
> files requires asking all servers. This increases the overall load on  
> the system.
> 
> In general a lot of functionality is implemented on the client side  
> which IMHO should be provided by the server: For example xrdcp issues  
> for each directory a mkdir command, instead just asking the server to  
> create the whole path. As currently no wildcard operations work, we  
> need to get a list of files from the server and then ask the server to  
> remove each file individually.
> 
> Cheers,
>         Remi
> 
> 
> ---------------------------------------------------------------------
> Q: How many kinds of physicists are there?
> A: Three. Those who can count and those who can't.
> 
> *********************************************************************
> Remigius K. Mommsen                 e-mail: [log in to unmask]
> University of California, Irvine       URL:    http://cern.ch/mommsen
> c/o SLAC                             voice:        ++1 (650) 926-3595
> 2575 Sand Hill Road #35                fax:        ++1 (650) 926-3882
> Menlo Park, CA 94025, US              home:        ++1 (650) 233-9041
> *********************************************************************