Hi Fons, It's pretty easy to do. I would assume you'd role that into your proof protocol, yes? Andy ----- Original Message ----- From: "Fons Rademakers" <[log in to unmask]> To: "Andrew Hanushevsky" <[log in to unmask]> Cc: "Fabrizio Furano" <[log in to unmask]>; "Jan Iwaszkiewicz" <[log in to unmask]>; <[log in to unmask]>; <[log in to unmask]>; "Gerri Ganis" <[log in to unmask]> Sent: Tuesday, August 22, 2006 2:26 AM Subject: Re: Quering locations of a vector of files > Hi Andy, > > that would be no problem assuming we can easily query the oldb admin > interface. How would this be done, via a popen/pclose or is there on API? > > Cheers, Fons. > > > > Andrew Hanushevsky wrote: >> Hi Fons, >> >> It would probably be relatively easy to do if the query was entered via >> the OLB admin interface. It's more difficult to do via an xroot protocol >> query request. Would that satisfy you? >> >> Andy >> >> ----- Original Message ----- From: "Fons Rademakers" >> <[log in to unmask]> >> To: "Fabrizio Furano" <[log in to unmask]> >> Cc: "Jan Iwaszkiewicz" <[log in to unmask]>; <[log in to unmask]>; >> <[log in to unmask]>; "Gerri Ganis" <[log in to unmask]> >> Sent: Saturday, August 19, 2006 3:02 PM >> Subject: Re: Quering locations of a vector of files >> >> >>> Hi Andy, Fabrizio, >>> >>> what we really urgently would like to have is an xrootd command that >>> takes as input a vector of generic xrootd urls and returns a vector with >>> resolved urls (including multiple urls in case the same file exists on >>> more than one leaf node). Of course the first time this will take some >>> time since the head node will have to ask the leaf nodes, but from then >>> on this info lives in the xrootd head node cache, so it should be very >>> quick. We need the final location in PROOF to submit work packets with >>> priority to the nodes that have the data local. >>> >>> Can you tell me if this feature is possible and if we can get it soon? >>> >>> Cheers, Fons. >>> >>> >>> >>> Fabrizio Furano wrote: >>>> Hi Jan, >>>> >>>> I see, imho this means that there is very little overhead you can >>>> overlap, at least on the client side. Or that you are opening all those >>>> files towards very few servers, or the same one. I hope not. >>>> >>>> Anyway the async open was not meant as a way to speed up the open >>>> primitive, but as a way to do other things while the open is in >>>> progress, or to stage many files in parallel without serializing the >>>> waits. But in your situation it seems that there are not so many waits >>>> to parallelize. >>>> >>>> Fabrizio >>>> >>>> >>>> Jan Iwaszkiewicz wrote: >>>>> Hi! >>>>> >>>>> I have done some test as Fabrizio advised. >>>>> The results of tests with asynchronous open are similar to those with >>>>> standard open: >>>>> >>>>> I used the following code: >>>>> >>>>> TTime starttime = gSystem->Now(); >>>>> TList *toOpenList = new TList(); >>>>> toOpenList->SetOwner(kFALSE); >>>>> TIter nextElem(fDset->GetListOfElements()); >>>>> while (TDSetElement *elem = >>>>> dynamic_cast<TDSetElement*>(nextElem())) { >>>>> TFile::AsyncOpen(elem->GetFileName()); >>>>> toOpenList->Add(elem); >>>>> } >>>>> >>>>> TFile::EAsyncOpenStatus aos; >>>>> TIter nextToOpen(toOpenList); >>>>> while (toOpenList->GetSize() > 0) { >>>>> while (TDSetElement* elem = >>>>> dynamic_cast<TDSetElement*>(nextToOpen())) { >>>>> aos = TFile::GetAsyncOpenStatus(elem->GetFileName()); >>>>> if (aos == TFile::kAOSSuccess || aos == TFile::kAOSNotAsync >>>>> || aos == TFile::kAOSFailure) { >>>>> elem->Lookup(); >>>>> toOpenList->Remove(elem); >>>>> } >>>>> else if (aos != TFile::kAOSInProgress) >>>>> Error("fileOpenTestTmp", "unknown aos"); >>>>> } >>>>> nextToOpen.Reset(); >>>>> }; >>>>> toOpenList->Delete(); >>>>> >>>>> TTime endtime = gSystem->Now(); >>>>> Float_t time_holder = Long_t(endtime-starttime)/Float_t(1000); >>>>> cout << "Openning time was " << time_holder << " seconds" << endl; >>>>> >>>>> >>>>> The result is: >>>>> >>>>> #files asynchronous standard TFile::Open >>>>> 300 12.5 11.7 >>>>> 240 9.68 9.4 >>>>> 120 4.5 4.6 >>>>> >>>>> Have a nice weekend! >>>>> Jan >>>>> >>>>> Jan Iwaszkiewicz wrote: >>>>>> Hi Fabrizio, Hi Andy! >>>>>> >>>>>> Thank you for the answers. >>>>>> I'm making tests with TFile::AsyncOpen and will keep you informed. >>>>>> Maybe I should clarify that we want to lookup locations of the files >>>>>> on the PROOF master node but then open the files on worker nodes. The >>>>>> point of the lookup is to determine what files each worker will >>>>>> open/process. For the problems that Andy described: >>>>>> 1) I agree. 2) It seems to be even more important to parallelize it. >>>>>> >>>>>> In fact the possibility to get all locations of a file is also high >>>>>> on our wish-list. It would prevent us from opening a remote file >>>>>> while another copy is on one of our workers. We have no mechanism to >>>>>> avoid it. I think it's quite different use case than file serving. We >>>>>> want to make best use of a set of nodes belonging to a PROOF session. >>>>>> It would be very usefull to have this functionality! >>>>>> Cheers, >>>>>> Jan >>>>>> >>>>>> -----Original Message----- >>>>>> From: Andrew Hanushevsky [mailto:[log in to unmask]] >>>>>> Sent: Wed 8/16/2006 10:47 PM >>>>>> To: Fabrizio Furano; Jan Iwaszkiewicz >>>>>> Cc: [log in to unmask]; [log in to unmask]; Gerardo Ganis >>>>>> Subject: Re: Quering locations of a vector of files >>>>>> Hi Jan, >>>>>> >>>>>> Another way to speed up the processing is to use the Prepare method >>>>>> that allows you to set in motion all the steps needed to get file >>>>>> location information. As far as finding out the location of a list of >>>>>> files, that may be doable but has problems of its own. In your case >>>>>> it probably doesn't matter but in the general case two things may >>>>>> happen: 1) the location may be incorrect by the time you get the >>>>>> information (i.e., the file has been moved or deleted), and 2) there >>>>>> is no particular location for files that don't exist yet (this >>>>>> includes files that may be in an MSS but not yet on disk). The latter >>>>>> is more problematical as it takes a while to determine that. Anyway, >>>>>> we'll look into a mechanism to get you file location information (one >>>>>> of n for each file) using a list. >>>>>> >>>>>> Andy >>>>>> >>>>>> ----- Original Message ----- From: "Fabrizio Furano" >>>>>> <[log in to unmask]> >>>>>> To: "Jan Iwaszkiewicz" <[log in to unmask]> >>>>>> Cc: <[log in to unmask]>; "Maarten Ballintijn" >>>>>> <[log in to unmask]>; "Gerri Ganis" <[log in to unmask]> >>>>>> Sent: Wednesday, August 16, 2006 10:09 AM >>>>>> Subject: Re: Quering locations of a vector of files >>>>>> >>>>>> >>>>>>> Hi Jan, >>>>>>> >>>>>>> at the moment such a primitive is not part of the protocol. The >>>>>>> simpler way of doing it is to call Stat for each file, but this >>>>>>> reduces the per-file overhead only by a small amount, with respect >>>>>>> to a Open call. >>>>>>> In fact, both primitives actually drive the client to the final >>>>>>> endpoint (the file), so you cannot avoid the overhead (mainly >>>>>>> communication latencies) of being redirected to other servers. >>>>>>> >>>>>>> Since you say it's critical for you, my suggestion is to open as >>>>>>> many files as you can in the parallel way. Doing so, all the >>>>>>> latencies are parallelized, and you can expect a much higher >>>>>>> performance. >>>>>>> >>>>>>> To do this, just call TFile::AsyncOpen(fname) for each file you >>>>>>> need to open (a cycle), and then, later, you can call the regular >>>>>>> TFile::Open (another cycle). >>>>>>> The async call is non-blocking and very fast. You can find an >>>>>>> example of its ROOT-based usage here: >>>>>>> >>>>>>> http://root.cern.ch/root/Version512.news.html >>>>>>> >>>>>>> The ugly thing is that doing this you are using a lot of resources, >>>>>>> so, if you have really a lot of files to open (let's say, 5000) and >>>>>>> the resources are a problem, maybe you can find a workaround by >>>>>>> opening them in bunches of fixed size. >>>>>>> >>>>>>> Fabrizio >>>>>>> >>>>>>> Jan Iwaszkiewicz wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> In PROOF we realized that we need a possibility to query exact >>>>>>>> locations of a set of files. As far as I have seen in the xrootd >>>>>>>> protocol, there is no way to ask for locations of a vector of >>>>>>>> files. >>>>>>>> >>>>>>>> At the beginning of a query, we want to check exact locations of >>>>>>>> all the files form a data set. The current implementation does it >>>>>>>> by opening all the files, one by one. >>>>>>>> The speed is about 30 files/sec. For many queries, the lookup takes >>>>>>>> much longer than the processing. >>>>>>>> It is a critical problem for us. >>>>>>>> >>>>>>>> The bool XrdClientAdmin::SysStatX(const char *paths_list, kXR_char >>>>>>>> *binInfo) method can check multiple files but it only verifies >>>>>>>> whether the files exist. >>>>>>>> I imagine that it would be best for us to have something similar >>>>>>>> but returning file locations. Is such an extension to the protocol >>>>>>>> possible/reasonable to implement? >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Jan >>>>>> >>>>>> >>>>>> >>>> >>> >>> -- >>> Org: CERN, European Laboratory for Particle Physics. >>> Mail: 1211 Geneve 23, Switzerland >>> E-Mail: [log in to unmask] Phone: +41 22 7679248 >>> WWW: http://fons.rademakers.org Fax: +41 22 7669640 >>> >> > > -- > Org: CERN, European Laboratory for Particle Physics. > Mail: 1211 Geneve 23, Switzerland > E-Mail: [log in to unmask] Phone: +41 22 7679248 > WWW: http://fons.rademakers.org Fax: +41 22 7669640 >