Print

Print


Hi Andy,

On Nov 25, 2004, at 2:31 AM, Andrew Hanushevsky wrote:

> Hi Remi,
>
>> This is indeed solved, but the load balancer cashes the information
>> once it has retrieved it.
> I don't understand this. There is no core file and there is nothing in 
> the
> log indicating that a crash/restart occured. Hence, there is no trail 
> of a
> crash. So what is really happening?

I don't say it crashes, but it caches the results (sorry for the typo, 
but it should have been clear from the context).

If you ask the load balancer for a file at a time this file does not 
exists, it claims the file does not exists for several hours (8 ?) even 
when the file was created in the meantime. The opposite is true as 
well: if you delete a file, the load balancer still reports that the 
file exists and reports even a checksum for files which were deleted 
hours before.

Best this is confusing, but for the skim production it is also very 
inconvienent. We need to merge several collections from the skim 
production which were copied to the /prod area. A merge operation fails 
from time to time and we need to redo it. As we create roughly 170 
merges for different skims in parallel, we do not want to remerge all 
because just 1 merge failed. Thus, we ask the load balancer at the 
beginning of the job if it already has the file (which is uniquely 
named). If it does not have the file, we start the merge and create the 
file which is then copied to xrootd. However, due to the cache of the 
load balancer, any further check on that file fails for the next 8 
hours as the load balancer claims the file does not exists. In the 
worst case, the merge is restarted before the cache is cleared and, as 
the in the first round no merge existed at all, the load balancer will 
report that all files do not exist. The whole merge is redone and the 
output files are reproduced, duplicating the files already in xrootd.

Another issue is when the copy of files into xrootd fails. We copy the 
file (with xrdcp) and then ask for the checksum of the copied file and 
compare it to the local checksum. Assume that the copy corrupted the 
file and the checksums do not match. We want to delete the file and 
retransfer it. But the load balancer will keep reporting the cached 
(wrong) checksum regardless if the retransfer copied the file correctly 
or not.

I hope it became clear that we either have to switch off the caching 
(if that is possible) or to make the caching more clever. I guess the 
latter will be hard as long as xrdcp does not work via the load 
balancer.

Cheers,
		Remi

---------------------------------------------------------------------
Computers are like air-conditioners, they stop working properly when
you open Windows.                                         (Anonymous)

*********************************************************************
Remigius K. Mommsen                 e-mail: [log in to unmask]
University of California, Irvine       URL:    http://cern.ch/mommsen
c/o SLAC                             voice:        ++1 (650) 926-3595
2575 Sand Hill Road #35                fax:        ++1 (650) 926-3882
Menlo Park, CA 94025, US              home:        ++1 (650) 233-9041
*********************************************************************