Hi Andy,
On Nov 25, 2004, at 2:31 AM, Andrew Hanushevsky wrote:
> Hi Remi,
>
>> This is indeed solved, but the load balancer cashes the information
>> once it has retrieved it.
> I don't understand this. There is no core file and there is nothing in
> the
> log indicating that a crash/restart occured. Hence, there is no trail
> of a
> crash. So what is really happening?
I don't say it crashes, but it caches the results (sorry for the typo,
but it should have been clear from the context).
If you ask the load balancer for a file at a time this file does not
exists, it claims the file does not exists for several hours (8 ?) even
when the file was created in the meantime. The opposite is true as
well: if you delete a file, the load balancer still reports that the
file exists and reports even a checksum for files which were deleted
hours before.
Best this is confusing, but for the skim production it is also very
inconvienent. We need to merge several collections from the skim
production which were copied to the /prod area. A merge operation fails
from time to time and we need to redo it. As we create roughly 170
merges for different skims in parallel, we do not want to remerge all
because just 1 merge failed. Thus, we ask the load balancer at the
beginning of the job if it already has the file (which is uniquely
named). If it does not have the file, we start the merge and create the
file which is then copied to xrootd. However, due to the cache of the
load balancer, any further check on that file fails for the next 8
hours as the load balancer claims the file does not exists. In the
worst case, the merge is restarted before the cache is cleared and, as
the in the first round no merge existed at all, the load balancer will
report that all files do not exist. The whole merge is redone and the
output files are reproduced, duplicating the files already in xrootd.
Another issue is when the copy of files into xrootd fails. We copy the
file (with xrdcp) and then ask for the checksum of the copied file and
compare it to the local checksum. Assume that the copy corrupted the
file and the checksums do not match. We want to delete the file and
retransfer it. But the load balancer will keep reporting the cached
(wrong) checksum regardless if the retransfer copied the file correctly
or not.
I hope it became clear that we either have to switch off the caching
(if that is possible) or to make the caching more clever. I guess the
latter will be hard as long as xrdcp does not work via the load
balancer.
Cheers,
Remi
---------------------------------------------------------------------
Computers are like air-conditioners, they stop working properly when
you open Windows. (Anonymous)
*********************************************************************
Remigius K. Mommsen e-mail: [log in to unmask]
University of California, Irvine URL: http://cern.ch/mommsen
c/o SLAC voice: ++1 (650) 926-3595
2575 Sand Hill Road #35 fax: ++1 (650) 926-3882
Menlo Park, CA 94025, US home: ++1 (650) 233-9041
*********************************************************************
|