Hoi,
just a few thoughts. Not prioritized.
1. We can run over all data and all MC within 4 hours under normal
network/server conditions with 20 parallel jobs. We don't need >
100 jobs for that. Not with anaRecoil, but with anaQA. Identical
output, but no superfluous histograms. There have been exchanges
about those histograms and they need not repeated here.
2. A very serious problem (in addition to the raw number of jobs) is
that Daniele's chain do not make use of a 'new' input format
implemented weeks ago in both anaQA and anaRecoil: While chaining
you can either specify just a filename or a filename and the number
of events. If you do the former, ROOT will open the file, extract
the number of events, close it. Go to the next file. Do that for
all files in the chain. Multiply by 100, if all of your jobs start
up at the same time in the long queue. (Or at least 125.) You
will completely saturate and kill any server. On the other hand,
if you do specify the number of events, ROOT opens the rootfiles
only for processing, and then we have some CPU cycles to spend and
the access is quickly becoming asynchronous. It might be beneficial
to switch to the new chaining scheme. At least consider it.
3. The difference between the servers can be extracted from
http://monitor/host.php
and entering sulyky% into the search field.
sulky25:/AWG18 619788288 410758632 195965336 68% /a/sulky25/AWG18
sulky26:/AWG23 619788288 168376 580893672 1% /a/sulky26/AWG23
sulky13:/AWG7 1000652800 781913464 205071112 80% /a/sulky13/AWG7
sulky09:/AWG8 500340736 492076992 8199392 99% /a/sulky09/AWG8
AWG8 is on one of the older servers available at SLAC, a
Netra-t-1400/1405. It was beaten to death repeatedly. AWG7 was much
better, and AWG12 as well.
AWG18 is on a sun Fire-280R. These servers should be able to serve
several TB (according to SCS). We replaced an aging Netra of the
Group C disks in May with such a model, and the performance went up
significantly. You can still kill it, and you can even do that with
block I/O and HBK (Henning can tell the story).
4. I have not yet verified that the messages
Error in <TFile::TFile>: file /nfs/farm/babar/AWG18/ISL/sx-080702/data/2000/output/outputdir/AlleEvents_2000_on-1099.root does not exist
are actually bad when encountered while processing chains (in the
new scheme). The situation might be different when trying to chain.
Note that I do not say it does not matter.
Cheers,
--U.
|