LISTSERV 16.5 - VUB-RECOIL Archives

Hi Yury,

 one example is

 ~daniele/scra/newchains_1030/data-2

 and the tipical message is

 Error in <TFile::TFile>: file /nfs/farm/babar/AWG18/ISL/sx-080702/data/2000/output/outputdir/AlleEvents_2000_on-1095.root does not exist

 on AWG8 this pathology happened just few times when there were >~300 jobs
reading the same disk if I remember correctly.

 Do you know which is the difference between AWG8 and AWG18?

 My proposal is to split things on different disks, if possible.

 Thanks a lot,

 Daniele

On Thu, 31 Oct 2002, Yury G. Kolomensky wrote:

> 	Hi Daniele,
>
> do you have an example of a log file for these jobs ? I do not know
> exactly what servers these disks have been installed on, but we
> noticed in E158, where most of the data were sitting on one
> (relatively slow) server, jobs were limited by I/O throughput to about
> 2 MB/sec. This limit comes from the random access pattern that split
> ROOT trees provide. If your job is sufficiently fast, you can saturate
> I/O limit quite quickly -- with 2-3 jobs. If you submit too many jobs
> (tens or even hundreds), the server will thrash to the point that the
> clients will receive NFS timeouts. ROOT usually does not like that --
> you may see error messages in the log file about files not found (when
> the files are actually on disk), or about problems uncompressing
> branches. These are usually more severe on Linux clients, where the
> NFS client implementation is not very robust..
>
> There are several ways to cope with this problem:
>
> 1) Submit fewer jobs at one time. I would not submit more than 10
>    I/O-limited jobs in parallel.
> 2) Place your data on different servers. That means, different sulky
>    servers is best. Even if you are on the same sulky server but split
>    your data onto different partitions, you still get the benefit of
>    parallelizing disk access
> 3) Re-write your jobs to first copy your data onto a local disk on the
>    batch worker (for instance, /tmp), then run on the local copy, then
>    delete the local copy. The benefit of that is that the cp command
>    will access the file in direct-access mode (with 10-20 MB/sec
>    throughput, depending on the network interface throughput).
> 4) Make your ntuples non-split (very highly recommended). This usually
>    increases the throughput by a factor of 10-20. If your typical job
>    reads most of the branches of the tree, making tree split makes no
>    sense. Non-split trees provide direct access to disk, which is much
>    more optimal.
>
> 							Yury
>
>
> On Thu, Oct 31, 2002 at 09:26:08AM -0800, Daniele del Re wrote:
> >
> > Hi all,
> >
> >  in the last two days I tried to run on data and MC on the new disk AWG18.
> > No way. I got problems in the 80% of the jobs. Someone crashed, most of
> > them have did not read a large number of root files (actually there).
> >
> >  This problem seems to be worse than ever. Do we have to contact
> > computing people about this?
> >
> >  Daniele
> >
> >
>
>