LISTSERV 16.5 - VUB-RECOIL Archives

Subscriber's Corner
Email Lists
VUB-RECOIL Archives

VUB-RECOIL@LISTSERV.SLAC.STANFORD.EDU

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		VUB-RECOIL Home
		VUB-RECOIL October 2002
Subject:
Re: problems on AWG18
From:
"Yury G. Kolomensky" <[log in to unmask]>
Reply-To:
[log in to unmask]
Date:
31 Oct 2002 11:01:14 -0800Thu, 31 Oct 2002 11:01:14 -0800
Content-Type:
text/plain
Parts/Attachments:
text/plain (180 lines)
ROOT should have no problem reading or writing files as long as they
are less than 2 GB. Making a few large files has many benefits --
fewer system calls, fewer parallel jobs --> better CPU/wallclock
ratio. Larger files can also be directly backed up in mstore -- which
you probably want to do at some point.

If you are going to run a skim job to produce large files -- I would
again encourage you to convert your trees to non-split mode. I see
that you now have 114 branches. This is very bad for I/O if your
typical job reads most of them for every event. 

							Yury

On Thu, Oct 31, 2002 at 10:54:29AM -0800, Riccardo Faccini wrote:
> Hi Alessio,
> each of the skimmed root files you would produce would be 300 times larger
> than the original one. Is this something we can handle? I do not know the
> answer...
> 	thanks
> 	ric
> 
> On Thu, 31 Oct 2002, Alessio Sarti wrote:
> 
> > I'm proposing to having skimmed rootfiles (just ~10) obtained running
> > against a splitted chain.
> >
> > Let say:
> > I start with a chain with 9000 rootfiles.
> > I split it up in 10 pieces.
> > I run 10 skimming jobs on 900 rootfiles each.
> > I'm producing just 10 output files.
> > And then you can safely just run against them!
> > what do you thyink about that?
> >
> > Alessio
> >
> > ______________________________________________________
> > Alessio Sarti     Universita' & I.N.F.N. Ferrara
> >  tel  +39-0532-781928  Ferrara
> > roma  +39-06-49914338
> > SLAC +001-650-926-2972
> >
> > "... e a un Dio 'fatti il culo' non credere mai..."
> > (F. De Andre')
> >
> > "He was turning over in his mind an intresting new concept in
> > Thau-dimensional physics which unified time, space, magnetism, gravity
> > and, for some reason, broccoli".  (T. Pratchett: "Pyramids")
> >
> > On Thu, 31 Oct 2002, Riccardo Faccini wrote:
> >
> > > Sorry Alessio,
> > > I am not sure I understand your proposal: you suggest to produce the
> > > reduced root files and make chains over them? But these root file are only
> > > 3 times smaller than the default ones, how can you claim we will need more
> > > than ten times less
> > > jobs?
> > > 	thanks
> > > 	ric
> > >
> > >
> > > On Thu, 31 Oct 2002, Alessio Sarti wrote:
> > >
> > > > Hi all,
> > > > I propose another workaround!
> > > >
> > > > I've produced the chains.
> > > > I can split them and run the skim job that produces in lees thatn 2 hours
> > > > 6-10 job contaiing ALL the generic / data/ cocktail info in such a way
> > > > that we donot longer rely on hundreds of job running.....
> > > >
> > > > What do you think about that?
> > > > Alessio
> > > >
> > > > ______________________________________________________
> > > > Alessio Sarti     Universita' & I.N.F.N. Ferrara
> > > >  tel  +39-0532-781928  Ferrara
> > > > roma  +39-06-49914338
> > > > SLAC +001-650-926-2972
> > > >
> > > > "... e a un Dio 'fatti il culo' non credere mai..."
> > > > (F. De Andre')
> > > >
> > > > "He was turning over in his mind an intresting new concept in
> > > > Thau-dimensional physics which unified time, space, magnetism, gravity
> > > > and, for some reason, broccoli".  (T. Pratchett: "Pyramids")
> > > >
> > > > On Thu, 31 Oct 2002, Daniele del Re wrote:
> > > >
> > > > >
> > > > > Hi Yury,
> > > > >
> > > > >  one example is
> > > > >
> > > > >  ~daniele/scra/newchains_1030/data-2
> > > > >
> > > > >  and the tipical message is
> > > > >
> > > > >  Error in <TFile::TFile>: file /nfs/farm/babar/AWG18/ISL/sx-080702/data/2000/output/outputdir/AlleEvents_2000_on-1095.root does not exist
> > > > >
> > > > >  on AWG8 this pathology happened just few times when there were >~300 jobs
> > > > > reading the same disk if I remember correctly.
> > > > >
> > > > >  Do you know which is the difference between AWG8 and AWG18?
> > > > >
> > > > >  My proposal is to split things on different disks, if possible.
> > > > >
> > > > >  Thanks a lot,
> > > > >
> > > > >  Daniele
> > > > >
> > > > > On Thu, 31 Oct 2002, Yury G. Kolomensky wrote:
> > > > >
> > > > > > 	Hi Daniele,
> > > > > >
> > > > > > do you have an example of a log file for these jobs ? I do not know
> > > > > > exactly what servers these disks have been installed on, but we
> > > > > > noticed in E158, where most of the data were sitting on one
> > > > > > (relatively slow) server, jobs were limited by I/O throughput to about
> > > > > > 2 MB/sec. This limit comes from the random access pattern that split
> > > > > > ROOT trees provide. If your job is sufficiently fast, you can saturate
> > > > > > I/O limit quite quickly -- with 2-3 jobs. If you submit too many jobs
> > > > > > (tens or even hundreds), the server will thrash to the point that the
> > > > > > clients will receive NFS timeouts. ROOT usually does not like that --
> > > > > > you may see error messages in the log file about files not found (when
> > > > > > the files are actually on disk), or about problems uncompressing
> > > > > > branches. These are usually more severe on Linux clients, where the
> > > > > > NFS client implementation is not very robust..
> > > > > >
> > > > > > There are several ways to cope with this problem:
> > > > > >
> > > > > > 1) Submit fewer jobs at one time. I would not submit more than 10
> > > > > >    I/O-limited jobs in parallel.
> > > > > > 2) Place your data on different servers. That means, different sulky
> > > > > >    servers is best. Even if you are on the same sulky server but split
> > > > > >    your data onto different partitions, you still get the benefit of
> > > > > >    parallelizing disk access
> > > > > > 3) Re-write your jobs to first copy your data onto a local disk on the
> > > > > >    batch worker (for instance, /tmp), then run on the local copy, then
> > > > > >    delete the local copy. The benefit of that is that the cp command
> > > > > >    will access the file in direct-access mode (with 10-20 MB/sec
> > > > > >    throughput, depending on the network interface throughput).
> > > > > > 4) Make your ntuples non-split (very highly recommended). This usually
> > > > > >    increases the throughput by a factor of 10-20. If your typical job
> > > > > >    reads most of the branches of the tree, making tree split makes no
> > > > > >    sense. Non-split trees provide direct access to disk, which is much
> > > > > >    more optimal.
> > > > > >
> > > > > > 							Yury
> > > > > >
> > > > > >
> > > > > > On Thu, Oct 31, 2002 at 09:26:08AM -0800, Daniele del Re wrote:
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > >  in the last two days I tried to run on data and MC on the new disk AWG18.
> > > > > > > No way. I got problems in the 80% of the jobs. Someone crashed, most of
> > > > > > > them have did not read a large number of root files (actually there).
> > > > > > >
> > > > > > >  This problem seems to be worse than ever. Do we have to contact
> > > > > > > computing people about this?
> > > > > > >
> > > > > > >  Daniele
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
Top of Message | Previous Page | Permalink
Search Archives

Advanced Options
Options

		Log In
		Get Password

		Search Archives

		Subscribe or Unsubscribe