LISTSERV 16.5 - XROOTD-L Archives

Hi, Artem

as always your emails are full of useful information. Thanks!  :-)

> At our site and probably many others, we use 100MB WN to rack switch, and
> 1G rack to core router. So, given that we have say 30 machines per rack,
> hence 120 cpu cores per 1GB/s link. That would provide 1 MB/s per core.

>
> Then, what can I say in responce to 4 MB/s requirement? I don't
> foreseen massive switch to 1gb WN connections in the next 2y..

The new systems we buy will for sure have at least one Gb ethernet port (most 
current rack based servers anyhow have now Gb ethernet). I guess that we will 
end up with 4 cores per WN (e.g. 2 Opteron double cores per node).

> One should also think how many servers one has to have to serve at such a
> rate to clients? Jean-Yves' measurements are : 30Mb/s per server for
> random io. Are you buying something like one server per 4-core WN? :) Or
> you are attaching 6 TB to such WN and run proof instead of batch on your
> machines? :) (Not bad idea for analysis :) ).

We will buy file servers with attached RAID arrays. I expect that one such 
server will have from 6-8TB. In this cycle of procurement will need about 
80TB, so this would be 10-12 servers for ~130 CPU cores. The requirement is 
easy to fulfill for this smaller cluster, but for ~1200 cores and 800TB it 
becomes more problematic if one needs to assume stupid job placement and 
stupid scheduling, i.e. needing to have the required access from every worker 
node to every part of the storage in any combination of access patterns. 
Needs to have the right network topology. 

If it was possible, I would prefer a solution ideally suited for PROOF (as you 
mentioned). But since we have to cater to ATLAS/CMS/LHCb and to a range of 
use cases, it cannot be done, and we have to assume a chaotic access pattern.

>
> So, when you talk to your vendors, don't insult then that their systems
> are "not good enough" for some experiments. :)
>
> By the way, in practically all CMS applications (ORCA) I'v seen input rate
> in about 0.5 MB/s. Perhaps in the new EDM it will be better. This is also
> why experiment's requirement s are not taken too seriouse :)

I collected numbers from people running applications in ATLAS/CMS/LHCb and got 
from all of them ~2MB/s as peak. The same I could see as peak values in the 
ASAP monitoring from Julia and Juha. After calculating and playing with a 
simple model, I thought that 4MB/s should be realistic for a system of the 
targeted size (800TB, about 1200 cores). 

At a Tier1/2 meeting at FZK I spoke to J van Wezel, a member of the HEPIX 
storage task force and he said that they used 2MB/s.
(BTW, they provide a useful document 
http://grid.fzk.de/publication/HEPiX_GDB_STF_v1.0.0.0-1.pdf. As always one 
stumbles more or less by accident over such information :-) ) 

It would be useful to share more information about procurements between 
centers. We already decided to do that with FZK.


Thanks,
Derek



>
> Artem.
>
> On Mon, 3 Apr 2006, Derek Feichtinger wrote:
> > Hi, Fabrizio
> >
> > I need to ensure that _every_ CPU core (i.e. every job) can read at a
> > steady rate from the storage space made up by the file servers. 4MB/s per
> > job is what we will require, but the CMS TDR has set higher values. These
> > however seem to be ignored by most centers as over the top. A discussion
> > with a member of the storage task force seems to indicate that most
> > centers go for a 2MB/s per job rate.
> >
> > The XrdMon will be nice to measure the complete xrootd system's
> > performance, but I need to define an easily measurable and well defined
> > procedure which can be used for the bidding companies. I could sure use
> > xrootd to compose such a procedure and I was initially thinking about
> > that. I just want to get an overview what has been used by others.
> >
> > If I was to use xrootd, I would try to get the clients to read at a given
> > rate. Then I could look at the server statistics. I would gradually
> > increase the specified rate and look at the actually measured rate.
> >
> > The nice thing about using xrootd naturally is that it is one of the
> > tools used for the real case.
> >
> > BTW: A good link about measurement techniques (but not containing the
> > definite answer to my problem) is here
> > http://dast.nlanr.net/NPMT/
> >
> > Thanks,
> > Derek
> >
> > On Monday 03 April 2006 17.59, Fabrizio Furano wrote:
> > > Hi Derek,
> > >
> > > Derek Feichtinger wrote:
> > > > Hi,
> > > >
> > > > This is slightly off-topic, but nontheless important for the setup of
> > > > large direct attached storage systems typically used with xrootd.
> > > > Maybe some of you have good suggestions or experiences.
> > >
> > >   Well, I don't know exactly your requirements, but wouldn't it be
> > > sufficient to look at the traffic by making an average of the data seen
> > > by each client after the file close ?
> > >
> > >   Another (better) way could be to setup XrdMon. Why not ?
> > >
> > >
> > > Fabrizio
> > >
> > > > For our next upgrade of our Tier2 I would need a benchmark with which
> > > > I can measure whether I can satisfy an I/O requirement per worker
> > > > node (WN, or CPU core). This has to be tested while all WNs are
> > > > reading in parallel from all file servers. I just want to assume that
> > > > the clients from the WNs are reading in a nicely distributed fashion
> > > > from the file servers, e.g. in the case of 10 file servers and 150
> > > > WNs, I would assume that in average 15 WNs are reading at the same
> > > > time from any file server. But any combination of 15 WNs must be able
> > > > to yield the desired bandwidth.
> > > >
> > > > Naturally, this benchmark is targeted at mimicking a cluster running
> > > > analysis applications.
> > > >
> > > > A primitive test (but not exactly matching the use case) could be
> > > > using netperf or iperf in UDP mode. E.g. the file servers would
> > > > receive packets from the required fraction of worker nodes (The
> > > > sending intervals and packet sizes can be set for netperf). One would
> > > > gradually increase the sending rate per worker node until UDP packet
> > > > loss is observed.
> > > >
> > > > I'm glad for any suggestions.
> > > >
> > > > Cheers,
> > > > Derek
> >
> > --
> > Dr. Derek Feichtinger                   Tel:   +41 56 310 47 33
> > AIT Group                               email: [log in to unmask]
> > PSI                                    
> > http://people.web.psi.ch/feichtinger CH-5232 Villigen PSI

-- 
Dr. Derek Feichtinger                   Tel:   +41 56 310 47 33
AIT Group                               email: [log in to unmask]
PSI                                     http://people.web.psi.ch/feichtinger
CH-5232 Villigen PSI