Print

Print


Hi Florian,

There are several issues here.

1) More requests are comming to the redirector than can be dispatched. It 
appears that you average somewhere around 5/second. That means you likely 
have peeks substantially greater. If you run a single redirector, perhaps 
it's time to now run two redirectors. The client will automatically load 
balance themselves between the two. The simples way to do this is to assign 
both IP address to the same DNS name. You should *not* use a load-balancing 
DNS to this this. The second IP address is simply added as another 
interface. Anyway, hitting the thread limit is not fatal. It just means that 
your load has increased to the point that the redirector can't effeciently 
handle the incomming traffic. You can get additional headroom by specifying:

xrd.timeout read 1

Please specify this *only* for the redirector as it will likely have 
negative effects for data servers. This will give you up to 80% improvement 
in thread turn-over. Mind you, we haven't really tried using this tuning 
option for this problem.

2) The more serious problem is indicated by the message:

080422 03:04:21 001 XrdAccept: Unable to perform accept.; too many open 
files in system

This is due to one of two reasons: a) the redirectors file descriptor limit 
is set too low (the actual number will be written to the log upon start-up), 
b) the overall system limit was exceeded. I doubt that (b) is the problem. 
So, assuming tcsh, a limit -h will show what the default "discriptors" hard 
limit is (the server tries to use the hard value). If it's below 16K you 
should ask your admins to increase to at least 16k (32k will likely mean you 
won't have to worry about for some time).

3) Unless you really need to trace redirects, I normally recommend turing it 
off. It serializes the processing path and generally slows thigs down.

Let me know how it goes.

Andy


----- Original Message ----- 
From: "Florian Bernlochner" <[log in to unmask]>
To: <[log in to unmask]>
Cc: <[log in to unmask]>
Sent: Monday, April 28, 2008 2:23 PM
Subject: load problem with redirector at GridKa


> Hi there,
>
> since last week I obseve some strange load related problem with the
> xrootd cluster at GridKa. During fairly 'normal' load [1] the
> redirector reaches its Thread limit [2]. I set it to maximum but the
> problems reoccurred several times [3]. It almost seems, that the
> scheduler does not flush the threads in time, but I guess might have
> more insight on this issue. Well feel free to share any kind of advice
> or wisdom :-)
> The cluster has 12 servers (2 in staging mode) and all run on 
> 20071101-0808p1.
>
> Florian
>
>
>
> [1] # of redirs per 10 minute
> http://iktp.tu-dresden.de/~petzold/work/GridKa/xrootd/redirs.png
> # of logins per 10 minute
> http://iktp.tu-dresden.de/~petzold/work/GridKa/xrootd/logins.png
>
> [2] Log:
>
> 080422 00:10:41 27010 odc_send2Man:
> mcprod.2729:[log in to unmask] redirected to
> f01-010-105.gridka.de:1094 by l01-001-110
> path=/store/cfg/2008/03/CfgDB-20080328T123723.root
> 080422 00:10:42 27010 XrdScheduler: Thread limit has been reached!
> 080422 00:10:42 27010 odc_send2Man:
> mcprod.9859:[log in to unmask] redirected to
> f01-010-105.gridka.de:1094 by l01-001-110
> path=/store/cfg/2008/03/CfgDB-20080328T123723.root
> 080422 00:10:42 27010 odc_send2Man:
> mcprod.907:[log in to unmask] redirected to
> f01-010-105.gridka.de:1094 by l01-001-110
> path=/store/cfg/2008/03/CfgDB-20080328T123723.root
> 080422 03:04:21 001 XrdAccept: Unable to perform accept.; too many
> open files in system
> 080422 03:04:21 001 XrdAccept: Unable to perform accept.; too many
> open files in system
> 080422 03:04:21 001 XrdAccept: Unable to perform accept.; too many
> open files in system
> 080422 03:04:21 001 XrdAccept: Unable to perform accept.; too many
> open files in system
>
> [3] Redirector Config:
>
> olb.allow host babar*.gridka.de
> olb.allow host f01-*.gridka.de
> olb.path s /store
> olb.path w /prod
> olb.path w /gaffertape
> olb.port 3121
> olb.sched cpu 100
> olb.wait
>
> xrd.sched mint 8 maxt 4095 avlt 512 idle 780
>
> xrootd.export /prod
> xrootd.export /store
> xrootd.export /gaffertape
> xrootd.fslib /home/xrootd/software/current/lib/libXrdOfs.so
>
> odc.trace redirect
> #olb.trace all
> #oss.trace all
> #xrd.trace all
> #xrootd.trace all
>
>
>
> -- 
> ------------------------------------------------------------------
> Humboldt-Universität zu Berlin
> Department of Physics
> BaBar, Prof. H. Lacker and Prof. M. Kobel
> Newtonstr. 15, 12489 Berlin, Germany
> Web: slac.stanford.edu/babar
> ------------------------------------------------------------------
>
>