LISTSERV 16.5 - XROOTD-L Archives

Hi,
     may be I got lost with my emails, but I could not find any reply to 
this thread. I think there was some email exchange at some point, but I 
am still missing the whole picture.

   So, could somebody please explain how to deal with memory exaustion 
on the data servers?
   More specifically, I'd be also interested to hear about:
  - is there a rough formula which allows to compute, given a machine
    with N GB of RAM and M GB of swap, the maximum number of connections
    which can be tolerated? Apart from performance issues, is it only the
    sum of physical memory and swap space which matters? Can somebody
    post a working configuration used at some site (may be SLAC) to serve
    as a starting point?
  - related to point above... In order to avoid having xrootd go in a
    funny state, would it be possible/advisable to limit the number of
    incoming connections? What will happen once the limit is hit? The
    extra incoming connections will just wait (I guess, sent back to the
    load-balancer) without complaining? Would this cause too much traffic
    around the load-balancer at some point?

   At CNAF we are planning for next week a massive sparsification of 
files across our data servers to help reduce the chance of a single 
machine being hit too hard, yet it would still be nice to improve our 
xrootd configuration.
   At CNAF we have ~20 machines each serving between 1 and 1.5 TB of 
BaBar data, and these have never given us troubles. Where we do have 
troubles is with the 2 "diskservers" (now three) serving 30 TB of data 
(8-12 TB each). These diskservers have, as Enrica already noted, 2 GB 
RAM and 4 GB swap each.

   Thanks a lot for your help!

   Ciao ciao

				Fulvio

Enrica Antonioli wrote:
> Hi all,
> 
> we found a problem that seems to be a cause of failure of jobs submitted 
> to CNAF farm, similar to the one reported by Gregory Schott, but this 
> case is related to a data server and not to a redirector.
> 
> We have a disk server with 20 TB (full of collections) and 2GB of 
> memory, 4GB swap.
> After a submission of jobs that need access to collections stored on 
> this diskserv (~250-300 jobs running), xrootd started to show problems 
> in accessing data, as you can see from the following piece of xrdlog:
> 
> [...]
> 050113 18:45:19 1260 XrootdXeq: User logged in as 
> kflood.27447:[log in to unmask]
> 050113 18:45:48 1260 XrootdXeq: User logged in as 
> kflood.6771:[log in to unmask]
> 050113 18:46:44 1260 XrootdXeq: User logged in as 
> kflood.26481:[log in to unmask]
> 050113 18:46:46 1260 XrdScheduler: Unable to create worker thread ; 
> cannot allocate memory
> 050113 18:46:53 1260 XrdScheduler: Unable to create worker thread ; 
> cannot allocate memory
> 050113 18:46:53 1260 XrootdXeq: User logged in as 
> kflood.19414:[log in to unmask]
> 050113 18:47:25 1260 XrdScheduler: Unable to create worker thread ; 
> cannot allocate memory
> 050113 18:47:33 1260 XrdScheduler: Unable to create worker thread ; 
> cannot allocate memory
> [...]
> 
> This message is repeated continuously, and xrootd doesn't answer 
> anymore, the only way to recover is to restart the olbd service.
> 
> This is what top shows now that everything seems ok (at the moment I'm 
> not able to post the result of top when the problem is present):
> 
> CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
>            total    0.2%    0.0%    1.9%   0.2%     1.3%   25.0%   71.3%
>            cpu00    0.0%    0.0%    2.0%   1.0%     3.0%   15.4%   78.6%
>            cpu01    0.2%    0.0%    1.6%   0.0%     0.6%   35.0%   62.6%
>            cpu02    0.0%    0.0%    2.0%   0.0%     0.2%   17.6%   80.2%
>            cpu03    0.6%    0.0%    2.2%   0.0%     1.4%   32.0%   63.8%
> Mem:  2061104k av, 2043568k used,   17536k free, 0k shrd, 77072k buff
>                    1549820k actv,  191940k in_d, 30736k in_c
> Swap: 4096532k av,       0k used, 4096532k free  1683772k cached
> 
> We read also /var/log/messages, but we didn't find anything related to 
> the moment the problem appeared the first time.
> 
> Do you have any suggestions-ideas?
> 
> Cheers,
> Enrica
> 
>