Hi,
may be I got lost with my emails, but I could not find any reply to
this thread. I think there was some email exchange at some point, but I
am still missing the whole picture.
So, could somebody please explain how to deal with memory exaustion
on the data servers?
More specifically, I'd be also interested to hear about:
- is there a rough formula which allows to compute, given a machine
with N GB of RAM and M GB of swap, the maximum number of connections
which can be tolerated? Apart from performance issues, is it only the
sum of physical memory and swap space which matters? Can somebody
post a working configuration used at some site (may be SLAC) to serve
as a starting point?
- related to point above... In order to avoid having xrootd go in a
funny state, would it be possible/advisable to limit the number of
incoming connections? What will happen once the limit is hit? The
extra incoming connections will just wait (I guess, sent back to the
load-balancer) without complaining? Would this cause too much traffic
around the load-balancer at some point?
At CNAF we are planning for next week a massive sparsification of
files across our data servers to help reduce the chance of a single
machine being hit too hard, yet it would still be nice to improve our
xrootd configuration.
At CNAF we have ~20 machines each serving between 1 and 1.5 TB of
BaBar data, and these have never given us troubles. Where we do have
troubles is with the 2 "diskservers" (now three) serving 30 TB of data
(8-12 TB each). These diskservers have, as Enrica already noted, 2 GB
RAM and 4 GB swap each.
Thanks a lot for your help!
Ciao ciao
Fulvio
Enrica Antonioli wrote:
> Hi all,
>
> we found a problem that seems to be a cause of failure of jobs submitted
> to CNAF farm, similar to the one reported by Gregory Schott, but this
> case is related to a data server and not to a redirector.
>
> We have a disk server with 20 TB (full of collections) and 2GB of
> memory, 4GB swap.
> After a submission of jobs that need access to collections stored on
> this diskserv (~250-300 jobs running), xrootd started to show problems
> in accessing data, as you can see from the following piece of xrdlog:
>
> [...]
> 050113 18:45:19 1260 XrootdXeq: User logged in as
> kflood.27447:[log in to unmask]
> 050113 18:45:48 1260 XrootdXeq: User logged in as
> kflood.6771:[log in to unmask]
> 050113 18:46:44 1260 XrootdXeq: User logged in as
> kflood.26481:[log in to unmask]
> 050113 18:46:46 1260 XrdScheduler: Unable to create worker thread ;
> cannot allocate memory
> 050113 18:46:53 1260 XrdScheduler: Unable to create worker thread ;
> cannot allocate memory
> 050113 18:46:53 1260 XrootdXeq: User logged in as
> kflood.19414:[log in to unmask]
> 050113 18:47:25 1260 XrdScheduler: Unable to create worker thread ;
> cannot allocate memory
> 050113 18:47:33 1260 XrdScheduler: Unable to create worker thread ;
> cannot allocate memory
> [...]
>
> This message is repeated continuously, and xrootd doesn't answer
> anymore, the only way to recover is to restart the olbd service.
>
> This is what top shows now that everything seems ok (at the moment I'm
> not able to post the result of top when the problem is present):
>
> CPU states: cpu user nice system irq softirq iowait idle
> total 0.2% 0.0% 1.9% 0.2% 1.3% 25.0% 71.3%
> cpu00 0.0% 0.0% 2.0% 1.0% 3.0% 15.4% 78.6%
> cpu01 0.2% 0.0% 1.6% 0.0% 0.6% 35.0% 62.6%
> cpu02 0.0% 0.0% 2.0% 0.0% 0.2% 17.6% 80.2%
> cpu03 0.6% 0.0% 2.2% 0.0% 1.4% 32.0% 63.8%
> Mem: 2061104k av, 2043568k used, 17536k free, 0k shrd, 77072k buff
> 1549820k actv, 191940k in_d, 30736k in_c
> Swap: 4096532k av, 0k used, 4096532k free 1683772k cached
>
> We read also /var/log/messages, but we didn't find anything related to
> the moment the problem appeared the first time.
>
> Do you have any suggestions-ideas?
>
> Cheers,
> Enrica
>
>
|