Hi, may be I got lost with my emails, but I could not find any reply to this thread. I think there was some email exchange at some point, but I am still missing the whole picture. So, could somebody please explain how to deal with memory exaustion on the data servers? More specifically, I'd be also interested to hear about: - is there a rough formula which allows to compute, given a machine with N GB of RAM and M GB of swap, the maximum number of connections which can be tolerated? Apart from performance issues, is it only the sum of physical memory and swap space which matters? Can somebody post a working configuration used at some site (may be SLAC) to serve as a starting point? - related to point above... In order to avoid having xrootd go in a funny state, would it be possible/advisable to limit the number of incoming connections? What will happen once the limit is hit? The extra incoming connections will just wait (I guess, sent back to the load-balancer) without complaining? Would this cause too much traffic around the load-balancer at some point? At CNAF we are planning for next week a massive sparsification of files across our data servers to help reduce the chance of a single machine being hit too hard, yet it would still be nice to improve our xrootd configuration. At CNAF we have ~20 machines each serving between 1 and 1.5 TB of BaBar data, and these have never given us troubles. Where we do have troubles is with the 2 "diskservers" (now three) serving 30 TB of data (8-12 TB each). These diskservers have, as Enrica already noted, 2 GB RAM and 4 GB swap each. Thanks a lot for your help! Ciao ciao Fulvio Enrica Antonioli wrote: > Hi all, > > we found a problem that seems to be a cause of failure of jobs submitted > to CNAF farm, similar to the one reported by Gregory Schott, but this > case is related to a data server and not to a redirector. > > We have a disk server with 20 TB (full of collections) and 2GB of > memory, 4GB swap. > After a submission of jobs that need access to collections stored on > this diskserv (~250-300 jobs running), xrootd started to show problems > in accessing data, as you can see from the following piece of xrdlog: > > [...] > 050113 18:45:19 1260 XrootdXeq: User logged in as > kflood.27447:[log in to unmask] > 050113 18:45:48 1260 XrootdXeq: User logged in as > kflood.6771:[log in to unmask] > 050113 18:46:44 1260 XrootdXeq: User logged in as > kflood.26481:[log in to unmask] > 050113 18:46:46 1260 XrdScheduler: Unable to create worker thread ; > cannot allocate memory > 050113 18:46:53 1260 XrdScheduler: Unable to create worker thread ; > cannot allocate memory > 050113 18:46:53 1260 XrootdXeq: User logged in as > kflood.19414:[log in to unmask] > 050113 18:47:25 1260 XrdScheduler: Unable to create worker thread ; > cannot allocate memory > 050113 18:47:33 1260 XrdScheduler: Unable to create worker thread ; > cannot allocate memory > [...] > > This message is repeated continuously, and xrootd doesn't answer > anymore, the only way to recover is to restart the olbd service. > > This is what top shows now that everything seems ok (at the moment I'm > not able to post the result of top when the problem is present): > > CPU states: cpu user nice system irq softirq iowait idle > total 0.2% 0.0% 1.9% 0.2% 1.3% 25.0% 71.3% > cpu00 0.0% 0.0% 2.0% 1.0% 3.0% 15.4% 78.6% > cpu01 0.2% 0.0% 1.6% 0.0% 0.6% 35.0% 62.6% > cpu02 0.0% 0.0% 2.0% 0.0% 0.2% 17.6% 80.2% > cpu03 0.6% 0.0% 2.2% 0.0% 1.4% 32.0% 63.8% > Mem: 2061104k av, 2043568k used, 17536k free, 0k shrd, 77072k buff > 1549820k actv, 191940k in_d, 30736k in_c > Swap: 4096532k av, 0k used, 4096532k free 1683772k cached > > We read also /var/log/messages, but we didn't find anything related to > the moment the problem appeared the first time. > > Do you have any suggestions-ideas? > > Cheers, > Enrica > >