Print

Print


Hi all,

we found a problem that seems to be a cause of failure of jobs submitted 
to CNAF farm, similar to the one reported by Gregory Schott, but this 
case is related to a data server and not to a redirector.

We have a disk server with 20 TB (full of collections) and 2GB of 
memory, 4GB swap.
After a submission of jobs that need access to collections stored on 
this diskserv (~250-300 jobs running), xrootd started to show problems 
in accessing data, as you can see from the following piece of xrdlog:

[...]
050113 18:45:19 1260 XrootdXeq: User logged in as 
kflood.27447:[log in to unmask]
050113 18:45:48 1260 XrootdXeq: User logged in as 
kflood.6771:[log in to unmask]
050113 18:46:44 1260 XrootdXeq: User logged in as 
kflood.26481:[log in to unmask]
050113 18:46:46 1260 XrdScheduler: Unable to create worker thread ; 
cannot allocate memory
050113 18:46:53 1260 XrdScheduler: Unable to create worker thread ; 
cannot allocate memory
050113 18:46:53 1260 XrootdXeq: User logged in as 
kflood.19414:[log in to unmask]
050113 18:47:25 1260 XrdScheduler: Unable to create worker thread ; 
cannot allocate memory
050113 18:47:33 1260 XrdScheduler: Unable to create worker thread ; 
cannot allocate memory
[...]

This message is repeated continuously, and xrootd doesn't answer 
anymore, the only way to recover is to restart the olbd service.

This is what top shows now that everything seems ok (at the moment I'm 
not able to post the result of top when the problem is present):

CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
            total    0.2%    0.0%    1.9%   0.2%     1.3%   25.0%   71.3%
            cpu00    0.0%    0.0%    2.0%   1.0%     3.0%   15.4%   78.6%
            cpu01    0.2%    0.0%    1.6%   0.0%     0.6%   35.0%   62.6%
            cpu02    0.0%    0.0%    2.0%   0.0%     0.2%   17.6%   80.2%
            cpu03    0.6%    0.0%    2.2%   0.0%     1.4%   32.0%   63.8%
Mem:  2061104k av, 2043568k used,   17536k free, 0k shrd, 77072k buff
                    1549820k actv,  191940k in_d, 30736k in_c
Swap: 4096532k av,       0k used, 4096532k free  1683772k cached

We read also /var/log/messages, but we didn't find anything related to 
the moment the problem appeared the first time.

Do you have any suggestions-ideas?

Cheers,
Enrica