Print

Print


Are you running a lot of checksums through that server? If so, there is a
known memory leak issue in that release.

On Wed, 17 Mar 2021, Kenyi Hurtado wrote:

> When using XRootD 4.12.6-1 (osg version) for our XRootD data servers, we notice OOM issues. This doesn't happen when we use 4.11 (we reverted back to that version)
>
> ```
> (primeradiant06)
> [Mon Mar 8 13:16:45 2021][316012.240505] xrootd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
>
> [Mon Mar 8 13:16:45 2021][316012.248957] xrootd cpuset=/ mems_allowed=0-1
>
> [Mon Mar 8 13:16:45 2021][316012.253811] CPU: 4 PID: 2890 Comm: xrootd Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.15.2.el7.x86_64 #1
>
> [Mon Mar 8 13:16:45 2021][316012.266621] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 2.11.0 11/02/2019
> [Mon Mar 8 13:16:45 2021][316012.275358] Call Trace:
> [Mon Mar 8 13:16:45 2021][316012.278184] [<ffffffffac381fba>] dump_stack+0x19/0x1b
> [Mon Mar 8 13:16:45 2021][316012.284014] [<ffffffffac37c8da>] dump_header+0x90/0x229
> [Mon Mar 8 13:16:45 2021][316012.290042] [<ffffffffabd065a2>] ? ktime_get_ts64+0x52/0xf0
> [Mon Mar 8 13:16:45 2021][316012.296457] [<ffffffffabd5dd7f>] ? delayacct_end+0x8f/0xb0
> [Mon Mar 8 13:16:45 2021][316012.302773] [<ffffffffabdc227d>] oom_kill_process+0x2cd/0x490
> [Mon Mar 8 13:16:45 2021][316012.309380] [<ffffffffabdc1c6d>] ? oom_unkillable_task+0xcd/0x120
> [Mon Mar 8 13:16:45 2021][316012.316375] [<ffffffffabdc296a>] out_of_memory+0x31a/0x500
> [Mon Mar 8 13:16:45 2021][316012.322691] [<ffffffffac37d3f7>] __alloc_pages_slowpath+0x5db/0x729
> [Mon Mar 8 13:16:45 2021][316012.329878] [<ffffffffabdc8ee6>] __alloc_pages_nodemask+0x436/0x450
> [Mon Mar 8 13:16:45 2021][316012.337066] [<ffffffffabe18bb8>] alloc_pages_current+0x98/0x110
> [Mon Mar 8 13:16:45 2021][316012.343866] [<ffffffffabdbdd37>] __page_cache_alloc+0x97/0xb0
> [Mon Mar 8 13:16:45 2021][316012.350475] [<ffffffffabdc0cd0>] filemap_fault+0x270/0x420
> [Mon Mar 8 13:16:45 2021][316012.356799] [<ffffffffc05c491e>] __xfs_filemap_fault+0x7e/0x1d0 [xfs]
> [Mon Mar 8 13:16:45 2021][316012.364192] [<ffffffffc05c4b1c>] xfs_filemap_fault+0x2c/0x30 [xfs]
> [Mon Mar 8 13:16:45 2021][316012.371288] [<ffffffffabdedfba>] __do_fault.isra.61+0x8a/0x100
> [Mon Mar 8 13:16:45 2021][316012.377991] [<ffffffffabdee56c>] do_read_fault.isra.63+0x4c/0x1b0
> [Mon Mar 8 13:16:45 2021][316012.384985] [<ffffffffabdf5db0>] handle_mm_fault+0xa20/0xfb0
> [Mon Mar 8 13:16:45 2021][316012.391497] [<ffffffffabe4efde>] ? do_readv_writev+0x19e/0x260
> [Mon Mar 8 13:16:45 2021][316012.398191] [<ffffffffac38f653>] __do_page_fault+0x213/0x500
> [Mon Mar 8 13:16:45 2021][316012.404699] [<ffffffffac38f975>] do_page_fault+0x35/0x90
> [Mon Mar 8 13:16:45 2021][316012.410819] [<ffffffffac38b778>] page_fault+0x28/0x30
> [Mon Mar 8 13:16:45 2021][316012.416649] Mem-Info:
> [Mon Mar 8 13:16:45 2021][316012.419282] active_anon:15536411 inactive_anon:5562 isolated_anon:0
> [Mon Mar 8 13:16:45 2021][316012.419282] active_file:0 inactive_file:0 isolated_file:0
> [Mon Mar 8 13:16:45 2021][316012.419282] unevictable:0 dirty:0 writeback:0 unstable:0
> [Mon Mar 8 13:16:45 2021][316012.419282] slab_reclaimable:217834 slab_unreclaimable:33046
> [Mon Mar 8 13:16:45 2021][316012.419282] mapped:4831 shmem:6745 pagetables:34002 bounce:0
> [Mon Mar 8 13:16:45 2021][316012.419282] free:41893 free_pcp:0 free_cma:0
> ```
>
> The client side messages are the following:
>
> ```
> cmd: 08-Mar-2021 13:39:24 EST Initiating request to open file root://deepthought.crc.nd.edu//store/user/kmohrman/postLHE_step/FullR2Studies/ULChecks/ttXJet-tXq_testFullULWFonCRC_ULCheck_UL17/v5/gen_step_tHq4f_testUpdateGenproddim6TopMay20GST_run2/GEN-00000_242.root
>>> cmd: 08-Mar-2021 13:39:29 EST Successfully opened file root://deepthought.crc.nd.edu//store/user/kmohrman/postLHE_step/FullR2Studies/ULChecks/ttXJet-tXq_testFullULWFonCRC_ULCheck_UL17/v5/gen_step_tHq4f_testUpdateGenproddim6TopMay20GST_run2/GEN-00000_242.root
>>> cmd: Begin processing the 1st record. Run 1, Event 1, LumiSection 93 on stream 4 at 08-Mar-2021 13:41:25.583 EST
>>> cmd: Begin processing the 2nd record. Run 1, Event 2, LumiSection 93 on stream 3 at 08-Mar-2021 13:41:25.586 EST
>>> cmd: Begin processing the 3rd record. Run 1, Event 3, LumiSection 93 on stream 0 at 08-Mar-2021 13:41:25.589 EST
>>> cmd: Begin processing the 4th record. Run 1, Event 4, LumiSection 93 on stream 2 at 08-Mar-2021 13:41:25.592 EST
>>> cmd: Begin processing the 5th record. Run 1, Event 5, LumiSection 93 on stream 5 at 08-Mar-2021 13:41:25.596 EST
>>> cmd: Begin processing the 6th record. Run 1, Event 6, LumiSection 93 on stream 1 at 08-Mar-2021 13:41:25.599 EST
>>> cmd: Begin processing the 7th record. Run 1, Event 7, LumiSection 93 on stream 2 at 08-Mar-2021 13:42:14.144 EST
>>> cmd: Begin processing the 8th record. Run 1, Event 8, LumiSection 93 on stream 1 at 08-Mar-2021 13:42:18.072 EST
>>> cmd: Begin processing the 9th record. Run 1, Event 9, LumiSection 93 on stream 4 at 08-Mar-2021 13:42:26.414 EST
>>> cmd: Begin processing the 10th record. Run 1, Event 10, LumiSection 93 on stream 0 at 08-Mar-2021 13:42:48.198 EST
>>> cmd: Begin processing the 11th record. Run 1, Event 11, LumiSection 93 on stream 2 at 08-Mar-2021 13:43:03.165 EST
>>> cmd: Begin processing the 12th record. Run 1, Event 12, LumiSection 93 on stream 3 at 08-Mar-2021 13:43:05.578 EST
>>> cmd: Begin processing the 13th record. Run 1, Event 13, LumiSection 93 on stream 5 at 08-Mar-2021 13:43:07.977 EST
>>> cmd: Begin processing the 14th record. Run 1, Event 14, LumiSection 93 on stream 4 at 08-Mar-2021 13:43:16.565 EST
>>> cmd: Begin processing the 15th record. Run 1, Event 15, LumiSection 93 on stream 1 at 08-Mar-2021 13:43:19.416 EST
>>> cmd: Begin processing the 16th record. Run 1, Event 16, LumiSection 93 on stream 0 at 08-Mar-2021 13:43:43.391 EST
>>> cmd: Begin processing the 17th record. Run 1, Event 17, LumiSection 93 on stream 2 at 08-Mar-2021 13:43:44.128 EST
>>> cmd: Begin processing the 18th record. Run 1, Event 18, LumiSection 93 on stream 4 at 08-Mar-2021 13:43:53.116 EST
>>> cmd: Begin processing the 19th record. Run 1, Event 19, LumiSection 93 on stream 3 at 08-Mar-2021 13:44:02.181 EST
>>> cmd: Begin processing the 20th record. Run 1, Event 20, LumiSection 93 on stream 5 at 08-Mar-2021 13:44:20.865 EST
>>> cmd: Begin processing the 21st record. Run 1, Event 21, LumiSection 93 on stream 2 at 08-Mar-2021 13:44:30.397 EST
>>> cmd: [2021-03-08 13:47:41.409148 -0500][Error ][XRootD ] [primeradiant06.crc.nd.edu:1096] Unable to get the response to request kXR_readv (handle: 0x00000000, chunks: 39, total size: 18360925)
>>> cmd: [2021-03-08 13:47:41.410099 -0500][Error ][File ] ***@***.***://deepthought.crc.nd.edu:1094//store/user/kmohrman/postLHE_step/FullR2Studies/ULChecks/ttXJet-tXq_testFullULWFonCRC_ULCheck_UL17/v5/gen_step_tHq4f_testUpdateGenproddim6TopMay20GST_run2/GEN-00000_242.root] Fatal file state error. Message kXR_readv (handle: 0x00000000, chunks: 39, total size: 18360925) returned with [ERROR] Operation expired
>>> cmd: %MSG-w XrdAdaptorInternal: (NoModuleName) 08-Mar-2021 13:47:41 EST pre-events
>>> cmd: XrdRequestManager::handle(name='root://deepthought.crc.nd.edu//store/user/kmohrman/postLHE_step/FullR2Studies/ULChecks/ttXJet-tXq_testFullULWFonCRC_ULCheck_UL17/v5/gen_step_tHq4f_testUpdateGenproddim6TopMay20GST_run2/GEN-00000_242.root) failure when reading from primeradiant06.crc.nd.edu:1096 (site T3_US_NotreDame); failed with error '[ERROR] Operation expired' (errno=0, code=206).
>>> cmd: %MSG```
>
> --
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly or view it on GitHub:
> https://github.com/xrootd/xrootd/issues/1430


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/xrootd/xrootd/issues/1430#issuecomment-801417015", "url": "https://github.com/xrootd/xrootd/issues/1430#issuecomment-801417015", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1