Hi again, Actually it seems to only change the "change" time (st_ctime) touch test stat test [...] Access: 2023-02-16 11:25:11.962804882 +0100 Modify: 2023-02-16 11:25:11.962804882 +0100 Change: 2023-02-16 11:25:11.962804882 +0100 Birth: 2023-02-16 11:25:11.962804882 +0100 chown xrootd test stat test [...] Access: 2023-02-16 11:25:11.962804882 +0100 Modify: 2023-02-16 11:25:11.962804882 +0100 Change: 2023-02-16 11:25:20.322843125 +0100 Birth: 2023-02-16 11:25:11.962804882 +0100 Does this play a role? Cheers, Nikolai On 2/16/23 11:18, Nikolai Hartmann wrote: > Hi Matevz (including xrootd list again which i forgot in the last reply), > >> Well, if for some reason more new files are placed on a single disk, >> those files will be "newer" and purge would preferentially wipe data >> off other disks. > Mhhhh - then i have an idea how i may have triggered this. As mentioned > in my first email the issue started after i updated my container image > and had to change the xrootd user ids. This changes the Access time of > the files - if that is used by xrootd to determine which files are newer > than it could just be that the chown process walked this directory last > and therefore will purge it last. > When i then deleted it when the disk ran full i made the problem even > worse since now all the files that end up there are recently accessed. > > So deleting the whole cache should fix it? > > Cheers, > Nikolai > > On 2/16/23 10:50, Matevz Tadel wrote: >> Hi Andy, Nikolai, >> >> On 2/15/23 23:51, Andrew Hanushevsky wrote: >>> Hi Nikolai, >>> >>> Hmm, this sounds like an off by one problem in Xcache. >> >> How? XCache does not do disks, it just uses oss API to a pool. >> >>> The question is what is >>> the "one". It does seem that ity consistently does not purge files >>> from a >>> particular disk but then again it doesn't know about disks. So, there >>> is some >>> systematic issue that resolves to ignoring a disk. Matevz? >> >> Well, if for some reason more new files are placed on a single disk, >> those files >> will be "newer" and purge would preferentially wipe data off other disks. >> >> That's why I asked in the first email how disks are selected for new >> files and >> if we could inject some debug printouts there. >> >> Perhaps a coincidence, but the full disk is the one that is listed >> first by df. >> >> The docs say default for oss.alloc fuzz = 0 and that this "forces oss >> to always >> use the partition with the largest amount of free space" -- so the >> fuller one >> should never get selected for new files. And xcache does pass the >> appropriate >> oss.asize opaque parameter to open. >> >> https://xrootd.slac.stanford.edu/doc/dev56/ofs_config.htm#_Toc116508676 >> >> Matevz >> >>> Andy >>> >>> >>> On Thu, 16 Feb 2023, Nikolai Hartmann wrote: >>> >>>> Hi Andy, >>>> >>>> The behavior seems to be that it purges all the disks except one. >>>> After the >>>> other disks now again surpassed the threshold of 95% it seemed to >>>> trigger the >>>> cleanup and now i have this: >>>> >>>> Filesystem Type Size Used Avail Use% Mounted on >>>> /dev/sdb btrfs 5,5T 5,3T 215G 97% >>>> /srv/xcache/b >>>> /dev/sda btrfs 5,5T 5,0T 560G 90% >>>> /srv/xcache/a >>>> /dev/sdh btrfs 5,5T 4,9T 588G 90% >>>> /srv/xcache/h >>>> /dev/sdj btrfs 5,5T 4,9T 584G 90% >>>> /srv/xcache/j >>>> /dev/sdf btrfs 5,5T 4,9T 580G 90% >>>> /srv/xcache/f >>>> /dev/sdm btrfs 5,5T 5,0T 535G 91% >>>> /srv/xcache/m >>>> /dev/sdc btrfs 5,5T 5,0T 553G 91% >>>> /srv/xcache/c >>>> /dev/sdg btrfs 5,5T 4,9T 612G 90% >>>> /srv/xcache/g >>>> /dev/sdi btrfs 5,5T 4,9T 596G 90% >>>> /srv/xcache/i >>>> /dev/sdl btrfs 5,5T 5,0T 518G 91% >>>> /srv/xcache/l >>>> /dev/sdn btrfs 5,5T 4,9T 570G 90% >>>> /srv/xcache/n >>>> /dev/sde btrfs 5,5T 4,9T 593G 90% >>>> /srv/xcache/e >>>> /dev/sdk btrfs 5,5T 4,8T 677G 88% >>>> /srv/xcache/k >>>> /dev/sdd btrfs 5,5T 4,9T 602G 90% >>>> /srv/xcache/d >>>> >>>> Cheers, >>>> Nikolai >>>> >>>> On 2/14/23 21:52, Andrew Hanushevsky wrote: >>>>> Hi Matevz & Nikolai, >>>>> >>>>> The allocation should favor the disk with the most free space >>>>> unless it's >>>>> atered using the oss.alloc directive: >>>>> https://urldefense.com/v3/__https://xrootd.slac.stanford.edu/doc/dev54/ofs_config.htm*_Toc89982400__;Iw!!Mih3wA!AsisYxoXis_6IdoiqK-BwdMsHfHTB41Z4-GEjaMqvO0PQHh6TqU8Sn79JUgDeJDLCvO63yQiG63Zu6syVA$ >>>>> I don't think Nikolai specifies that and I don't think the pfc >>>>> alters it in >>>>> any way. So, I can't explain why we see that difference other than >>>>> via an >>>>> uneven purge. >>>>> >>>>> Andy >>>>> >>>>> >>>>> On Tue, 14 Feb 2023, Matevz Tadel wrote: >>>>> >>>>>> Hi Nikolai, Andy, >>>>>> >>>>>> I saw this a long time back, 2++ years. The thing is that xcache >>>>>> does oss df on >>>>>> the whole space and then deletes files without any knowledge of >>>>>> the usage on >>>>>> individual disks themselves. Placement of new files should prefer >>>>>> the more >>>>>> empty >>>>>> disks though, iirc. >>>>>> >>>>>> I remember asking Andy about how xcache could be made aware of >>>>>> individual disks >>>>>> and he prepared something for me but it got really complicated >>>>>> when I was >>>>>> trying >>>>>> to include this into the cache purge algorithm so I think I >>>>>> dropped this. >>>>>> >>>>>> Andy, could we sneak some debug printouts into oss new file disk >>>>>> selection to >>>>>> see if something is going wrong there? >>>>>> >>>>>> Nikolai, how fast does this happen? Is it a matter of days, ie, >>>>>> over many purge >>>>>> cycles? Is it always the same disk? >>>>>> >>>>>> Cheers, >>>>>> Matevz >>>>>> >>>>>> On 2/13/23 23:21, Nikolai Hartmann wrote: >>>>>>> Hi Andy, >>>>>>> >>>>>>> The config is the following: >>>>>>> >>>>>>> https://urldefense.com/v3/__https://gitlab.physik.uni-muenchen.de/etp-computing/xcache-nspawn-lrz/-/blob/086e5ade5d27fc7d5ef59448c955523e453c091f/etc/xrootd/xcache.cfg__;!!Mih3wA!DfZZQn5-SZKaGYvPW97K8SD5gDYYTy0wuUgMgQCUMhwQehl01yhKQdErjCRUz3BoZYL_nKVipwRIRYyR$ >>>>>>> The directories for `oss.localroot` and `oss.space meta` are on >>>>>>> the system >>>>>>> disk. >>>>>>> The `/srv/xcache/[a-m]` are individually mounted devices. >>>>>>> >>>>>>> Best, >>>>>>> Nikolai >>>>>>> >>>>>>> On 2/14/23 00:34, Andrew Hanushevsky wrote: >>>>>>>> Hi Nikolai, >>>>>>>> >>>>>>>> Hmmm, no it seems you are the first one. Then again, not many >>>>>>>> people have a >>>>>>>> multi-disk setup. So, could you send a link to your config >>>>>>>> file? It might be >>>>>>>> the case that all of the metadata files wind up on the same disk >>>>>>>> and that is >>>>>>>> the source of the issue here. >>>>>>>> >>>>>>>> Andy >>>>>>>> >>>>>>>> On Mon, 13 Feb 2023, Nikolai Hartmann wrote: >>>>>>>> >>>>>>>>> Dear xrootd-l, >>>>>>>>> >>>>>>>>> I'm seeing the issue that one of the disks on one of our xcache >>>>>>>>> servers >>>>>>>>> fills >>>>>>>>> up disproportionally - that means it runs completely full until >>>>>>>>> i get "no >>>>>>>>> space left on device" errors without xcache running cleanup, >>>>>>>>> while the other >>>>>>>>> disks still have plenty of space left. My current df output: >>>>>>>>> >>>>>>>>> /dev/sdb btrfs 5,5T 5,2T 273G 96% >>>>>>>>> /srv/xcache/b >>>>>>>>> /dev/sda btrfs 5,5T 4,9T 584G 90% >>>>>>>>> /srv/xcache/a >>>>>>>>> /dev/sdh btrfs 5,5T 5,0T 562G 90% >>>>>>>>> /srv/xcache/h >>>>>>>>> /dev/sdj btrfs 5,5T 5,0T 551G 91% >>>>>>>>> /srv/xcache/j >>>>>>>>> /dev/sdf btrfs 5,5T 4,9T 579G 90% >>>>>>>>> /srv/xcache/f >>>>>>>>> [...] >>>>>>>>> >>>>>>>>> If you look at the first line you see that disk is 96% full >>>>>>>>> while the others >>>>>>>>> are around 90%. The issue occurred the first time after i built >>>>>>>>> a new >>>>>>>>> container for running xrootd. That change involved switching >>>>>>>>> the container >>>>>>>>> from centos7 to almalinux8 and changing the xrootd user id (ran >>>>>>>>> chown and >>>>>>>>> chgrp afterwards on the cache directories which are bind >>>>>>>>> mounted). The >>>>>>>>> xrootd >>>>>>>>> version stayed the same (5.4.2). The high/low watermark >>>>>>>>> configuration is the >>>>>>>>> following: >>>>>>>>> >>>>>>>>> pfc.diskusage 0.90 0.95 >>>>>>>>> >>>>>>>>> I already tried clearing the misbehaving disk (after it ran >>>>>>>>> full to 100%), >>>>>>>>> but now the issue is reappearing. Has anyone seen similar >>>>>>>>> issues or does it >>>>>>>>> ring any bells for you? >>>>>>>>> >>>>>>>>> One thing i checked is the size that xrootd reports in the log >>>>>>>>> for the total >>>>>>>>> storage and that at least matches what i get when i sum the >>>>>>>>> entries from >>>>>>>>> `df`. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Nikolai >>>>>>>>> >>>>>>>>> ######################################################################## >>>>>>>>> Use REPLY-ALL to reply to list >>>>>>>>> >>>>>>>>> To unsubscribe from the XROOTD-L list, click the following link: >>>>>>>>> https://urldefense.com/v3/__https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1__;!!Mih3wA!DfZZQn5-SZKaGYvPW97K8SD5gDYYTy0wuUgMgQCUMhwQehl01yhKQdErjCRUz3BoZYL_nKVip_SnON6x$ >>>>>> >>>> >> ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-L list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-L&A=1