This already happened last night. SDF is already back up and running.
In order to address critical issues of filesystem stability, we will undergo emergency maintenance of SDF tonight (Wednesday 22nd July 2022). The vendor (DDN) believes we are hitting a known Lustre bug that has been addressed in a newer
release that we upgrade to starting at 5pm PDT tonight. We shall terminate all slurm jobs at this time and also prevent logins into SDF whilst we perform this necessary upgrade. The estimated outage duration is 8 hours. We apologize for lack of notice but
this is the quickest solution to a useable system.
We have also observed that some filesystem issues correlate with users concurrently running 1,000s of IO-bound jobs via the shared partition. In an effort to minimize the impact of these submissions upon the larger SDF community, we will also start limiting
the number of jobs each user has running concurrently in the shared partition to 50 jobs. There will be no limit on the number of submissions/pending jobs. This will only be applied to the shared partition - there will be no changes to other partitions in
slurm. This will be in effect once tonights outage is over.
Thank you all for your patience. We are determined to deliver production-quality SDF services.
------
Yemi
Use REPLY-ALL to reply to list
To unsubscribe from the HPS-DEPARTMENT list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-DEPARTMENT&A=1