Thanks for this update. Can you remind me how to monitor the HPS-specific resources?
I just want everyone to note that jobs using the shared partition will now be limited to 50. If all HPS SDF cores are being used, then members of the HPS group will only be able to run 50 jobs at a time until the HPS cores are freed up.
Hi Omar,
This already happened last night. SDF is already back up and running.
Cheers,
Cameron
In order to address critical issues of filesystem stability, we will undergo emergency maintenance of SDF tonight (Wednesday 22nd July 2022). The vendor (DDN) believes we are hitting a known Lustre bug that has been addressed in
a newer release that we upgrade to starting at 5pm PDT tonight. We shall terminate all slurm jobs at this time and also prevent logins into SDF whilst we perform this necessary upgrade. The estimated outage duration is 8 hours. We apologize for lack of notice
but this is the quickest solution to a useable system.
We have also observed that some filesystem issues correlate with users concurrently running 1,000s of IO-bound jobs via the shared partition. In an effort to minimize the impact of these submissions upon the larger SDF community, we will also start limiting
the number of jobs each user has running concurrently in the shared partition to 50 jobs. There will be no limit on the number of submissions/pending jobs. This will only be applied to the shared partition - there will be no changes to other partitions in
slurm. This will be in effect once tonights outage is over.
Thank you all for your patience. We are determined to deliver production-quality SDF services.
------
Yemi
Use REPLY-ALL to reply to list
To unsubscribe from the HPS-DEPARTMENT list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-DEPARTMENT&A=1
Use REPLY-ALL to reply to list
To unsubscribe from the HPS-DEPARTMENT list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-DEPARTMENT&A=1
Use REPLY-ALL to reply to list
To unsubscribe from the HPS-DEPARTMENT list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-DEPARTMENT&A=1