Print

Print


Hi Omar,

Thanks for this update. Can you remind me how to monitor the HPS-specific resources?

Thanks,
Norman

From: [log in to unmask] <[log in to unmask]> on behalf of Moreno, Omar <[log in to unmask]>
Sent: Thursday, June 23, 2022 2:15 PM
To: Bravo, Cameron B. <[log in to unmask]>; Moreno, Omar <[log in to unmask]>; hps-department <[log in to unmask]>; hps-software <[log in to unmask]>
Subject: Re: URGENT - Emergency SDF outage at 5pm today 6/22
 
I just want everyone to note that jobs using the shared partition will now be limited to 50.   If all HPS SDF cores are being used, then members of the HPS group will only be able to run 50 jobs at a time until the HPS cores are freed up. 

From: [log in to unmask] <[log in to unmask]> on behalf of Bravo, Cameron B. <[log in to unmask]>
Sent: Thursday, June 23, 2022 2:13 PM
To: Moreno, Omar <[log in to unmask]>; hps-department <[log in to unmask]>; hps-software <[log in to unmask]>
Subject: Re: URGENT - Emergency SDF outage at 5pm today 6/22
 
Hi Omar,

This already happened last night. SDF is already back up and running.

Cheers,
Cameron


From: [log in to unmask] <[log in to unmask]> on behalf of Moreno, Omar <[log in to unmask]>
Sent: Thursday, June 23, 2022 2:09 PM
To: hps-department <[log in to unmask]>; hps-software <[log in to unmask]>
Subject: Fw: URGENT - Emergency SDF outage at 5pm today 6/22
 


From: Adeyemi Adesanya <[log in to unmask]>
Sent: Wednesday, June 22, 2022 1:59 PM
To: Moreno, Omar <[log in to unmask]>
Subject: URGENT - Emergency SDF outage at 5pm today 6/22
 
In order to address critical issues of filesystem stability, we will undergo emergency maintenance of SDF tonight (Wednesday 22nd July 2022). The vendor (DDN) believes we are hitting a known Lustre bug that has been addressed in a newer release that we upgrade to starting at 5pm PDT tonight. We shall terminate all slurm jobs at this time and also prevent logins into SDF whilst we perform this necessary upgrade. The estimated outage duration is 8 hours. We apologize for lack of notice but this is the quickest solution to a useable system.

We have also observed that some filesystem issues correlate with users concurrently running 1,000s of IO-bound jobs via the shared partition. In an effort to minimize the impact of these submissions upon the larger SDF community, we will also start limiting the number of jobs each user has running concurrently in the shared partition to 50 jobs. There will be no limit on the number of submissions/pending jobs. This will only be applied to the shared partition - there will be no changes to other partitions in slurm. This will be in effect once tonights outage is over.

Thank you all for your patience. We are determined to deliver production-quality SDF services.

------
Yemi



Use REPLY-ALL to reply to list

To unsubscribe from the HPS-DEPARTMENT list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-DEPARTMENT&A=1



Use REPLY-ALL to reply to list

To unsubscribe from the HPS-DEPARTMENT list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-DEPARTMENT&A=1



Use REPLY-ALL to reply to list

To unsubscribe from the HPS-DEPARTMENT list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-DEPARTMENT&A=1



Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1