Print

Print


Hi Omar,

This already happened last night. SDF is already back up and running.

Cheers,
Cameron

________________________________
From: [log in to unmask] <[log in to unmask]> on behalf of Moreno, Omar <[log in to unmask]>
Sent: Thursday, June 23, 2022 2:09 PM
To: hps-department <[log in to unmask]>; hps-software <[log in to unmask]>
Subject: Fw: URGENT - Emergency SDF outage at 5pm today 6/22


________________________________
From: Adeyemi Adesanya <[log in to unmask]>
Sent: Wednesday, June 22, 2022 1:59 PM
To: Moreno, Omar <[log in to unmask]>
Subject: URGENT - Emergency SDF outage at 5pm today 6/22

In order to address critical issues of filesystem stability, we will undergo emergency maintenance of SDF tonight (Wednesday 22nd July 2022). The vendor (DDN) believes we are hitting a known Lustre bug that has been addressed in a newer release that we upgrade to starting at 5pm PDT tonight. We shall terminate all slurm jobs at this time and also prevent logins into SDF whilst we perform this necessary upgrade. The estimated outage duration is 8 hours. We apologize for lack of notice but this is the quickest solution to a useable system.

We have also observed that some filesystem issues correlate with users concurrently running 1,000s of IO-bound jobs via the shared partition. In an effort to minimize the impact of these submissions upon the larger SDF community, we will also start limiting the number of jobs each user has running concurrently in the shared partition to 50 jobs. There will be no limit on the number of submissions/pending jobs. This will only be applied to the shared partition - there will be no changes to other partitions in slurm. This will be in effect once tonights outage is over.

Thank you all for your patience. We are determined to deliver production-quality SDF services.

------
Yemi


________________________________

Use REPLY-ALL to reply to list

To unsubscribe from the HPS-DEPARTMENT list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-DEPARTMENT&A=1

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the HPS-SOFTWARE list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=HPS-SOFTWARE&A=1