AWS Parallel Computing Service (AWS PCS) now supports Slurm 25.11, bringing several new capabilities to HPC workloads on AWS. Key additions include expedited re-queue for automatic job rescheduling when nodes fail, a Prometheus-compatible OpenMetrics endpoint for real-time monitoring of jobs and nodes, and expanded logging options. Slurm daemon logs (slurmdbd and slurmrestd) can now be sent to Amazon CloudWatch Logs, S3, or Data Firehose. Scheduler audit logs are now a dedicated log type, giving independent control over ingestion and storage costs. These features are available in all AWS regions where PCS is supported.

2m read timeFrom aws.amazon.com
Post cover image

Sort: