SLURM / PLGrid guide¶
The project runs on PLGrid Athena (partition plgrid-gpu-a100, account plggolemml26-gpu-a100). All sbatch scripts live under $SCRATCH/sbatch/.
Directory structure¶
sbatch/
├── train/
│ ├── train_classifier_new_ast.sbatch
│ └── train_classifier_new_vggish.sbatch
├── predict/
│ ├── ast.sbatch
│ ├── vggish.sbatch
│ └── spectttra.sbatch
└── attack/
├── ast.sbatch
└── vggish.sbatch
Submitting jobs¶
cd $SCRATCH
# Training
sbatch sbatch/train/train_classifier_new_ast.sbatch
# Inference
sbatch sbatch/predict/ast.sbatch
# Attack
sbatch sbatch/attack/ast.sbatch
Monitoring¶
# List your jobs
squeue -u $USER
# Watch a running job's output
tail -f $SCRATCH/Audio-XAI/logs/<job_name>-<job_id>.out
# Cancel a job
scancel <job_id>
Log files are written to $SCRATCH/Audio-XAI/logs/:
Resource allocation per script¶
| Script | Nodes | GPUs | RAM | Time |
|---|---|---|---|---|
train/ast.sbatch |
2 | 2×2 A100 | 128 GB | 12 h |
train/vggish.sbatch |
2 | 2×2 A100 | 128 GB | 12 h |
predict/ast.sbatch |
1 | 1 A100 | 64 GB | 6 h |
predict/vggish.sbatch |
1 | 1 A100 | 32 GB | 6 h |
attack/ast.sbatch |
1 | 1 A100 | 64 GB | 12 h |
attack/vggish.sbatch |
1 | 1 A100 | 32 GB | 6 h |
Environment variables¶
Override defaults without editing sbatch files:
# Use a different checkpoint
AST_CKPT=$SCRATCH/Audio-XAI/runs/ast/version_7/checkpoints/best.ckpt \
sbatch sbatch/attack/ast.sbatch
# Run on CPU (debugging)
DEVICE=cpu sbatch sbatch/predict/ast.sbatch
Batched attack jobs¶
The attack sbatch scripts split n_attack_samples across N_BATCHES sequential chunks within a single job:
# In sbatch/attack/ast.sbatch:
N_BATCHES=8
for BATCH_IDX in $(seq 0 $((N_BATCHES - 1))); do
srun python3 scripts/attack.py \
--n-batches $N_BATCHES --batch-index $BATCH_IDX \
--run-name "$RUN_NAME" ...
done
All batches write to the same RUN_NAME directory. If a batch fails, re-submitting the job skips already-processed samples automatically (resume logic checks for audio/*/adversarial.wav).
Multi-node training¶
Training uses Lightning's SLURMEnvironment plugin which handles:
- Node discovery via SLURM_NODELIST
- Auto-requeue on wall-time limit (--no-auto-requeue disables this)
- DDP process group setup
Set --strategy ddp_find_unused_parameters_false for AST (some transformer parameters are unused in forward).
TensorBoard on the cluster¶
TensorBoard cannot be accessed directly from a compute node. Use SSH port forwarding:
# On your local machine:
ssh -L 6006:localhost:6006 plgpkitlo@athena.cyfronet.pl \
"tensorboard --logdir $SCRATCH/Audio-XAI/runs --port 6006"
Then open http://localhost:6006 in your browser.
Common issues¶
Job killed immediately: Check the .err log — usually a missing module or wrong conda env.
OOM on attack: Reduce --attack-micro-batch or --clip-seconds. The sbatch scripts set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True which helps with fragmentation.
NCCL errors on multi-node: Ensure --strategy ddp_find_unused_parameters_false is set and the number of --ntasks-per-node matches --devices.
HuggingFace download fails: Check that $SCRATCH/hf_token exists and contains a valid token. Verify HF_HOME points to scratch (not home directory).