Distributing runs on a multi-gpu machine

This is my approach using queues on an 8 GPU machine.

First, I set up a queue for each GPU. This only needs to be done once.

for i in {0..7}
do
  guild run --background --yes queue gpus=$i
done

I can check that my queues are running with

guild runs -Fo queue

To submit jobs, I use, e.g.,

guild run --stage-trials --quiet train.py learning_rate='logspace[-3:0:4]' batch_size='[1,2,5,10,20]'

The queues automatically grab jobs and run them.

3 Likes