Distributing runs on a multi-gpu machine

mtmccann · October 19, 2022, 8:36pm

This is my approach using queues on an 8 GPU machine.

First, I set up a queue for each GPU. This only needs to be done once.

for i in {0..7}
do
  guild run --background --yes queue gpus=$i
done

I can check that my queues are running with

guild runs -Fo queue

To submit jobs, I use, e.g.,

guild run --stage-trials --quiet train.py learning_rate='logspace[-3:0:4]' batch_size='[1,2,5,10,20]'

The queues automatically grab jobs and run them.

Topic		Replies	Views
Assigning runs to GPUs Troubleshooting	4	427	June 12, 2023
Hyperparameter optimizer distributed over several remotes machines General	4	539	July 2, 2021
Dask Scheduler not utilizing all available resources Troubleshooting	0	415	March 3, 2022
Dask scheduler not using multiple gpus on remote Troubleshooting	2	776	April 5, 2021
Parallel processing with Dask scheduler Guides	0	1425	February 24, 2021