I have a guild operation main that runs 3 steps which are other operations: impute, evaluate, and predict. The latter two require on the impute operation (specifically a model checkpoint and some data output).
(guild.yml file for reference)
- When one of the steps fails (e.g.
evaluate), themainop shows error and so doesevaluate. If I fix the error in the code and restart the run with something likefor hash in $(guild select --operation evaluate --error --all); do guild run -y --background --restart $hash --force-sourcecode; done, then theevaluateop fixes to completed, but themainoperation does not. It doesn’t seem very possible to update it, but it is slightly unclean and annoying to keep track of what broke and what is fixed. I end up with something like:
[71:ec03c916] evaluate 2023-02-20 14:43:57 completed dvae myexperiment
[72:957ecb30] evaluate 2023-02-20 14:43:56 completed dvae myexperiment
[73:19493e6b] evaluate 2023-02-20 14:43:56 completed dvae myexperiment
...
[127:fe72a7ff] predict 2023-02-18 20:58:56 completed dvae
[128:617bc8fd] impute 2023-02-18 20:26:16 completed dvae myexperiment
[129:2b155ff0] main 2023-02-18 20:26:14 error dvae
[130:39125144] predict 2023-02-18 20:21:08 completed dvae
[131:5c4ed46a] impute 2023-02-18 19:45:25 completed dvae myexperiment
[132:c542fcbe] main 2023-02-18 19:45:24 error dvae
It said error for main but it’s really been fixed sine the evaluate op was fixed.
Another issue is also what files are stored under each op which leads me to the next point, where ill use run 132 as an example:
- If I look at what is stored under the
mainop I see:
me@machine:~/mambaforge/envs/ap/.guild/runs/c542fcbe24ab4a86b1ea0e33fabd839a$ ls
evaluate impute options.yml predict
If I drill into the directories I see:
me@machine:~/mambaforge/envs/ap/.guild/runs/c542fcbe24ab4a86b1ea0e33fabd839a$ cd evaluate
me@machine:~/mambaforge/envs/ap/.guild/runs/c542fcbe24ab4a86b1ea0e33fabd839a/evaluate$ ls
F.O. options.yml serialized_models
If evaluate fails and I rerun it, does that mean that the evaluatefolder will be updated too (is it a symlink)? There seems to be some redundancy too which leads me to:
- If I look at the output of the substeps
imputeandpredictI see:
# impute op
me@machine:~/mambaforge/envs/ap/.guild/runs/5c4ed46a12e145158b2351621ee81345/serialized_models$ ls
AEDitto_STATIC.pt imputed_data.pkl STATIC_test_dataloader.pt
# predict op
me@machine:~/mambaforge/envs/ap/.guild/runs/391251446fe041048d93d80deda6ac8a/serialized_models$ ls
AEDitto_STATIC.pt imputed_data.pkl STATIC_test_dataloader.pt
I also see
# impute op
me@machine:~/mambaforge/envs/ap/.guild/runs/5c4ed46a12e145158b2351621ee81345$ ls F.O./0.33/MNAR\(G\)/dvae/lightning_logs/version_0/
events.out.tfevents.1676780435.lambda2.6521.0
events.out.tfevents.1676780445.lambda2.6521.1
events.out.tfevents.1676780453.lambda2.6521.2
hparams.yaml
# predict op
me@machine:~/mambaforge/envs/ap/.guild/runs/391251446fe041048d93d80deda6ac8a$ ls F.O./0.33/MNAR\(G\)/dvae/lightning_logs/version_0/
events.out.tfevents.1676780435.lambda2.6521.0
events.out.tfevents.1676780445.lambda2.6521.1
events.out.tfevents.1676780453.lambda2.6521.2
hparams.yaml
It looks like it copies over everything from the impute op top the parents: main, and dependent steps: predict, and evaluate. This is a lot of redundancy especially for expensive/large models and artifacts. This is making me run out of space on my machine.
My questions are
a) How do I avoid redundancy in stored artifacts between parent and child steps like main having substeps.
b) How do I avoid redundancy amongst sibling runs where one may be dependent on another? While evaluate relies on the artifacts from impute I don’t want it to store all the artifacts all over again (including the model checkpoints, data, and the logging files), I just want evaluate to use the checkpointed data and model. I know there’s a select: option but it seems to be regex, making it complicated to select the checkpointed model AND data. Also even if that solves excluding the logged files, I don’t want to copy over the files it relies on to the final logged artifacts.