4 - Writing Machine Learning Scripts

4 - Writing Machine Learning Scripts

In this section we will explore how to write a Machine Learning script that will help you making the best use of the cluster and apply good practices. You can find the full example at example.py.

4.1 - Parsing Arguments and Hyperparameters

When doing Machine Learning experiments, it's important to be able easily change the hyperparameters of the model and the training process. There are several options to achieve this, but we recommend using the argparse library, for general flags and parameters, and yaml, for model and training loop hyperparameters. The code below shows an example, whose full version you can find at example.py.

import argparse
import yaml
 
parser = argparse.ArgumentParser()
parser.add_argument("--config", type=str, default="config.yaml")
parser.add_argument("--debug", action="store_true")
parser.add_argument("--seed", type=int, default=42)
parser.add_argument("--num_workers", type=int, default=4)
parser.add_argument("--device", type=str, default="cuda:0")
 
args = parser.parse_args()
with open(args.config, "r") as f:
    config = yaml.safe_load(f)

4.2 - Logging

We recommend using logging tools like Weights & Biases (wandb) (opens in a new tab), Comet (opens in a new tab), and Tensorboard (opens in a new tab) to keep track of your experiments and results. They allow you to log metrics, hyperparameters, losses, tables, figures and much more of your machine learning experiments and also provide a web-based interface that helps you visualizing and analyzing different runs.

The figure below shows an example of the web interface of Weights & Biases (opens in a new tab) for the example.py script.

Weights and Biases screenshot

4.3 - Saving and Loading Checkpoints

It's a good practice to save checkpoints of your model during training. This way you can resume training from a checkpoint in case your training process is interrupted for some reason. You can also save multiple checkpoints during training, and select the best model according to some metric. You can save a checkpoint with the following code:

import torch
 
model = torch.nn.Linear(3, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
 
checkpoint = {
    "epoch": epoch,
    "model": model.state_dict(),
    "optimizer": optimizer.state_dict(),
    "scheduler": scheduler.state_dict(),
}
 
torch.save(checkpoint, "checkpoint.pth")

To load a checkpoint you can use the following code:

checkpoint = torch.load("checkpoint.pth")
 
model.load_state_dict(checkpoint["model"])
optimizer.load_state_dict(checkpoint["optimizer"])
scheduler.load_state_dict(checkpoint["scheduler"])

4.4 - Seeds

It's a good practice to set the seeds of the random number generators of the libraries you are using. This way you can reproduce the results of your experiments. You can set the seeds with the following code:

import random
import numpy as np
import torch
 
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

However, it's important to note that numbers generated by the GPU are not deterministic by default, which will make your experiments non-reproducible. To enforce a deterministic behaviour you need to set the deterministic flag to True:. You can do it with the following code:

torch.backends.cudnn.deterministic = True

4.5 - Testing Code with Jupyter Notebooks

When building a Machine Learning model, it's important to test your code before running it on the cluster. Jupyter Notebooks are a great tool to do this. They allow you to test a piece of code and visualize the results. You can find an example of a Jupyter Notebook at example.ipynb that tests the MNISTDataset.

4.6 - Type Hinting and Docstrings

Type hinting and docstrings are a good practice when writing code. They help you and other people to understand the code and make it easier to debug. Additionally, in Machine Learning scripts, it can be useful to document variables with their shape and type.

Example of a function with type hinting and docstrings:

import torch
 
def add(a: torch.tensor, b: torch.tensor) -> int:
    """Adds two torch tensors.
 
    Args:
        a (torch.tensor): First number.
        b (torch.tensor): Second number.
 
    Returns:
        int: Sum of a and b.
    """
    return a + b

You can find an example of typehinting and docstrings at example.py.

4.7 Git

You should always keep track of the different versions of your code using a version control system like Git (opens in a new tab), which will enable you to reproduce the results of your experiments and track changes that may have introduced bugs.