1 - Introduction to the Cluster

The cluster, with domain name mlkd.tp.vps.inesc-id.pt, comprises 9 virtual machines. Namely:

7 dedicated to computation
- douro, lima, mondego, sado, tamega, tejo, zezere
1 to host the group's website and monitor GPU usage on other machines
- mlkd
1 for centralized authentication and file sharing within the cluster
- atlantico

Cluster Architecture

For simplicity, in this guide, the machines will be referred to only by their hostname. E.g., tejo.mlkd.tp.vps.inesc-id.pt will be referred to by tejo.

1.1 - Authentication and File Server

The cluster's centralized authentication service is provided by atlantico. Whenever you log into a machine, it's with atlantico they communicate to authenticate you. This centralization allows easier account creation and management. However, it also constitutes a Single Point of Failure (SPoF), since a failure by atlantico leads all machines to become unavailable.

Additionally, atlantico provides a Network File System (NFS), that allows seamless access to the same files in all machines, avoiding data replication and ensuring environment consistency. However, this solution makes atlantico a SPoF for data access, and reduces data access speed, since it becomes constrained by network speed rather than memory and bus speed.

1.1.1 - Home Directories

Home directories are stored in a single disk, attached to atlantico, which exports them to all other machines via NFS.

To prevent abusive usage, they are limited by a usage quota (currently 15GB). If you exceed this quota, you will get an error message finishing with Disk quota exceeded. If this happens:

Delete unnecessary files or move them to your local machine
If you are still exceeding the quota, contact the cluster administrator

1.1.2 - Data Storage

Datasets, especially large ones, are usually stored in separate disks. Additionally, some users might have disks to store personal data that does not fit in their home directory. These disks are attached to atlantico and exported via NFS to each machine's /media directory.

Currently, the following public datasets are available:

If you need to store a dataset that does not fit in your home directory, either public or private, contact the cluster administrator.

1.1.3 - IO Intensive Operations

The cluster's NFS solution has several advantages. However, it also leads to some issues. Since files are shared over the network, each transfer consumes network bandwidth. Therefore, IO intensive operations can lead to:

Network congestion
High latency
Slow data transfer rates
Decreased overall system performance

There is no single solution for this problem, but the following might help:

Load the full dataset from disk to RAM at the start of the program, instead of loading items on the fly, i.e., on the Dataset's __init__ method rather than on __getitem__
Store your data in formats optimized for compression, such as, HDF5 (opens in a new tab), TFRecord (opens in a new tab), LMDB (opens in a new tab), Parquet (opens in a new tab)
Ask the cluster administrator to mount your disk directly on the machine running your code

1.2 - Computation Machines

Currently, machines dedicated to compuation have 5 types of configuration:

96 cores CPU, 960GB RAM, 4x80GB NVIDIA A100 GPUs
- guadiana
40 cores CPU, 256GB RAM, 4x32GB NVIDIA V100 GPUs
- zezere
48 cores CPU, 240GB RAM, 2x48GB NVIDIA A40 GPUs
- cavado, vouga
8 cores CPU, 64GB RAM, 32GB NVIDIA V100 GPU
- douro, mondego, sado, tejo
44 cores CPU, 120GB RAM, no GPU
- lima, tamega

Except for sado and guadiana, you can use any machine, according to your needs. As a rule of thumb, you can consider the following:

You only need CPU
- lima, tamega
You need a single GPU
- douro, mondego, tejo
You need more than a single GPU or all other machines are busy
- cavado, vouga, zezere

When using machines with multiple GPUs, you should only split your model between multiple GPUs if you really need to. Otherwise, you should use a single GPU and should avoid using one that is already in use. Additionally, to maximize resource usage efficiency, when using a GPU, you should always aim to use as much GPU memory as possible.

Unlike other machines, cavado and vouga have 2.4TB ephemeral disks, which are much faster to access than other disks, and can be used to store input data for model training. However, keep in mind that, as the name suggests, these disks are ephemeral, and anyone can read from and write on them. Thus, they shouldn't be used to store private data or data for which you don't have other copies.

1.3 - GPU Monitor

To determine the machine where you can run your code without having to access them all, you can use the GPU Monitoring Service (opens in a new tab).

GPU Monitor

Introduction 2 - Using the Machines