Parallelization with Arrays
Introduction
Sometimes we need to submit many jobs with different parameters. In this case, one option is to use SLURM job arrays. This is very useful when we have many parameters and do not want to submit one job for each parameter combination.
Example Code
Suppose we want to run a script called process_data.py, which receives 1 input parameter. In this case, it prints results to the screen.
An example command would be:
python3 process_data.py 3
And the code would be:
import pandas as pd
import sys
# Get the task ID from command-line arguments
try:
task_id = int(sys.argv[1])
except IndexError:
task_id = 0
# Create a sample DataFrame with users and balances
data = {
'user': ['Alice', 'Bob', 'Charlie', 'David'],
'balance': [100.50, 250.00, 50.75, 500.20]
}
df = pd.DataFrame(data)
# Simulate a calculation that depends on task ID.
# In this case, we multiply the balance by task_id + 1.
# This is only an example; in a real case we could run more complex computations.
df['balance'] = df['balance'] * (task_id + 1)
print(f"--- Results for Task ID: {task_id} ---")
print(df[['user', 'balance']])
We will create a file called launch_python_array.sh where we define the array of values we want to run. In this case we only have one input parameter, but we could have more and generate parameter combinations. We also activate a venv to ensure an isolated Python environment, and use SLURM directives to configure the job.
#!/bin/bash
#SBATCH --job-name=pandas_venv
#SBATCH --output=resultat_%A_%a.out
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --time=00:10:00
#SBATCH --array=0-2
# 1. Path to the venv (adjust this path to your setup)
VENV_PATH="$HOME/tests/arrays_example/venv/bin/activate"
# 2. Activate the virtual environment
if [ -f "$VENV_PATH" ]; then
echo "Activating venv..."
source "$VENV_PATH"
else
echo "Error: venv not found at: $VENV_PATH"
exit 1
fi
# 3. Run the Python script with task ID as argument
# In this example, $SLURM_ARRAY_TASK_ID will be 0, 1, or 2 depending on the running task.
echo "Running Python for task $SLURM_ARRAY_TASK_ID..."
python3 process_data.py $SLURM_ARRAY_TASK_ID
# Add a sleep to make result visualization easier before deactivating the venv
echo "Sleeping for a while:"
sleep 60
# 4. Deactivate the virtual environment
echo "Deactivating environment..."
deactivate
echo "Process completed."
Execution
sbatch launch_python_array.sh
In this example, this launches 3 different tasks, each with a different task ID value (0, 1, and 2). Our Python script performs different calculations based on that task ID, and results are saved in different output files (resultat_%A_%a.out), where %A is the job ID and %a is the task ID. Once execution is finished, the output will be visible in the same directory where the command was launched.