Unity Module

Overview

At a Glance

Purpose:

Run FUSION simulations on SLURM-managed HPC clusters

Location:

fusion/unity/

Key Files:

make_manifest.py, submit_manifest.py, fetch_results.py

Cluster:

Any SLURM cluster (named “Unity” after UMass Amherst’s cluster)

The unity module provides a complete workflow for running FUSION simulations on High-Performance Computing (HPC) clusters that use the SLURM workload manager.

Important

This module is for SLURM clusters only.

SLURM (Simple Linux Utility for Resource Management) is the job scheduler used by many HPC clusters including Unity at UMass, NERSC, and most university clusters. If your cluster uses a different scheduler (PBS, SGE, etc.), this module will not work directly.

What this module does:

  1. Generate job manifests - Convert parameter specifications into CSV manifests

  2. Submit to SLURM - Submit manifests as array jobs to the cluster

  3. Fetch results - Automatically find and download results to your local machine

Quick Start: End-to-End Example

This section walks you through the complete workflow from setting up your environment to downloading results.

Step 1: Create the Virtual Environment on the Cluster

First, SSH into your cluster and create a virtual environment for FUSION.

# SSH into your cluster
ssh username@unity.rc.umass.edu

# Navigate to your work directory
cd /work/username

# Clone FUSION (if not already done)
git clone https://github.com/your-org/FUSION.git
cd FUSION

# Create virtual environment using the provided script
./fusion/unity/scripts/make_unity_venv.sh /work/username/fusion_venv python3.11

# Activate the virtual environment
source /work/username/fusion_venv/venv/bin/activate

# Install FUSION and dependencies
pip install -e .
pip install -r requirements.txt

Note

We call it “unity_venv” because our cluster is named Unity, but this works on any SLURM cluster. Name it whatever makes sense for your environment.

Step 2: Create a Specification File

Create a YAML file that defines your experiment parameters. The module will automatically expand parameter combinations into individual jobs.

Create specs/my_experiment.yaml:

# Resource allocation for SLURM
resources:
  partition: gpu-long      # SLURM partition name
  time: "24:00:00"         # Wall clock time (HH:MM:SS)
  mem: "32G"               # Memory per job
  cpus: 8                  # CPU cores per job
  gpus: 1                  # GPUs per job (0 for CPU-only)
  nodes: 1                 # Nodes per job

# Parameter grid - all combinations will be generated
grid:
  # Common parameters (same for all jobs)
  common:
    network: "NSFNet"
    num_requests: 10000
    holding_time: 5000
    guard_slots: 1
    cores_per_link: 7
    allocation_method: "first_fit"

  # Parameters to sweep (creates Cartesian product)
  path_algorithm: ["k_shortest_path", "least_congested"]
  erlang_start: [100, 200, 300]
  k_paths: [3, 5]

This specification creates 12 jobs (2 algorithms x 3 traffic loads x 2 k values).

Step 3: Generate the Manifest

Run the manifest generator to create job files:

# From your FUSION directory
python -m fusion.unity.make_manifest my_experiment

# Or with full path
python -m fusion.unity.make_manifest specs/my_experiment.yaml

Output structure created:

experiments/
  0119/                          # Date (MMDD)
    1430/                         # Time (HHMM)
      NSFNet/                      # Network name
        manifest.csv              # Job parameters (one row per job)
        manifest_meta.json        # Metadata about the manifest

manifest.csv contents:

path_algorithm,erlang_start,erlang_stop,k_paths,network,num_requests,...
k_shortest_path,100,150,3,NSFNet,10000,...
k_shortest_path,100,150,5,NSFNet,10000,...
k_shortest_path,200,250,3,NSFNet,10000,...
...

Step 4: Submit Jobs to SLURM

Submit your manifest as a SLURM array job:

python -m fusion.unity.submit_manifest \
    experiments/0119/1430/NSFNet \
    run_sim.sh

What happens:

  1. Reads the manifest CSV

  2. Creates a jobs/ directory for SLURM output logs

  3. Submits an array job where each task processes one manifest row

  4. Returns the SLURM job ID

Example output:

Submitted batch job 12345678

SLURM logs are saved to:

experiments/0119/1430/NSFNet/jobs/
  slurm_12345678_0.out     # First job
  slurm_12345678_1.out     # Second job
  ...

Step 5: Monitor Your Jobs

Use standard SLURM commands to monitor progress:

# Check job status
squeue -u $USER

# Check detailed job info
sacct -j 12345678

# View job output in real-time
tail -f experiments/0119/1430/NSFNet/jobs/slurm_12345678_0.out

# Check cluster priority (using provided script)
./fusion/unity/scripts/priority.sh

Step 6: Fetch Results to Your Local Machine

Once jobs complete, download results to your local machine.

On your local machine, create configs/config.yml:

# Remote paths on the cluster
metadata_root: "username@unity.rc.umass.edu:/work/username/FUSION/experiments"
data_root: "username@unity.rc.umass.edu:/work/username/FUSION/data"
logs_root: "username@unity.rc.umass.edu:/work/username/FUSION/logs"

# Local destination
dest: "~/cluster_results"

# Which experiment to fetch
experiment: "0119/1430/NSFNet"

# Set to true to preview without downloading
dry_run: false

Run the fetch command:

python -m fusion.unity.fetch_results

What happens:

  1. Downloads the runs index file from the cluster

  2. Identifies all completed simulation outputs

  3. Uses rsync to download output data, input configs, and logs

  4. Organizes everything in your local destination directory

Local result structure:

~/cluster_results/
  data/
    NSFNet/
      0119/1430/
        output/
          s1/                  # Seed 1 results
          s2/                  # Seed 2 results
        input/
          sim_input_s1.json
          sim_input_s2.json
  logs/
    k_shortest_path/
      NSFNet/
        0119/1430/
          simulation.log

How It Works

Architecture Overview

LOCAL MACHINE                         CLUSTER (SLURM)
=============                         ===============

specs/experiment.yaml
       |
       v
+------------------+
| make_manifest.py |  ------>  experiments/MMDD/HHMM/network/
+------------------+              manifest.csv
                                  manifest_meta.json
                                         |
                                         v
                               +--------------------+
                               | submit_manifest.py |
                               +--------------------+
                                         |
                                         v
                                  SLURM Array Job
                                  (sbatch --array=0-N)
                                         |
                                  +------+------+
                                  |      |      |
                                  v      v      v
                                Job 0  Job 1  Job N
                                  |      |      |
                                  v      v      v
                               data/output/network/...
                                         |
+------------------+                     |
| fetch_results.py |  <------ rsync -----+
+------------------+
       |
       v
~/cluster_results/

Manifest Generation Details

The make_manifest.py module converts specifications into job manifests.

Input Modes:

  1. Grid Mode (grid or grids): Cartesian product of parameter lists

  2. Explicit Mode (jobs): Manually specified job list

Grid Expansion Example:

grid:
  path_algorithm: ["ppo", "dqn"]
  erlang_start: [100, 200]
  k_paths: [3]

Expands to 4 jobs (2 x 2 x 1):

ppo,  100, 3
ppo,  200, 3
dqn,  100, 3
dqn,  200, 3

Automatic Erlang Stop:

If erlang_stop is not specified, it’s automatically set to erlang_start + 50.

Type Casting:

All parameters are automatically cast to the correct types based on FUSION’s configuration schema. Booleans, lists, and dicts are properly encoded.

SLURM Submission Details

The submit_manifest.py module submits manifests as SLURM array jobs.

Environment Variables Passed to Jobs:

MANIFEST=/path/to/manifest.csv    # Full path to manifest
N_JOBS=11                          # Number of jobs (0-indexed)
JOB_DIR=experiments/0119/1430/net  # Experiment directory
NETWORK=NSFNet                     # Network name
DATE=0119                          # Date portion
JOB_NAME=ppo_100_0119_1430_net     # SLURM job name
PARTITION=gpu-long                 # SLURM partition
TIME=24:00:00                      # Time limit
MEM=32G                            # Memory
CPUS=8                             # CPUs
GPUS=1                             # GPUs
NODES=1                            # Nodes

Your bash script (e.g., run_sim.sh) reads these variables and the SLURM_ARRAY_TASK_ID to determine which manifest row to process.

Result Fetching Details

The fetch_results.py module uses rsync to download results.

What Gets Downloaded:

  1. Output data: Simulation results (data/output/...)

  2. Input configs: The configuration used for each run (data/input/...)

  3. Logs: Simulation logs organized by algorithm and topology

Path Conversion:

The module automatically converts output paths to input paths:

/work/data/output/NSFNet/exp1/s1  ->  /work/data/input/NSFNet/exp1

Rsync Options:

  • -a: Archive mode (preserves permissions, timestamps)

  • -v: Verbose output

  • -P: Show progress and allow resume

  • --compress: Compress during transfer

A 3-second delay is added between rsync commands to avoid overwhelming the cluster.

Input Format Reference

Specification File Structure

# REQUIRED: Resource allocation
resources:
  partition: "gpu"          # SLURM partition
  time: "24:00:00"          # Wall time (HH:MM:SS)
  mem: "32G"                # Memory
  cpus: 8                   # CPU cores
  gpus: 1                   # GPUs (use 0 for CPU-only)
  nodes: 1                  # Nodes

# OPTION A: Grid-based parameter sweep
grid:
  common:
    # Parameters that are the same for all jobs
    network: "NSFNet"
    num_requests: 10000

  # Parameters to sweep (lists create combinations)
  path_algorithm: ["ppo", "dqn"]
  erlang_start: [100, 200]

# OPTION B: Multiple grids
grids:
  - common:
      network: "NSFNet"
    path_algorithm: ["ppo"]
    erlang_start: [100, 200]
  - common:
      network: "COST239"
    path_algorithm: ["dqn"]
    erlang_start: [300]

# OPTION C: Explicit job list
jobs:
  - algorithm: "ppo"
    traffic: 100
    k_paths: 3
    network: "NSFNet"
  - algorithm: "dqn"
    traffic: 200
    k_paths: 5
    network: "COST239"

Required Grid Parameters

These parameters MUST be present in grid specifications:

  • path_algorithm: Routing/RL algorithm name

  • erlang_start: Traffic load start value

  • k_paths: Number of candidate paths

  • obs_space: Observation space (for RL algorithms)

  • network: Network topology name (determines output grouping)

Fetch Configuration

The configs/config.yml file for fetching results:

# Remote paths (user@host:path format)
metadata_root: "user@cluster:/path/to/experiments"
data_root: "user@cluster:/path/to/data"
logs_root: "user@cluster:/path/to/logs"

# Local destination
dest: "~/cluster_results"

# Experiment to fetch (relative path)
experiment: "0119/1430/NSFNet"

# Preview mode (true = don't actually download)
dry_run: false

Output Format Reference

Manifest CSV

One row per job, with all parameters as columns:

path_algorithm,erlang_start,erlang_stop,k_paths,network,num_requests,is_rl,...
ppo,100,150,3,NSFNet,10000,true,...
dqn,200,250,5,NSFNet,10000,true,...

Encoding Rules:

  • Booleans: true / false (lowercase strings)

  • Lists: JSON format [1,2,3] (no spaces)

  • Dicts: JSON format {"key":"value"} (no spaces)

  • Floats: No trailing zeros (3.14 not 3.140000)

Manifest Metadata JSON

{
  "generated": "2025-01-19T14:30:45",
  "source": "/path/to/specs/my_experiment.yaml",
  "network": "NSFNet",
  "num_rows": 12,
  "resources": {
    "partition": "gpu-long",
    "time": "24:00:00",
    "mem": "32G",
    "cpus": 8,
    "gpus": 1,
    "nodes": 1
  }
}

Directory Structure

On the cluster:

FUSION/
  experiments/
    MMDD/
      HHMM/
        network/
          manifest.csv
          manifest_meta.json
          jobs/
            slurm_12345678_0.out
            slurm_12345678_1.out
  data/
    output/
      network/
        experiment_path/
          s1/
          s2/
    input/
      network/
        experiment_path/
          sim_input_s1.json
  logs/
    algorithm/
      network/
        MMDD/HHMM/

Fetched locally:

~/cluster_results/
  data/
    network/
      experiment_path/
        output/
        input/
  logs/
    algorithm/
      network/

Components

make_manifest.py

Purpose:

Generate job manifests from specification files

Entry Point:

python -m fusion.unity.make_manifest <spec>

Key Functions:

  • make_manifest(spec_path): Main entry point

  • _expand_grid(grid, resources): Expand grid to job list

  • _write_csv(rows, output_dir): Write manifest CSV

  • _cast(key, value): Type cast parameters

submit_manifest.py

Purpose:

Submit manifest as SLURM array job

Entry Point:

python -m fusion.unity.submit_manifest <dir> <script>

Key Functions:

  • submit_manifest(experiment_dir, bash_script): Main entry point

  • build_environment_variables(): Create SLURM env vars

  • read_first_row(manifest_path): Parse manifest header

fetch_results.py

Purpose:

Download results from cluster via rsync

Entry Point:

python -m fusion.unity.fetch_results

Key Functions:

  • fetch_results(): Main entry point

  • synchronize_remote_directory(): rsync a directory

  • convert_output_to_input_path(): Path conversion

Helper Scripts

Located in fusion/unity/scripts/:

  • make_unity_venv.sh: Create virtual environment on cluster

  • priority.sh: Check SLURM job priorities and queue

  • group_jobs.sh: Analyze resource usage by group

Error Handling

The module defines a custom exception hierarchy:

UnityError (base)
+-- ManifestError
|   +-- ManifestNotFoundError
|   +-- ManifestValidationError
+-- SpecificationError
|   +-- SpecNotFoundError
|   +-- SpecValidationError
+-- JobSubmissionError
+-- SynchronizationError
|   +-- RemotePathError
+-- ConfigurationError

Common errors and solutions:

SpecNotFoundError: Specification file not found

  • Check the file exists in current directory or specs/ subdirectory

  • Supported extensions: .yaml, .yml, .json

ManifestValidationError: Invalid manifest parameters

  • Ensure required fields are present (path_algorithm, erlang_start, etc.)

  • Check parameter names match FUSION config schema

JobSubmissionError: SLURM submission failed

  • Verify you’re on the cluster with SLURM access

  • Check partition name is valid for your cluster

  • Ensure bash script exists in bash_scripts/ directory

SynchronizationError: rsync failed

  • Verify SSH access to cluster works

  • Check paths in config.yml are correct

  • Try with dry_run: true first

Testing

Test Location:

fusion/unity/tests/

Run Tests:

pytest fusion/unity/tests/ -v

# Run all unity tests
pytest fusion/unity/tests/ -v

# Run specific test file
pytest fusion/unity/tests/test_make_manifest.py -v