.. _unity-module: ============ Unity Module ============ Overview ======== .. admonition:: At a Glance :class: tip :Purpose: Run FUSION simulations on SLURM-managed HPC clusters :Location: ``fusion/unity/`` :Key Files: ``make_manifest.py``, ``submit_manifest.py``, ``fetch_results.py`` :Cluster: Any SLURM cluster (named "Unity" after UMass Amherst's cluster) The ``unity`` module provides a complete workflow for running FUSION simulations on High-Performance Computing (HPC) clusters that use the **SLURM** workload manager. .. important:: **This module is for SLURM clusters only.** SLURM (Simple Linux Utility for Resource Management) is the job scheduler used by many HPC clusters including Unity at UMass, NERSC, and most university clusters. If your cluster uses a different scheduler (PBS, SGE, etc.), this module will not work directly. **What this module does:** 1. **Generate job manifests** - Convert parameter specifications into CSV manifests 2. **Submit to SLURM** - Submit manifests as array jobs to the cluster 3. **Fetch results** - Automatically find and download results to your local machine Quick Start: End-to-End Example =============================== This section walks you through the complete workflow from setting up your environment to downloading results. Step 1: Create the Virtual Environment on the Cluster ------------------------------------------------------ First, SSH into your cluster and create a virtual environment for FUSION. .. code-block:: bash # SSH into your cluster ssh username@unity.rc.umass.edu # Navigate to your work directory cd /work/username # Clone FUSION (if not already done) git clone https://github.com/your-org/FUSION.git cd FUSION # Create virtual environment using the provided script ./fusion/unity/scripts/make_unity_venv.sh /work/username/fusion_venv python3.11 # Activate the virtual environment source /work/username/fusion_venv/venv/bin/activate # Install FUSION and dependencies pip install -e . pip install -r requirements.txt .. note:: We call it "unity_venv" because our cluster is named Unity, but this works on any SLURM cluster. Name it whatever makes sense for your environment. Step 2: Create a Specification File ----------------------------------- Create a YAML file that defines your experiment parameters. The module will automatically expand parameter combinations into individual jobs. Create ``specs/my_experiment.yaml``: .. code-block:: yaml # Resource allocation for SLURM resources: partition: gpu-long # SLURM partition name time: "24:00:00" # Wall clock time (HH:MM:SS) mem: "32G" # Memory per job cpus: 8 # CPU cores per job gpus: 1 # GPUs per job (0 for CPU-only) nodes: 1 # Nodes per job # Parameter grid - all combinations will be generated grid: # Common parameters (same for all jobs) common: network: "NSFNet" num_requests: 10000 holding_time: 5000 guard_slots: 1 cores_per_link: 7 allocation_method: "first_fit" # Parameters to sweep (creates Cartesian product) path_algorithm: ["k_shortest_path", "least_congested"] erlang_start: [100, 200, 300] k_paths: [3, 5] This specification creates **12 jobs** (2 algorithms x 3 traffic loads x 2 k values). Step 3: Generate the Manifest ----------------------------- Run the manifest generator to create job files: .. code-block:: bash # From your FUSION directory python -m fusion.unity.make_manifest my_experiment # Or with full path python -m fusion.unity.make_manifest specs/my_experiment.yaml **Output structure created:** .. code-block:: text experiments/ 0119/ # Date (MMDD) 1430/ # Time (HHMM) NSFNet/ # Network name manifest.csv # Job parameters (one row per job) manifest_meta.json # Metadata about the manifest **manifest.csv contents:** .. code-block:: text path_algorithm,erlang_start,erlang_stop,k_paths,network,num_requests,... k_shortest_path,100,150,3,NSFNet,10000,... k_shortest_path,100,150,5,NSFNet,10000,... k_shortest_path,200,250,3,NSFNet,10000,... ... Step 4: Submit Jobs to SLURM ---------------------------- Submit your manifest as a SLURM array job: .. code-block:: bash python -m fusion.unity.submit_manifest \ experiments/0119/1430/NSFNet \ run_sim.sh **What happens:** 1. Reads the manifest CSV 2. Creates a ``jobs/`` directory for SLURM output logs 3. Submits an array job where each task processes one manifest row 4. Returns the SLURM job ID **Example output:** .. code-block:: text Submitted batch job 12345678 **SLURM logs are saved to:** .. code-block:: text experiments/0119/1430/NSFNet/jobs/ slurm_12345678_0.out # First job slurm_12345678_1.out # Second job ... Step 5: Monitor Your Jobs ------------------------- Use standard SLURM commands to monitor progress: .. code-block:: bash # Check job status squeue -u $USER # Check detailed job info sacct -j 12345678 # View job output in real-time tail -f experiments/0119/1430/NSFNet/jobs/slurm_12345678_0.out # Check cluster priority (using provided script) ./fusion/unity/scripts/priority.sh Step 6: Fetch Results to Your Local Machine ------------------------------------------- Once jobs complete, download results to your local machine. **On your local machine**, create ``configs/config.yml``: .. code-block:: yaml # Remote paths on the cluster metadata_root: "username@unity.rc.umass.edu:/work/username/FUSION/experiments" data_root: "username@unity.rc.umass.edu:/work/username/FUSION/data" logs_root: "username@unity.rc.umass.edu:/work/username/FUSION/logs" # Local destination dest: "~/cluster_results" # Which experiment to fetch experiment: "0119/1430/NSFNet" # Set to true to preview without downloading dry_run: false **Run the fetch command:** .. code-block:: bash python -m fusion.unity.fetch_results **What happens:** 1. Downloads the runs index file from the cluster 2. Identifies all completed simulation outputs 3. Uses rsync to download output data, input configs, and logs 4. Organizes everything in your local destination directory **Local result structure:** .. code-block:: text ~/cluster_results/ data/ NSFNet/ 0119/1430/ output/ s1/ # Seed 1 results s2/ # Seed 2 results input/ sim_input_s1.json sim_input_s2.json logs/ k_shortest_path/ NSFNet/ 0119/1430/ simulation.log How It Works ============ Architecture Overview --------------------- .. code-block:: text LOCAL MACHINE CLUSTER (SLURM) ============= =============== specs/experiment.yaml | v +------------------+ | make_manifest.py | ------> experiments/MMDD/HHMM/network/ +------------------+ manifest.csv manifest_meta.json | v +--------------------+ | submit_manifest.py | +--------------------+ | v SLURM Array Job (sbatch --array=0-N) | +------+------+ | | | v v v Job 0 Job 1 Job N | | | v v v data/output/network/... | +------------------+ | | fetch_results.py | <------ rsync -----+ +------------------+ | v ~/cluster_results/ Manifest Generation Details --------------------------- The ``make_manifest.py`` module converts specifications into job manifests. **Input Modes:** 1. **Grid Mode** (``grid`` or ``grids``): Cartesian product of parameter lists 2. **Explicit Mode** (``jobs``): Manually specified job list **Grid Expansion Example:** .. code-block:: yaml grid: path_algorithm: ["ppo", "dqn"] erlang_start: [100, 200] k_paths: [3] Expands to 4 jobs (2 x 2 x 1): .. code-block:: text ppo, 100, 3 ppo, 200, 3 dqn, 100, 3 dqn, 200, 3 **Automatic Erlang Stop:** If ``erlang_stop`` is not specified, it's automatically set to ``erlang_start + 50``. **Type Casting:** All parameters are automatically cast to the correct types based on FUSION's configuration schema. Booleans, lists, and dicts are properly encoded. SLURM Submission Details ------------------------ The ``submit_manifest.py`` module submits manifests as SLURM array jobs. **Environment Variables Passed to Jobs:** .. code-block:: text MANIFEST=/path/to/manifest.csv # Full path to manifest N_JOBS=11 # Number of jobs (0-indexed) JOB_DIR=experiments/0119/1430/net # Experiment directory NETWORK=NSFNet # Network name DATE=0119 # Date portion JOB_NAME=ppo_100_0119_1430_net # SLURM job name PARTITION=gpu-long # SLURM partition TIME=24:00:00 # Time limit MEM=32G # Memory CPUS=8 # CPUs GPUS=1 # GPUs NODES=1 # Nodes **Your bash script** (e.g., ``run_sim.sh``) reads these variables and the ``SLURM_ARRAY_TASK_ID`` to determine which manifest row to process. Result Fetching Details ----------------------- The ``fetch_results.py`` module uses rsync to download results. **What Gets Downloaded:** 1. **Output data**: Simulation results (``data/output/...``) 2. **Input configs**: The configuration used for each run (``data/input/...``) 3. **Logs**: Simulation logs organized by algorithm and topology **Path Conversion:** The module automatically converts output paths to input paths: .. code-block:: text /work/data/output/NSFNet/exp1/s1 -> /work/data/input/NSFNet/exp1 **Rsync Options:** - ``-a``: Archive mode (preserves permissions, timestamps) - ``-v``: Verbose output - ``-P``: Show progress and allow resume - ``--compress``: Compress during transfer A 3-second delay is added between rsync commands to avoid overwhelming the cluster. Input Format Reference ====================== Specification File Structure ---------------------------- .. code-block:: yaml # REQUIRED: Resource allocation resources: partition: "gpu" # SLURM partition time: "24:00:00" # Wall time (HH:MM:SS) mem: "32G" # Memory cpus: 8 # CPU cores gpus: 1 # GPUs (use 0 for CPU-only) nodes: 1 # Nodes # OPTION A: Grid-based parameter sweep grid: common: # Parameters that are the same for all jobs network: "NSFNet" num_requests: 10000 # Parameters to sweep (lists create combinations) path_algorithm: ["ppo", "dqn"] erlang_start: [100, 200] # OPTION B: Multiple grids grids: - common: network: "NSFNet" path_algorithm: ["ppo"] erlang_start: [100, 200] - common: network: "COST239" path_algorithm: ["dqn"] erlang_start: [300] # OPTION C: Explicit job list jobs: - algorithm: "ppo" traffic: 100 k_paths: 3 network: "NSFNet" - algorithm: "dqn" traffic: 200 k_paths: 5 network: "COST239" Required Grid Parameters ------------------------ These parameters MUST be present in grid specifications: - ``path_algorithm``: Routing/RL algorithm name - ``erlang_start``: Traffic load start value - ``k_paths``: Number of candidate paths - ``obs_space``: Observation space (for RL algorithms) - ``network``: Network topology name (determines output grouping) Fetch Configuration ------------------- The ``configs/config.yml`` file for fetching results: .. code-block:: yaml # Remote paths (user@host:path format) metadata_root: "user@cluster:/path/to/experiments" data_root: "user@cluster:/path/to/data" logs_root: "user@cluster:/path/to/logs" # Local destination dest: "~/cluster_results" # Experiment to fetch (relative path) experiment: "0119/1430/NSFNet" # Preview mode (true = don't actually download) dry_run: false Output Format Reference ======================= Manifest CSV ------------ One row per job, with all parameters as columns: .. code-block:: text path_algorithm,erlang_start,erlang_stop,k_paths,network,num_requests,is_rl,... ppo,100,150,3,NSFNet,10000,true,... dqn,200,250,5,NSFNet,10000,true,... **Encoding Rules:** - Booleans: ``true`` / ``false`` (lowercase strings) - Lists: JSON format ``[1,2,3]`` (no spaces) - Dicts: JSON format ``{"key":"value"}`` (no spaces) - Floats: No trailing zeros (``3.14`` not ``3.140000``) Manifest Metadata JSON ---------------------- .. code-block:: json { "generated": "2025-01-19T14:30:45", "source": "/path/to/specs/my_experiment.yaml", "network": "NSFNet", "num_rows": 12, "resources": { "partition": "gpu-long", "time": "24:00:00", "mem": "32G", "cpus": 8, "gpus": 1, "nodes": 1 } } Directory Structure ------------------- **On the cluster:** .. code-block:: text FUSION/ experiments/ MMDD/ HHMM/ network/ manifest.csv manifest_meta.json jobs/ slurm_12345678_0.out slurm_12345678_1.out data/ output/ network/ experiment_path/ s1/ s2/ input/ network/ experiment_path/ sim_input_s1.json logs/ algorithm/ network/ MMDD/HHMM/ **Fetched locally:** .. code-block:: text ~/cluster_results/ data/ network/ experiment_path/ output/ input/ logs/ algorithm/ network/ Components ========== make_manifest.py ---------------- :Purpose: Generate job manifests from specification files :Entry Point: ``python -m fusion.unity.make_manifest `` **Key Functions:** - ``make_manifest(spec_path)``: Main entry point - ``_expand_grid(grid, resources)``: Expand grid to job list - ``_write_csv(rows, output_dir)``: Write manifest CSV - ``_cast(key, value)``: Type cast parameters submit_manifest.py ------------------ :Purpose: Submit manifest as SLURM array job :Entry Point: ``python -m fusion.unity.submit_manifest