.. _rl-module: ============================== Reinforcement Learning Module ============================== .. note:: **Status: Transitioning to UnifiedSimEnv** This module is transitioning from the legacy ``GeneralSimEnv`` to the new ``UnifiedSimEnv``. New work should use ``UnifiedSimEnv``. - **Pre v6.0**: ``GeneralSimEnv`` (deprecated, removal planned for v6.1+) - **Post v6.0**: ``UnifiedSimEnv`` (recommended for new work) .. toctree:: :maxdepth: 2 :caption: Submodules adapter agents algorithms args environments feat_extrs gymnasium_envs policies sb3 utils visualization Overview ======== .. admonition:: At a Glance :class: tip :Purpose: Reinforcement learning for network resource allocation and optimization :Location: ``fusion/modules/rl/`` :Key Entry Points: ``workflow_runner.py``, ``model_manager.py`` :CLI Command: ``python -m fusion.cli.run_train --config_path --agent_type rl`` :External Docs: `Stable-Baselines3 Documentation `_ The RL module enables intelligent network optimization through reinforcement learning. It provides a complete framework for training agents to make routing decisions in optical networks. .. important:: **Current Agent Support:** - **Path/Routing Agent**: Fully implemented and supported - **Core Agent**: Placeholder - development planned for future versions - **Spectrum Agent**: Placeholder - development planned for future versions Currently, only the **routing/path selection agent** is functional. Core assignment and spectrum allocation use heuristic methods (e.g., ``first_fit``). .. warning:: **Spectral Band Limitation:** RL environments currently only support **C-band** spectrum allocation. L-band and multi-band scenarios are not yet supported. Multi-band support is planned for a future v6.X release. **What This Module Provides:** - **In-house RL algorithms**: Q-learning, Epsilon-Greedy Bandits, UCB Bandits (actively expanded) - **Deep RL via Stable-Baselines3**: PPO, A2C, DQN, QR-DQN wrappers - **Custom SB3 callbacks**: Episodic reward tracking, dynamic learning rate/entropy decay (see :ref:`rl-utils`) - **RLZoo3 integration**: Automatic hyperparameter optimization and experiment management (see :ref:`rl-sb3`) - **Offline RL policies** *(beta)*: Behavioral Cloning (BC), Implicit Q-Learning (IQL) - **GNN-based feature extractors** *(beta)*: GAT, SAGE, GraphConv, Graphormer (see :ref:`rl-feat-extrs`) - **Hyperparameter optimization**: Optuna integration with configurable pruning - **Action masking**: Safe RL deployment preventing invalid actions .. tip:: **RLZoo3 Integration**: FUSION environments can be registered with `RLZoo3 `_ for automated hyperparameter tuning, experiment tracking, and benchmarking. See :ref:`rl-sb3` for details. We are also developing FUSION-native training infrastructure that will provide tighter integration with our simulation stack, custom callbacks, and domain-specific optimizations for optical network environments. .. note:: **Multi-Processing Limitation:** RL training currently runs in **single-process mode**. Multi-environment parallelization (e.g., ``SubprocVecEnv``) is not yet supported due to the complexity of serializing simulation state across processes. This is planned for a future release. .. tip:: If you're new to reinforcement learning, we recommend familiarizing yourself with `Stable-Baselines3 `_ first, as the deep RL components build on top of it. However, you can use FUSION's in-house algorithms (Q-learning, bandits) without any SB3 knowledge. Capabilities Overview ===================== .. code-block:: text +===========================================================================+ | REINFORCEMENT LEARNING IN FUSION | +===========================================================================+ | | | +---------------------------+ +---------------------------+ | | | IN-HOUSE ALGORITHMS | | DEEP RL (via SB3) | | | +---------------------------+ +---------------------------+ | | | | | | | | | Q-Learning (tabular) | | PPO (policy gradient) | | | | Epsilon-Greedy Bandit | | A2C (actor-critic) | | | | UCB Bandit | | DQN (value-based) | | | | | | QR-DQN (distributional) | | | | Status: Active expansion | | | | | +---------------------------+ +---------------------------+ | | | | | | +-----------------------------------+ | | | | | v | | +---------------------------------------------------------------+ | | | FUSION RL ENVIRONMENTS | | | +---------------------------------------------------------------+ | | | GeneralSimEnv (legacy) --> UnifiedSimEnv (recommended) | | | | | | | | Features: | | | | - Gymnasium-compatible interface | | | | - Action masking for safe exploration | | | | - GNN-based state representations | | | | - Configurable reward functions | | | +---------------------------------------------------------------+ | | | | | v | | +---------------------------------------------------------------+ | | | OFFLINE RL POLICIES (BETA) | | | +---------------------------------------------------------------+ | | | BCPolicy - Behavioral Cloning from expert demonstrations | | | | IQLPolicy - Implicit Q-Learning from offline data | | | | KSPFFPolicy - K-Shortest Path First-Fit heuristic baseline | | | | OnePlusOnePolicy - 1+1 disjoint path protection | | | +---------------------------------------------------------------+ | | | +===========================================================================+ Algorithm Types --------------- .. list-table:: :header-rows: 1 :widths: 20 20 30 30 * - Algorithm - Type - Implementation - Use Case * - ``q_learning`` - Tabular RL - In-house (``algorithms/q_learning.py``) - Small state spaces, interpretable policies * - ``epsilon_greedy`` - Multi-armed Bandit - In-house (``algorithms/bandits.py``) - Path selection with exploration * - ``ucb`` - Multi-armed Bandit - In-house (``algorithms/bandits.py``) - Optimistic exploration * - ``ppo`` - Deep RL - Stable-Baselines3 wrapper - Large state spaces, continuous training * - ``a2c`` - Deep RL - Stable-Baselines3 wrapper - Faster training, simpler architecture * - ``dqn`` - Deep RL - Stable-Baselines3 wrapper - Discrete actions, replay buffer * - ``qr_dqn`` - Deep RL - Stable-Baselines3 wrapper - Risk-sensitive decisions Getting Started =============== Prerequisites ------------- 1. FUSION installed with RL dependencies: ``pip install -e ".[rl]"`` 2. A configuration INI file (see :ref:`rl-configuration`) 3. Basic understanding of RL concepts (recommended) Quick Start: Training an Agent ------------------------------ **Step 1: Create a configuration file** Copy the default template and customize the ``[rl_settings]`` section: .. code-block:: bash cp fusion/configs/templates/default.ini my_rl_config.ini **Step 2: Configure RL parameters** Edit the ``[rl_settings]`` section in your INI file: .. code-block:: ini [rl_settings] # Algorithm selection (only path_algorithm uses RL currently) path_algorithm = dqn # Options: q_learning, dqn, ppo, a2c, etc. core_algorithm = first_fit # Heuristic (RL agent not yet implemented) spectrum_algorithm = first_fit # Heuristic (RL agent not yet implemented) # Training parameters is_training = True n_trials = 1 # Number of training runs device = cpu # cpu, cuda, or mps # Hyperparameters gamma = 0.1 # Discount factor epsilon_start = 0.01 # Initial exploration rate epsilon_end = 0.01 # Final exploration rate # Neural network (for DRL algorithms) feature_extractor = path_gnn gnn_type = graph_conv layers = 2 emb_dim = 64 **Step 3: Run training** .. code-block:: bash python -m fusion.cli.run_train --config_path my_rl_config.ini --agent_type rl **Step 4: Check results** RL-specific outputs (model, rewards, memory) are saved to: .. code-block:: text logs////