RL Algorithms Package

At a Glance

Purpose:: RL algorithm implementations for network optimization
Location:: fusion/modules/rl/algorithms/
Key Files:: q_learning.py, bandits.py, base_drl.py
Prerequisites:: Basic RL theory (Q-learning, bandits, policy gradients)

This package provides the actual RL algorithms used by agents. It includes traditional methods (Q-learning, bandits) and integration classes for deep RL via Stable-Baselines3.

Algorithm Categories

FUSION supports three categories of RL algorithms:

+------------------------------------------------------------------+
|                    ALGORITHM CATEGORIES                          |
+------------------------------------------------------------------+
|                                                                  |
|  Traditional RL          Multi-Armed Bandits      Deep RL        |
|  ----------------        ------------------       --------       |
|  - Q-Learning            - Epsilon-Greedy         - PPO          |
|    (tabular)             - UCB                    - A2C          |
|                                                   - DQN          |
|                                                   - QR-DQN       |
|                                                                  |
|  State: (src, dst)       State: (src, dst)        State: obs vec |
|  Action: path index      Action: arm index        Action: int    |
|  Updates: Q-table        Updates: value est.      Updates: SB3   |
|                                                                  |
+------------------------------------------------------------------+

Category	Algorithms	State Space	Best For
Traditional	Q-Learning	Discrete (tabular)	Small networks, interpretable
Bandits	Epsilon-Greedy, UCB	Contextual (src, dst)	Fast learning, simple
Deep RL	PPO, A2C, DQN, QR-DQN	Continuous (vectors)	Large networks, complex features

Quick Start: Using Q-Learning

Q-Learning maintains a table of Q-values for each (source, destination, path, congestion_level) combination.

Step 1: Initialize Q-Learning

from fusion.modules.rl.algorithms import QLearning

# Q-learning needs rl_props with network info and engine_props with config
q_learner = QLearning(rl_props=rl_props, engine_props=engine_props)

# Q-tables are automatically initialized:
# - routes_matrix: Q-values for path selection
# - cores_matrix: Q-values for core selection

What happens at initialization:

Creates QProps to hold Q-tables and statistics
Initializes routes_matrix with shape (num_nodes, num_nodes, k_paths, path_levels)
Initializes cores_matrix for multi-core scenarios
Populates Q-tables with initial paths from shortest path computation

Step 2: Get Best Action

# Get congestion levels for available paths
congestion_list = rl_helper.classify_paths(paths_list)

# Get action with highest Q-value
best_index, best_path = q_learner.get_max_curr_q(
    cong_list=congestion_list,
    matrix_flag="routes_matrix",  # or "cores_matrix"
)

# best_index: index of path with highest Q-value
# best_path: the actual path (list of nodes)

Step 3: Update Q-Values

After executing the action and receiving a reward:

q_learner.update_q_matrix(
    reward=1.0,                          # Reward from allocation
    level_index=congestion_level,        # Current congestion level
    network_spectrum_dict=spectrum_db,   # Current network state
    flag="path",                         # "path" or "core"
    trial=current_trial,
    iteration=current_iteration,
)

The Q-learning update rule:

# Bellman equation
delta = reward + gamma * max_future_q
td_error = current_q - delta
new_q = (1 - learn_rate) * current_q + learn_rate * delta

Step 4: Save the Model

Models are automatically saved at configured intervals:

# Automatic saving happens in update_q_matrix when:
# - iteration % save_step == 0, OR
# - iteration == max_iters - 1

# Models saved to: logs/q_learning/{network}/{date}/{time}/
# Files:
# - rewards_e{erlang}_routes_c{cores}_t{trial}_iter_{iter}.npy
# - state_vals_e{erlang}_routes_c{cores}_t{trial}.json

Quick Start: Using Bandits

Bandits are simpler than Q-learning - they don’t consider future rewards, just immediate value estimates.

Epsilon-Greedy Bandit

from fusion.modules.rl.algorithms import EpsilonGreedyBandit

# Create bandit for path selection
bandit = EpsilonGreedyBandit(
    rl_props=rl_props,
    engine_props=engine_props,
    is_path=True,  # True for path selection, False for core selection
)

# Set exploration rate
bandit.epsilon = 0.1  # 10% random exploration

# Select an arm (path)
action = bandit.select_path_arm(source=0, dest=5)

# After allocation, update with reward
bandit.update(
    arm=action,
    reward=1.0,
    iteration=current_iter,
    trial=current_trial,
)

How epsilon-greedy works:

if random() < epsilon:
    return random_arm()      # Explore
else:
    return best_value_arm()  # Exploit

UCB Bandit

Upper Confidence Bound adds an exploration bonus based on uncertainty:

from fusion.modules.rl.algorithms import UCBBandit

# Create UCB bandit
ucb = UCBBandit(
    rl_props=rl_props,
    engine_props=engine_props,
    is_path=True,
)

# Select arm (automatically balances exploration/exploitation)
action = ucb.select_path_arm(source=0, dest=5)

# Update after allocation
ucb.update(arm=action, reward=1.0, iteration=iter, trial=trial)

How UCB works:

# UCB formula
ucb_value = estimated_value + sqrt(c * log(total_counts) / arm_counts)

# Arms with fewer selections get higher bonus (exploration)
# Arms with high values are preferred (exploitation)

Quick Start: Deep RL Integration

Deep RL algorithms (PPO, A2C, DQN, QR-DQN) are thin wrappers that provide observation and action spaces. The actual training happens via Stable-Baselines3.

Note

These classes don’t implement the algorithms - they configure the spaces for SB3. The heavy lifting is done by SB3’s implementations.

Creating a DRL Algorithm

from fusion.modules.rl.algorithms import PPO, DQN

# Create PPO configuration
ppo = PPO(rl_props=rl_props, engine_obj=engine_obj)

# Get spaces for SB3
obs_space = ppo.get_obs_space()    # gymnasium.spaces.Dict
action_space = ppo.get_action_space()  # gymnasium.spaces.Discrete

# These spaces are used by the environment
# SB3 handles the actual learning

Using with Stable-Baselines3

from stable_baselines3 import PPO as SB3_PPO
from fusion.modules.rl.gymnasium_envs import GeneralSimEnv

# Create environment (uses algorithm spaces internally)
env = GeneralSimEnv(sim_dict=config)

# Create SB3 model
model = SB3_PPO("MultiInputPolicy", env, verbose=1)

# Train
model.learn(total_timesteps=10000)

# The algorithm class configured the spaces
# SB3 does the actual training

Understanding Properties Classes

The algorithms module includes several properties classes that hold state and configuration.

RLProps

State container for RL simulations (used by environments and agents):

from fusion.modules.rl.algorithms import RLProps

rl_props = RLProps()

# Network configuration
rl_props.k_paths = 3              # Candidate paths
rl_props.cores_per_link = 7       # Cores per fiber
rl_props.spectral_slots = 320     # Slots per core
rl_props.num_nodes = 14           # Network nodes

# Current request state
rl_props.source = 0               # Source node
rl_props.destination = 5          # Destination node
rl_props.paths_list = [...]       # Available paths

# Selection state (set by agent)
rl_props.chosen_path_index = 0
rl_props.chosen_path_list = [0, 1, 5]

QProps

Q-learning specific properties:

from fusion.modules.rl.algorithms import QProps

q_props = QProps()

# Epsilon (exploration rate)
q_props.epsilon = 0.1
q_props.epsilon_start = 1.0
q_props.epsilon_end = 0.01
q_props.epsilon_list = []  # Track over time

# Q-tables
q_props.routes_matrix = np.array(...)  # Path Q-values
q_props.cores_matrix = np.array(...)   # Core Q-values

# Statistics tracking
q_props.rewards_dict = {"routes_dict": {...}, "cores_dict": {...}}
q_props.errors_dict = {"routes_dict": {...}, "cores_dict": {...}}

BanditProps

Bandit-specific properties:

from fusion.modules.rl.algorithms import BanditProps

bandit_props = BanditProps()

# Rewards for each episode
bandit_props.rewards_matrix = []  # [[r1, r2, ...], [r1, r2, ...], ...]

# Action counts and values
bandit_props.counts_list = []
bandit_props.state_values_list = []

Model Persistence

The module provides classes for saving and loading trained models.

Saving Q-Learning Models

from fusion.modules.rl.algorithms import QLearningModelPersistence

# Save is typically called automatically by QLearning.save_model()
# But can be called directly:
QLearningModelPersistence.save_model(
    q_dict=q_values_dict,        # Q-values as dict
    rewards_avg=rewards_array,   # Average rewards
    erlang=100.0,                # Traffic load
    cores_per_link=7,
    base_str="routes",           # or "cores"
    trial=0,
    iteration=1000,
    save_dir="logs/q_learning/NSFNet/2024-01-15/10-30-00",
)

# Saved files:
# - rewards_e100.0_routes_c7_t1_iter_1000.npy
# - state_vals_e100.0_routes_c7_t1.json

Saving Bandit Models

from fusion.modules.rl.algorithms import BanditModelPersistence

BanditModelPersistence.save_model(
    state_values_dict=bandit.values,
    erlang=100.0,
    cores_per_link=7,
    save_dir="logs/epsilon_greedy_bandit/...",
    is_path=True,
    trial=0,
)

# Saved file:
# - state_vals_e100.0_routes_c7_t1.json

Loading Models

# Load bandit model
state_values = BanditModelPersistence.load_model(
    train_fp="epsilon_greedy_bandit/NSFNet/.../state_vals_e100.0_routes_c7_t1.json"
)

# Load Q-learning model (usually via agent.load_model())
# The Q-tables are loaded into the algorithm object

Extending the Algorithms Module

Tutorial: Adding a New Bandit Algorithm

Let’s add Thompson Sampling bandit.

Step 1: Create the class in bandits.py

class ThompsonSamplingBandit:
    """
    Thompson Sampling bandit algorithm.

    Uses Beta distribution to model uncertainty about arm values.
    """

    def __init__(
        self,
        rl_props: object,
        engine_props: dict,
        is_path: bool,
    ) -> None:
        self.props = BanditProps()
        self.engine_props = engine_props
        self.rl_props = rl_props
        self.is_path = is_path
        self.iteration = 0

        self.source: int | None = None
        self.dest: int | None = None

        if is_path:
            self.n_arms = engine_props["k_paths"]
        else:
            self.n_arms = engine_props["cores_per_link"]

        self.num_nodes = rl_props.num_nodes

        # Beta distribution parameters (successes, failures)
        self.alpha, self.beta = self._init_beta_params()

    def _init_beta_params(self) -> tuple[dict, dict]:
        """Initialize Beta distribution parameters."""
        alpha = {}
        beta = {}
        for src in range(self.num_nodes):
            for dst in range(self.num_nodes):
                if src == dst:
                    continue
                key = (src, dst)
                alpha[key] = np.ones(self.n_arms)
                beta[key] = np.ones(self.n_arms)
        return alpha, beta

Step 2: Add action selection

def select_path_arm(self, source: int, dest: int) -> int:
    """Select arm using Thompson Sampling."""
    self.source = source
    self.dest = dest
    key = (source, dest)

    # Sample from Beta distribution for each arm
    samples = np.random.beta(self.alpha[key], self.beta[key])

    return int(np.argmax(samples))

Step 3: Add update method

def update(
    self,
    arm: int,
    reward: float,
    iteration: int,
    trial: int,
) -> None:
    """Update Beta parameters based on reward."""
    key = (self.source, self.dest)

    # Bernoulli reward: success (1) or failure (0)
    if reward > 0:
        self.alpha[key][arm] += 1
    else:
        self.beta[key][arm] += 1

    self.iteration = iteration

    # Track rewards
    if self.iteration >= len(self.props.rewards_matrix):
        self.props.rewards_matrix.append([])
    self.props.rewards_matrix[self.iteration].append(reward)

    # Save model periodically
    save_model(
        iteration=iteration,
        algorithm="thompson_sampling_bandit",
        self=self,
        trial=trial,
    )

Step 4: Export in __init__.py

from .bandits import EpsilonGreedyBandit, UCBBandit, ThompsonSamplingBandit

__all__ = [
    # ...
    "ThompsonSamplingBandit",
]

Step 5: Add to agents

In base_agent.py:

elif self.algorithm == "thompson_sampling_bandit":
    from fusion.modules.rl.algorithms.bandits import ThompsonSamplingBandit
    self.algorithm_obj = ThompsonSamplingBandit(
        rl_props=self.rl_props,
        engine_props=self.engine_props,
        is_path=is_path,
    )

Tutorial: Adding a New DRL Algorithm

DRL algorithms are wrappers that configure spaces for SB3.

Step 1: Create the class

# sac.py
"""Soft Actor-Critic (SAC) algorithm integration."""

from fusion.modules.rl.algorithms.base_drl import BaseDRLAlgorithm


class SAC(BaseDRLAlgorithm):
    """
    Soft Actor-Critic for reinforcement learning.

    Inherits observation and action space handling from BaseDRLAlgorithm.
    """

    def get_action_space(self):
        """SAC typically uses continuous actions, but we use discrete."""
        # Override if SAC needs different action space
        return super().get_action_space()

Step 2: Export and integrate

# In __init__.py
from .sac import SAC

__all__ = [..., "SAC"]

# In base_agent.py setup_env()
elif self.algorithm == "sac":
    self.algorithm_obj = SAC(rl_props=self.rl_props, engine_obj=self.engine_props)

Testing

Running Tests

# Run algorithm tests
pytest fusion/modules/tests/rl/test_algorithm_props.py -v
pytest fusion/modules/tests/rl/test_q_learning.py -v
pytest fusion/modules/tests/rl/test_bandits.py -v

# Run with coverage
pytest fusion/modules/tests/rl/ -v --cov=fusion.modules.rl.algorithms

Writing Algorithm Tests

import pytest
import numpy as np
from unittest.mock import MagicMock

from fusion.modules.rl.algorithms import EpsilonGreedyBandit


@pytest.fixture
def mock_rl_props():
    props = MagicMock()
    props.num_nodes = 5
    return props


@pytest.fixture
def bandit(mock_rl_props):
    engine_props = {
        "k_paths": 3,
        "cores_per_link": 7,
        "max_iters": 100,
        "save_step": 50,
        "num_requests": 10,
    }
    return EpsilonGreedyBandit(mock_rl_props, engine_props, is_path=True)


def test_select_path_arm_returns_valid_action(bandit):
    """select_path_arm should return action in valid range."""
    bandit.epsilon = 0.0  # Greedy selection

    action = bandit.select_path_arm(source=0, dest=1)

    assert 0 <= action < bandit.n_arms


def test_epsilon_greedy_explores_with_high_epsilon(bandit):
    """High epsilon should lead to diverse actions."""
    bandit.epsilon = 1.0  # Always explore

    actions = [bandit.select_path_arm(0, 1) for _ in range(100)]

    # Should see multiple different actions
    assert len(set(actions)) > 1

Common Issues

“rl_props must have num_nodes”

# RLProps needs to be properly initialized
rl_props = RLProps()
rl_props.num_nodes = 14  # Set this before creating algorithms

Q-table shape mismatch

The Q-table shape depends on network configuration:

# routes_matrix shape: (num_nodes, num_nodes, k_paths, path_levels)
# Make sure engine_props matches rl_props:
assert rl_props.num_nodes == expected_nodes
assert rl_props.k_paths == engine_props["k_paths"]

Model saving path errors

Models save to logs/{algorithm}/{network}/{date}/{time}/:

# Ensure engine_props has required fields:
engine_props = {
    "network": "NSFNet",
    "date": "2024-01-15",
    "sim_start": "10-30-00",
    "erlang": 100.0,
    "cores_per_link": 7,
    # ...
}

File Reference

fusion/modules/rl/algorithms/
|-- __init__.py          # Public exports
|-- README.md            # Module documentation
|-- algorithm_props.py   # RLProps, QProps, BanditProps, PPOProps
|-- persistence.py       # BanditModelPersistence, QLearningModelPersistence
|-- base_drl.py          # BaseDRLAlgorithm (DRL base class)
|-- q_learning.py        # QLearning
|-- bandits.py           # EpsilonGreedyBandit, UCBBandit
|-- ppo.py               # PPO (SB3 wrapper)
|-- a2c.py               # A2C (SB3 wrapper)
|-- dqn.py               # DQN (SB3 wrapper)
`-- qr_dqn.py            # QrDQN (SB3 wrapper)

What to import:

# Algorithms
from fusion.modules.rl.algorithms import (
    QLearning,
    EpsilonGreedyBandit,
    UCBBandit,
    PPO,
    A2C,
    DQN,
    QrDQN,
)

# Properties
from fusion.modules.rl.algorithms import (
    RLProps,
    QProps,
    BanditProps,
)

# Persistence
from fusion.modules.rl.algorithms import (
    BanditModelPersistence,
    QLearningModelPersistence,
)

# Base class (for extending)
from fusion.modules.rl.algorithms import BaseDRLAlgorithm