RL Algorithms Package
At a Glance
- Purpose:
RL algorithm implementations for network optimization
- Location:
fusion/modules/rl/algorithms/- Key Files:
q_learning.py,bandits.py,base_drl.py- Prerequisites:
Basic RL theory (Q-learning, bandits, policy gradients)
This package provides the actual RL algorithms used by agents. It includes traditional methods (Q-learning, bandits) and integration classes for deep RL via Stable-Baselines3.
Algorithm Categories
FUSION supports three categories of RL algorithms:
+------------------------------------------------------------------+
| ALGORITHM CATEGORIES |
+------------------------------------------------------------------+
| |
| Traditional RL Multi-Armed Bandits Deep RL |
| ---------------- ------------------ -------- |
| - Q-Learning - Epsilon-Greedy - PPO |
| (tabular) - UCB - A2C |
| - DQN |
| - QR-DQN |
| |
| State: (src, dst) State: (src, dst) State: obs vec |
| Action: path index Action: arm index Action: int |
| Updates: Q-table Updates: value est. Updates: SB3 |
| |
+------------------------------------------------------------------+
Category |
Algorithms |
State Space |
Best For |
|---|---|---|---|
Traditional |
Q-Learning |
Discrete (tabular) |
Small networks, interpretable |
Bandits |
Epsilon-Greedy, UCB |
Contextual (src, dst) |
Fast learning, simple |
Deep RL |
PPO, A2C, DQN, QR-DQN |
Continuous (vectors) |
Large networks, complex features |
Quick Start: Using Q-Learning
Q-Learning maintains a table of Q-values for each (source, destination, path, congestion_level) combination.
Step 1: Initialize Q-Learning
from fusion.modules.rl.algorithms import QLearning
# Q-learning needs rl_props with network info and engine_props with config
q_learner = QLearning(rl_props=rl_props, engine_props=engine_props)
# Q-tables are automatically initialized:
# - routes_matrix: Q-values for path selection
# - cores_matrix: Q-values for core selection
What happens at initialization:
Creates
QPropsto hold Q-tables and statisticsInitializes routes_matrix with shape
(num_nodes, num_nodes, k_paths, path_levels)Initializes cores_matrix for multi-core scenarios
Populates Q-tables with initial paths from shortest path computation
Step 2: Get Best Action
# Get congestion levels for available paths
congestion_list = rl_helper.classify_paths(paths_list)
# Get action with highest Q-value
best_index, best_path = q_learner.get_max_curr_q(
cong_list=congestion_list,
matrix_flag="routes_matrix", # or "cores_matrix"
)
# best_index: index of path with highest Q-value
# best_path: the actual path (list of nodes)
Step 3: Update Q-Values
After executing the action and receiving a reward:
q_learner.update_q_matrix(
reward=1.0, # Reward from allocation
level_index=congestion_level, # Current congestion level
network_spectrum_dict=spectrum_db, # Current network state
flag="path", # "path" or "core"
trial=current_trial,
iteration=current_iteration,
)
The Q-learning update rule:
# Bellman equation
delta = reward + gamma * max_future_q
td_error = current_q - delta
new_q = (1 - learn_rate) * current_q + learn_rate * delta
Step 4: Save the Model
Models are automatically saved at configured intervals:
# Automatic saving happens in update_q_matrix when:
# - iteration % save_step == 0, OR
# - iteration == max_iters - 1
# Models saved to: logs/q_learning/{network}/{date}/{time}/
# Files:
# - rewards_e{erlang}_routes_c{cores}_t{trial}_iter_{iter}.npy
# - state_vals_e{erlang}_routes_c{cores}_t{trial}.json
Quick Start: Using Bandits
Bandits are simpler than Q-learning - they don’t consider future rewards, just immediate value estimates.
Epsilon-Greedy Bandit
from fusion.modules.rl.algorithms import EpsilonGreedyBandit
# Create bandit for path selection
bandit = EpsilonGreedyBandit(
rl_props=rl_props,
engine_props=engine_props,
is_path=True, # True for path selection, False for core selection
)
# Set exploration rate
bandit.epsilon = 0.1 # 10% random exploration
# Select an arm (path)
action = bandit.select_path_arm(source=0, dest=5)
# After allocation, update with reward
bandit.update(
arm=action,
reward=1.0,
iteration=current_iter,
trial=current_trial,
)
How epsilon-greedy works:
if random() < epsilon:
return random_arm() # Explore
else:
return best_value_arm() # Exploit
UCB Bandit
Upper Confidence Bound adds an exploration bonus based on uncertainty:
from fusion.modules.rl.algorithms import UCBBandit
# Create UCB bandit
ucb = UCBBandit(
rl_props=rl_props,
engine_props=engine_props,
is_path=True,
)
# Select arm (automatically balances exploration/exploitation)
action = ucb.select_path_arm(source=0, dest=5)
# Update after allocation
ucb.update(arm=action, reward=1.0, iteration=iter, trial=trial)
How UCB works:
# UCB formula
ucb_value = estimated_value + sqrt(c * log(total_counts) / arm_counts)
# Arms with fewer selections get higher bonus (exploration)
# Arms with high values are preferred (exploitation)
Quick Start: Deep RL Integration
Deep RL algorithms (PPO, A2C, DQN, QR-DQN) are thin wrappers that provide observation and action spaces. The actual training happens via Stable-Baselines3.
Note
These classes don’t implement the algorithms - they configure the spaces for SB3. The heavy lifting is done by SB3’s implementations.
Creating a DRL Algorithm
from fusion.modules.rl.algorithms import PPO, DQN
# Create PPO configuration
ppo = PPO(rl_props=rl_props, engine_obj=engine_obj)
# Get spaces for SB3
obs_space = ppo.get_obs_space() # gymnasium.spaces.Dict
action_space = ppo.get_action_space() # gymnasium.spaces.Discrete
# These spaces are used by the environment
# SB3 handles the actual learning
Using with Stable-Baselines3
from stable_baselines3 import PPO as SB3_PPO
from fusion.modules.rl.gymnasium_envs import GeneralSimEnv
# Create environment (uses algorithm spaces internally)
env = GeneralSimEnv(sim_dict=config)
# Create SB3 model
model = SB3_PPO("MultiInputPolicy", env, verbose=1)
# Train
model.learn(total_timesteps=10000)
# The algorithm class configured the spaces
# SB3 does the actual training
Understanding Properties Classes
The algorithms module includes several properties classes that hold state and configuration.
RLProps
State container for RL simulations (used by environments and agents):
from fusion.modules.rl.algorithms import RLProps
rl_props = RLProps()
# Network configuration
rl_props.k_paths = 3 # Candidate paths
rl_props.cores_per_link = 7 # Cores per fiber
rl_props.spectral_slots = 320 # Slots per core
rl_props.num_nodes = 14 # Network nodes
# Current request state
rl_props.source = 0 # Source node
rl_props.destination = 5 # Destination node
rl_props.paths_list = [...] # Available paths
# Selection state (set by agent)
rl_props.chosen_path_index = 0
rl_props.chosen_path_list = [0, 1, 5]
QProps
Q-learning specific properties:
from fusion.modules.rl.algorithms import QProps
q_props = QProps()
# Epsilon (exploration rate)
q_props.epsilon = 0.1
q_props.epsilon_start = 1.0
q_props.epsilon_end = 0.01
q_props.epsilon_list = [] # Track over time
# Q-tables
q_props.routes_matrix = np.array(...) # Path Q-values
q_props.cores_matrix = np.array(...) # Core Q-values
# Statistics tracking
q_props.rewards_dict = {"routes_dict": {...}, "cores_dict": {...}}
q_props.errors_dict = {"routes_dict": {...}, "cores_dict": {...}}
BanditProps
Bandit-specific properties:
from fusion.modules.rl.algorithms import BanditProps
bandit_props = BanditProps()
# Rewards for each episode
bandit_props.rewards_matrix = [] # [[r1, r2, ...], [r1, r2, ...], ...]
# Action counts and values
bandit_props.counts_list = []
bandit_props.state_values_list = []
Model Persistence
The module provides classes for saving and loading trained models.
Saving Q-Learning Models
from fusion.modules.rl.algorithms import QLearningModelPersistence
# Save is typically called automatically by QLearning.save_model()
# But can be called directly:
QLearningModelPersistence.save_model(
q_dict=q_values_dict, # Q-values as dict
rewards_avg=rewards_array, # Average rewards
erlang=100.0, # Traffic load
cores_per_link=7,
base_str="routes", # or "cores"
trial=0,
iteration=1000,
save_dir="logs/q_learning/NSFNet/2024-01-15/10-30-00",
)
# Saved files:
# - rewards_e100.0_routes_c7_t1_iter_1000.npy
# - state_vals_e100.0_routes_c7_t1.json
Saving Bandit Models
from fusion.modules.rl.algorithms import BanditModelPersistence
BanditModelPersistence.save_model(
state_values_dict=bandit.values,
erlang=100.0,
cores_per_link=7,
save_dir="logs/epsilon_greedy_bandit/...",
is_path=True,
trial=0,
)
# Saved file:
# - state_vals_e100.0_routes_c7_t1.json
Loading Models
# Load bandit model
state_values = BanditModelPersistence.load_model(
train_fp="epsilon_greedy_bandit/NSFNet/.../state_vals_e100.0_routes_c7_t1.json"
)
# Load Q-learning model (usually via agent.load_model())
# The Q-tables are loaded into the algorithm object
Extending the Algorithms Module
Tutorial: Adding a New Bandit Algorithm
Let’s add Thompson Sampling bandit.
Step 1: Create the class in bandits.py
class ThompsonSamplingBandit:
"""
Thompson Sampling bandit algorithm.
Uses Beta distribution to model uncertainty about arm values.
"""
def __init__(
self,
rl_props: object,
engine_props: dict,
is_path: bool,
) -> None:
self.props = BanditProps()
self.engine_props = engine_props
self.rl_props = rl_props
self.is_path = is_path
self.iteration = 0
self.source: int | None = None
self.dest: int | None = None
if is_path:
self.n_arms = engine_props["k_paths"]
else:
self.n_arms = engine_props["cores_per_link"]
self.num_nodes = rl_props.num_nodes
# Beta distribution parameters (successes, failures)
self.alpha, self.beta = self._init_beta_params()
def _init_beta_params(self) -> tuple[dict, dict]:
"""Initialize Beta distribution parameters."""
alpha = {}
beta = {}
for src in range(self.num_nodes):
for dst in range(self.num_nodes):
if src == dst:
continue
key = (src, dst)
alpha[key] = np.ones(self.n_arms)
beta[key] = np.ones(self.n_arms)
return alpha, beta
Step 2: Add action selection
def select_path_arm(self, source: int, dest: int) -> int:
"""Select arm using Thompson Sampling."""
self.source = source
self.dest = dest
key = (source, dest)
# Sample from Beta distribution for each arm
samples = np.random.beta(self.alpha[key], self.beta[key])
return int(np.argmax(samples))
Step 3: Add update method
def update(
self,
arm: int,
reward: float,
iteration: int,
trial: int,
) -> None:
"""Update Beta parameters based on reward."""
key = (self.source, self.dest)
# Bernoulli reward: success (1) or failure (0)
if reward > 0:
self.alpha[key][arm] += 1
else:
self.beta[key][arm] += 1
self.iteration = iteration
# Track rewards
if self.iteration >= len(self.props.rewards_matrix):
self.props.rewards_matrix.append([])
self.props.rewards_matrix[self.iteration].append(reward)
# Save model periodically
save_model(
iteration=iteration,
algorithm="thompson_sampling_bandit",
self=self,
trial=trial,
)
Step 4: Export in __init__.py
from .bandits import EpsilonGreedyBandit, UCBBandit, ThompsonSamplingBandit
__all__ = [
# ...
"ThompsonSamplingBandit",
]
Step 5: Add to agents
In base_agent.py:
elif self.algorithm == "thompson_sampling_bandit":
from fusion.modules.rl.algorithms.bandits import ThompsonSamplingBandit
self.algorithm_obj = ThompsonSamplingBandit(
rl_props=self.rl_props,
engine_props=self.engine_props,
is_path=is_path,
)
Tutorial: Adding a New DRL Algorithm
DRL algorithms are wrappers that configure spaces for SB3.
Step 1: Create the class
# sac.py
"""Soft Actor-Critic (SAC) algorithm integration."""
from fusion.modules.rl.algorithms.base_drl import BaseDRLAlgorithm
class SAC(BaseDRLAlgorithm):
"""
Soft Actor-Critic for reinforcement learning.
Inherits observation and action space handling from BaseDRLAlgorithm.
"""
def get_action_space(self):
"""SAC typically uses continuous actions, but we use discrete."""
# Override if SAC needs different action space
return super().get_action_space()
Step 2: Export and integrate
# In __init__.py
from .sac import SAC
__all__ = [..., "SAC"]
# In base_agent.py setup_env()
elif self.algorithm == "sac":
self.algorithm_obj = SAC(rl_props=self.rl_props, engine_obj=self.engine_props)
Testing
Running Tests
# Run algorithm tests
pytest fusion/modules/tests/rl/test_algorithm_props.py -v
pytest fusion/modules/tests/rl/test_q_learning.py -v
pytest fusion/modules/tests/rl/test_bandits.py -v
# Run with coverage
pytest fusion/modules/tests/rl/ -v --cov=fusion.modules.rl.algorithms
Writing Algorithm Tests
import pytest
import numpy as np
from unittest.mock import MagicMock
from fusion.modules.rl.algorithms import EpsilonGreedyBandit
@pytest.fixture
def mock_rl_props():
props = MagicMock()
props.num_nodes = 5
return props
@pytest.fixture
def bandit(mock_rl_props):
engine_props = {
"k_paths": 3,
"cores_per_link": 7,
"max_iters": 100,
"save_step": 50,
"num_requests": 10,
}
return EpsilonGreedyBandit(mock_rl_props, engine_props, is_path=True)
def test_select_path_arm_returns_valid_action(bandit):
"""select_path_arm should return action in valid range."""
bandit.epsilon = 0.0 # Greedy selection
action = bandit.select_path_arm(source=0, dest=1)
assert 0 <= action < bandit.n_arms
def test_epsilon_greedy_explores_with_high_epsilon(bandit):
"""High epsilon should lead to diverse actions."""
bandit.epsilon = 1.0 # Always explore
actions = [bandit.select_path_arm(0, 1) for _ in range(100)]
# Should see multiple different actions
assert len(set(actions)) > 1
Common Issues
“rl_props must have num_nodes”
# RLProps needs to be properly initialized
rl_props = RLProps()
rl_props.num_nodes = 14 # Set this before creating algorithms
Q-table shape mismatch
The Q-table shape depends on network configuration:
# routes_matrix shape: (num_nodes, num_nodes, k_paths, path_levels)
# Make sure engine_props matches rl_props:
assert rl_props.num_nodes == expected_nodes
assert rl_props.k_paths == engine_props["k_paths"]
Model saving path errors
Models save to logs/{algorithm}/{network}/{date}/{time}/:
# Ensure engine_props has required fields:
engine_props = {
"network": "NSFNet",
"date": "2024-01-15",
"sim_start": "10-30-00",
"erlang": 100.0,
"cores_per_link": 7,
# ...
}
File Reference
fusion/modules/rl/algorithms/
|-- __init__.py # Public exports
|-- README.md # Module documentation
|-- algorithm_props.py # RLProps, QProps, BanditProps, PPOProps
|-- persistence.py # BanditModelPersistence, QLearningModelPersistence
|-- base_drl.py # BaseDRLAlgorithm (DRL base class)
|-- q_learning.py # QLearning
|-- bandits.py # EpsilonGreedyBandit, UCBBandit
|-- ppo.py # PPO (SB3 wrapper)
|-- a2c.py # A2C (SB3 wrapper)
|-- dqn.py # DQN (SB3 wrapper)
`-- qr_dqn.py # QrDQN (SB3 wrapper)
What to import:
# Algorithms
from fusion.modules.rl.algorithms import (
QLearning,
EpsilonGreedyBandit,
UCBBandit,
PPO,
A2C,
DQN,
QrDQN,
)
# Properties
from fusion.modules.rl.algorithms import (
RLProps,
QProps,
BanditProps,
)
# Persistence
from fusion.modules.rl.algorithms import (
BanditModelPersistence,
QLearningModelPersistence,
)
# Base class (for extending)
from fusion.modules.rl.algorithms import BaseDRLAlgorithm