Failures Module
Warning
Beta Status: This module is in beta. It has not been fully validated
in production simulations. See TODO.md for the development roadmap.
Warning
Integration is Incomplete: The orchestrator path has partial failure support. Read the “Current Integration Status” section carefully before using failures in your experiments.
Quick Summary: What Works and What Doesn’t
Capability |
Legacy Path |
Orchestrator Path |
|---|---|---|
Inject failures (link/node/SRLG/geo) |
YES |
YES |
Activate failures at scheduled time |
YES |
YES |
Repair failures at scheduled time |
YES |
YES |
Handle impact on allocated requests |
YES |
YES |
Avoid failed paths during NEW routing |
YES |
NO (gap!) |
RL policies receive failure info |
N/A |
YES (via DisasterState) |
Bottom line: If you need new allocations to avoid failed links, use the legacy
path (use_orchestrator=False) until orchestrator integration is complete.
Overview
At a Glance
- Purpose:
Network failure injection for survivability testing
- Location:
fusion/modules/failures/- Key Files:
failure_manager.py,failure_types.py- Status:
Beta - Partial orchestrator integration
The failures module injects network failures (failed links) into simulations to test survivability and recovery mechanisms.
What This Module Does
Injects failures: Schedule when links fail and when they’re repaired
Tracks active failures: Know which links are currently down
Checks path feasibility: Determine if a path avoids failed links
Handles failure impact: When failures hit allocated requests, switch to backup or drop
Failure Types Supported
F1: Link Failure - Single link fails
F2: Node Failure - Node + all adjacent links fail
F3: SRLG Failure - Multiple links sharing risk fail together
F4: Geographic - All links within hop-radius of center fail
Current Integration Status
This is the confusing part. Here’s exactly what happens in each path:
Visual: How Failures Flow Through the System
+===========================================================================+
| SIMULATION ENGINE |
| (fusion/core/simulation.py) |
| |
| Owns FailureManager - works for BOTH paths |
+===========================================================================+
|
| At simulation start:
| _initialize_failure_manager() creates FailureManager
|
v
+------------------+
| FailureManager | <-- Owned by SimulationEngine, shared reference
| - scheduled | to SDNController (legacy) but NOT to
| failures | SDNOrchestrator
| - active failures|
| - repair schedule|
+--------+---------+
|
|
=========|=====================================================================
| DURING SIMULATION (main event loop)
=========|=====================================================================
|
| For EACH time step, SimulationEngine calls:
|
v
+------------------+ +------------------+ +------------------+
| activate_ | | _handle_failure_ | | repair_ |
| failures(time) |---->| impact() |---->| failures(time) |
| | | | | |
| Moves scheduled | | For allocated | | Removes links |
| failures to | | requests hit by | | from active |
| active set | | new failures: | | failures |
+------------------+ | - Switch to | +------------------+
| backup path |
WORKS FOR | - Or drop request| WORKS FOR
BOTH PATHS +------------------+ BOTH PATHS
|
WORKS FOR
BOTH PATHS
=========|=====================================================================
| DURING ROUTING (when new request arrives)
=========|=====================================================================
|
+------------------+------------------+
| |
v v
+------------------+ +------------------+
| LEGACY PATH | | ORCHESTRATOR |
| use_orchestrator | | use_orchestrator |
| = False | | = True |
+------------------+ +------------------+
| |
v v
+------------------+ +------------------+
| SDNController | | SDNOrchestrator |
| HAS reference to | | NO reference to |
| failure_manager | | failure_manager |
+--------+---------+ +--------+---------+
| |
v v
+------------------+ +------------------+
| Before routing: | | Routing happens |
| checks | | WITHOUT checking |
| is_path_feasible | | path feasibility |
| | | |
| AVOIDS FAILED | | MAY ALLOCATE |
| PATHS | | THROUGH FAILED |
+------------------+ | LINKS! |
| +------------------+
| |
v v
SAFE GAP - NOT SAFE
(will be fixed in
_handle_failure_impact
but allocation already
happened)
What This Means For Your Experiments
If using legacy path (use_orchestrator=False):
Failures work as expected
New requests avoid failed paths
Protected requests switch to backup when primary fails
If using orchestrator path (use_orchestrator=True):
Failures are injected and tracked (this works)
Already-allocated requests are handled when failures hit (this works)
BUT: New requests may be allocated through failed links (BUG/GAP)
The allocation will then immediately be impacted by
_handle_failure_impact()This is inefficient and may cause unexpected behavior
If using RL with orchestrator:
RL policies CAN receive failure information via
DisasterStateThe RL adapter computes
failure_maskfeaturesRL policies trained on survivability CAN make failure-aware decisions
This is a workaround, not a fix for the core gap
Future Intent
The intended architecture (not yet implemented):
FUTURE STATE (v6.x):
+------------------+
| SDNOrchestrator |
| |
| routing ---------|---> RoutingPipeline checks FailureManager
| spectrum | before returning paths
| protection |
| failure_manager -|---> Reference to FailureManager (NEW)
+------------------+
Options being considered:
1. Pass FailureManager to orchestrator
- Orchestrator checks is_path_feasible() during routing
- Similar to how SDNController works
2. Add failure info to NetworkState
- NetworkState.failed_links property
- Routing pipelines read from NetworkState
3. Create FailuresPipeline
- New pipeline stage that filters infeasible paths
- Fits the pipeline architecture pattern
No decision has been made yet. This is tracked in the module’s TODO.md.
Module Components
failure_manager.py
The main class for managing failures:
from fusion.modules.failures import FailureManager
manager = FailureManager(engine_props, topology)
# Schedule a failure
event = manager.inject_failure(
'link',
t_fail=100.0, # Fails at t=100
t_repair=200.0, # Repaired at t=200
link_id=(0, 1)
)
# Later, at t=100:
activated = manager.activate_failures(100.0) # Returns [(0, 1)]
# Check if path avoids failures:
if manager.is_path_feasible([0, 1, 2]): # False - uses failed link
allocate(path)
# At t=200:
repaired = manager.repair_failures(200.0) # Returns [(0, 1)]
failure_types.py
Implementation of F1-F4 failure types:
from fusion.modules.failures import fail_link, fail_node, fail_srlg, fail_geo
# F1: Single link
event = fail_link(topology, link_id=(0, 1), t_fail=10, t_repair=20)
# F2: Node (all adjacent links)
event = fail_node(topology, node_id=5, t_fail=10, t_repair=20)
# F3: SRLG (multiple links)
event = fail_srlg(topology, srlg_links=[(0,1), (2,3)], t_fail=10, t_repair=20)
# F4: Geographic (hop radius)
event = fail_geo(topology, center_node=5, hop_radius=2, t_fail=10, t_repair=20)
registry.py
Registry pattern for extensibility:
from fusion.modules.failures import register_failure_type, get_failure_handler
# Get built-in handler
handler = get_failure_handler('link')
# Register custom handler
def my_failure(topology, t_fail, t_repair, **kwargs):
return {"failure_type": "custom", "failed_links": [...], ...}
register_failure_type('custom', my_failure)
Development Guide
Adding a New Failure Type
Add function to
failure_types.pyRegister in
registry.pyExport in
__init__.pyAdd tests
Running Survivability Experiments
For now, use legacy path:
engine_props = {
"use_orchestrator": False, # Use legacy for reliable failure handling
"failure_type": "geo",
"failure_center_node": 5,
"failure_hop_radius": 2,
# ... other props
}
If you must use orchestrator with RL:
from fusion.modules.rl.adapter import DisasterState, RLSimulationAdapter
# Create disaster state for RL
disaster_state = DisasterState(
active=True,
centroid=(x, y),
radius=100.0,
failed_links=frozenset([(0, 1), (2, 3)]),
)
# Pass to RL adapter for feature computation
state = adapter.compute_state(request, network_state, disaster_state)
Testing
# Run failure module tests
pytest fusion/modules/failures/tests/ -v
# Run with coverage
pytest fusion/modules/failures/tests/ -v --cov=fusion.modules.failures
Troubleshooting
“My orchestrator simulation allocates through failed links”
This is the known gap. Use use_orchestrator=False for now.
“Failures aren’t being activated”
Check that you called activate_failures(time) at the failure time.
The SimulationEngine does this automatically in the main loop.
“Protected requests aren’t switching to backup”
Check _handle_failure_impact() in simulation.py. This handles
switchover but requires protection to have been set up via ProtectionPipeline.