.. _failures-module: =============== Failures Module =============== .. warning:: **Beta Status**: This module is in beta. It has not been fully validated in production simulations. See ``TODO.md`` for the development roadmap. .. warning:: **Integration is Incomplete**: The orchestrator path has partial failure support. Read the "Current Integration Status" section carefully before using failures in your experiments. Quick Summary: What Works and What Doesn't ========================================== .. list-table:: :header-rows: 1 :widths: 40 30 30 * - Capability - Legacy Path - Orchestrator Path * - Inject failures (link/node/SRLG/geo) - YES - YES * - Activate failures at scheduled time - YES - YES * - Repair failures at scheduled time - YES - YES * - Handle impact on allocated requests - YES - YES * - **Avoid failed paths during NEW routing** - **YES** - **NO** (gap!) * - RL policies receive failure info - N/A - YES (via DisasterState) **Bottom line**: If you need new allocations to avoid failed links, use the legacy path (``use_orchestrator=False``) until orchestrator integration is complete. Overview ======== .. admonition:: At a Glance :class: tip :Purpose: Network failure injection for survivability testing :Location: ``fusion/modules/failures/`` :Key Files: ``failure_manager.py``, ``failure_types.py`` :Status: **Beta** - Partial orchestrator integration The failures module injects network failures (failed links) into simulations to test survivability and recovery mechanisms. What This Module Does --------------------- 1. **Injects failures**: Schedule when links fail and when they're repaired 2. **Tracks active failures**: Know which links are currently down 3. **Checks path feasibility**: Determine if a path avoids failed links 4. **Handles failure impact**: When failures hit allocated requests, switch to backup or drop Failure Types Supported ----------------------- .. code-block:: text F1: Link Failure - Single link fails F2: Node Failure - Node + all adjacent links fail F3: SRLG Failure - Multiple links sharing risk fail together F4: Geographic - All links within hop-radius of center fail Current Integration Status ========================== This is the confusing part. Here's exactly what happens in each path: Visual: How Failures Flow Through the System -------------------------------------------- .. code-block:: text +===========================================================================+ | SIMULATION ENGINE | | (fusion/core/simulation.py) | | | | Owns FailureManager - works for BOTH paths | +===========================================================================+ | | At simulation start: | _initialize_failure_manager() creates FailureManager | v +------------------+ | FailureManager | <-- Owned by SimulationEngine, shared reference | - scheduled | to SDNController (legacy) but NOT to | failures | SDNOrchestrator | - active failures| | - repair schedule| +--------+---------+ | | =========|===================================================================== | DURING SIMULATION (main event loop) =========|===================================================================== | | For EACH time step, SimulationEngine calls: | v +------------------+ +------------------+ +------------------+ | activate_ | | _handle_failure_ | | repair_ | | failures(time) |---->| impact() |---->| failures(time) | | | | | | | | Moves scheduled | | For allocated | | Removes links | | failures to | | requests hit by | | from active | | active set | | new failures: | | failures | +------------------+ | - Switch to | +------------------+ | backup path | WORKS FOR | - Or drop request| WORKS FOR BOTH PATHS +------------------+ BOTH PATHS | WORKS FOR BOTH PATHS =========|===================================================================== | DURING ROUTING (when new request arrives) =========|===================================================================== | +------------------+------------------+ | | v v +------------------+ +------------------+ | LEGACY PATH | | ORCHESTRATOR | | use_orchestrator | | use_orchestrator | | = False | | = True | +------------------+ +------------------+ | | v v +------------------+ +------------------+ | SDNController | | SDNOrchestrator | | HAS reference to | | NO reference to | | failure_manager | | failure_manager | +--------+---------+ +--------+---------+ | | v v +------------------+ +------------------+ | Before routing: | | Routing happens | | checks | | WITHOUT checking | | is_path_feasible | | path feasibility | | | | | | AVOIDS FAILED | | MAY ALLOCATE | | PATHS | | THROUGH FAILED | +------------------+ | LINKS! | | +------------------+ | | v v SAFE GAP - NOT SAFE (will be fixed in _handle_failure_impact but allocation already happened) What This Means For Your Experiments ------------------------------------ **If using legacy path** (``use_orchestrator=False``): - Failures work as expected - New requests avoid failed paths - Protected requests switch to backup when primary fails **If using orchestrator path** (``use_orchestrator=True``): - Failures are injected and tracked (this works) - Already-allocated requests are handled when failures hit (this works) - **BUT**: New requests may be allocated through failed links (BUG/GAP) - The allocation will then immediately be impacted by ``_handle_failure_impact()`` - This is inefficient and may cause unexpected behavior **If using RL with orchestrator**: - RL policies CAN receive failure information via ``DisasterState`` - The RL adapter computes ``failure_mask`` features - RL policies trained on survivability CAN make failure-aware decisions - This is a workaround, not a fix for the core gap The Two Failure-Related Concepts ================================ There are TWO different things that sound similar but are different: .. code-block:: text +----------------------------------+----------------------------------+ | FailureManager | ProtectionPipeline | | (fusion/modules/failures/) | (fusion/pipelines/) | +----------------------------------+----------------------------------+ | | | | SIMULATES failures happening | PREPARES for failures | | | | | "At t=100, link (0,1) fails" | "Allocate backup path now | | "At t=200, it's repaired" | in case primary fails later" | | | | | Answers: "Is this path | Answers: "What disjoint backup | | currently blocked?" | path should we provision?" | | | | | Used by: SimulationEngine, | Used by: SDNOrchestrator | | SDNController | (orchestrator path only) | | | | +----------------------------------+----------------------------------+ | | | | REACTIVE | PROACTIVE | | (respond to failures) | (prepare before failures) | | | | +----------------------------------+----------------------------------+ **They should work together** but currently don't fully integrate: - ProtectionPipeline provisions backup paths - FailureManager should trigger switchover when failures hit - The ``_handle_failure_impact()`` method does this, but only AFTER allocation Future Intent ============= The intended architecture (not yet implemented): .. code-block:: text FUTURE STATE (v6.x): +------------------+ | SDNOrchestrator | | | | routing ---------|---> RoutingPipeline checks FailureManager | spectrum | before returning paths | protection | | failure_manager -|---> Reference to FailureManager (NEW) +------------------+ Options being considered: 1. Pass FailureManager to orchestrator - Orchestrator checks is_path_feasible() during routing - Similar to how SDNController works 2. Add failure info to NetworkState - NetworkState.failed_links property - Routing pipelines read from NetworkState 3. Create FailuresPipeline - New pipeline stage that filters infeasible paths - Fits the pipeline architecture pattern **No decision has been made yet.** This is tracked in the module's ``TODO.md``. Module Components ================= failure_manager.py ------------------ The main class for managing failures: .. code-block:: python from fusion.modules.failures import FailureManager manager = FailureManager(engine_props, topology) # Schedule a failure event = manager.inject_failure( 'link', t_fail=100.0, # Fails at t=100 t_repair=200.0, # Repaired at t=200 link_id=(0, 1) ) # Later, at t=100: activated = manager.activate_failures(100.0) # Returns [(0, 1)] # Check if path avoids failures: if manager.is_path_feasible([0, 1, 2]): # False - uses failed link allocate(path) # At t=200: repaired = manager.repair_failures(200.0) # Returns [(0, 1)] failure_types.py ---------------- Implementation of F1-F4 failure types: .. code-block:: python from fusion.modules.failures import fail_link, fail_node, fail_srlg, fail_geo # F1: Single link event = fail_link(topology, link_id=(0, 1), t_fail=10, t_repair=20) # F2: Node (all adjacent links) event = fail_node(topology, node_id=5, t_fail=10, t_repair=20) # F3: SRLG (multiple links) event = fail_srlg(topology, srlg_links=[(0,1), (2,3)], t_fail=10, t_repair=20) # F4: Geographic (hop radius) event = fail_geo(topology, center_node=5, hop_radius=2, t_fail=10, t_repair=20) registry.py ----------- Registry pattern for extensibility: .. code-block:: python from fusion.modules.failures import register_failure_type, get_failure_handler # Get built-in handler handler = get_failure_handler('link') # Register custom handler def my_failure(topology, t_fail, t_repair, **kwargs): return {"failure_type": "custom", "failed_links": [...], ...} register_failure_type('custom', my_failure) Development Guide ================= Adding a New Failure Type ------------------------- 1. Add function to ``failure_types.py`` 2. Register in ``registry.py`` 3. Export in ``__init__.py`` 4. Add tests Running Survivability Experiments --------------------------------- **For now, use legacy path:** .. code-block:: python engine_props = { "use_orchestrator": False, # Use legacy for reliable failure handling "failure_type": "geo", "failure_center_node": 5, "failure_hop_radius": 2, # ... other props } **If you must use orchestrator with RL:** .. code-block:: python from fusion.modules.rl.adapter import DisasterState, RLSimulationAdapter # Create disaster state for RL disaster_state = DisasterState( active=True, centroid=(x, y), radius=100.0, failed_links=frozenset([(0, 1), (2, 3)]), ) # Pass to RL adapter for feature computation state = adapter.compute_state(request, network_state, disaster_state) Testing ======= .. code-block:: bash # Run failure module tests pytest fusion/modules/failures/tests/ -v # Run with coverage pytest fusion/modules/failures/tests/ -v --cov=fusion.modules.failures Related Documentation ===================== - :ref:`modules-directory` - Overview of all modules - :ref:`core-module` - SimulationEngine and SDNController - ``fusion/pipelines/protection_pipeline.py`` - 1+1 protection (different from this) - ``fusion/modules/rl/adapter/rl_adapter.py`` - DisasterState for RL Troubleshooting =============== **"My orchestrator simulation allocates through failed links"** This is the known gap. Use ``use_orchestrator=False`` for now. **"Failures aren't being activated"** Check that you called ``activate_failures(time)`` at the failure time. The SimulationEngine does this automatically in the main loop. **"Protected requests aren't switching to backup"** Check ``_handle_failure_impact()`` in ``simulation.py``. This handles switchover but requires protection to have been set up via ProtectionPipeline.