How a10y works in practice — from fault detection to autonomous recovery.
A physical fiber cut triggers cascading alarms across transport, IP, and mobile core layers. a10y correlates them to a single root cause in seconds.
Transport layer: optical power loss alarm on interface ge-0/0/1
IP layer: BGP neighbor down, OSPF adjacency lost on 3 links
Mobile core (free5GC): AMF registration failures spike, UPF path unreachable
47 alerts in 90 seconds across 12 devices
Keep deduplicates 47 alerts → 8 unique alarm types
Qdrant finds similar incident from 3 months ago (fiber cut on same span)
Correlation: all affected devices share a common fiber path
Queries active-inventory: "what is the physical topology between these devices?"
Identifies fiber span X as single point of failure
RCA: fiber cut on span X → transport down → IP reroute failed (no alternate path) → mobile core unreachable
Action plan: reroute traffic via backup path, notify NOC for physical repair
Activates backup MPLS path via NETCONF
Creates incident ticket with RCA summary
Notifies NOC: "Fiber cut on span X, backup path activated, physical repair needed"
BGP sessions re-established on backup path
AMF registration success rate returns to 99.9%
Alert storm resolved — 47 alerts → 0 active
Time to RCA: 12 seconds. Time to recovery: 3 minutes.
Gradual latency increase detected by perfSONAR between research sites. Statistical AI identifies the anomaly before any threshold alert fires.
perfSONAR OWAMP tests: one-way delay increasing 2ms/hour on path A→B
No threshold alarm yet (still below 50ms SLA)
Statistical anomaly detection flags the trend as abnormal
Topology query: path A→B traverses 4 hops, optical amplifiers on span 2
Qdrant: similar pattern seen last year — optical amplifier degradation
Correlation with SNMP: optical power on span 2 decreasing (still in spec)
Prediction: at current rate, SLA breach in ~18 hours
Root cause: optical amplifier aging on span 2 (pre-failure state)
Recommendation: proactive maintenance window, switch to protection path
Creates maintenance ticket with predicted failure window
Schedules protection path switchover for next maintenance window
Alerts optical team: "Amplifier on span 2 showing degradation, replace within 18h"
After switchover: latency on path A→B returns to baseline
perfSONAR tests confirm SLA compliance
Issue resolved before any user impact. Zero SLA violations.
An operator's typical workflow — from morning overview to incident investigation, all from the aether-ide portal.
Topology view shows all network nodes. Green = healthy, yellow = warning, red = critical. Two nodes are yellow.
Opens OpenObserve filtered to that device's logs and metrics. Sees elevated error rate on one interface.
Opens Keep dashboard. 3 correlated alerts for this device — CRC errors, input drops, and optical power warning. Keep has already grouped them.
"Have we seen this pattern before?" Qdrant dashboard shows 2 similar past incidents — both resolved by cleaning the fiber connector.
correlation-engine suggests dispatching a field tech with the RCA summary. Operator approves → Keep workflow creates the dispatch ticket.