FRA2 - Internet Connectivity Loss
Incident Report for Teraswitch
Postmortem

In preparation for a fiber vendor maintenance, Teraswitch drained backbone traffic from FRA2’s edge network at approximately 09:35 UTC 11/27/2024.

At approximately 09:45 UTC, connectivity was impacted in FRA2 - and shortly after that the fiber vendor maintenance began.  The FRA2 core network was unable to reach the Internet via any means for a period of time, lasting approximately 12 minutes.  A fix was implemented to ensure the traffic would be forced to the next exit point towards the internet no matter the status of the connection, and site connectivity was restored around 09:57 UTC.

Teraswitch engineering determined the root cause to be both an unfortunate timing of a provider issue and also a logical error in the BGP tooling used to divert traffic away from the links about to undergo maintenance. This sequence of events caused routes to be inadvertently withdrawn from the FRA2 core data center network.  The issue is now well understood and our tooling has been adjusted to prevent this going forward.

Our understanding of the real events and lab simulations verified this issue sequence:

  • In preparation for maintenance, our tooling had set a “no-export” flag (BGP community) on all routes learned from the AS20326/Teraswitch backbone network. This configured FRA2’s edge network and internet routers to not export any backbone-learned routes to any other ASN.
  • Inadvertently, this was being set on a “default route” (0.0.0.0/0) that is often exported towards customers or our data center network, which effectively signals “this edge router is ready to pass traffic, you may send internet traffic via this route.”
  • Initially this caused no issue with network operation, as FRA2’s edge routers had selected a default route locally (from a local FRA2 internet transit provider) to use as the signal that they were capable of passing traffic.
  • Possibly affected by the same fiber vendor maintenance that we were expecting, one or more of our local FRA2 transit providers “regenerated” their default route and caused our FRA2 edge routers to elect a new default route.
  • FRA2’s edge network picked a default route generated at a nearby location (FRA1), but this route had the no-export flag still enabled as it was learned from the backbone. This caused the FRA2 edge routers to stop sending a default route (0.0.0.0/0) towards the FRA2 data center network (routers within a different ASN than AS20326), leading to the data center believing there was no available edge router to process internet traffic. This caused a total FRA2 outage until Teraswitch network engineering forced the removal of the no-export flag.

To prevent this from happening in the future, the application of the no-export flag to default routes has been entirely removed - this does not have any traffic steering effect as Internet destinations will almost always have a more-specific route within the Teraswitch network. Second, we have also added a locally generated default route that will activate if the data center network believes it is isolated from all edge routers.

We apologize for any inconvenience caused by this incident.  If you have any questions or concerns, please reach out to Teraswitch Support at support@teraswitch.com.

Posted Nov 29, 2024 - 06:35 UTC

Resolved
This incident is resolved. A root cause analysis will be published shortly with more details.
Posted Nov 28, 2024 - 05:18 UTC
Monitoring
The issue has been identified and a fix put in place. Customer systems should be reachable again - we are monitoring to confirm resolution of the issue. Root cause investigation is also underway.
Posted Nov 27, 2024 - 22:05 UTC
Investigating
Teraswitch is investigating reports of Internet connectivity loss at our FRA2 site. We will provide an update shortly as more information becomes available.
Posted Nov 27, 2024 - 21:58 UTC
This incident affected: FRA2 - Frankfurt, Germany.