Source of Recent Routing Instability

As discussed here earlier this week, there was quite a bit of BGP routing instability that seemed to be triggered by mis-handling of extremely long AS paths.  There were a couple reasons for this, mostly related to implementations that weren’t expecting to handle those crazy long AS paths.

Two issues of particular note related to this incident:

  1. Buffer allocation issues as pointed to in the IOS bug (CSCdr54230) referenced in the previous post seemed to cause heartburn for folks running Cisco routers on ancient software – patch’m if you got’m, and be ashamed you were bitten by a bug that was patched such a long time ago.
  2. A whole slew of routing code (well beyond IOS) that was tearing down sessions, albeit as gracefully as can be, with NOTIFICATION messages indicating malformed AS paths – either because a peer packed the segments of the AS_PATH attribute wrong and it’s malformed on receipt, or because it had issues with AS prepending on advertisement to external BGP peer, or because it was being unpackaged wrong on receipt.  Our earlier suspicion that this likely stems software mishandling AS space in AS_SEQUENCE segments for additional ASes (beyond 255, and prepended when propagating an update to external BGP peers) seems to have been right on the mark.

The new Cisco bug for 2. is CSCsx73770: Invalid BGP formatted update causes peer reset with AS prepending, and another adding “full” IOS support for S. 5.1.2 of RFC 4271 (BGP) is being tracked under CSCsx75937.

There is a proposed workaround which mitigates some of the risk – that of suppressing propagation of routes that have AS_PATHs longer than n (e.g., 75), but for reasons previosuly outlined, I’m not a huge fan of this, and it still leaves some risk on the table.

As noted, Cisco wasn’t the only vendor bitten by this (e.g., openBSD’s bgpd), as fuzzing and related testing hasn’t traditionally been done on routing protocols, BGP in particular – it’s clearly wide open for the taking, and things like transitive BGP attributes makes for interesting opportunities for vulnerability-focused types, and for attackers and targeted attacks.

Vendors and operators alike had better be investing in testing in these areas, else we’ll continue to see plenty more of these problems.

Thanks for Rodney Dunn of Cisco for chasing this on their end.

Comments are closed.