______________________________________________________________________ DRAFT TRANSCRIPT SIG: Routing Date: Wednesday 1 March 2006 Time: 2.00pm Presentation: BGP convergence : Better handling of silent peer failures Presenter: David Hughes ______________________________________________________________________ PHILIP SMITH: Good afternoon, everyone. Welcome to the Routing SIG. This is the first of two sessions of the Routing Special Interest Group. Unfortunately, we're competing a little bit with the routing and operations track but such is the scheduling that we have a conference the size of APRICOT. Randy Bush and I are the chairs of this special interest group. Randy was somewhere here. Where's he gone? Oh, he's right there, hiding at the back. Randy and I are co-chairs of this special interest group. We hope we've got together an interesting enough program for you all. I would like to start off really by thanking all the speakers who have volunteered their content and their time to participate in this session. The mailing list, if you're interested in subscribing and if you're not already a member is as on the screen. It's sig-routing@lists.apnic.net and you can subscribe to that through the APNIC mailman system. So we don't actually have any action items on the Routing SIG, which is fine, so we don't have to look through any of those so all we really need to do now is start with the actual presentations and first up we have David Hughes, who will talking about BGP convergence. DAVID HUGHES: Thanks, Phil. Right, as Phil mentioned, the general subject matter is BGP convergence. More specifically, I cover better handling of what we've been calling silent peer failures. It's something that is growing more and more important as more and more people are taking up services delivered over some of the newer technologies offered by the telcos. One small apology I make before I get started is that the networks that are involved with running are predominantly Cisco-centric. Where possible, I've tried to determine through public documentation the availability of certain features for Junipers but, obviously, without exposure to the equipment and software, some of it may not be accurate. But every effort has been made to provide as much detail as possible. So the basic overview of the presentation. Just a quick overview of what we term as a silent peer failure and a then a look at some of the technologies that have been worked on by various groups that, in the future, will hopefully make this a problem of the past. Unfortunately, as I mentioned, they are technologies of the future, so we'll have a quick look at what can be done to help with the problem now and then we'll present some operational experience from what we've done with those tweaks. But a quick overview of what we term a silent peer failure. Normal network situation - a couple of routers. Unfortunately, these days more often than not, you will not have a direct layer 2 connection between the two devices. With the prevalence of technologies such as metro-ethernet, we're seeing that the chances of you having layer 2 connectivity with DC connectivity between the two devices is getting lower and lower. So, in this situation we have router A and B connected via a switch and BGP between them and in this situation B has a default rout from A, A being the next hop. At some point in time, the link between A and the switch disappears. Of course, router B still has enough interface and will quite happily then continue forwarding traffic it what it thinks is still a very valid next hop. Obviously, router A not being there, that traffic is just going to be black-holed for quite some time unfortunately. The concern here is the time it takes the BGP to actually determine that that is no longer a valid path. So that's what we're terming a silent peer failure, when the sending router has no idea that the peer has gone away. So the main problem with this is that the time to detect the lack of the peer is ridiculously long. Looking through the BGP spec, we've got two timers that are of interest. We've got the hold timer and the keep-alive timer. The hold timer being the maximum number of seconds that the peer will wait between receiving a keep-alive message from its peer. The hold time is a negotiated value. When the session is established, the two boxes will specify what they'd like to use for the hold timer and the lower of those two values will be used within reason. The keep-alive timer is historically set at one-third of the hold timer and if a device has not sent a message in that time frame, it should send a message to keep the keep-alive time active. The problem is that standard implementations on devices are using ridiculously large values for these timers. The Cisco in particular, the default keep alive-timer is set at 60 seconds, giving you a hold time of 180 seconds. If that peer fails silently, you can and will blackhole traffic for up to 179 seconds. Juniper's much better in the fact that they run 30-second timers with a 90-second hold timer and that is actually compliant with the RFC. The RFC does specify - it doesn't specify, but it recommends 30-second keep-alives. Even the best situation on a default config on a Juniper, you're still going to blackhole traffic for 89 seconds. If you're trying to offer, you know, your classic five lines of reliability that everyone is trying to achieve, however unrealistic that may be, giving you 316 seconds per year of downtime, you can't possibly have the situation that you might blackhole traffic for up to 179 seconds running standard Cisco timers. If you offer SLAs like that, you cannot run standard timers. How can we get around these problems? There are a couple of things coming out that do help. An obvious one that is being made more and more available is next hop tracking for BGP. The whole premise being that it's event driven from your IGP and, if the next hop of prefixes that are in your route from the IGP, they'll be dropped. There's no waiting for the timers to expire and for your internal network it's a very good solution. It's basically a fundamental premise of routing that your next hop should be reachable and it's quite astounding that it's taken this long for something as simple as keeping an eye on the next hop to reach what is mainstream routing platforms. The downside, however, is that it's obviously not so good for your eBGP session unless you're running an iBGP upstream which is obviously not going to be happening. From my reading of the document, Juniper has no such mechanism at such point in time. I would stand to be corrected but from my readings it's not the case. To make it worse, it's limited on IOS at this point in time, shown there as the list of trains where you'll find it. Unfortunately for people who do a lot of aggregation using 65 hundreds there's no sign of it there. It's a good feature but it's not widely acceptable at this point in time. Another emerging technology that helps considerably is bidirectional forwarding detection. BFD, for those who haven't come across it before, is basically a very aggressive hello protocol. It's designed specifically to run where possible between the forwarding planes of the devices leaving the control planes out of the equation. The premise there being that, if the control plane gets busy for whatever reason, it doesn't have an impact on the forwarding plane's ability to talk to each other to maintain the consistency of that circuit. The way it's been implemented, it works over basically anything you'd think of, directly attached connections, tunnels, you name it, it works over it. And reading through the RFCs, it's designed to be very aggressive. It can be specified in microseconds. The next thing is that the two main guys behind the IETF draft, one from Cisco and one from Juniper and everyone is pushing it as something everyone should have so hopefully we'll get widespread deployment and it's a good solution for eBGP sessions. Unfortunately, it's only good for eBGP sessions if your upstream supports it and, particularly in this country, I'm not sure how long it will take to see a monopoly telco provide something like BFD on its sessions. Another slight negative for BFD is that it does require some interaction back-up obviously to the routing protocols so there are different standards being written once again for the IETF working groups on the interaction between BFD and BGP, OSPF and IS-IS. As far as trying to get any support for BGP, it's incredibly limited. It shows you up there platforms where you might be able to find it. Mostly we're seeing that it's being deployed for IGPs only at this stage which doesn't help us a lot. If we had next-hop tracking being able to yank prefixes as a group from an iBGP failure, that would be great. If you want to, you can jump in at 12.4T and you get it for eBGP in there as well. If you have a CRS1 lying around not doing anything, you can have it there too. I didn't have one at the time of testing. A general solution for today - I mean, the underlying fundamental problem here is that the value of the timers that we inherit by default configurations of our routing devices are insanely long. So the obvious solution is to just decrease the timers. We had a bit of debate on one of the public Cisco-orientated mailing lists about my proposition that this was a good idea and one participant in particular was pushing forward the argument that using a high-level mechanism like BGP timers to try to detect a underlying link failure was fundamentally flawed. It is but, if there's no other mechanism, I don't see what choice we have. IOS supports a configuration within the BGP standard to tweak the timers to whatever you wish. You can drop them to whatever you want to run internally or externally to try to get your way around this problem. The only problem is that your upstream, if you're doing this externally, has every right to reject your session if it doesn't like the values you've presented. The RFC specifies quite clearly that 2 or below isn't an option and it will definitely reject it there. However, there is - up until very recently, there has been no way for the upstream to actually reject it based on the time. There's a very recent 12.0S release for IOS that has a new node that's been attached to it so you can configure what you believe would be a reasonable minimum value for the hold timer but, as I said, that's only recently been added. It's only a 12.0S, nothing newer and hopefully that won't get too far out there in the wilderness. Juniper doesn't have any such feature. So hopefully that means we should be able to tweak our timers to something more reasonable without our upstreamers getting more angry. Alternatively, if you have large aggregation routers, it might increase the load and it might be something to keep an eye on. If you're running incredibly aggressive timers, you might see some issues if the router gets busy. We haven't had any problems like that. We've been running this in production for some time now. We did some trials and it worked fine under the stresses we could think of. We didn't see instability of the peer sessions. We rolled it out around our upstreams. Webcentral has connectivity to basically all the Tier-1 upstreams. Don't get link to upstreams all over the place - we don't get that problem. We're running 5-second keep-alives. We're getting maximums of 14 seconds of blackholing if we experience a silent failure and we haven't seen a single issue so far. Because we have no next-hop tracking internally, we're looking at rolling this out internally on one-second keep-alives in the network itself. It's something we're going to be trialling in the near future. We think that will give us something a little bit closer to what we need until the other features become available. So in summary, we do have a light at the end of the tunnel. This is a problem that hopefully will be relegated to history once people get on board the likes of BFD and providing us IGP-based next-hop tracking in more commonly used releases. This is a very good interim solution and as I said, the 3x5-second keep-alives have been working very well and if you're happy to run Cisco default values then you obviously don't care about reliable passing of your traffic. Any questions? PHILIP SMITH: Are there any questions for David? In my haste to get started, I forgot the housekeeping list. If there are any questions, can you please come up to the microphone, state name and affiliation before you ask your question. Thank you very much, everyone. APPLAUSE