______________________________________________________________________

DRAFT TRANSCRIPT

SIG:           Routing
Date:          Wednesday 1 March 2006
Time:          2.00pm

Presentation:  BGP convergence : Better handling of silent peer 
               failures
Presenter:     David Hughes
			       
______________________________________________________________________


PHILIP SMITH:

Good afternoon, everyone. Welcome to the Routing SIG. This is the
first of two sessions of the Routing Special Interest Group.
Unfortunately, we're competing a little bit with the routing and
operations track but such is the scheduling that we have a conference
the size of APRICOT.

Randy Bush and I are the chairs of this special interest group.
Randy was somewhere here. Where's he gone? Oh, he's right there,
hiding at the back. Randy and I are co-chairs of this special
interest group. We hope we've got together an interesting enough
program for you all. I would like to start off really by thanking
all the speakers who have volunteered their content and their time
to participate in this session.

The mailing list, if you're interested in subscribing and if you're
not already a member is as on the screen. It's sig-routing@lists.apnic.net
and you can subscribe to that through the APNIC mailman system.

So we don't actually have any action items on the Routing SIG, which
is fine, so we don't have to look through any of those so all we
really need to do now is start with the actual presentations and
first up we have David Hughes, who will talking about BGP convergence.

DAVID HUGHES:

Thanks, Phil.

Right, as Phil mentioned, the general subject matter is BGP
convergence. More specifically, I cover better handling of what
we've been calling silent peer failures. It's something that is
growing more and more important as more and more people are taking
up services delivered over some of the newer technologies offered
by the telcos. One small apology I make before I get started is
that the networks that are involved with running are predominantly
Cisco-centric. Where possible, I've tried to determine through
public documentation the availability of certain features for
Junipers but, obviously, without exposure to the equipment and
software, some of it may not be accurate. But every effort has been
made to provide as much detail as possible.

So the basic overview of the presentation. Just a quick overview
of what we term as a silent peer failure and a then a look at some
of the technologies that have been worked on by various groups that,
in the future, will hopefully make this a problem of the past.
Unfortunately, as I mentioned, they are technologies of the future,
so we'll have a quick look at what can be done to help with the
problem now and then we'll present some operational experience from
what we've done with those tweaks.

But a quick overview of what we term a silent peer failure. Normal
network situation - a couple of routers. Unfortunately, these days
more often than not, you will not have a direct layer 2 connection
between the two devices. With the prevalence of technologies such
as metro-ethernet, we're seeing that the chances of you having layer
2 connectivity with DC connectivity between the two devices is
getting lower and lower.

So, in this situation we have router A and B connected via a switch
and BGP between them and in this situation B has a default rout
from A, A being the next hop. At some point in time, the link between
A and the switch disappears. Of course, router B still has enough
interface and will quite happily then continue forwarding traffic
it what it thinks is still a very valid next hop.  Obviously, router
A not being there, that traffic is just going to be black-holed for
quite some time unfortunately.

The concern here is the time it takes the BGP to actually determine
that that is no longer a valid path. So that's what we're terming
a silent peer failure, when the sending router has no idea that the
peer has gone away.

So the main problem with this is that the time to detect the lack
of the peer is ridiculously long. Looking through the BGP spec,
we've got two timers that are of interest. We've got the hold timer
and the keep-alive timer. The hold timer being the maximum number
of seconds that the peer will wait between receiving a keep-alive
message from its peer. The hold time is a negotiated value. When
the session is established, the two boxes will specify what they'd
like to use for the hold timer and the lower of those two values
will be used within reason. The keep-alive timer is historically
set at one-third of the hold timer and if a device has not sent a
message in that time frame, it should send a message to keep the
keep-alive time active.

The problem is that standard implementations on devices are using
ridiculously large values for these timers. The Cisco in particular,
the default keep alive-timer is set at 60 seconds, giving you a
hold time of 180 seconds. If that peer fails silently, you can and
will blackhole traffic for up to 179 seconds. Juniper's much better
in the fact that they run 30-second timers with a 90-second hold
timer and that is actually compliant with the RFC. The RFC does
specify - it doesn't specify, but it recommends 30-second keep-alives.
Even the best situation on a default config on a Juniper, you're
still going to blackhole traffic for 89 seconds. If you're trying
to offer, you know, your classic five lines of reliability that
everyone is trying to achieve, however unrealistic that may be,
giving you 316 seconds per year of downtime, you can't possibly
have the situation that you might blackhole traffic for up to 179
seconds running standard Cisco timers.  If you offer SLAs like that,
you cannot run standard timers.

How can we get around these problems? There are a couple of things
coming out that do help. An obvious one that is being made more and
more available is next hop tracking for BGP. The whole premise being
that it's event driven from your IGP and, if the next hop of prefixes
that are in your route from the IGP, they'll be dropped. There's
no waiting for the timers to expire and for your internal network
it's a very good solution. It's basically a fundamental premise of
routing that your next hop should be reachable and it's quite
astounding that it's taken this long for something as simple as
keeping an eye on the next hop to reach what is mainstream routing
platforms.

The downside, however, is that it's obviously not so good for your
eBGP session unless you're running an iBGP upstream which is obviously
not going to be happening. From my reading of the document, Juniper
has no such mechanism at such point in time. I would stand to be
corrected but from my readings it's not the case. To make it worse,
it's limited on IOS at this point in time, shown there as the list
of trains where you'll find it.

Unfortunately for people who do a lot of aggregation using 65
hundreds there's no sign of it there. It's a good feature but it's
not widely acceptable at this point in time.

Another emerging technology that helps considerably is bidirectional
forwarding detection. BFD, for those who haven't come across it
before, is basically a very aggressive hello protocol. It's designed
specifically to run where possible between the forwarding planes
of the devices leaving the control planes out of the equation. The
premise there being that, if the control plane gets busy for whatever
reason, it doesn't have an impact on the forwarding plane's ability
to talk to each other to maintain the consistency of that circuit.
The way it's been implemented, it works over basically anything
you'd think of, directly attached connections, tunnels, you name
it, it works over it. And reading through the RFCs, it's designed
to be very aggressive.  It can be specified in microseconds. The
next thing is that the two main guys behind the IETF draft, one
from Cisco and one from Juniper and everyone is pushing it as
something everyone should have so hopefully we'll get widespread
deployment and it's a good solution for eBGP sessions.

Unfortunately, it's only good for eBGP sessions if your upstream
supports it and, particularly in this country, I'm not sure how
long it will take to see a monopoly telco provide something like
BFD on its sessions.

Another slight negative for BFD is that it does require some
interaction back-up obviously to the routing protocols so there are
different standards being written once again for the IETF working
groups on the interaction between BFD and BGP, OSPF and IS-IS. As
far as trying to get any support for BGP, it's incredibly limited.
It shows you up there platforms where you might be able to find it.
Mostly we're seeing that it's being deployed for IGPs only at this
stage which doesn't help us a lot. If we had next-hop tracking being
able to yank prefixes as a group from an iBGP failure, that would
be great. If you want to, you can jump in at 12.4T and you get it
for eBGP in there as well. If you have a CRS1 lying around not doing
anything, you can have it there too. I didn't have one at the time
of testing.

A general solution for today - I mean, the underlying fundamental
problem here is that the value of the timers that we inherit by
default configurations of our routing devices are insanely long.
So the obvious solution is to just decrease the timers. We had a
bit of debate on one of the public Cisco-orientated mailing lists
about my proposition that this was a good idea and one participant
in particular was pushing forward the argument that using a high-level
mechanism like BGP timers to try to detect a underlying link failure
was fundamentally flawed. It is but, if there's no other mechanism,
I don't see what choice we have.

IOS supports a configuration within the BGP standard to tweak the
timers to whatever you wish. You can drop them to whatever you want
to run internally or externally to try to get your way around this
problem.

The only problem is that your upstream, if you're doing this
externally, has every right to reject your session if it doesn't
like the values you've presented. The RFC specifies quite clearly
that 2 or below isn't an option and it will definitely reject it
there. However, there is - up until very recently, there has been
no way for the upstream to actually reject it based on the time.
There's a very recent 12.0S release for IOS that has a new node
that's been attached to it so you can configure what you believe
would be a reasonable minimum value for the hold timer but, as I
said, that's only recently been added. It's only a 12.0S, nothing
newer and hopefully that won't get too far out there in the wilderness.
Juniper doesn't have any such feature.

So hopefully that means we should be able to tweak our timers to
something more reasonable without our upstreamers getting more
angry. Alternatively, if you have large aggregation routers, it
might increase the load and it might be something to keep an eye
on. If you're running incredibly aggressive timers, you might see
some issues if the router gets busy.

We haven't had any problems like that. We've been running this in
production for some time now. We did some trials and it worked fine
under the stresses we could think of. We didn't see instability of
the peer sessions. We rolled it out around our upstreams. Webcentral
has connectivity to basically all the Tier-1 upstreams. Don't get
link to upstreams all over the place - we don't get that problem.
We're running 5-second keep-alives. We're getting maximums of 14
seconds of blackholing if we experience a silent failure and we
haven't seen a single issue so far. Because we have no next-hop
tracking internally, we're looking at rolling this out internally
on one-second keep-alives in the network itself. It's something
we're going to be trialling in the near future. We think that will
give us something a little bit closer to what we need until the
other features become available.

So in summary, we do have a light at the end of the tunnel. This
is a problem that hopefully will be relegated to history once people
get on board the likes of BFD and providing us IGP-based next-hop
tracking in more commonly used releases. This is a very good interim
solution and as I said, the 3x5-second keep-alives have been working
very well and if you're happy to run Cisco default values then you
obviously don't care about reliable passing of your traffic. Any
questions?

PHILIP SMITH:

Are there any questions for David?

In my haste to get started, I forgot the housekeeping list. If there
are any questions, can you please come up to the microphone, state
name and affiliation before you ask your question.

Thank you very much, everyone.

APPLAUSE