______________________________________________________________________

DRAFT TRANSCRIPT

SIG:           Routing
Date:          Wednesday 1 March 2006
Time:          4.00pm

Presentation:  Real-world use of route analytics technology
Presenter:     Greg Hooten
			       
______________________________________________________________________


PHILIP SMITH:

OK, next up we have Greg Hooten talking about real-world use of 
route analytics technology.

GREG HOOTEN:

Hi. My name is Greg Hooten and I'm from Packet Design. We put out
a product called Route Explorer and I want to talk to you today
about route analytics and how it's used and where it's used.

A little history about the company - I'm not going to go through
it. You'll see it in the slide set if you want to pull it out off
the Web.

So why route analytics? I've worked for a lot of large ISPs. Some
of them went belly up. Most of them couldn't bandage their networks.
And a lot of the reason was because everything was focused on Layer
2. How do you manage a Layer 3 network? It wasn't really known and
HP Openview really built up this big process at Layer 2 to fill out
how the Layer 3 worked and that didn't work either. Most of Layer
3 problems are caused by misconfigurations in the routers, hardware
failures and they cause the majority of the problems in the networks.
What we saw and at Home Corporation, where I worked before this,
was, when we froze the network, for example, for a holiday, essentially
our outages went down bill over 90%. So nobody is allowed to make
changes on the network - outages went away.

That didn't catch on with a lot of people but we tried to minimise
our outages by being smarter about how we did them. But there was
no way to really test except for brute force, whether the Layer 3
was working or not.

Route analytics leverages the strength of routing by listening to
the routing protocols. So, if I'm listening to OSPF, I listen to
the LSA information, record what's happening in the network, I keep
it historically and it allows me to diagnose the problems either
currently, or historically, or over time, that are happening in the
network. So, if a problem happens once, it may happen again. If
it's happened 15 times, it probably won't show up in a Layer 2
management system but it will show up in the routed infrastructure.

Layer 3 is designed to survive through redundancy and through
rerouting. That's good and bad. One way it says, you know, "I'm
going to try and get around this problem," but it also disguises
there is a problem so we're trying to get visibility into that Layer
3 topology.

Simple topology (refers to slide). Here's the routers, here's the
path across the network, some BGP, some colours, it looks like a
diagram. It can be changed any way you want. We tried to keep it
as simple as possible. Through the reports and through very lightweight
touch on the network peering with OSPF, you get the data about what
the routers are saying in the network, what routes are up, what the
metrics are, what paths are down, what links are down. And then you
build up over history a timeline of what has happened and do analysis
on that type of data.

These are some of the companies that are interested in this type
of stuff (refers to slide).

So MPLS seems to be our big focus recently. I don't really know
why. It's a really good way of selling very large routers to the
edge of your network instead of smaller routers tell edge of your
network. But MPLS DPNs are catching on in a lot of Tier-1 ISPs. One
of the big problems that they're running into is change validations
saying, "I'm adding a new customer to my network. How do I know it
works?" What they're doing right now is they're doing as much testing
as they can, calling up the network and saying, "Alright. Try it."
When you go out and add new routes to the network, how do they know
that works?

So they're running into per-customer reachability issues, privacy
issues - are the customer routes really distinguished one from
another? - and policy issues - am I getting the hub and spoke that
I really want or am I getting the mesh that I want? And these are
the problems that they're running into - customers leaking routes
between each other because of misconfigurations, providers trying
to monitor the policy as they change it, provider misconfigurations
for various manual processes.

This is a summary page of the changes that are happening in a
network. Detail based on customers - so you can name them pretty
much anything you want - but what we're looking at is trying to
baseline what's happening into a network. So I've got three active
Pes, 10 active routes for this customer, baseline routes of 3 so
I've got seven new routes. The question is why? So the process is
really more like, if I'm making a change in the network, someone's
responsible for that. Someone's responsible for saying, "I'm going
to add a new PE. I'm going to add a new set of routes to my network."
There's also an acceptance process of, alright, once those routes
are in, once the new routers are in, how does that get accepted
into my network as an operational piece of equipment? And so what
we're trying to do is give that data to those groups so when I make
a change to this network, customer 5, they've got 10 active PEs,
two withdrawn routes, 100 new routes. Why do they have 100 new
routes?

So being able to look at these routes and say, "These routes were
assigned to this PE at this time and this is the active routes,"
with this route discriminator, that gives you the ability to view
in more detail rather than saying, "I planned a maintenance and
executed that maintenance and then this is the result of that
maintenance."

So we've got those 100 new routes that were put in for customer 5.
13 are coming from brand new PEs for that customer. They're not
part of the baseline. Is that a mistake? If they look at the work
order, they should be able to tell. These are the new sets of PEs
that were supposed to be added. If there weren't, did Pepsi Co get
crossed with Coca-Cola? Is this something we want to do? Then you
can do justification on your maintenance before you contact the
customer and say this is working or this is not working.

If there's instability in the network, then we can display that
instability and you want to do it with as much detail as you can.
So here is just a graph of it over time but we could also give the
details about what routes were being affected at that time.

So we had a large WAN RFP that required routing converging SLA. So
when does a network converge? Right now, it's difficult to measure.
What parts do you measure? Do you measure when OSPF starts to go
away or once the timers have expired and the changes are occurring
in the network? There's no standard for it so what we're doing is
monitoring the convergence based on the propagation delay across
the network. Is that the ideal solution? No. But it worked well
enough for this company to justify to their customers that they
were within the specifications that were required. Do we want better
solutions? Of course.

Right now, being able to say, "I have a monitoring system on one
side, monitoring system on the other," and being able to generate
route changes into the network, inject and withdraw, look-back
address of /32, measuring that propagation delay across that network
so that you can see this is the change and this is your graph of
the change over a long period of time.  Does that meet the SLA that
you're interested in?

By being able to look at BGP over long periods of time, you can see
changes like huge deluges of routes coming into the network, whether
it's a redistribution from OSPF into BGP or new changes, loss of
peering, how does a loss of routing in AS1237 seven hops away from
my network affected the way that I send data out of my network?
Does it make an effect? If it does make an effect, do I want to
find out why and do I want to change that process? So we've seen
this at Tier-1 ISPs all over the place. UC Berkeley, new route
leakage into the network - why did they come about? Where are they
coming from? Being able to determine which customer was advertising
those new set of prefixes, where did they shift from in the entry
of my network? So, in this case, instead of looking at 108,000
prefixes that shifted or 330,000 events, we're going to create a
pretty picture about it and it's actually a moveable picture.


So we can play over time what's changing in the network. We see a
large loss of prefixes across one edge. So, in this case, Calran 2
is advertising prefixes across this edge, it's losing prefixes
across this edge and so it's fairly simple to see what's changing
in the network. It's easy for a customer on this side to contact
Calran 2 and say, "Look, you're losing peering here, you're picking
this up. It's not affecting us right now because we have duplicate
address advertisements, but it needs to be fixed." And, in this
case, it was fairly short. The more complex problems where Sprintlink
may be losing connectivity or, in this case, where Quest lost peering
with Calran or withdrew a bunch of routes to Calran and instead of
having a simple back-up, the back-up path was six hops. Really
difficult to find from route advertisements. Really simple to see
from a graphic.

Being able to categorise route changes before and after with a
delta, here's 96,000 route cal change based on next hops on this
next hop being able to visualise those either through a table or
through a graphic makes it easier to diagnose why that was happening
in your network.

Probably most of you have heard of MED oscillations. Probably not
too many people have seen that oscillations in the met work. Is a
very small cut of the data we collected in a Tier-1 ISP about a
year-and-a-half ago. Interesting thing about it was it went on for
over two weeks. It consumed the full processing power of three GSRs
and nobody really knew what was going on except that they are three
GSRs that were saturated. The changes were happening so quickly,
essentially at the speed of the three processors for the GSRs that
they couldn't, by the time they typed a command to see what routes
were changing, they were now 100 route changes behind. The changes
were happening so rapidly and they caused so much change in the
network that essentially the GSRs were useless.

They started having to route around thousands of routers to try and
get data connectivity back into major parts of their network.
Eventually they shut down two of the routers and they were ready
to replace them with larger GSRs which would have just exacerbated
the problem. But we were able to record some of the data that was
happening at that time and show them what was happening. They
implemented an always-compare MED and that essentially changed -
eliminated the problem.

So, over time, when you're capturing large amounts of data, you can
also do analysis over long periods of time and look at simple
changes, even for a single prefix on the network. Customer /24 was
flapping continuously over a one-minute period for approximately a
week. They always got back-up through a NAP but this new service
that they brought from the Tier-1 - they spent a lot of money on
the primary access that essentially was doing them no good. We were
able to pull out the data and look at that and show them where the
problem was and how bad the problem was.

So this is a fusion, a new type of fusion, that we're working on,
a gentleman named Pabo Yu (?) started this, Steve Casner and Van
Jacobsen decided that, if we're understanding how the routing works
in the network, then there must be a way to take net flow data,
figure out where the net flow data is coming from, where it's going
to in the network and then instead of listening to it throughout
the network, flow it across the network. So the idea here is instead
of collecting net flow data everywhere in my network, what I really
want to do is collect it at the key points in my network - entrance
points, key data centres. Given that I understand how my routing
topology works, I want to be able to take that data and then flow
it out across that topology. What that gives me is the ability to
say, "If something changes in my network based on either routing
or on data flow, how will that affect the other circuits that are
now being overloaded?  Do I buy more bandwidth? What happens if I
lose a router? Where will those packets go? Will they flow the way
I expect them to? Will they saturate the circuits that I have as
back-ups?"

In a complex network, even with complexity coming from redundancy
rather than from things like MPLS, the question is where will my
data go? It's designed to be resilient. I want to make it as
predictable as possible so being able to fail pieces of equipment,
proactively testing them, seeing where yesterday's data would have
flown if this router went away, gives me the ability to predict how
it will happen in the future if I really do lose pieces of equipment.

So being able to look at the routing topology, being able to look
at net flow data and then being able to fuse those two together
gives me another tool that will allow me to better understand how
my Layer 3 topology is working, how the data will flow in the
network, what capacities I need for the future, what peering I need
or what private peering I need rather than buying transit from
service providers. So it's another way of looking at the same types
of data we've been collecting for a long period of time and hopefully
it will add a little bit more clarity into the Layer 3 network
rather than trying to build that clarity up from a Layer 2 network
and then try to interpret from Layer 2 what the Layer 3 topology
will do.

Are there any questions?

I was either very successful or you're very asleep.

OK.

PHILIP SMITH:

No questions for Greg? No.

GREG HOOTEN:

Great. Thanks a lot.

PHILIP SMITH:

OK then. Thank you very much.

APPLAUSE