______________________________________________________________________ DRAFT TRANSCRIPT SIG: Routing Date: Wednesday 1 March 2006 Time: 4.00pm Presentation: Real-world use of route analytics technology Presenter: Greg Hooten ______________________________________________________________________ PHILIP SMITH: OK, next up we have Greg Hooten talking about real-world use of route analytics technology. GREG HOOTEN: Hi. My name is Greg Hooten and I'm from Packet Design. We put out a product called Route Explorer and I want to talk to you today about route analytics and how it's used and where it's used. A little history about the company - I'm not going to go through it. You'll see it in the slide set if you want to pull it out off the Web. So why route analytics? I've worked for a lot of large ISPs. Some of them went belly up. Most of them couldn't bandage their networks. And a lot of the reason was because everything was focused on Layer 2. How do you manage a Layer 3 network? It wasn't really known and HP Openview really built up this big process at Layer 2 to fill out how the Layer 3 worked and that didn't work either. Most of Layer 3 problems are caused by misconfigurations in the routers, hardware failures and they cause the majority of the problems in the networks. What we saw and at Home Corporation, where I worked before this, was, when we froze the network, for example, for a holiday, essentially our outages went down bill over 90%. So nobody is allowed to make changes on the network - outages went away. That didn't catch on with a lot of people but we tried to minimise our outages by being smarter about how we did them. But there was no way to really test except for brute force, whether the Layer 3 was working or not. Route analytics leverages the strength of routing by listening to the routing protocols. So, if I'm listening to OSPF, I listen to the LSA information, record what's happening in the network, I keep it historically and it allows me to diagnose the problems either currently, or historically, or over time, that are happening in the network. So, if a problem happens once, it may happen again. If it's happened 15 times, it probably won't show up in a Layer 2 management system but it will show up in the routed infrastructure. Layer 3 is designed to survive through redundancy and through rerouting. That's good and bad. One way it says, you know, "I'm going to try and get around this problem," but it also disguises there is a problem so we're trying to get visibility into that Layer 3 topology. Simple topology (refers to slide). Here's the routers, here's the path across the network, some BGP, some colours, it looks like a diagram. It can be changed any way you want. We tried to keep it as simple as possible. Through the reports and through very lightweight touch on the network peering with OSPF, you get the data about what the routers are saying in the network, what routes are up, what the metrics are, what paths are down, what links are down. And then you build up over history a timeline of what has happened and do analysis on that type of data. These are some of the companies that are interested in this type of stuff (refers to slide). So MPLS seems to be our big focus recently. I don't really know why. It's a really good way of selling very large routers to the edge of your network instead of smaller routers tell edge of your network. But MPLS DPNs are catching on in a lot of Tier-1 ISPs. One of the big problems that they're running into is change validations saying, "I'm adding a new customer to my network. How do I know it works?" What they're doing right now is they're doing as much testing as they can, calling up the network and saying, "Alright. Try it." When you go out and add new routes to the network, how do they know that works? So they're running into per-customer reachability issues, privacy issues - are the customer routes really distinguished one from another? - and policy issues - am I getting the hub and spoke that I really want or am I getting the mesh that I want? And these are the problems that they're running into - customers leaking routes between each other because of misconfigurations, providers trying to monitor the policy as they change it, provider misconfigurations for various manual processes. This is a summary page of the changes that are happening in a network. Detail based on customers - so you can name them pretty much anything you want - but what we're looking at is trying to baseline what's happening into a network. So I've got three active Pes, 10 active routes for this customer, baseline routes of 3 so I've got seven new routes. The question is why? So the process is really more like, if I'm making a change in the network, someone's responsible for that. Someone's responsible for saying, "I'm going to add a new PE. I'm going to add a new set of routes to my network." There's also an acceptance process of, alright, once those routes are in, once the new routers are in, how does that get accepted into my network as an operational piece of equipment? And so what we're trying to do is give that data to those groups so when I make a change to this network, customer 5, they've got 10 active PEs, two withdrawn routes, 100 new routes. Why do they have 100 new routes? So being able to look at these routes and say, "These routes were assigned to this PE at this time and this is the active routes," with this route discriminator, that gives you the ability to view in more detail rather than saying, "I planned a maintenance and executed that maintenance and then this is the result of that maintenance." So we've got those 100 new routes that were put in for customer 5. 13 are coming from brand new PEs for that customer. They're not part of the baseline. Is that a mistake? If they look at the work order, they should be able to tell. These are the new sets of PEs that were supposed to be added. If there weren't, did Pepsi Co get crossed with Coca-Cola? Is this something we want to do? Then you can do justification on your maintenance before you contact the customer and say this is working or this is not working. If there's instability in the network, then we can display that instability and you want to do it with as much detail as you can. So here is just a graph of it over time but we could also give the details about what routes were being affected at that time. So we had a large WAN RFP that required routing converging SLA. So when does a network converge? Right now, it's difficult to measure. What parts do you measure? Do you measure when OSPF starts to go away or once the timers have expired and the changes are occurring in the network? There's no standard for it so what we're doing is monitoring the convergence based on the propagation delay across the network. Is that the ideal solution? No. But it worked well enough for this company to justify to their customers that they were within the specifications that were required. Do we want better solutions? Of course. Right now, being able to say, "I have a monitoring system on one side, monitoring system on the other," and being able to generate route changes into the network, inject and withdraw, look-back address of /32, measuring that propagation delay across that network so that you can see this is the change and this is your graph of the change over a long period of time. Does that meet the SLA that you're interested in? By being able to look at BGP over long periods of time, you can see changes like huge deluges of routes coming into the network, whether it's a redistribution from OSPF into BGP or new changes, loss of peering, how does a loss of routing in AS1237 seven hops away from my network affected the way that I send data out of my network? Does it make an effect? If it does make an effect, do I want to find out why and do I want to change that process? So we've seen this at Tier-1 ISPs all over the place. UC Berkeley, new route leakage into the network - why did they come about? Where are they coming from? Being able to determine which customer was advertising those new set of prefixes, where did they shift from in the entry of my network? So, in this case, instead of looking at 108,000 prefixes that shifted or 330,000 events, we're going to create a pretty picture about it and it's actually a moveable picture. So we can play over time what's changing in the network. We see a large loss of prefixes across one edge. So, in this case, Calran 2 is advertising prefixes across this edge, it's losing prefixes across this edge and so it's fairly simple to see what's changing in the network. It's easy for a customer on this side to contact Calran 2 and say, "Look, you're losing peering here, you're picking this up. It's not affecting us right now because we have duplicate address advertisements, but it needs to be fixed." And, in this case, it was fairly short. The more complex problems where Sprintlink may be losing connectivity or, in this case, where Quest lost peering with Calran or withdrew a bunch of routes to Calran and instead of having a simple back-up, the back-up path was six hops. Really difficult to find from route advertisements. Really simple to see from a graphic. Being able to categorise route changes before and after with a delta, here's 96,000 route cal change based on next hops on this next hop being able to visualise those either through a table or through a graphic makes it easier to diagnose why that was happening in your network. Probably most of you have heard of MED oscillations. Probably not too many people have seen that oscillations in the met work. Is a very small cut of the data we collected in a Tier-1 ISP about a year-and-a-half ago. Interesting thing about it was it went on for over two weeks. It consumed the full processing power of three GSRs and nobody really knew what was going on except that they are three GSRs that were saturated. The changes were happening so quickly, essentially at the speed of the three processors for the GSRs that they couldn't, by the time they typed a command to see what routes were changing, they were now 100 route changes behind. The changes were happening so rapidly and they caused so much change in the network that essentially the GSRs were useless. They started having to route around thousands of routers to try and get data connectivity back into major parts of their network. Eventually they shut down two of the routers and they were ready to replace them with larger GSRs which would have just exacerbated the problem. But we were able to record some of the data that was happening at that time and show them what was happening. They implemented an always-compare MED and that essentially changed - eliminated the problem. So, over time, when you're capturing large amounts of data, you can also do analysis over long periods of time and look at simple changes, even for a single prefix on the network. Customer /24 was flapping continuously over a one-minute period for approximately a week. They always got back-up through a NAP but this new service that they brought from the Tier-1 - they spent a lot of money on the primary access that essentially was doing them no good. We were able to pull out the data and look at that and show them where the problem was and how bad the problem was. So this is a fusion, a new type of fusion, that we're working on, a gentleman named Pabo Yu (?) started this, Steve Casner and Van Jacobsen decided that, if we're understanding how the routing works in the network, then there must be a way to take net flow data, figure out where the net flow data is coming from, where it's going to in the network and then instead of listening to it throughout the network, flow it across the network. So the idea here is instead of collecting net flow data everywhere in my network, what I really want to do is collect it at the key points in my network - entrance points, key data centres. Given that I understand how my routing topology works, I want to be able to take that data and then flow it out across that topology. What that gives me is the ability to say, "If something changes in my network based on either routing or on data flow, how will that affect the other circuits that are now being overloaded? Do I buy more bandwidth? What happens if I lose a router? Where will those packets go? Will they flow the way I expect them to? Will they saturate the circuits that I have as back-ups?" In a complex network, even with complexity coming from redundancy rather than from things like MPLS, the question is where will my data go? It's designed to be resilient. I want to make it as predictable as possible so being able to fail pieces of equipment, proactively testing them, seeing where yesterday's data would have flown if this router went away, gives me the ability to predict how it will happen in the future if I really do lose pieces of equipment. So being able to look at the routing topology, being able to look at net flow data and then being able to fuse those two together gives me another tool that will allow me to better understand how my Layer 3 topology is working, how the data will flow in the network, what capacities I need for the future, what peering I need or what private peering I need rather than buying transit from service providers. So it's another way of looking at the same types of data we've been collecting for a long period of time and hopefully it will add a little bit more clarity into the Layer 3 network rather than trying to build that clarity up from a Layer 2 network and then try to interpret from Layer 2 what the Layer 3 topology will do. Are there any questions? I was either very successful or you're very asleep. OK. PHILIP SMITH: No questions for Greg? No. GREG HOOTEN: Great. Thanks a lot. PHILIP SMITH: OK then. Thank you very much. APPLAUSE