______________________________________________________________________ DRAFT TRANSCRIPT Routing SIG Wednesday 1 March 2006 2.00pm ______________________________________________________________________ PHILIP SMITH: Good afternoon, everyone. Welcome to the Routing SIG. This is the first of two sessions of the Routing Special Interest Group. Unfortunately, we're competing a little bit with the routing and operations track but such is the scheduling that we have a conference the size of APRICOT. Randy Bush and I are the chairs of this special interest group. Randy was somewhere here. Where's he gone? Oh, he's right there, hiding at the back. Randy and I are co-chairs of this special interest group. We hope we've got together an interesting enough program for you all. I would like to start off really by thanking all the speakers who have volunteered their content and their time to participate in this session. The mailing list, if you're interested in subscribing and if you're not already a member is as on the screen. It's sig-routing@lists.apnic.net and you can subscribe to that through the APNIC mailman system. So we don't actually have any action items on the Routing SIG, which is fine, so we don't have to look through any of those so all we really need to do now is start with the actual presentations and first up we have David Hughes, who will talking about BGP convergence. DAVID HUGHES: Thanks, Phil. Right, as Phil mentioned, the general subject matter is BGP convergence. More specifically, I cover better handling of what we've been calling silent peer failures. It's something that is growing more and more important as more and more people are taking up services delivered over some of the newer technologies offered by the telcos. One small apology I make before I get started is that the networks that are involved with running are predominantly Cisco-centric. Where possible, I've tried to determine through public documentation the availability of certain features for Junipers but, obviously, without exposure to the equipment and software, some of it may not be accurate. But every effort has been made to provide as much detail as possible. So the basic overview of the presentation. Just a quick overview of what we term as a silent peer failure and a then a look at some of the technologies that have been worked on by various groups that, in the future, will hopefully make this a problem of the past. Unfortunately, as I mentioned, they are technologies of the future, so we'll have a quick look at what can be done to help with the problem now and then we'll present some operational experience from what we've done with those tweaks. But a quick overview of what we term a silent peer failure. Normal network situation - a couple of routers. Unfortunately, these days more often than not, you will not have a direct layer 2 connection between the two devices. With the prevalence of technologies such as metro-ethernet, we're seeing that the chances of you having layer 2 connectivity with DC connectivity between the two devices is getting lower and lower. So, in this situation we have router A and B connected via a switch and BGP between them and in this situation B has a default rout from A, A being the next hop. At some point in time, the link between A and the switch disappears. Of course, router B still has enough interface and will quite happily then continue forwarding traffic it what it thinks is still a very valid next hop. Obviously, router A not being there, that traffic is just going to be black-holed for quite some time unfortunately. The concern here is the time it takes the BGP to actually determine that that is no longer a valid path. So that's what we're terming a silent peer failure, when the sending router has no idea that the peer has gone away. So the main problem with this is that the time to detect the lack of the peer is ridiculously long. Looking through the BGP spec, we've got two timers that are of interest. We've got the hold timer and the keep-alive timer. The hold timer being the maximum number of seconds that the peer will wait between receiving a keep-alive message from its peer. The hold time is a negotiated value. When the session is established, the two boxes will specify what they'd like to use for the hold timer and the lower of those two values will be used within reason. The keep-alive timer is historically set at one-third of the hold timer and if a device has not sent a message in that time frame, it should send a message to keep the keep-alive time active. The problem is that standard implementations on devices are using ridiculously large values for these timers. The Cisco in particular, the default keep alive-timer is set at 60 seconds, giving you a hold time of 180 seconds. If that peer fails silently, you can and will blackhole traffic for up to 179 seconds. Juniper's much better in the fact that they run 30-second timers with a 90-second hold timer and that is actually compliant with the RFC. The RFC does specify - it doesn't specify, but it recommends 30-second keep-alives. Even the best situation on a default config on a Juniper, you're still going to blackhole traffic for 89 seconds. If you're trying to offer, you know, your classic five lines of reliability that everyone is trying to achieve, however unrealistic that may be, giving you 316 seconds per year of downtime, you can't possibly have the situation that you might blackhole traffic for up to 179 seconds running standard Cisco timers. If you offer SLAs like that, you cannot run standard timers. How can we get around these problems? There are a couple of things coming out that do help. An obvious one that is being made more and more available is next hop tracking for BGP. The whole premise being that it's event driven from your IGP and, if the next hop of prefixes that are in your route from the IGP, they'll be dropped. There's no waiting for the timers to expire and for your internal network it's a very good solution. It's basically a fundamental premise of routing that your next hop should be reachable and it's quite astounding that it's taken this long for something as simple as keeping an eye on the next hop to reach what is mainstream routing platforms. The downside, however, is that it's obviously not so good for your eBGP session unless you're running an iBGP upstream which is obviously not going to be happening. From my reading of the document, Juniper has no such mechanism at such point in time. I would stand to be corrected but from my readings it's not the case. To make it worse, it's limited on IOS at this point in time, shown there as the list of trains where you'll find it. Unfortunately for people who do a lot of aggregation using 65 hundreds there's no sign of it there. It's a good feature but it's not widely acceptable at this point in time. Another emerging technology that helps considerably is bidirectional forwarding detection. BFD, for those who haven't come across it before, is basically a very aggressive hello protocol. It's designed specifically to run where possible between the forwarding planes of the devices leaving the control planes out of the equation. The premise there being that, if the control plane gets busy for whatever reason, it doesn't have an impact on the forwarding plane's ability to talk to each other to maintain the consistency of that circuit. The way it's been implemented, it works over basically anything you'd think of, directly attached connections, tunnels, you name it, it works over it. And reading through the RFCs, it's designed to be very aggressive. It can be specified in microseconds. The next thing is that the two main guys behind the IETF draft, one from Cisco and one from Juniper and everyone is pushing it as something everyone should have so hopefully we'll get widespread deployment and it's a good solution for eBGP sessions. Unfortunately, it's only good for eBGP sessions if your upstream supports it and, particularly in this country, I'm not sure how long it will take to see a monopoly telco provide something like BFD on its sessions. Another slight negative for BFD is that it does require some interaction back-up obviously to the routing protocols so there are different standards being written once again for the IETF working groups on the interaction between BFD and BGP, OSPF and IS-IS. As far as trying to get any support for BGP, it's incredibly limited. It shows you up there platforms where you might be able to find it. Mostly we're seeing that it's being deployed for IGPs only at this stage which doesn't help us a lot. If we had next-hop tracking being able to yank prefixes as a group from an iBGP failure, that would be great. If you want to, you can jump in at 12.4T and you get it for eBGP in there as well. If you have a CRS1 lying around not doing anything, you can have it there too. I didn't have one at the time of testing. A general solution for today - I mean, the underlying fundamental problem here is that the value of the timers that we inherit by default configurations of our routing devices are insanely long. So the obvious solution is to just decrease the timers. We had a bit of debate on one of the public Cisco-orientated mailing lists about my proposition that this was a good idea and one participant in particular was pushing forward the argument that using a high-level mechanism like BGP timers to try to detect a underlying link failure was fundamentally flawed. It is but, if there's no other mechanism, I don't see what choice we have. IOS supports a configuration within the BGP standard to tweak the timers to whatever you wish. You can drop them to whatever you want to run internally or externally to try to get your way around this problem. The only problem is that your upstream, if you're doing this externally, has every right to reject your session if it doesn't like the values you've presented. The RFC specifies quite clearly that 2 or below isn't an option and it will definitely reject it there. However, there is - up until very recently, there has been no way for the upstream to actually reject it based on the time. There's a very recent 12.0S release for IOS that has a new node that's been attached to it so you can configure what you believe would be a reasonable minimum value for the hold timer but, as I said, that's only recently been added. It's only a 12.0S, nothing newer and hopefully that won't get too far out there in the wilderness. Juniper doesn't have any such feature. So hopefully that means we should be able to tweak our timers to something more reasonable without our upstreamers getting more angry. Alternatively, if you have large aggregation routers, it might increase the load and it might be something to keep an eye on. If you're running incredibly aggressive timers, you might see some issues if the router gets busy. We haven't had any problems like that. We've been running this in production for some time now. We did some trials and it worked fine under the stresses we could think of. We didn't see instability of the peer sessions. We rolled it out around our upstreams. Webcentral has connectivity to basically all the Tier-1 upstreams. Don't get link to upstreams all over the place - we don't get that problem. We're running 5-second keep-alives. We're getting maximums of 14 seconds of blackholing if we experience a silent failure and we haven't seen a single issue so far. Because we have no next-hop tracking internally, we're looking at rolling this out internally on one-second keep-alives in the network itself. It's something we're going to be trialling in the near future. We think that will give us something a little bit closer to what we need until the other features become available. So in summary, we do have a light at the end of the tunnel. This is a problem that hopefully will be relegated to history once people get on board the likes of BFD and providing us IGP-based next-hop tracking in more commonly used releases. This is a very good interim solution and as I said, the 3x5-second keep-alives have been working very well and if you're happy to run Cisco default values then you obviously don't care about reliable passing of your traffic. Any questions? PHILIP SMITH: Are there any questions for David? In my haste to get started, I forgot the housekeeping list. If there are any questions, can you please come up to the microphone, state name and affiliation before you ask your question. Thank you very much, everyone. APPLAUSE PHILIP SMITH: OK, while Geoff is setting up for his routing update presentation, I will go through the housekeeping list that I omitted to do at the start. As I was saying, if people want to ask questions at the end of the presentation, please come up to the microphone. I think there's also a travelling microphone wandering around as well if you're not mobile. State your name and affiliation, again, for the benefit of the microphone, and also I should point out that this session is being broadcast and audio-casted so it makes it more important that people use the microphone so that the people who have joined us from elsewhere can hear what's being talked about. Afternoon tea is in the level 2 foyer area, so it's basically that way, to my left, to your right. The social event is this evening. If you've got your ticket, please bring your ticket and the details of where it is are listed on the back of your ticket. APNIC staff will collect the ticket from you when you board the bus and that will actually leave - the bus that is - will leave from level one plaza deck outside. Last bus will be 7:10, so please don't miss it. Next item - MyAPNIC and the policy flash demo are running all day at the APNIC help desk. The help desk is available during break times - morning, lunchtime and afternoon breaks. Onsite notice board - again, it's advised to have a look at the onsite notice board on the APNIC website for any last-minute updates and so forth. There's a special session at 4pm today in meeting room 3 discussing the APNIC fee structure. That's an open session. Anyone with an interest in that topic is welcome to attend and is invited to participate in discussions, although I'd much prefer you to come to the Routing SIG but that's up to you. That's the housekeeping. So the next presenter is Geoff, who will be giving us a routing update. GEOFF HUSTON: I will. Thank you. Good afternoon. I seem to do these at every routing SIG, giving you an idea of what's happening inside the BGP routing table. I've got three parts to this presentation today - one is a status report and then I'm actually looking at work based on a question that Vince Fuller asked me a few months ago that I found it interesting to answer and further observations after that. Normally I use hourly snapshots I pull from routeviews but this time I used a complete dump of the data and I must thank Stephan Millet of Telstra for assisting with some of the data used in the presentation. My disk is now full. Thank you. Usual picture. This is 2005 January through February, the BGP prefixes. It might look like November and December were tailing off but be assured that January and February of this year, you have come back again and routing is back on once more. What does it look like? You put a line across and go, "The number of prefixes in the default-free zone across last year rose from 150,000 to 175,000 prefixes in 12 months." So life is still increasing the way it always was. The amount of address space is kind of interesting - 4.4 billion addresses in IPv4 if you try and use them all. It started at one point through 6 billion at the start of the year and finished at 1.5 billionish. Those big jumps there - there are till two /8s that appear and disappear like lighthouses. It amazes me that there are /8s that flap but there are and there they are. You can draw a line across the top of this and I've eliminated the /8s, smoothed it out and then you see pretty cleanly that the amount of address space rose 1.36 billion to about 1.5 billion addresses. Some seasonal variation. Some of you took holidays over the Northern Hemisphere summer and you're slightly below average but then got back to work in October and decided to add more addresses into the network. We appreciate that and thank you. LAUGHTER The total number of AS numbers - it's the same kind of curve. Very consistent. Unlike address space, AS number appearances on the Net keep romping through so the trend line is spot on from 17,500 up to 21,000. Somehow ASes are remarkably consistent unlike addresses or routing table entries. The addresses that keep on appearing, appear almost like clockwork. It's strange. So the vital statistics - prefixes up by 18%. Roots and more specifics - are we getting better or worse at this business of only advertising the aggregate? And the answer is no - no better and no worse. Because the number of basic root advertisements that, if you will, encompass new space, rose by 17% to 85,500 but the number of more specifics predominantly /24s but a few others in the mix, also rose by about the same amount. So around 50% of the network is still more specifics and around 50% are basic root prefixes. The amount of addresses rose by 10% so the granularity of advertisements is getting smaller, not larger, again, yes, you know, lots of little advertisements because the address space is not growing as quickly. The number of AS numbers up by 14%. What can I say? The average advertisement size is smaller. Address origination per AS is getting smaller. ASes are moving down the food chain. Interestingly, the average AS path link is remarkably steady at 3.5 ASes. 14% more ASes but the diameter of the network is constant, the density of interconnection is increasing. It's an interesting question - whether this is uniform density increase or whether there are black points where the amount of density is increasing. In other words are exchange points and similar of their ilk actually gaining strength or is the bombing-out of the long-distance transit fibre market causing interconnection to appear across longer spans? Some work is happening, I think in Adelaide, right now around topology of the network and I'd be keen to see if they're working on that. It's interesting because the denser the mesh, the more badly BGP behaves when it tries to converge. So trying to understand if the explosions are global or local is probably an interesting thing to understand. However, on a macro level, the network is getting denser. The advertisement granularity is getting smaller. More interconnections, more specifics. By contrast, this was v6. Similar growth except you've got to look at the numbers. The number of prefixes rose from a phenomenal 700 to a phenomenal 868, I believe. More patchy - you can obviously see that some people decided it was a bad time. In August, they mucked around with v6, and then got bored. After that, they all went home again. More noisier - the advertised address span. This is weird. Two big spikes and everything going down. In other words, the blue line is actually decreasing each time. Why is that? I took away the 6Bone. And now what you actually see - and this is this issue that a /20 is much, much, much, much bigger than a /32. So here are all the /32s, blip, blip, blip, /20. That's a /21 and that's a /20. It's hard to show you the growth in v6 address space apart from saying two whacking great big allocations happened that year and got advertised and a few little ones but you can't see them. There's the 6Bone. You've now got precisely four months to quit and so far each of you are quitting. Each jumped down as a /24. The 6Bone is slowly being turned off. Here's that combined view now. I don't know if it's any clear. That's the big picture, that's the bit left after 6Bone and that's the 6Bone slowly flying off and without those two big allocations being advertised, that's what we actually have. You just can't see it. Someone's given me a laser pointer. The bit without the big allocations. Advertised AS numbers - noisier. Probably because 500 to 600 - so there are 600 of you playing in the v6 game as far as I can see, some in 6Bone, some not. So what can I tell you about this? Prefixes up by 21%, routes by 15, more specifics by 21. Naughty, naughty, naughty. You're not meant to disaggregate in v6. Stop it! LAUGHTER The amount of address space went up by a phenomenal 50% because two allocations happened to be advertised and went separate up. ASes up by 20%. The average advertisement size as a result of those two massive ones is getting enormously large and origination per AS is getting large but only because of those two factors. The path link I can't give you much of a view on. Such a small network that you can't see what the average AS path is and the interconnection degrees, I can't tell. This is a network that continues to go large with little overlays at the edges and trends really aren't there yet. Part one. Part two - more interesting. If you were buying a router and you'd like it to live for three years inside your network doing what it's doing and it's trying to run in a default-free zone, what spec would you tell the vendor to build to? You know, how many prefixes, how many prefix updates per second? Vince wanted me to answer two questions - v4 and v6. I took the easy one and did v4. It's hard to predict v6 because it's such a small network. I can't tell. I'd like to try and have a shot at this answer in a v4 context. And for this I'm actually taking a macro view so I've taken the entire set of update withdrawals from AS1221 for 2005. Because you're inside a relatively busy network, there's a whole bunch of local updates also happening from inside the network so I've basically tried to filter out everything that I don't think came from the default-free zone so smacked out a whole bunch of updates from there. What I'm trying to do is to see if I can relate the number of updates and withdrawals against the number of entries in the RIB. And at the same time, I'm also looking at the CPU load records from that router that was supplying all the updates and again try and see if there's a relationship between the amount of CPU that's being used in that router on some kind of granularity against the table size. Now, if that's the case, if you can do a table size predictive model and you know the relative number of updates and withdrawals in CPU, you have a vague idea of how big the thing should possibly be somewhere in three to five years' time. So that's the methodology of trying to answer that question. Updates per day for the year in millions - default-free zone. So I've filtered out a fair few. At the start of the year, somewhere around 300,000 updates per second, BGP updates, messages, were being caught at the router. By the end of the year, it was slightly under 600,000. That's big, an enormous amount of updates. Notice also that this is not uniform. Now the law of very, very big numbers says that, if each of you contribute a little, and there are 21,000 ASes, that line should be smooth. So the law of large numbers isn't working. A variation on that number is so big, up to 50%, that I suspect that each of you aren't contributing a little. And that's what I want to talk about a bit later. Notice that astonishing variation. That is not nice. The number of prefixes per update message. How efficient are we at packing it into single updates? Getting worse. In other words, the update is getting closer to describing one prefix, not a bulk of them. The network, in terms of the routing policy granularity and update is decreasing, granularity is getting smaller and the number of prefixes vary a lot. Some folk are doing very strange things. I'm trying to understand that. Number of update message per day - unfortunately, it's close to double, as I said, it's highly variable and it shouldn't be. The number of prefixes per update message is falling and I'm just wondering if this is actually due to this increasing use of ASes to do multihoming at the edge, that I'm actually seeing what used to be an ISP with 50 prefixes, now an AS with 20, and a whole bunch of ASes underneath it that are starting to multihome. But for some reason, the number of prefixes is falling. And now time trying to understand - it seems to be that the update rate is increasing faster than the table rate. That update rate was almost double. In other words, the number of updates happening in the network is increasing faster than table size. Is there some kind of multiplicative factor going on? Or is something else happening? Some kind of thing getting larger than the routing table itself? This is not good news. Now, maybe if I stop looking at update messages and start looking at individual prefixes. How many prefixes change per day and what's their trends? What I've done now - I'm actually looking at prefixes, these are updates, those are withdrawals, in millions. So the withdrawals, 200,000 going up to 300,000 in a day. Huge number of withdrawals. The updates of prefixes is even noisier. This is weird. I can put a trend line on the prefix update rates and, you know, yes, it has increased. Around 800,000 update prefix messages. You know, the prefixes that actually got updated each day by the end of the year. Withdrawal rates, you're going to actually start to see - that is an exponential line, and even though it's noisy, there's clearly an exponential growth factor in the withdrawal rates. So high variability and approximately exponential but at different rates. The updates are growing faster than withdrawals. So now, can I relate that to the size of the network? The actual number of entries in the network itself? So that's the default-free zone across the year, that's 100,000, that's 170,000 and you can see pretty clearly there that that's not linear now. It really is a bend going on there. I've smoothed it out and done a first-order differential. The default-free zone in terms of number of entries in the RIB is growing faster than linear and is actually growing order in squared. If I look at it as an order two polynomial - how many RIB entries in three to five years, Vince? Somewhere between 275,000 in three years and 375,000 in five years would be a prediction. My guess at the confidence interval was about 20% so it's not that confident, but that fit isn't bad. That appears to be - if you're looking at three to five years, that is appears to be the metric you're looking at. Now, I've done the next thing, which is, for each RIB entry - 100,000, whatever - how many update per RIB entry over the year? If that was linear, that would be growing at the same rate as the table itself. If it's more than linear, then the number of prefixes being altered each day is growing faster than the table size and this is the number of withdrawals per RIB entry. Is that weird? One withdrawal per RIB entry, on average, every day, the entire table is withdrawn or there are a very small number of prefixes that are doing an awful lot of work - withdraw, update, withdraw, update. But that's one. Every single entry is updated three times every day or there is a small number of prefixes that are, you know, pushing up an awful lot of iron very quickly and that's growing very fast and, again, very noisy. So I can answer you at least at a gross level, Vince, on what I think will happen on withdrawal and update, inside three years, you're going to have to cope with around 1.7 million prefixes being changed every day and, by five years, that will grow up to, you know, around 3 million, which is, know, aine more amount. And the withdrawal rate, similarly, should be around 1.5 million withdrawals per day. Which should keep your routers busy. How fast would your router have to spin? Again, same kind of technique. I got from Steph the actual 5-minute and 1-minute CPU loads and across the year in question, the router had a brain transplant twice and this is a PRP2. Isn't it cool? So what I tried to do was normalise that and make everything a PRP2. So what I actually got here was this - this is what happened across the year, the one-minute load rate on the PRP2 increased by that rate and just pushed it up per RIB entry. This is growing faster than the RIB it appears to be that, when I push that forward, if I'm doing a unit of one by the end of this year, by the end of five years, I'll need four times that amount of processing power to cope with that load. That appears to be the projection. Today at 176,000 prefixes, 700,000 update rates per day, withdrawals of 400,000, you'd need around 250 Mbytes of memory and need 30% of a one-gig processor. Three years time - 275,000 prefixes, just under 1.7 million updates. Almost a million withdrawals, double the memory and 75% of that processor and in five years, you'll need a new processor absolutely. About 120%. It seems awfully low. I think it is low. You really are talking about trying to cope with peaks, convergences about speed, rather than reliability. Got to get there faster. It's per-second peak rates, not loads, that is the problem and it assumes that BGP isn't going to change. And there's been more than enough words over the last year or two that we need to do something around securing BGP. If you think you need to be able to do that inside five years' time, you may want to think about what exactly are the factors are the router you're going it buy in terms of security-related protocols - are we doing IPSEC in our peering or something similar, incremental workload and so on. So I would actually say that, if I was going to spec one out, I'd at least want 500,000 entries in the RIB, no sweat. I'm going to need an awful lot of adjacency RIB space. If I'm going to be Conservative and say, if I do this, I think I'm OK, then it's about 6 million prefix updates per day. I think I need at least a two-gig route processor memory and probably a 5GHz processor for route processing. What was a Cray1? I think it was less than that, wasn't it? How good is that number? What's going on here? I've got a couple more seconds. Is this uniform? I don't think so. I don't think all of you are behaving so well as that. Is this skewed? If so, how skewed? 289,000 prefixes were actually announced that were different. Now, the table is only 179,000 prefixes. So there are actually 127,000 prefixes that appeared for some period and aren't there any more. So people are leaking. And then they pull it back bull there's been an awful lot of leaking. The number of prefixes that had no updates at all through the year, congratulations, were 12,640. Well done, gold star, tick, elephant stamp. Everyone else doesn't, they did some kind of update. This is a cumulative histogram of what's going on. 50% of the prefixes contributed less than 10% of the updates. 60% of the were fixes contributed less than of the 20%. 80% contributed just on 20%. So the top 20% of the prefixes were pretty bad and the top 1% contributed 15% of the update load. OK. Let's name them. LAUGHTER So, if you see yourself on this list, you're on this list for one of two reasons - one, you're multihoming and you don't know how because, somehow, you managed 158,000 prefixes for the year, of which 20,000 were actually just flips in the first-hop AS, the next hop and, of course, you flap like crazy. All of these folk flapped. Some of them also re-homed as well. So have a look there. You may be there. You may wonder why you're there. If you're there and wonder why you're there, go to the tutorials. Let's have a look at these people because some of them are systematic and some of them are night-time stuff-ups. Systematic - Hong Kong Supernet, this was a prefix that was active, withdraw update. Green is the flap and the red is an attribute change so for a period of at least four months, this one prefix managed to generate around 1500 update messages per day for one prefix, most of which were withdraw, announce, withdraw, announce. Precisely what information that added or subtracted from the routing table beats me with a stick. Somehow, they got a clue in September. Well done. Here's another one. This is from ICARE in Hong Kong. This is straight traffic engineering. This is Hong Kong Supernet again. They have four upstreams and they're moving prefixes around. This is one in Turkey. Someone out there in vendorland sold them something in June that they shouldn't have bought. LAUGHTER Because this is systematic sustained 500 updates per second and moving across multiple upstreams. This one is an interesting case. Here's another one from Turkey. They went and bought even more of it at the end of September and did even better with it. Why is this? Again, here's another one, Amphibian Media, I think they're related to the folk who do inbound route traffic engineering and here is another one - Merit. Surely they should know better. I looked for this network a couple of days ago and it's gone. They must have seen it flapping and sent it off into routing hell because I couldn't see it a work or so ago. Phenomenal amount of updates from folk like Merit. Here's the last one again and, oh, it's our friends from Turkey. They really did buy something in June. I think they tested it in April and really turned it on in June. Shouldn't have happened any other way. And Number 10 - again, a US one, I think they're InterNAP-related. Systematic, absolutely systematic, so this is no accident. Is it prefixes or Autonomous Systems? Look at that curve. This is the autonomous system one. Go back by 10, look at this one, so, while there might be a small number of prefixes creating all these updates, there's a tiny number of autonomous system numbers generating all these updates. The top 1%, the top 2%, the top 3%, 3% of the autonomous system numbers generated half of the updates. Thank you very much. Well done. LAUGHTER Here's another way of looking at it. Red is the actual number of updates. The green is just the top 50 ASes. So the top 50 ASes do half the updates. 50 people cause your router to have a problem. 50 people cause BGP to have a problem. Let's name them. Here's the first one, we met them before, the folk who bought this wonderful thing in June, AS 9121, 206,000 rehomes since June. If they'd run if for a year, they'd be effectively off the planet and so on. You could see a lot of multihoming in 17557 and 721. 721 probably should know better and an awful lot of flaps on all of them. Here's the signature for our friends in Turkey of the total number of prefixes they originate and, yes, they actually tried it in February, liked what they saw. I think they bought it in April, turned it on hard in June, took a holiday over August, came back to work and then something blew up in December. This is bad stuff. I actually had a look at them. They seem to have six upstreams and my guess is that they're using a tool like OER and doing more specific juggling to try and get their incoming traffic evenly balanced. And my suspicion is that they could stop all this if they bought more bandwidth. Because what's going on as far as I can see - and you can look them up in BGPlay and have fun with these guys but you will find that the routes flip across all six at 4:00 the morning. I know they're hardworking ISPs but I don't think anyone is up doing massive config changes as fast as they can type at 4:00 in the morning. Something is happening doing bad things globally. More tutorials and a word of caution to the vendor - you shouldn't have sold it to them. The next one, Korea Internet Exchange, number two. This is not traffic engineering. This is some pathological condition that lasted for a while, caught it and fixed it. MCI Europe - traffic engineering. Pakistan Telecom. A combination of the two? I can't tell. These enormous spikes are weird but there's a background of activity. So that's probably harder to put a signature on. Ah, Hong Kong Supernet, our friends. Whatever it was stopped in September and we're all very grateful. TPG here in Australia, again, you know, they're number 6 on the list, really high spikes intermittently. Can't figure it out. Yet. 9121, our friends in the US military. Something must have happened in June somewhere that required a huge amount of updates in BGP. Hong Kong Supernet has two ASes and here's the second one. They just keep on having more fun than you can imagine. DACOM in Korea and the Korea National Computerisation Agency - those are individual problems that happened over a few days. We're not seeing a level of uniformity in update rates that my answer to Vince was any good at all. I'm only looking at 100 ASes, I believe. And I'm looking at 100 ASes whose behaviour pattern is pathologically bad and they contribute the overwhelming bulk of updates now and have done for at least a year and I suspect there's two reasons. One - there's a whole bunch of automated inbound traffic engineering software that does route prefix juggling that is stuffing up the entire network with millions of updates and withdrawals. Literally millions, in fact hundreds of millions. That sustained consistent update rate is killing us. If you want to do inbound traffic engineering, consider buying more bandwidth instead. The rest of us will thank you. I don't think the routing system can withstand that kind of abuse. There's something more than MED oscillation. You shouldn't see withdrawal. It just simply flips. What I'm seeing in these unstable ones is a massive number of withdrawals with moving around in the prefix. We have isolated incidents of unstable configurations that are causing massive load rates. So, as I said, the overwhelming number of updates are generated by an underwhelming number of sources. The uncertainty in the trend models I gave you is extremely high. That means that, if you really want me to give you an answer, Vince, on what you need to buy in three to five years. My answer is I don't have a clue. Thank you. APPLAUSE RANDY BUSH: Geoff, two things. Why did you throw out the internal route fluctuations when the router is going to have to handle them? GEOFF HUSTON: This was the question as distinct from the investigation. The question was actually asked, "If I was buying a default-free zone router," and when I start factoring in AS 1221, how do I sort of know that that's a typical kind of pattern So I thought the way to do this - and AS 1221 doesn't have an awful lot of upstreams so, if you look at its topology, it's actually got a couple of domestic peers and you see this one upstream - I'm kind of looking truly at a default-free zone single path and that seemed a good baseline to take for these measurements. RANDY BUSH: This is what I need with a single router in the default-free zone that doesn't connect to anything else. GEOFF HUSTON: I'm talking about the floor, not the ceiling. RANDY BUSH: And you don't want to know what router to buy. It's what router to sell, since Vince is not a vendor. The other thing is those /8s you saw flapping which you don't understand - I gave a lightning talk at the end of the last NANOG, where we discovered that there are a number of /8s being used by spammers. They announce the /8, they hit you with spam from the dark space in it, from the unused space in it and withdraw the /8. And that's those flapping /8s. GEOFF HUSTON: Thank you. We should name them. I'm not sure it will do any good because they're stolen. RANDY BUSH: We did. GEOFF HUSTON: You named them. But they're stolen. RANDY BUSH: Right. They cover some sparse allocations and what they're doing is using it for agile spam generation. They announce the /8 and go, " ping, ping, ping, ping," and turn it off. GEOFF HUSTON: Interesting. Thank you. PHILIP SMITH: Any other questions for Geoff? If not, we should move on. We're a little bit behind in time. Next up we have Randy. RANDY BUSH: OK. Credit also to Steve Bellovin, a lot of his work is represented here. What is routing security? It is not router security, it is not defending your router against attack, that are similar to attacks that will happen to your Windows or real computer or whatever. The unique thread is attacking using the routing protocols. It's routing security, not router security. What are they doing it for? To divert traffic and to alter traffic. We have some ability to lessen the danger but not enough. Steve Bellovin published stuff in '89. Work accelerates in 1996. Kent et alia did good stuff in 2000. Why so little progress? The problems are technically difficult. OK. Simple routing, as Geoff just showed you, is not simple. Complex problems in routing are exceedingly complex. When some of the best computer scientists in the world are studying your stuff as a behavioural phenomenon, you should know you're in trouble. So it's not traditional communication security. All the boys and girls in your big stack telcos know that it doesn't apply very well here. So it's new ground. The installed base is big and a transition problem is big. And vendors aren't seeing people with big bucks saying, "Do it." Whether those people don't exist, or the vendor's vision is poor, is not clear. Please go to the tutorials for normal operation security. What do we want to ensure? That an ISP who announces something owns it, origin of prefix. If a router announces a path to X, that it can actually deliver the packets there. And if it tells me it can get me to some place, did some place authorise it to take me there? If I'm told I can get taken to Geoff, did Geoff want his traffic to go there? We won't get into that, Geoff. OK. What's different? The well-studied communication and host security issues are buggy code, are all about buggy code or bad protocol design. In this case, we have, as good protocol designers, we can do and the code is, well, we won't go there. But we're still vulnerable because of a dishonest member of the game. Somebody with whom we are connected or somebody with whom we are connected to be connected is lying. OK. So hop-by-hop authentication is not sufficient because, just because I have a secure connection to Geoff doesn't mean Geoff's a good guy. OK. What does the attacker want? The normal situation is site A sends along this path, a nice financial transaction from A to X to Y to B. Z, the attacker, wants to divert the traffic, strip off and keep the dollars and let the traffic go there, maybe or maybe not, so that everybody thinks everything's fine and they kept the dollars. Of course, it may not be dollars, it could be critical information, it could be - I won't go there. All sorts of stuff. How does Attacker do it? This is going to be a simplified model. I think the original title said it was an oversimplification, so excuse me. On a hop-by-hop basis, the attacker owned some router and lies about the cost and we must assume that random routers on the Internet can be owned. The current price, going price in the black hat market for the enabled password to an ISP router, is five credit card numbers. That's pennies. So, they own some of your routers today, or mine too. OK. How does Z do it? OK, there are some costs, so the path here is 5+5+5 is 15, I'm not going to go to that because it's 30. So Y told Z it was 10 and its cost to B was 5. I told X and Z if you can get it to 5, I can get it to B. X told A and Z that its costs are I can get you to Y in 5 and B in 10. And X is told these are the costs. Z lies and says, "I can get you there cheaper." So the traffic gets handed to Z. And Z just got lucky. Why is it a hard problem? X does not really know what Z's links are. Has to believe what Z's saying. X does not really know what Y's links are either so they all trust each other regarding cost. OK. Validating prefix ownership does not help as nobody lied that B owned the target. OK. Using a routing registry like peering map does not help because nobody lied about who connected to whom. OK. One approach and Steve gets credit for the underlying model, is that B cryptographically signs when it tells Y it says, "Hey, hello, Y, you can get to my cost for B is 5." So it is signing forward when it announces. Y signs messages to X and Z, encapsulating B's message. It is signed by B saying this cost and it adds in front of it, "I can get you to Y for 10. So I can get you to Y for 10, signed by Y." So Y is committing to that and by the way, B, which you can verify, and I'm not making it up, because B created this signature and the only way B could have created it is that B can get you there for 5. Y can get you to B for 5. I can get you to Y for 10 and Y can get you to B for 5. Z can only sign now that it can get to Y for 10 and Y can get to B for 5. It can't tell you the lie that it did before. You can verify this cryptographically. But this costs. So... Use caching and pre or delayed validation. OK. The fact is, for those who had the misfortune to be in the database session and hear about my version of the PKI model a couple of hours ago, you can, or even in Steve's model, you can validate who owns what when you receive the database, you don't have to wait till you receive the routing announcement. And if you can't get it right now, be optimistic, say, "OK, I'll believe it for the moment," and go check it slowly - delayed validation. Hopefully computers will get a little faster and as Geoff has just pointed out, most routing announcements are boring, because you've already heard them recently. So if the router had a cache of what it already had validated, even though that was withdrawn, those people in Hong Kong are going to tell it to you again. GEOFF HUSTON: In the next second? RANDY BUSH: And you won't even have to revalidate it. Just when we need revalidation the most it is done. Geoff pointed out that probably that stuff is boring. OK. Trust issues - how does X know the plight of ISP Y? How does anyone know prefix ownership are the two core questions. Who owns the address space and who is announcing it? OK. Address space ownership and ASN ownership by the way, both of them luckily fall into a natural hierarchy with IANA at the root. So the allocations to RIRs and RIRs can sign allocations to ISPs, LIRs, NIRs and XIRs. Who issues the certificate. Who certifies an RIR or IANA? Who certifies an ISP's identity or an RIR? Is it a web of trust? Issuing the certificate of who owns that address space can be separated from issuing the certificate for the address space prefix. In other words, the ISP's identity is separable from the space that is being allocated to them. I'm going to do something horrible. I'm going to go to another presentation, if I can see here. And I don't have my glasses. OK. This is an allocation by APNIC. It is an allocation to me as the ISP. There's my identity and my public key. And it allocates that address space with that exploration and APNIC signs the delegation. Sorry, Steve, you'll survive the experience. We'll have some opportunities to discuss this. OK. And similarly, for an AS number it gives to me, and when I make the announcement, it's from one of the ASes I own, of address space I own, and you can verify it all. OK, how are the certs distributed? Is it administratively by ftp, etc, if it doesn't do a good job of it, then somebody should come up with that in v4 please. Out of band protocol, new cert distribution protocol. OK. Is there an in-band protocol? There are some on secure BGP suggestions that suggest that as yet another extension to BGP and some will work out how to do it with the DMS. So, that was it. I made it. Steve, by the way, did a lot of this work and helped me and IIJ pays my salary. Go for it, Steve. STEVE KENT: I liked most of it, until we got to the, "We take these blobs and sign them." The way you describe them doesn't correspond a standard way of issuing certificates from a format perspective, because you tied together, if you go back to the slide, which ever presentation that was. There. You either issue an identity cert or you issue an attribute search but that is served on separate ways. Someone can issue an identity cert. RANDY BUSH: They can do it from anywhere. STEPHEN KENT: And someone else as the responsibility of figuring out whether or not that identity search represents the people they're dealing with and then issue the attribute search. The question that comes to mind is why would I add the extra level of that? RANDY BUSH: They already do that. When I contract - where is somebody from an RIR. Sam, you're the victim. When I sign a contract with APNIC, the paper part of the contract binds her identity. Real space, my physical address, my signature - that is, a legally binding contract. I should hand them the certificate when I do that, when I will use to transact this stuff. That is my identity. I assert that in the binding legal contract. I might have gotten that cert. 95% of the cases, no-one wants to do their own cert management and they'll pay APNIC $10 and APNIC will give them their cert. OK. But Telstra - Steve, where are you - I believe paid tens of thousands of dollars to verify it with someone to get a Telstra CA. And I am not going to tell Telstra how to do their internal security management, they are a 500-kilogram gorilla and they'll tell me how they want to do their security management. What's interesting here is that they need the identity search of the ISPs, or endsites need no attestation because they're only used in two ways. Only used either as a business transaction. I sign a financial email that I sent to Sam, or I signed a DNS request that I sent to Sam, or I signed an allocation request I sent to Sam. And Sam, I signed it with the private key that matches the public key in the cert I gave her when I paid her money and give a contract. STEPHEN KENT: If I, as an RIR, go to the trouble of issuing an attribute search to someone, the work is exactly the same as if I issued a public cert? RANDY BUSH: It doesn't have a key, it doesn't have keys on it and therefore I cannot use it to sign requests to them. STEPHEN KENT: But your slides have attribute certs in them, issued to the holder of two identities. You've added this additional thing that a registry in this case had to do and I'm not sure why one would add it given that the work you would have to do to do that is just as much as if you just issued them a cert with a public key in it, binding those resources in the first place. RANDY BUSH: Because what we really do in our operational world - this just is, I like sBGP because it matches the BGP I know and is very congruent with it. I like this because it matches our transactions. For instance, my identity will be the same and the public key I will announce will be the same for my transactions with APNIC and with, where is ARIN? Is there an ARIN. STEPHEN KENT: You always get to choose the public key anyway so you can have the same... RANDY BUSH: Not if it's in the resource cert. STEPHEN KENT: You can. You get to choose what your public key is, if you choose to give the same public key in two resource certs, which are public key certs with extensions to them, you can choose to do that. And that won't break anything. The second concern now is that you said the only way you use the cert, the identity cert, is in local transaction with the registry for instance? RANDY BUSH: Bound in the attestations. STEPHEN KENT: You can only validate a public key cert by walking some chain from an anchor point. Everyone has to be a trust anchor for everyone else, which is a very significant. RANDY BUSH: No, no. I get the BGP announcement from Geoff. With the two of those objects, one being the ASN, and one being the IP space. I go to the IP space and I chase the IP space and I can verify the signatures all the way up to the IANA that encapsulated the attestations of who owned the space. STEPHEN KENT: If the space is only allocated there, you can't chain. RANDY BUSH: I'm going up a chain of the sign attestations. These, and there is a question of what is, in what name space are these. I agree. But we have a natural name space which is the IP identifier and the ASN. Here you go, Geoff. At least Geoff and I are used to storing these madly to run them for research. And I somehow think they make a very natural name space which will be run by other means. I think Geoff wants to speak. STEPHEN KENT: I watched you at the point where you talked about binding two other attribute certificates and identifier certificate which is binded. A resulting thing like that is not something that PKIs deal with from a chain perspective. It can be a digitally signed bit but we only chain certificates and we only chain public key certificates, not something else. So it would be a new thingy of some sort and I'm having a hard time relating to that. PHILIP SMITH: Just a quick comment. GEOFF HUSTON: What Randy is saying is that there is an implicit chaining inside the series of attribute certificates up the food chain by nature of aggregates and more specifics. If I see down at an ISP level a particular prefix, we can search through for an attribute certificate that expresses an aggregate of that. My question to Randy was why I can see that, how precisely does it work in the AS number space? Or are you assuming AS number blob arrangements? RANDY BUSH: AS numbers are allocated - I'm sorry. AS numbers are very like the IP space - they don't have dots in the middle of them. But IANA does hand out blocks of them to the RIRs. So there are ranges of this. GEOFF HUSTON: You would say it's the range search and would use the same thing. OK. RANDY BUSH: I will fess up that I haven't, you're at the edge of, while I've thought this out, and so has Steve. OK. You're at the edge of how far I've thought this out and luckily I think we have a couple more meetings in the next five days to beat on each other further on it. But this maps the business and social relationships that we actually have. It's intuitive to us. But, again, we're harping on this stuff when this presentation was supposed to be about the router security problem. George, bail me out. PHILIP SMITH: Thank you very much, Randy. Next up we have George Michaelson. He will give us an update. GEORGE MICHAELSON: A rather horrible slide pack that is, especially for a workshop, but it's a fairly good reflection of the work we've been doing. I'm going to cover what our current goals are here. I've got quite a lot of stuff. Some examples of the activity and where we think we're going to go next. The immediate short-term goals we were looking at were to try and get something that was the demonstrator, something that would get us off first base, with a prime focus on using free and open source. We badly wanted to avoid re-implementing anything. We have a body of code that we do in-house and it's built around Perl, the mod Perl extension to Apache 2. It's quite an investment we've made in the last 2.5 years and it's worked very well for us. We were looking to make use of that same mechanism. That means it had to fit in with this thing that is a particular way that we're developing code at APNIC. We're interested in trends in the WIDE community. We're looking at things which are doing REST, which is a way of doing things. We were looking at using encoding because there is so much code out there that you can use XMLs. If you can pump the load on to the client side, make things happen on people's own iron rather than yours, that lets you get rid of quite a few of the problems. We've been doing the bootstrap and to support the basic infrastructure. I've got to say the infrastructure work, this is really primitive stuff. How do we handle a cert, where do we store it and make statements out of it. It's not up at service level where we say this is a valid search. It's much lower than that. But we've been able to leave for experience in the bootstrap phase. We actually found there's quite good code out there. There's a library called convert ASN1. It maps the code into a Per 1 hatch structure and it had a module for passing certificates. That was just a test demonstrator but it's quite good. Although it's very badly documented, the code worked very well and we were able to use this to understand how to construct ASN1 sequences and interact with them. We targeted Open SSL deliberately because we thought it was the most successful. We thought hard about going into the OpenSSL coding. But it is frighteningly complicated. It's bizarre what Tim and Eric do to make SSL happen and their interface is very confusing. It has abstractions and make your tea, cut your hair, it does everything. It is poorly duplicated. We steer clear of that. Instead, we're looking at using its command line interface. We have basic functions. We can issue a CRL and do basic things very straightforwardly. Verification - that's built into this tool. If you present a well-formed certificate, CRLs, you can say is this cert OK? The problem is they've written the verification in a way that doesn't understand extensions. If you think about it, you can you write a generic tool that will understand a totally arbitrary extension? You can't. And what we've basically had to recognise is that this is for us at this stage an inherently 2-phased process. We're going to verify the crypto stuff. We'll have to do it out of band by looking at them ourselves. The other thing we're getting is tools to convert from PEM ASCI encoded form. You get opportunities to look at things and the Perl is a big advantage to us. But there is that problem that it's undocumented and pretty complicated. A quick overview - the way it works is it has an MS DOS configure file. You give the block a name and do a variable value assignment. We found there's an option to add an extra config file. We decided we'll add it as an external fig and pass it in. There's a weird pack where you put in any flags about its criticality or importance or what colour of milk you drink and say the rest is a DER encoded sequence and embed it hex. So if you've got ASN1 and put it into a sequence, you can bang any extension into the framework and use it to construct the signing. An example would be is an arbitrary member - they have 1 ASN, 17814. They've got a /20 of space as well. Using this mechanism, this is what it would look like in the encoding in a config file to run through OpenSSL. We've defined an arbitrary extension name in the configuration file and we have mandatory components the RFC says we have to have. We must pass down the CA bit because you have to use the certificates. The subject key identifier has to be put in as a hash and the security identifier has to be present and a key usage stream. There's a bunch of functions you can do. The critical here is the interesting one. This has the behaviour of making people not be able to do things with your cert. It puts restrictions on. I will say I have suspicion there are hacks out there that ignore things and do things with or without them. In a community of well-behaved people, I think the flags are quite good. Sorry, Steve, is that possibly me being naughty or fair? They're more mandatory or only if you play the rules. STEVE KENT: I suspect that most widely used applications on security with a critical extension is not. GEORGE MICHAELSON: I noticed that it said you couldn't use X,Y,Z after you've used it. If you're happy, that's good for you. People think there are things you can do. The important stuff is the second set. These are the mandatory extensions that aren't in the set of attributes. There's a no ID tag and dollar are two which are defined by the RFC and there's are the IPv4 address and IPv6 address components, both of which are critical. If your ASN is as good as mine was 20 years ago, there is something like a number of and encodes a representation of that number of prefix. To give you an example of what we get from the Perl, in the cert running program, we do a configuration print against the no ID. And the PDU is the string equivalent of the stuff we were seeing before. Does my voice does not reach this microphone from over here? You get a hash array which consists of a high-level identifier that's in the ASN1 and has the decode value. That's Payload and this is the instance of data we were looking at. The instance is the policy identifier and this is its Payload value. That's an assigned number that's managed by IANA. I think Steve may run the registry that assigns those? Yes or no. Russ does it. Russ assigned that number. So creation act is a command line. You're doing a call of a CA function and passing in your own config and passing extension file and saying use inside that. And it's just absolutely normal command line arguments for certificate processing. You could look on 100 web pages, "How do I make my own certificate", and see this sequence with very small functions. We basically copped out in the short term. We weren't comfortable with the ASN1 part of encoding these projects. It's not that we don't think ASN 1 would work but we are one step back. We modelled the resource signing phase of constructing an SHA 1 sign across a body of text. We're doing detached signature. We've bypassed the issue of what it's like, how to manipulate them still coming up there. We can sign anything. We have tested signing an RPSL object. There is a problem in the tool kit, if you want to apply the verify function, you must have the public key in the ASN1 format. I don't know if you've ever seen how OpenSSL does it. It's made available as a text you've done of the elements followed by a body. You have to do the conversion into ASN1 if you want to use its command tool to do a verify. It doesn't tell you that anywhere in its doco which was painful. We're expecting to have to put some tools into our own facilities to provide this to people to help you. If you want to see one of these babies, this is what an example certificate looks like. Can you all see that and read that? Great. We're deliberately using a name which is quite clearly not a valid - do not use this in the wild name. We have short-life validity. We wanted to get into a cycle of having aged certificates that people could get against. We have a whole bunch of mandatory components. And down here, you have a typo because the people that submitted the code into OpenSSL got one finger wrong. So instead of SBP you've got SQP. Almost every OpenSSL out there will see that string until one minor upgrade goes through. You cannot actually see the elements. They present as arbitrary data. We expect we'll have to write code and submit it to OpenSSL community to present this in a structured manner and show people what the elements are so you can do extraction and manipulate it yourself. Then you get the certificate as a bunch of text at the end. The current status - what we did is we took all of the top-level resources that we've given out to our membership and generated certificate instances for all of them. Our own file is about 8 to 10 k of text and that is a single certificate that covers the entire space we have responsibility for. We can do somewhere around 1,000 signings in 30 minutes, not that I would suggest in your wildest dreams you should resign the entire state of the world in one hit. That's silly. If you have to, it's not expensive. On a Dell 1750, which is a 2 gig CPU you could do it.. Most of your time is spent doing IRO. The encrypto component of it is very small. I don't know why I keep seeing the same numbers come up and I'm wondering if every single certificate number in the world has the same prime number. Isn't that bad? SPEAKER FROM THE FLOOR: It's the exponent. GEORGE MICHAELSON: It surprised me we're told the way this stuff works is that you have huge numbers of relatively prime number. SPEAKER FROM THE FLOOR: BECAUSE THE WAY: But, the public exponent that used, E, it can be any one of a variety of things. To the 16th minus 1 is a good choice. GEORGE MICHAELSON: Basically, at this point, if you say, "Trust me," it's OK. SPEAKER FROM THE FLOOR: It's a good number. GEORGE MICHAELSON: I dropped $200 worth of glassware one day and they were cross with me, so I didn't do too well. I went to the beach 100 times in one year, so. The other thing about this is we've deliberately made the certificate names a blind. We haven't given people institutional names. I don't know what your telco is, we've created this space using names based on this arbitrary number field. We chose completely a random prefix FC00 and thought it was a nice number space to use. Every one of these certificates has a 40-bit centrally assigned unique local identifier magically associated with it as a value. You can work out who it is because you plug the AS or the IP into whois to find out who has got it. But we're not making a data dump. These are flattened and we think there is potential in deliberately having certificates with anonymous names. There are many reasons to think about names that get attached to these things. This is important. We're playing with effectively artificial flat name space. We thought it was useful. We have made the configuration files and the private keys available. Never, ever give people your private keys. However, we are. If you want to play with this stuff you need the private key in order to do the signings. Our database, and I've put the URL up there, and it is here has for every certificate, it has the private key. So that you could download these and pretend to be that person and use that certificate to do signings and tests. When we go live, we won't do that. I think that's about it. We're just starting to come up to speed on our own framework to manipulate these things. We're about three months behind with some other stuff but we'll have our facilities able to manipulate these. And the next steps - we've got test certs from BBN. Charles Gardener has been amazing. He's been checking off stuff. We have a nice body of code. We want to put up this little demonstrator to giving people a chance to pick some numbers, be a certificate, do validations, even make some bogus certs and verify that you can show it's not valid and put up a time line to get this stuff out the door. A lot of this is subject to other work. That's it, folks. APPLAUSE RANDY BUSH: Randy Bush. GEORGE MICHAELSON: How are you? RANDY BUSH: I'm good. The RIRs and particularly APNIC, since this is my home region, is getting this done because this is longly timed stuff and there is going to be some Monday morning, we're going to wake up where we have serious routing attacks and this stuff has to have been all been done beforehand. The vendors isn't solving the problem for us, you're doing the substrate that has to be done first and thank you. GEORGE MICHAELSON: Has crash dummy here, I'd say that it's really comforting the open source tools are going towards our goal. It's a little worrying about some of the behaviours but not a big step to give them back some food that improves the quality of what they're doing. I'd be fairly confident of what we get out the door may not be the fastest or the best but it will fly. I don't think the community is going to have a problem here. I think we can fulfil the role that's needed. PHILIP SMITH: Thank you very much, George. So this brings us to the end of the first session of the routing SIG. After the break, we have those three sessions to entice you back. And I hope you'll come back for the second half in about 20 minutes time. Thank you. PHILIP SMITH: Welcome to the second half of the Routing SIG. Before I actually introduce Randy, our first speaker, as you probably noticed, the network has gone down and, yep, it is being fixed as we speak. So it just means you'll have to pay attention. You really will have to pay attention rather than just pretending to pay attention. Anyway, second session, we have three speakers. First up is Randy Bush, then Greg Hooten and then finally Henk Uijterwaal. Henk is just fresh off a flight from Amsterdam so I've put him last on the agenda so he can recover from that experience. First up is Randy. RANDY BUSH: You picked the worst time to lose the Net, huh? Then, people would have something to do instead of having to listen to me. Still, you win some, you lose some. OK, this talk is really a few years old, but Philip asked me to give it because I think he thinks it's still relevant. It's about complexity and really what it's about is simplicity. The Internet actually does work. Right? I know this will come as a shock to most people here but, you know VINCENT FULLER: It's not working in this room right now. NARELLE CLARKE: Trust me, it's still out there. RANDY BUSH: It is still out there. IP forwarding really works. MPLS switching is a label, look-up, IP forwarding is an IP look-up. It's all done with T-cams, it's all the same story. Actual measurements show the quality of service is just fine. If you remember Steve Casner's measurements of transcontinental-US VoIP, Jitter, etc, etc, on just connected up to the commercial network, it works. OK. Anyway, QOS is a decision of which packets to drop. I don't know about you, but I get paid not to drop packets: So there are reasons that the Internet has taken over the data world and has taken over the communications world and so trying to turn it back into the other seems to be swimming up stream. Reliability and resiliency are the core strengths of the Internet. The Internet was designed to provide reliable service over unreliable infrastructure. Somebody was talking about the reliability - Geoff was talking about the reliability issue. The idea is components are going to be unreliable. They will be almost as unreliable as humans. OK? But the Internet handles routing around problems. Right? Our weakness is security, as it was once the telcos' by the way. If you remember, they used to mix controlling data and 2600 and Captain Crunch and all that stuff, OK? IP routing yields as good a service as MPLS switching and better in cases of multiple failures. Routing will find a way around. MPLS - you better have configured it. To quote Mike O'Dell, the hero of many of us, the real problem is scaling. All other problems come from that. If you can make it scale, the game's over, OK? Complexity is the arch-enemy of scaling and this is key. Because, if you do something complex, your costs are non-linear as you scale. The telco culture started to glorify complexity as a competitive tactic in the 1970s and into the '80s. They wanted to compete with each other so the big 500kg gorillas added feature, feature, feature and hung boxes on the sides of switches and boxes on the sides of boxes in order to provide perceived features to compete with each other. But look what it did to them - Geoff showed you this morning the wonderful chart of those people are dying on the profit and loss statements and they're dying on the earnings per share and they're dying on the capital market. OK? And we're all in a commodity market. We all buy from the same vendors as the competition. Right? Making things complex will only raise your operational costs and raise your capital costs. I do have to remind you of RFC 1925 section 2.3 - "With sufficient thrust, pigs fly just fine." The question is, do you want to pay for the fuel? Out of your income statement? I don't. And who cares about flying pigs anyway? 'The Hitchhiker's Guide' has a wonderful saying about the Sirius Cybernetics Corporation, their products - "It is very easy to be blinded to the essential uselessness of them by the sense of achievement that you get from getting them to work at all." LAUGHTER How many of us are working with networks that we're amazed when we get it to work. Well, maybe we've put junk in there we shouldn't. "In other words - and this is the rock solid principle on which the whole of the corporation's galaxy-wide success is founded - their fundamental design flaws are completely hidden by their superficial design flaws." OK? Stop building artificial make-believe circuits on top of switching on top of circuits. OK? I have worked for a number of - I have worked for the world's largest telco and I've worked for the world's fourth - ex-fourth-largest telco which no longer exists because they pursued this path. And so I'm now anonymously going to tell you which place I learned this. But the optics people in the telco, the people who are responsible for fibre, said, "We can give them all the real circuits they want." Building circuits on top of layer 2 is costly to the company and damaging to the company. The problem is two things - one is the internal cost model and that company had an internal cost model which, if you were the first user of fibre, you just wanted one line of that fibre, you had to pay for the whole thing and this is very common in the telcos. So instead of buying another fibre, they buy more router, switches, or whatever you call these monsters these days, and build MPLS on top of them. And the second one is what Geoff referred to this morning, is the convergence game, which really isn't convergence. It's one department who's been at political war with the other departments for the last 150 or 100 years saying, "We can provide this converged network and therefore we will subsume the people, the ATM People, the Voice people, the IP people, etc and we will give you one network and, oh, we'll manage it all." Now, what's interesting is what they did was they took a profitable frame relay business and actually even - hard to believe, still as it was, late in the '90s and into the 2000s, a profitable ATM business and turned it into an unprofitable MPLS business. Where the smarts are is the big difference. Traditional Voice had stupid edge devices - the telephone instrument we all know and love with that dialler button on it and a very smart core. These monstrous switches that are very sophisticated. The Internet has smart edges - this computer, undoubtedly it's smarter than I am, but that's easy. With sophisticated operating systems, applications, etc, etc, and a very simple stupid core which does packet forwarding. And a key thing here, which Geoff was pointing out this morning, underlining innovation, which is critical, is adding an entirely new Internet service, such as Skype, such as HTTP, etc, etc, is just a matter of distributing an application to a few consenting desktops - let's forget NATs. And you fielded it. You do not have to change the core. Think about what it takes if you want to add a service to the telco Voice networks - massive time, massive money and you have to change the whole core of the network. Where is the reliability? The Voice network has very smart central organs which are heavily armoured and have rooms full of battery backup, etc, etc. The Internet assumes component failure and achieves reliability through the redundancy in the protocol designs. For instance, the root servers can be seriously attacked without anyone noticing they were and people have to actually show measurements of which ones weren't reachable when because none of the users knew. Right? The protocols find a working one and remember it until it fails. Great ones - carrier class reliability. We've got fibre 69s, we can give it to you. The famous 5ESS switch regularly has five nines in operation and has even hit six nines in the field. We think we want that in routers and other Internet boxes. Can we achieve this? Let me tell you a secret about the 5ESS. The 5ESS somebody designed with a poor (pause) - there goes another noun - relational breakdown of its data structure. So the data is redundant round inside the switch. So there has to be a supervisory function which continuously runs and cleans up the internal inconsistencies in those data structures. It is the majority of the code. And, if it's removed, the switch crashes in a few hours. And that's your five-nines reliability. Can you imagine this approach scaling to Internet routing? You can't distribute that? OK? Does not play here. Spread it across the layers. Again, RFC 1925 - "It is always possible to agglutinate multiple separate problems into a single complex interdependent solution." In most cases, this is a bad idea." Don't do it. This is why ATM-1 failed in the Internet. It tried to solve QOS, traffic engineering, circuit simulation, all that stuff. RFC 1925 again says, "Every old idea will be proposed again with a different name and a different presentation regardless of whether it works." And we are now facing ATM-2. Trade-offs across the layers or how to get power and simplicity. L 2/L3 technologies such as Frame, IP, MPLS, have costs proportional to software costs. They drop very slowly. Fibre bandwidth and pricing seems to follow Moore's law - it's much like hardware cost. Which do you want to bet on? OK? So, instead of increasing the L2/L3 cost with pseudo-muxing, DWDM is your friend. Every year, they get twice as many bits out of the same piece of glass. Bet on it. The cost of bandwidth is falling faster than 32 feet per second squared. Routers aren't costing less. They're costing more. My OPEX is going up because of the complexity. Get a clue. What do you do? Bet on simple and cheap. Layer-1 costs are driven by hardware. Layer-2 is driven by software. Provision the bandwidth you need. What happens when fibre keeps falling and Google, and Yahoo, etc, provide cheap transport and the last monopoly is broken and peer-to-peer dominates? And VoIP keeps exploding, even though it's not bandwidthed? There's only so long the government and lawyers can save the telcos. The second question I wanted to ask Geoff this morning - and I'll ask it now - is I think I see a game being played. It's especially visible in the States but it's leaked here and it's leaking to Asia, and that is that the trademark and copy right lawyers on the right hand are trying to label content as property so that, on the left hand, the entrenched carriers who are being protected can sell the transport of a commodity product instead of a commodity service. So that that's what's being called - what was it? Oh, God - he asked the question... um, the whole thing where they're doing, "We'll charge Google for carrying their bits but we'll carry our bits better." Net neutrality issue, etc. So what's happening is, on the right, they're productising content, the motion pictures association, the record association, etc, and, on the left, they're nailing it as, OK, "Now, we will give you differentiated carriage of that." GEOFF HUSTON: I'll have a quick response to that. Yes, that is plan A from the media industry and plan A isn't working. Expect movies to have more insidious product placement because plan B is distribute the movie more but pack the ads inside so you actually can't filter them out. My suspicion is that movies will be sponsored by various media outlets - Coca-Cola, etc. And that's their plan B. This whole issue of the telcos defending their space - because they employ a lot of people and otherwise the Department of Social Security would have a huge problem on their hands - is now this last desperate card they're playing. The media stuff - Google I think has proved that the media industry is strippable and they are working through it. BitTorrent is proving that the traditional distribution systems are inefficient but that doesn't mean there's still a strong industry there. There is. You just place the ads in different spots. RANDY BUSH: That's what's happening with the whole media thing. The newspapers are losing the ad revenues. The newspapers are going online and now Google is ahead of them and they're going to be in deep yoghurt. So the lawyers aren't going to save it but they're working hard to muddle it in the meantime. Telcos have to save themselves. They're going to try and climb up the stack but what they need to do is get in front of the technology. If VoIP is so cheap, then provide it already. Provide innovative services and not video on demand but mediated peer-to-peer, right? And do it as a commodity service with simplicity, not complexity. Because, if you complicate your network, you're just going to take any money you might have made and throw it right down the drain. OK? Going back to the cannibalisation of the frame relay business by the MPLS, what happened to the profit side, the margin of the frame relay business was it got turned into capital expenditure to put in more and more MPLS routers and into the OPEX to manage a very difficult technology. And so your margin went down the tubes by complexity. OK? So I think with enough complexity we strongly suspect that we can operate an approximate Internet in polynomial time and dollars. That's a researcher's joke. Sorry. We are working on a proof that operating the Internet can be made to be NP-hard and then we'll just wonder where the profits went. Just like the voice network. Never learn. The United States didn't learn from Vietnam and we didn't learn from the telcos. I think that's the show. ED LEWIS: I agree with what you're saying about complexity. But there are some things that the telco companies have, some services like emergency phone calls - we have 911 in the States. I don't know the number here - looking at trying to put that stuff into the Internet, you start seeing a lot of really complex solutions that are above the telco line in the software now. It looks like we're just pushing the complexity around sometimes to achieve some of the services that we've had for years in the telephone system. RANDY BUSH: I think Geoff is going to do a better job of this one than I. GEOFF HUSTON: When you start arguing with desperate people who see the problem about if they're going to be in business next year and they start bringing up a whole bunch of reasons why they're socially useful and you should fund their continued existence, most of the stuff about 911 is actually nonsense. Indeed, realistically comes the issue of where and why is there a rollout there. The telcos actually operate a damn fine SDH switch with Voice at the moment and they'll continue to operate for some time yet. This is not a here and now problem. What they're truly trying to argue is deregulation is hurting a lot because other people are taking niche points and taking money away from them. You're seeing desperate people clutch at straws and argue why their role is still necessary and important and why the money should come in this way. I'm not sure it's believable but that's the case they're making. RANDY BUSH: Another way of looking at it is why isn't the demand being made of my television that I should be able to make an emergency thing over it? Why isn't the demand being made of my car? Why is it being made of the Internet service? All you're trying to do is stack stuff on top of it because, oh, my God, I can make it look like a circuit. But, if you don't do that, your head won't hurt so much. ED LEWIS: This isn't an essential service we've provided for years and now the Internet has to do it. I'm looking at it as someone using the Internet and watching us trying to replicate the same service. RANDY BUSH: Don't do it. If it hurts, stop. Trying to solve the wrong problem or else in a disastrously wrong way. ED LEWIS: The comment that led me to this was the comment about battery back-up with telephones. People say when we have power outages, you can pick up the old pots line and call someone. RANDY BUSH: Go pick it up. Just because I bought a car doesn't mean I'm going to let go of the pots line so, if you want that service, get that service. Don't try to impose it on automobiles or televisions. ED LEWIS: Sometimes we take having no complexity too far, making it too simple. RANDY BUSH: I wouldn't do something like that. PHILIP SMITH: Any other questions for Randy. NARELLE CLARKE: Item number 8 in RFC 1925 says "It is more complicated than you think." RANDY BUSH: 1925, yeah. PHILIP SMITH: Thanks very much, Randy. OK, next up we have Greg Hooten talking about real-world use of route analytics technology. GREG HOOTEN: Hi. My name is Greg Hooten and I'm from Packet Design. We put out a product called Route Explorer and I want to talk to you today about route analytics and how it's used and where it's used. A little history about the company - I'm not going to go through it. You'll see it in the slide set if you want to pull it out off the Web. So why route analytics? I've worked for a lot of large ISPs. Some of them went belly up. Most of them couldn't bandage their networks. And a lot of the reason was because everything was focused on Layer 2. How do you manage a Layer 3 network? It wasn't really known and HP Openview really built up this big process at Layer 2 to fill out how the Layer 3 worked and that didn't work either. Most of Layer 3 problems are caused by misconfigurations in the routers, hardware failures and they cause the majority of the problems in the networks. What we saw and at Home Corporation, where I worked before this, was, when we froze the network, for example, for a holiday, essentially our outages went down bill over 90%. So nobody is allowed to make changes on the network - outages went away. That didn't catch on with a lot of people but we tried to minimise our outages by being smarter about how we did them. But there was no way to really test except for brute force, whether the Layer 3 was working or not. Route analytics leverages the strength of routing by listening to the routing protocols. So, if I'm listening to OSPF, I listen to the LSA information, record what's happening in the network, I keep it historically and it allows me to diagnose the problems either currently, or historically, or over time, that are happening in the network. So, if a problem happens once, it may happen again. If it's happened 15 times, it probably won't show up in a Layer 2 management system but it will show up in the routed infrastructure. Layer 3 is designed to survive through redundancy and through rerouting. That's good and bad. One way it says, you know, "I'm going to try and get around this problem," but it also disguises there is a problem so we're trying to get visibility into that Layer 3 topology. Simple topology (refers to slide). Here's the routers, here's the path across the network, some BGP, some colours, it looks like a diagram. It can be changed any way you want. We tried to keep it as simple as possible. Through the reports and through very lightweight touch on the network peering with OSPF, you get the data about what the routers are saying in the network, what routes are up, what the metrics are, what paths are down, what links are down. And then you build up over history a timeline of what has happened and do analysis on that type of data. These are some of the companies that are interested in this type of stuff (refers to slide). So MPLS seems to be our big focus recently. I don't really know why. It's a really good way of selling very large routers to the edge of your network instead of smaller routers tell edge of your network. But MPLS DPNs are catching on in a lot of Tier-1 ISPs. One of the big problems that they're running into is change validations saying, "I'm adding a new customer to my network. How do I know it works?" What they're doing right now is they're doing as much testing as they can, calling up the network and saying, "Alright. Try it." When you go out and add new routes to the network, how do they know that works? So they're running into per-customer reachability issues, privacy issues - are the customer routes really distinguished one from another? - and policy issues - am I getting the hub and spoke that I really want or am I getting the mesh that I want? And these are the problems that they're running into - customers leaking routes between each other because of misconfigurations, providers trying to monitor the policy as they change it, provider misconfigurations for various manual processes. This is a summary page of the changes that are happening in a network. Detail based on customers - so you can name them pretty much anything you want - but what we're looking at is trying to baseline what's happening into a network. So I've got three active Pes, 10 active routes for this customer, baseline routes of 3 so I've got seven new routes. The question is why? So the process is really more like, if I'm making a change in the network, someone's responsible for that. Someone's responsible for saying, "I'm going to add a new PE. I'm going to add a new set of routes to my network." There's also an acceptance process of, alright, once those routes are in, once the new routers are in, how does that get accepted into my network as an operational piece of equipment? And so what we're trying to do is give that data to those groups so when I make a change to this network, customer 5, they've got 10 active PEs, two withdrawn routes, 100 new routes. Why do they have 100 new routes? So being able to look at these routes and say, "These routes were assigned to this PE at this time and this is the active routes," with this route discriminator, that gives you the ability to view in more detail rather than saying, "I planned a maintenance and executed that maintenance and then this is the result of that maintenance." So we've got those 100 new routes that were put in for customer 5. 13 are coming from brand new PEs for that customer. They're not part of the baseline. Is that a mistake? If they look at the work order, they should be able to tell. These are the new sets of PEs that were supposed to be added. If there weren't, did Pepsi Co get crossed with Coca-Cola? Is this something we want to do? Then you can do justification on your maintenance before you contact the customer and say this is working or this is not working. If there's instability in the network, then we can display that instability and you want to do it with as much detail as you can. So here is just a graph of it over time but we could also give the details about what routes were being affected at that time. So we had a large WAN RFP that required routing converging SLA. So when does a network converge? Right now, it's difficult to measure. What parts do you measure? Do you measure when OSPF starts to go away or once the timers have expired and the changes are occurring in the network? There's no standard for it so what we're doing is monitoring the convergence based on the propagation delay across the network. Is that the ideal solution? No. But it worked well enough for this company to justify to their customers that they were within the specifications that were required. Do we want better solutions? Of course. Right now, being able to say, "I have a monitoring system on one side, monitoring system on the other," and being able to generate route changes into the network, inject and withdraw, look-back address of /32, measuring that propagation delay across that network so that you can see this is the change and this is your graph of the change over a long period of time. Does that meet the SLA that you're interested in? By being able to look at BGP over long periods of time, you can see changes like huge deluges of routes coming into the network, whether it's a redistribution from OSPF into BGP or new changes, loss of peering, how does a loss of routing in AS1237 seven hops away from my network affected the way that I send data out of my network? Does it make an effect? If it does make an effect, do I want to find out why and do I want to change that process? So we've seen this at Tier-1 ISPs all over the place. UC Berkeley, new route leakage into the network - why did they come about? Where are they coming from? Being able to determine which customer was advertising those new set of prefixes, where did they shift from in the entry of my network? So, in this case, instead of looking at 108,000 prefixes that shifted or 330,000 events, we're going to create a pretty picture about it and it's actually a moveable picture. So we can play over time what's changing in the network. We see a large loss of prefixes across one edge. So, in this case, Calran 2 is advertising prefixes across this edge, it's losing prefixes across this edge and so it's fairly simple to see what's changing in the network. It's easy for a customer on this side to contact Calran 2 and say, "Look, you're losing peering here, you're picking this up. It's not affecting us right now because we have duplicate address advertisements, but it needs to be fixed." And, in this case, it was fairly short. The more complex problems where Sprintlink may be losing connectivity or, in this case, where Quest lost peering with Calran or withdrew a bunch of routes to Calran and instead of having a simple back-up, the back-up path was six hops. Really difficult to find from route advertisements. Really simple to see from a graphic. Being able to categorise route changes before and after with a delta, here's 96,000 route cal change based on next hops on this next hop being able to visualise those either through a table or through a graphic makes it easier to diagnose why that was happening in your network. Probably most of you have heard of MED oscillations. Probably not too many people have seen that oscillations in the met work. Is a very small cut of the data we collected in a Tier-1 ISP about a year-and-a-half ago. Interesting thing about it was it went on for over two weeks. It consumed the full processing power of three GSRs and nobody really knew what was going on except that they are three GSRs that were saturated. The changes were happening so quickly, essentially at the speed of the three processors for the GSRs that they couldn't, by the time they typed a command to see what routes were changing, they were now 100 route changes behind. The changes were happening so rapidly and they caused so much change in the network that essentially the GSRs were useless. They started having to route around thousands of routers to try and get data connectivity back into major parts of their network. Eventually they shut down two of the routers and they were ready to replace them with larger GSRs which would have just exacerbated the problem. But we were able to record some of the data that was happening at that time and show them what was happening. They implemented an always-compare MED and that essentially changed - eliminated the problem. So, over time, when you're capturing large amounts of data, you can also do analysis over long periods of time and look at simple changes, even for a single prefix on the network. Customer /24 was flapping continuously over a one-minute period for approximately a week. They always got back-up through a NAP but this new service that they brought from the Tier-1 - they spent a lot of money on the primary access that essentially was doing them no good. We were able to pull out the data and look at that and show them where the problem was and how bad the problem was. So this is a fusion, a new type of fusion, that we're working on, a gentleman named Pabo Yu (?) started this, Steve Casner and Van Jacobsen decided that, if we're understanding how the routing works in the network, then there must be a way to take net flow data, figure out where the net flow data is coming from, where it's going to in the network and then instead of listening to it throughout the network, flow it across the network. So the idea here is instead of collecting net flow data everywhere in my network, what I really want to do is collect it at the key points in my network - entrance points, key data centres. Given that I understand how my routing topology works, I want to be able to take that data and then flow it out across that topology. What that gives me is the ability to say, "If something changes in my network based on either routing or on data flow, how will that affect the other circuits that are now being overloaded? Do I buy more bandwidth? What happens if I lose a router? Where will those packets go? Will they flow the way I expect them to? Will they saturate the circuits that I have as back-ups?" In a complex network, even with complexity coming from redundancy rather than from things like MPLS, the question is where will my data go? It's designed to be resilient. I want to make it as predictable as possible so being able to fail pieces of equipment, proactively testing them, seeing where yesterday's data would have flown if this router went away, gives me the ability to predict how it will happen in the future if I really do lose pieces of equipment. So being able to look at the routing topology, being able to look at net flow data and then being able to fuse those two together gives me another tool that will allow me to better understand how my Layer 3 topology is working, how the data will flow in the network, what capacities I need for the future, what peering I need or what private peering I need rather than buying transit from service providers. So it's another way of looking at the same types of data we've been collecting for a long period of time and hopefully it will add a little bit more clarity into the Layer 3 network rather than trying to build that clarity up from a Layer 2 network and then try to interpret from Layer 2 what the Layer 3 topology will do. Are there any questions? I was either very successful or you're very asleep. OK. PHILIP SMITH: No questions for Greg? No. GREG HOOTEN: Great. Thanks a lot. PHILIP SMITH: OK then. Thank you very much. APPLAUSE PHILIP SMITH: OK next up and last speaker is Henk Uijterwaal. He's talking about ASNs missing in action. HENK UIJTERWAAL: This is work I did together with RIPE NCC an ASN missing in action. OK. So, I will talk. This is why we started doing this. We started to look at ASNs. I'm not going to explain what an ASN is because I assume you all know that. What you do know about ASNs is that each AS needs its own unique identifier. They are assigned in a hierarchical way. Local RIRs and users. And it is guaranteed uniqueness. You have to be able to identify it uniquely. Another observation is that ASes are a limited resource. At the moment, a reserve for an AS number, 60 bits means 65,000 something. A couple can get used for private use. But you can't use all the numbers. So you only have 64, 510 of them available on the net. So, who has an AS? It's quite simple. If you want an AS you go to your local RIR and you ask for one. They all have policies, the five regions have five different policies. They are all based on a single document. With some global places added. If you're in a region and you think you need an AS, you just go to your RIR and you ask for one. I want to explain the policies in detail. And it's at the bottom of the slide and it says you get the AS for as long as you need it. If you don't need it anymore you're supposed to return it. I now start to look at this. I want to see the ASes that are handed out are actually used. Something seems to be fairly obvious. The ASN is a network of ASes. If you're looking for your router, you should find all the ASes that are in use at a particular time. And ASes, like I said, in the previous slide, are demonstrated by need. Somebody shows that they need an AS, obviously connects to the net, and should be your router. So next you think all the assigned ASes are in RIB. Not quite. I look at this, three years ago by now, and I noticed a couple of things. Early 2003, the RIRs are assigned 20,000 ASN, 300 new ones per month. If I look at RIBs on a couple of routers, only 14,000 ASNs are visible, only 200 new ones showing up every month. So 20,000 compared to 4,600 are missing. And there are about 100 handed out more every month than are actually showing up on the net. So, yeah, my question was what's happening here? This work is a result of a study trying to find out what was happening here or what is happening here. So to study this program, you need some data. Fortunately, there are quite a few data around. The first thing is - data sources around. The RIRs publish Stats Files, where a list of all the ASes they have assigned and the day-to-day assigns. This is a daily report, they have weekly and monthly reports before. Of course, you can always work your way back where by just taking the file or say today and if I wanted it for yesterday I take today's report and remove everything that was assigned in the last day and I know what was assigned yesterday, I can work my way backwards. If you do that, you find some small difference, as sometimes these files change. So we worked through all these files, we removed them, and we found mistakes and double counting in there. This is a list of all the ASes that has been assigned at a particular time. The second thing you can look at is what's happening on the net. The RIBs, there are a couple of projects around that have RIBs around the place. One is RIPE NCC. It's a project that is RIBs from 450 peers. IPv4 and IPv6. All BGP updates. We looked at the data from 18 August, 2000 to 1 August, 2005. We had AS patterns. We took and broke them down into their components, the ASes and as soon as an AS showed up, we found it was used on a particular day. We have a long list of ASes that were used on particular days. You sometimes see private AS numbers, we remove them. And we also find people make typos and things like that. We remove all ASes for less than a week. And then there's a data sources. CIDR report, that's a weekly report on the Internet from AS4637 has been available since 1994 and includes all the ASN seen in RIB. And what do we have after this? We have two lists. One is sort of the ASN is assigned. RIR Stats Files. Theory - that should be out on the net. The second is the ASN in use - RIS and CIDR report - Practice. The normal thing you would expect from an ASN appears in both lists. It's assigned as somebody is using it. You found the differences. The ASes are used on sites and there can be two reasons for that. ASN in use but not assigned. Some people have inappropriate use. And then there's sometimes problem with the registration mechanism. We'll get to that later. Then there are ASes missing in action. ASNs in use but not registered. Over the course of the five years we found 436 ASNs used but not registered. Some of them were used for a short while. 255 were still visible on August 1 of last year. If you look closer at them, 215 of them are in RIPE NCC's ranges. We went through the basement of RIPE NCC and looked at other files other than the statistics files. Found that maybe old registrations, 10 years ago or more. Digging through more files, we found some data for 214. These ASes are probably registered as they are in files but they don't show up on the publicly available files. That's something that can be corrected. We found, we still don't have any idea who owns the ASNs, who are using it and if and how and when it was registered. Of the remainder, those are reported to ARIN and APNIC, the good thing is, that a lot of them are in the ARIN ranges. None of them are found in the ARIN ranges. This is presumably a problem with transferring of data. Over the years, records were moved from one registry to another. Seven of them fell through the cracks. So, people often mention, and ask about how the files is. We have 33,000 assigned, 41,000 without data and it's probably a lot less. Given this is a mechanism which has been running for like 15 years or so, 0.12% with no records. So next thing is, here are two curves. The purple one and the blue, that is blue, the data available from 2002. And then the purple line is working your way back which is simply removing everything from the first statistics files, working your way backwards. And then the red, you see here. And the green is what you see in the CIDR report. You can see a couple of things. First of all, this is fairly straight line. This is sort of 1999, that is start with the Internet logo. Now, the Internet bubble lasted for a couple of years, and in 2002, late 2001, early 2002, you can see in effect on what you see on the net. You don't see that, so people are still making plans, still getting ASNs assigned that don't appear on the net. The other thing is - you see down here, see on the net, these were pretty much parallel lines. Yes, there are a number of ASNs missing. However, if you look at the last couple of years, the difference and the number of ASNs missing is global. For modelling later on, if you look at this graph, you can see this is lit. The behaviour. We looked at it and we think that the growth of the number of ASNs assigned is linear We did a couple of tests. Fit to linear and exponential curves. It still seems to be linear. Just to show that, this is the last couple of years. Graphs and data, the solid black line is linear, and the dotted line is exponential. Exponential starts to deviate at the end. Growth rates - so how many ASNs are appearing every month? Three lines here - the most important one is the red one, that's new allocations every month. And the blue line, that is what's disappearing. If you look at the red line, since 2002, it's pretty much flat at about 284 a month. Also, with this varying, it's pretty flat there. And one thing you should note is this bit at the end, I'll explain it in the next slide. So these are growth rates for all five registries. So I split it up into various regions. The first one is ARIN. Here are three curves. The red - new assignments. The green - new re-assignment. And the blue - that's what is disappearing. And you see a couple of things. The first thing is, from 2002, from 2004, there was no recovery. And then ARIN starts to recover ASNs. That's the blue area there. The second thing you see is that the green and the red lines are new assignments from the never used pool, assignments that we used before are deviate. What is being recovered is being reassigned up here. Look at RIPE NCC, a couple of things. Very little recovery. And the other thing is the curve seems to go up. It is clearly going up. And so far, this is compensated by ARIN's recovery error efforts. It is still going as a linear curve. If ARIN ever stops doing this or recover anything they can possibly recover, it might cross from linear to exponential again. You can see the recovery effort over early 2003. Next one is the fraction of ASN seen. I took everything assigned over time and divided it by what's shown up on the net. How much you see here is OK. This is 1998 was only 40% and has been growing up quite nicely but for the last couple of years it's been pretty flat to about 63% which is visible on the net. And there are some numbers here. 33,000 were assigned, 20,000 there - 60%. 5,000 were used for a while but they were retired. Next interesting observation is the age of retired ASN. We plotted how long an ASN was seen before it disappeared from the Net. It simply means that people use - looking like 50 or 60 months and then disappears. People seem to think this and then stop using it. It can be plotted over time. This is plot, this is time and a fraction of the ASNs used. This is time when ASN was assigned to a fraction that's still used. This is 2005 and this is 2004. In 2004, the ASNs assigned to 2004, about 80% was used. And then it goes down pretty rapidly. So ones that were assigned 10 years ago, only 40% were still active. Um, next thing I'll look at - wait a second. Why does this drop? There are two effects that cause these drops. First thing is, sites go out of business. And when a site goes out of business, the need for an ASN disappears. But people are sort of scrambling out, getting it ready, trying to find new jobs with their CVs. The last thing on their minds is to send an email to the registry saying we get an ASNs from you and now you can have it back. And there is very little recovery effort there as well. The second thing that happens is the network merge. Often when there is a merge one ASN disappears but there's no incentive to return it to the unused pool. People often merge their networks ASN1 and ASN2 and they call them ASN1 and one gets lost. You then have to go through a registration process and apply for a new one. That causes the drop there. Then, the activation delay - how long does it take for an ASN that you apply for to appear on the net? This is data for the APNIC region, so on the bottom is assigned, the difference between days of assignment and appearance. And the three curves, pink, blue and purple are the various years. You can see a couple of things. The first thing is, if you wait two months, 60 days, about 40% appears within two months. If you wait 200 days, a little over half the year, about 2 out of 3 has appeared. If you wait for a really long time - a year and a half - essentially this curve flatters off. 80% only appears after a year and a half and it's fairly constant. The other way you can look at this, 20% of ASNs that has been assigned never appears. So that's observation. I looked at the policies. All the regions have policies on what you have to do with an ASN. If you read the policy, it says that there must be plans to use the ASN within 30 days after assignment with ARIN. In RIPE NCC, there's no policy. There was a discussion on the mailing list three years ago. And here in the APNIC region, policy is that you must reach requirements on requirements upon receiving an ASN on reasonably soon or after. So, if you look at the policies and the theory and reality. They all say months, three months, soon, but in practice the time between assignments and appearance on the net is a lot longer. And the second problem is that 20% of the ASNs that are assigned never make it through on the net even though there was demonstrated need. Those are the raw numbers. Now some modelling. The first question that people have asked is when will the Internet run out of 16 bit ASN numbers? 33,681 were assigned last year and 30,000 are still available. We have 284 assignments per month. And we have about - which means you run out in 2016. It is sort of worrisome. If we cannot have ASNs, what do we have to do now? So, the first thing you can do is instead of solving the problem, postponing the problem. First we can reclaim what disappears, to 284 -105 that equals 179 assignments/month. You can even make it be a little bit more aggressive, by also reclaiming what is never used. Go down to 60 assignments per month and the period there can be 33 years. Yeah. If that's not long enough, next thing is, let's make the ASNs a bit longer. There is a solution that there is a proposal here. I'm not going to make anymore detailed predictions. There is a draft that has been around in the RIR working group for a while with an extension of ASN numbers. Now, based on this work, and several studies by Geoff Huston, a policy proposal in all five regions which says we'll start handing them out. Handing out is one thing, it also has to be implemented. You have to update your routers, have them deployed. And you'll need a couple of years. So whenever your router is there, make sure that this draft is implemented and that's something that requires a push from your side. Other ways to make things last longer - you can obviously, don't move to upgrade your kits, you might think about changing policies, current policies are basically demonstrate need. 20% never makes it to the net. It's probably too easy to demonstrate. So we visit policies. That's something for the various RIRs. Not for this forum. A couple of things - the essential thing in this game is uniqueness. You want an ASN to be unique. You don't want it to be used and reassigned. The first thing to do is using it again. There's no good mechanism for recovery. A solution to that might be the certification efforts, which are going on in the APNIC and RIPE region. Certification is very simple. It shows that it's assigned to somebody. This somebody is for a one-year period. You can always renew it if you still need it. So you never have to renumber. And if you get an ASN, you start to use it, and after a year, if you still need it, renew it. If you don't need it, the certificate will expire. As time goes on, it can be reused. It has one requirement, and people need to check these certificates when setting up. However, that's probably something that will become standard practice for securing the routing system that's deployed. 284 ASNs assigned per month from the unused pool. The actual growth is only 160. The pool will be empty by some time between 2013-2016. Ways to make it last longer - reclamation, certification. If people want to do this, it will last longer. If people don't want to do this, then one should start to think about deploying 4 byte ASNs. There's a full report from the RIPE website. And with that, I have to acknowledge a couple of people who helped with the data and a few transport companies as well. (refers to projector slide) What else - any questions? PHILIP SMITH: Any questions at all? DAVID CONRAD: Do you think there should be global consistency with regards to the ASN reclamation policies? HENK UIJTERWAAL: Well, I mean, my personal opinion would be it would be good if this would be done - if this was done across all RIRs. There is no advantage to go to one registry or their policy differences. But I don't really have an opinion on this. CHRIS CHAUNDRY: The one you said that was totally unmanned, early on in the talk, surely you can, you have the path information from the RIB and surely the people that the ASN number is connected to know that the number has connected? HENK UIJTERWAAL: You can find that out. Inside the RIPE NCC building, we couldn't find any data on that. PHILIP SMITH: No other questions or comments? OK, thank you very much. So that brings us to the end of the Routing SIG, the two sessions. I would like to thank all the seven speakers that we've had this afternoon for their presentations. Finished a few minutes early, so you've really got only about 40 minutes now between this session and the APOPS, which is the next session in this room. The APNIC social event is tonight. Those who are going to that please remember the last bus leaves at 7:10. If you're late, you'll miss out. Please be there on time and directions for the transport are on the back of the ticket, in case you're wondering what to do. Otherwise, thank you all for coming. Thanks very much to the stenographers for their work. See you at the next Routing SIG in six months time. Thank you. APPLAUSE