______________________________________________________________________ DRAFT TRANSCRIPT SIG: Routing Date: Wednesday 1 March 2006 Time: 2.00pm Presentation: Routing update Presenter: Geoff Huston ______________________________________________________________________ PHILIP SMITH: OK, while Geoff is setting up for his routing update presentation, I will go through the housekeeping list that I omitted to do at the start. As I was saying, if people want to ask questions at the end of the presentation, please come up to the microphone. I think there's also a travelling microphone wandering around as well if you're not mobile. State your name and affiliation, again, for the benefit of the microphone, and also I should point out that this session is being broadcast and audio-casted so it makes it more important that people use the microphone so that the people who have joined us from elsewhere can hear what's being talked about. Afternoon tea is in the level 2 foyer area, so it's basically that way, to my left, to your right. The social event is this evening. If you've got your ticket, please bring your ticket and the details of where it is are listed on the back of your ticket. APNIC staff will collect the ticket from you when you board the bus and that will actually leave - the bus that is - will leave from level one plaza deck outside. Last bus will be 7:10, so please don't miss it. Next item - MyAPNIC and the policy flash demo are running all day at the APNIC help desk. The help desk is available during break times - morning, lunchtime and afternoon breaks. Onsite notice board - again, it's advised to have a look at the onsite notice board on the APNIC website for any last-minute updates and so forth. There's a special session at 4pm today in meeting room 3 discussing the APNIC fee structure. That's an open session. Anyone with an interest in that topic is welcome to attend and is invited to participate in discussions, although I'd much prefer you to come to the Routing SIG but that's up to you. That's the housekeeping. So the next presenter is Geoff, who will be giving us a routing update. GEOFF HUSTON: I will. Thank you. Good afternoon. I seem to do these at every routing SIG, giving you an idea of what's happening inside the BGP routing table. I've got three parts to this presentation today - one is a status report and then I'm actually looking at work based on a question that Vince Fuller asked me a few months ago that I found it interesting to answer and further observations after that. Normally I use hourly snapshots I pull from routeviews but this time I used a complete dump of the data and I must thank Stephan Millet of Telstra for assisting with some of the data used in the presentation. My disk is now full. Thank you. Usual picture. This is 2005 January through February, the BGP prefixes. It might look like November and December were tailing off but be assured that January and February of this year, you have come back again and routing is back on once more. What does it look like? You put a line across and go, "The number of prefixes in the default-free zone across last year rose from 150,000 to 175,000 prefixes in 12 months." So life is still increasing the way it always was. The amount of address space is kind of interesting - 4.4 billion addresses in IPv4 if you try and use them all. It started at one point through 6 billion at the start of the year and finished at 1.5 billionish. Those big jumps there - there are till two /8s that appear and disappear like lighthouses. It amazes me that there are /8s that flap but there are and there they are. You can draw a line across the top of this and I've eliminated the /8s, smoothed it out and then you see pretty cleanly that the amount of address space rose 1.36 billion to about 1.5 billion addresses. Some seasonal variation. Some of you took holidays over the Northern Hemisphere summer and you're slightly below average but then got back to work in October and decided to add more addresses into the network. We appreciate that and thank you. LAUGHTER The total number of AS numbers - it's the same kind of curve. Very consistent. Unlike address space, AS number appearances on the Net keep romping through so the trend line is spot on from 17,500 up to 21,000. Somehow ASes are remarkably consistent unlike addresses or routing table entries. The addresses that keep on appearing, appear almost like clockwork. It's strange. So the vital statistics - prefixes up by 18%. Roots and more specifics - are we getting better or worse at this business of only advertising the aggregate? And the answer is no - no better and no worse. Because the number of basic root advertisements that, if you will, encompass new space, rose by 17% to 85,500 but the number of more specifics predominantly /24s but a few others in the mix, also rose by about the same amount. So around 50% of the network is still more specifics and around 50% are basic root prefixes. The amount of addresses rose by 10% so the granularity of advertisements is getting smaller, not larger, again, yes, you know, lots of little advertisements because the address space is not growing as quickly. The number of AS numbers up by 14%. What can I say? The average advertisement size is smaller. Address origination per AS is getting smaller. ASes are moving down the food chain. Interestingly, the average AS path link is remarkably steady at 3.5 ASes. 14% more ASes but the diameter of the network is constant, the density of interconnection is increasing. It's an interesting question - whether this is uniform density increase or whether there are black points where the amount of density is increasing. In other words are exchange points and similar of their ilk actually gaining strength or is the bombing-out of the long-distance transit fibre market causing interconnection to appear across longer spans? Some work is happening, I think in Adelaide, right now around topology of the network and I'd be keen to see if they're working on that. It's interesting because the denser the mesh, the more badly BGP behaves when it tries to converge. So trying to understand if the explosions are global or local is probably an interesting thing to understand. However, on a macro level, the network is getting denser. The advertisement granularity is getting smaller. More interconnections, more specifics. By contrast, this was v6. Similar growth except you've got to look at the numbers. The number of prefixes rose from a phenomenal 700 to a phenomenal 868, I believe. More patchy - you can obviously see that some people decided it was a bad time. In August, they mucked around with v6, and then got bored. After that, they all went home again. More noisier - the advertised address span. This is weird. Two big spikes and everything going down. In other words, the blue line is actually decreasing each time. Why is that? I took away the 6Bone. And now what you actually see - and this is this issue that a /20 is much, much, much, much bigger than a /32. So here are all the /32s, blip, blip, blip, /20. That's a /21 and that's a /20. It's hard to show you the growth in v6 address space apart from saying two whacking great big allocations happened that year and got advertised and a few little ones but you can't see them. There's the 6Bone. You've now got precisely four months to quit and so far each of you are quitting. Each jumped down as a /24. The 6Bone is slowly being turned off. Here's that combined view now. I don't know if it's any clear. That's the big picture, that's the bit left after 6Bone and that's the 6Bone slowly flying off and without those two big allocations being advertised, that's what we actually have. You just can't see it. Someone's given me a laser pointer. The bit without the big allocations. Advertised AS numbers - noisier. Probably because 500 to 600 - so there are 600 of you playing in the v6 game as far as I can see, some in 6Bone, some not. So what can I tell you about this? Prefixes up by 21%, routes by 15, more specifics by 21. Naughty, naughty, naughty. You're not meant to disaggregate in v6. Stop it! LAUGHTER The amount of address space went up by a phenomenal 50% because two allocations happened to be advertised and went separate up. ASes up by 20%. The average advertisement size as a result of those two massive ones is getting enormously large and origination per AS is getting large but only because of those two factors. The path link I can't give you much of a view on. Such a small network that you can't see what the average AS path is and the interconnection degrees, I can't tell. This is a network that continues to go large with little overlays at the edges and trends really aren't there yet. Part one. Part two - more interesting. If you were buying a router and you'd like it to live for three years inside your network doing what it's doing and it's trying to run in a default-free zone, what spec would you tell the vendor to build to? You know, how many prefixes, how many prefix updates per second? Vince wanted me to answer two questions - v4 and v6. I took the easy one and did v4. It's hard to predict v6 because it's such a small network. I can't tell. I'd like to try and have a shot at this answer in a v4 context. And for this I'm actually taking a macro view so I've taken the entire set of update withdrawals from AS1221 for 2005. Because you're inside a relatively busy network, there's a whole bunch of local updates also happening from inside the network so I've basically tried to filter out everything that I don't think came from the default-free zone so smacked out a whole bunch of updates from there. What I'm trying to do is to see if I can relate the number of updates and withdrawals against the number of entries in the RIB. And at the same time, I'm also looking at the CPU load records from that router that was supplying all the updates and again try and see if there's a relationship between the amount of CPU that's being used in that router on some kind of granularity against the table size. Now, if that's the case, if you can do a table size predictive model and you know the relative number of updates and withdrawals in CPU, you have a vague idea of how big the thing should possibly be somewhere in three to five years' time. So that's the methodology of trying to answer that question. Updates per day for the year in millions - default-free zone. So I've filtered out a fair few. At the start of the year, somewhere around 300,000 updates per second, BGP updates, messages, were being caught at the router. By the end of the year, it was slightly under 600,000. That's big, an enormous amount of updates. Notice also that this is not uniform. Now the law of very, very big numbers says that, if each of you contribute a little, and there are 21,000 ASes, that line should be smooth. So the law of large numbers isn't working. A variation on that number is so big, up to 50%, that I suspect that each of you aren't contributing a little. And that's what I want to talk about a bit later. Notice that astonishing variation. That is not nice. The number of prefixes per update message. How efficient are we at packing it into single updates? Getting worse. In other words, the update is getting closer to describing one prefix, not a bulk of them. The network, in terms of the routing policy granularity and update is decreasing, granularity is getting smaller and the number of prefixes vary a lot. Some folk are doing very strange things. I'm trying to understand that. Number of update message per day - unfortunately, it's close to double, as I said, it's highly variable and it shouldn't be. The number of prefixes per update message is falling and I'm just wondering if this is actually due to this increasing use of ASes to do multihoming at the edge, that I'm actually seeing what used to be an ISP with 50 prefixes, now an AS with 20, and a whole bunch of ASes underneath it that are starting to multihome. But for some reason, the number of prefixes is falling. And now time trying to understand - it seems to be that the update rate is increasing faster than the table rate. That update rate was almost double. In other words, the number of updates happening in the network is increasing faster than table size. Is there some kind of multiplicative factor going on? Or is something else happening? Some kind of thing getting larger than the routing table itself? This is not good news. Now, maybe if I stop looking at update messages and start looking at individual prefixes. How many prefixes change per day and what's their trends? What I've done now - I'm actually looking at prefixes, these are updates, those are withdrawals, in millions. So the withdrawals, 200,000 going up to 300,000 in a day. Huge number of withdrawals. The updates of prefixes is even noisier. This is weird. I can put a trend line on the prefix update rates and, you know, yes, it has increased. Around 800,000 update prefix messages. You know, the prefixes that actually got updated each day by the end of the year. Withdrawal rates, you're going to actually start to see - that is an exponential line, and even though it's noisy, there's clearly an exponential growth factor in the withdrawal rates. So high variability and approximately exponential but at different rates. The updates are growing faster than withdrawals. So now, can I relate that to the size of the network? The actual number of entries in the network itself? So that's the default-free zone across the year, that's 100,000, that's 170,000 and you can see pretty clearly there that that's not linear now. It really is a bend going on there. I've smoothed it out and done a first-order differential. The default-free zone in terms of number of entries in the RIB is growing faster than linear and is actually growing order in squared. If I look at it as an order two polynomial - how many RIB entries in three to five years, Vince? Somewhere between 275,000 in three years and 375,000 in five years would be a prediction. My guess at the confidence interval was about 20% so it's not that confident, but that fit isn't bad. That appears to be - if you're looking at three to five years, that is appears to be the metric you're looking at. Now, I've done the next thing, which is, for each RIB entry - 100,000, whatever - how many update per RIB entry over the year? If that was linear, that would be growing at the same rate as the table itself. If it's more than linear, then the number of prefixes being altered each day is growing faster than the table size and this is the number of withdrawals per RIB entry. Is that weird? One withdrawal per RIB entry, on average, every day, the entire table is withdrawn or there are a very small number of prefixes that are doing an awful lot of work - withdraw, update, withdraw, update. But that's one. Every single entry is updated three times every day or there is a small number of prefixes that are, you know, pushing up an awful lot of iron very quickly and that's growing very fast and, again, very noisy. So I can answer you at least at a gross level, Vince, on what I think will happen on withdrawal and update, inside three years, you're going to have to cope with around 1.7 million prefixes being changed every day and, by five years, that will grow up to, you know, around 3 million, which is, know, aine more amount. And the withdrawal rate, similarly, should be around 1.5 million withdrawals per day. Which should keep your routers busy. How fast would your router have to spin? Again, same kind of technique. I got from Steph the actual 5-minute and 1-minute CPU loads and across the year in question, the router had a brain transplant twice and this is a PRP2. Isn't it cool? So what I tried to do was normalise that and make everything a PRP2. So what I actually got here was this - this is what happened across the year, the one-minute load rate on the PRP2 increased by that rate and just pushed it up per RIB entry. This is growing faster than the RIB it appears to be that, when I push that forward, if I'm doing a unit of one by the end of this year, by the end of five years, I'll need four times that amount of processing power to cope with that load. That appears to be the projection. Today at 176,000 prefixes, 700,000 update rates per day, withdrawals of 400,000, you'd need around 250 Mbytes of memory and need 30% of a one-gig processor. Three years time - 275,000 prefixes, just under 1.7 million updates. Almost a million withdrawals, double the memory and 75% of that processor and in five years, you'll need a new processor absolutely. About 120%. It seems awfully low. I think it is low. You really are talking about trying to cope with peaks, convergences about speed, rather than reliability. Got to get there faster. It's per-second peak rates, not loads, that is the problem and it assumes that BGP isn't going to change. And there's been more than enough words over the last year or two that we need to do something around securing BGP. If you think you need to be able to do that inside five years' time, you may want to think about what exactly are the factors are the router you're going it buy in terms of security-related protocols - are we doing IPSEC in our peering or something similar, incremental workload and so on. So I would actually say that, if I was going to spec one out, I'd at least want 500,000 entries in the RIB, no sweat. I'm going to need an awful lot of adjacency RIB space. If I'm going to be Conservative and say, if I do this, I think I'm OK, then it's about 6 million prefix updates per day. I think I need at least a two-gig route processor memory and probably a 5GHz processor for route processing. What was a Cray1? I think it was less than that, wasn't it? How good is that number? What's going on here? I've got a couple more seconds. Is this uniform? I don't think so. I don't think all of you are behaving so well as that. Is this skewed? If so, how skewed? 289,000 prefixes were actually announced that were different. Now, the table is only 179,000 prefixes. So there are actually 127,000 prefixes that appeared for some period and aren't there any more. So people are leaking. And then they pull it back bull there's been an awful lot of leaking. The number of prefixes that had no updates at all through the year, congratulations, were 12,640. Well done, gold star, tick, elephant stamp. Everyone else doesn't, they did some kind of update. This is a cumulative histogram of what's going on. 50% of the prefixes contributed less than 10% of the updates. 60% of the were fixes contributed less than of the 20%. 80% contributed just on 20%. So the top 20% of the prefixes were pretty bad and the top 1% contributed 15% of the update load. OK. Let's name them. LAUGHTER So, if you see yourself on this list, you're on this list for one of two reasons - one, you're multihoming and you don't know how because, somehow, you managed 158,000 prefixes for the year, of which 20,000 were actually just flips in the first-hop AS, the next hop and, of course, you flap like crazy. All of these folk flapped. Some of them also re-homed as well. So have a look there. You may be there. You may wonder why you're there. If you're there and wonder why you're there, go to the tutorials. Let's have a look at these people because some of them are systematic and some of them are night-time stuff-ups. Systematic - Hong Kong Supernet, this was a prefix that was active, withdraw update. Green is the flap and the red is an attribute change so for a period of at least four months, this one prefix managed to generate around 1500 update messages per day for one prefix, most of which were withdraw, announce, withdraw, announce. Precisely what information that added or subtracted from the routing table beats me with a stick. Somehow, they got a clue in September. Well done. Here's another one. This is from ICARE in Hong Kong. This is straight traffic engineering. This is Hong Kong Supernet again. They have four upstreams and they're moving prefixes around. This is one in Turkey. Someone out there in vendorland sold them something in June that they shouldn't have bought. LAUGHTER Because this is systematic sustained 500 updates per second and moving across multiple upstreams. This one is an interesting case. Here's another one from Turkey. They went and bought even more of it at the end of September and did even better with it. Why is this? Again, here's another one, Amphibian Media, I think they're related to the folk who do inbound route traffic engineering and here is another one - Merit. Surely they should know better. I looked for this network a couple of days ago and it's gone. They must have seen it flapping and sent it off into routing hell because I couldn't see it a work or so ago. Phenomenal amount of updates from folk like Merit. Here's the last one again and, oh, it's our friends from Turkey. They really did buy something in June. I think they tested it in April and really turned it on in June. Shouldn't have happened any other way. And Number 10 - again, a US one, I think they're InterNAP-related. Systematic, absolutely systematic, so this is no accident. Is it prefixes or Autonomous Systems? Look at that curve. This is the autonomous system one. Go back by 10, look at this one, so, while there might be a small number of prefixes creating all these updates, there's a tiny number of autonomous system numbers generating all these updates. The top 1%, the top 2%, the top 3%, 3% of the autonomous system numbers generated half of the updates. Thank you very much. Well done. LAUGHTER Here's another way of looking at it. Red is the actual number of updates. The green is just the top 50 ASes. So the top 50 ASes do half the updates. 50 people cause your router to have a problem. 50 people cause BGP to have a problem. Let's name them. Here's the first one, we met them before, the folk who bought this wonderful thing in June, AS 9121, 206,000 rehomes since June. If they'd run if for a year, they'd be effectively off the planet and so on. You could see a lot of multihoming in 17557 and 721. 721 probably should know better and an awful lot of flaps on all of them. Here's the signature for our friends in Turkey of the total number of prefixes they originate and, yes, they actually tried it in February, liked what they saw. I think they bought it in April, turned it on hard in June, took a holiday over August, came back to work and then something blew up in December. This is bad stuff. I actually had a look at them. They seem to have six upstreams and my guess is that they're using a tool like OER and doing more specific juggling to try and get their incoming traffic evenly balanced. And my suspicion is that they could stop all this if they bought more bandwidth. Because what's going on as far as I can see - and you can look them up in BGPlay and have fun with these guys but you will find that the routes flip across all six at 4:00 the morning. I know they're hardworking ISPs but I don't think anyone is up doing massive config changes as fast as they can type at 4:00 in the morning. Something is happening doing bad things globally. More tutorials and a word of caution to the vendor - you shouldn't have sold it to them. The next one, Korea Internet Exchange, number two. This is not traffic engineering. This is some pathological condition that lasted for a while, caught it and fixed it. MCI Europe - traffic engineering. Pakistan Telecom. A combination of the two? I can't tell. These enormous spikes are weird but there's a background of activity. So that's probably harder to put a signature on. Ah, Hong Kong Supernet, our friends. Whatever it was stopped in September and we're all very grateful. TPG here in Australia, again, you know, they're number 6 on the list, really high spikes intermittently. Can't figure it out. Yet. 9121, our friends in the US military. Something must have happened in June somewhere that required a huge amount of updates in BGP. Hong Kong Supernet has two ASes and here's the second one. They just keep on having more fun than you can imagine. DACOM in Korea and the Korea National Computerisation Agency - those are individual problems that happened over a few days. We're not seeing a level of uniformity in update rates that my answer to Vince was any good at all. I'm only looking at 100 ASes, I believe. And I'm looking at 100 ASes whose behaviour pattern is pathologically bad and they contribute the overwhelming bulk of updates now and have done for at least a year and I suspect there's two reasons. One - there's a whole bunch of automated inbound traffic engineering software that does route prefix juggling that is stuffing up the entire network with millions of updates and withdrawals. Literally millions, in fact hundreds of millions. That sustained consistent update rate is killing us. If you want to do inbound traffic engineering, consider buying more bandwidth instead. The rest of us will thank you. I don't think the routing system can withstand that kind of abuse. There's something more than MED oscillation. You shouldn't see withdrawal. It just simply flips. What I'm seeing in these unstable ones is a massive number of withdrawals with moving around in the prefix. We have isolated incidents of unstable configurations that are causing massive load rates. So, as I said, the overwhelming number of updates are generated by an underwhelming number of sources. The uncertainty in the trend models I gave you is extremely high. That means that, if you really want me to give you an answer, Vince, on what you need to buy in three to five years. My answer is I don't have a clue. Thank you. APPLAUSE RANDY BUSH: Geoff, two things. Why did you throw out the internal route fluctuations when the router is going to have to handle them? GEOFF HUSTON: This was the question as distinct from the investigation. The question was actually asked, "If I was buying a default-free zone router," and when I start factoring in AS 1221, how do I sort of know that that's a typical kind of pattern So I thought the way to do this - and AS 1221 doesn't have an awful lot of upstreams so, if you look at its topology, it's actually got a couple of domestic peers and you see this one upstream - I'm kind of looking truly at a default-free zone single path and that seemed a good baseline to take for these measurements. RANDY BUSH: This is what I need with a single router in the default-free zone that doesn't connect to anything else. GEOFF HUSTON: I'm talking about the floor, not the ceiling. RANDY BUSH: And you don't want to know what router to buy. It's what router to sell, since Vince is not a vendor. The other thing is those /8s you saw flapping which you don't understand - I gave a lightning talk at the end of the last NANOG, where we discovered that there are a number of /8s being used by spammers. They announce the /8, they hit you with spam from the dark space in it, from the unused space in it and withdraw the /8. And that's those flapping /8s. GEOFF HUSTON: Thank you. We should name them. I'm not sure it will do any good because they're stolen. RANDY BUSH: We did. GEOFF HUSTON: You named them. But they're stolen. RANDY BUSH: Right. They cover some sparse allocations and what they're doing is using it for agile spam generation. They announce the /8 and go, " ping, ping, ping, ping," and turn it off. GEOFF HUSTON: Interesting. Thank you. PHILIP SMITH: Any other questions for Geoff? If not, we should move on. We're a little bit behind in time. Next up we have Randy.