Routing SIG, Thursday Aug 21, 4:05-5:30 pm PHILIP SMITH: I think we will make a start to this. It's just after four o'clock. Welcome to the routing SIG. I've put the agenda up on the screen there so you can see what we're going to talk about. But, before I actually make a start to this, I have some announcements I would like to make. First one is to thank Korea Telecom for being platinum sponsor for this meeting and, secondly, the MyAPNIC demo is on all day outside the meeting room. I'm sure you've seen it sitting out there just behind the door at the rear of the room. Finally, also, check the onsite notice board for announcements. So that's on the APNIC website if you have a look there for current announcements. The URL is there, if you'd like to check the status of the onsite notice board. My usual virus or worm status. The system with this IP address - 221.143.6.120. That user is being devious. Because I put up the address at lunchtime, he got a different address. You can't escape. I am watching all the time. You will get your Internet access back when you get your laptop updated at the door. You can change your address if you like. I will block outgoing access. It's irresponsible to come in here with infected PCs. You shouldn't connect to a public Internet if you can't set up your PC properly. That's my personal opinion. The address is 221.143.6.120. OK. Outstanding business that we had to close off from last time around. The first one was the routing SIG charter. We discussed the routing SIG charter at the APNIC meeting during APRICOT in Taipei. This was basically the charter that the routing SIG examined - important issues of Internet routing and policy in the Asia Pacific region and globally. These issues would include Internet routing table growth, de-aggregation of provider blocks, routing stability and flap damping, routing security and the Internet Routing Registry. That's what the meeting agreed. After the meeting, I put this on to the routing SIG mailing list, received no comments whatsoever, so this is now formally adopted as the charter of the routing SIG. Also for all participants, this is more my personal request than anything I have to do but I would like you all to speak slowly and clearly as English is not the native language for the majority of meeting participants. Also, if you're going to ask questions or make comments, please use the microphone. We have microphones running around the room for you. Also try and use simple and clear sentence structure rather than getting too complex. The other open action item I forgot to put on the slide is that, at the last SIG meeting, it was proposed that Randy Bush be my co- chair for the routing SIG. There were no objections to this at the meeting. Again, I posted this to the SIG mailing list, no objections there either so Randy is now formally my co-chair. Welcome. OK. So that's the preliminaries. Now let’s get on with the agenda we've got here. We've got five presentations so we don't have a great deal of time to go through those. Speakers have a maximum of 20 minutes each. Hopefully we can fit all this in. The first one is from Geoff Huston which is about hunting the bogons. GEOFF HUSTON: Thanks, Phil. OK. This is about hunting the bogons. Let me explain a little bit about this. Firstly, it's research activity that's being supported by APNIC and what I'm trying to do is to understand to what extent do we see advertisements in the routing table of address blocks or autonomous system numbers that aren't registered anywhere as being allocated? Those are what I would call a bogon. The most common example that one finds is leakage of RFC 1918 private address space. Most commonly are what you'd see is instances where network 10 appears from time to time on the routing table. But, if you're really trying to understand what is allocated and what is not, you've actually got to wade through a reasonable amount of data. And this is the list of sort of primary sources here. The first primary source is actually the IANA registry report that indicates what blocks, /8s, have actually been allocated to the RIRs for use in further assignments and what blocks are reserved and are not to be used. So, if a block is marked by IANA as reserved and you see it in a routing table, that to my mind is a bogon. The next level down is to actually have a look at assignments that have been undertaken by the RIRs. And, by looking at the stats files, the stats file is actually a summary of all of the allocations that the RIR has performed. Those files are now updated periodically. This slide pack is slightly old and, in the last couple of weeks, all four RIRs update that stats file on a daily basis so now you have an accurate view of what address blocks have been allocated. If it's not a stats file, it's getting to be pretty dubious but the stats files unfortunately are not complete so what I've had to do to get a reasonable set is to include Whois data if I can dig it up to get a second set of network blocks that are listed in Whois that appears to be allocate or assigned in some form or fashion. I've been asked to swallow the microphone! Firstly, having a look at the IANA registries, there are two primary sources for AS numbers and address space. And, actually, having a look at them, there are some problems in trying to analyse that data. It's not actually a very clean registry ... path as you'll find that some blocks are sort of assigned as reserved but then have a user listed beside, such as 36/8. Is it really still Stanford's or is it reserved? I don't know. And blocks 49 and 50 are marked as returned to IANA but still listed as a Joint Technical Command. I'm not sure if they're reserved or not. I can't understand why the top-level blocks, 240 and upwards, are marked the same as the lower-level blocks. It's actually quite difficult to make a clean judgment from the IANA data as to what is unicast, assigned and, you know, available for use and what isn't. I would certainly appreciate a slightly more consistent IANA registry file. There are also some listings in RFC 3330 that, actually, I don't understand. The one that I find rather strange is actually 223.255.255.0/24. It's marked as reserved but subject to allocation. I actually thought that most of the reserve blocks are subject to allocation and I'm not sure if that's a bogon or not. There's a URL at the bottom, www.potaroo.net/IPAddrs. You can make some sense of it there. The next sense is the stats files which are all of the allocations that the RIRs have performed. They're updated every day. The problem is that it's not quite all the allocations. The early ones, which are part now of the Early Registration Transfer process, ERX, are only listed in the ARIN area and actually aren't in the stats files cleanly. The Whois data actually contains a lot of additional records, around about 6,000 or 7,000 at the moment. Some of them are clearly nonsense. If you look up RIPE's Whois database, you'll find an entry for a /24 drawn out of network two, obviously nonsense. When you start parsing the Whois data, it's not clear that you're seeing real addresses all the time. Also in RIPE, they produce issue files and it would be good if the issue files were the same as the stats file. They're not. So I actually go to Whois and have a look as well. There are some references there to what I use. Again, OK, but here's these blocks that I'm finding in RIPE. If someone is here, why RIPE, have a look at Whois 2.6.190.56/29 and you will get an answer. If the Whois is meant to be an authoritative listing of space, then there are some anomalies here. I'm also seeing the same from the other registries to some extent or another but I am assured the RIRs are working on this. I'm also seeing things in the BGP table that actually have no visible allocation whatsoever and, if you wander over to the US military Whois server, you find a whole bunch more of both addresses and AS numbers. For example, the block of autonomous system numbers from 1451 to 1533 appear to have been allocated over to the US military but there is absolutely no record anywhere that I can find that actually says this is what happened. So I'm making the assumption that the military aren't lying. So on that assumption and on a whole bunch of other assumptions, what's a bogon? If it's not listed in the IANA - sorry if it is listed in the IANA registry as reserved, it's a bogon. If it's not listed in a collection of RIR stats files, then it's a bogon and if it's being advertised, then ... really there. What I'm seeing could be a bogon. What I'm also seeing could be an inconsistency in the data. What I'm not seeing, what I'm not reporting on, are folk who intentionally steal address space, invalid allocations of address space. So, if someone is hijacking address space, I can't see it. I don't do any checks based on origin AS or anything else. Hijacking is not part of this report and I don't report on it. Other folk may be working on that. I'll go live in a second. You'll actually find the bogon listing in www.cidr-report.org. It's updated every hour and includes a list of possible bogons. I first started this in May and found 54 autonomous system numbers that looked dubious. There has been some very good work. It's come down to 17 at the start of July, so there has been some work in cleaning up autonomous system numbers. This is what I'm finding right now. The US military, the OD network, some stuff out of Harnet as announced by 3362. I can't find any record for AS 3363. It doesn't exist. I can find nothing for AS 4665 but it does appear in the BGP routing table. The other thing I'm looking at is IP addresses. When I started this in May, I found 264. It's down to 173 so it's getting a little bit better. So what? Why am I doing this? Addressing is important. It's probably why you're here and, in terms of trying to make the network work properly, the integrity of the network works on uniqueness of addresses, but uniqueness actually depends on matching records to reality. So that what we want, what all of us need if we're going to rely on the Internet as being a reliable system, is that what is routed is real and, if it's not real, it shouldn't be in the routing table. So, on the list that I'm about to splash up, if you find yourself there, what should you do? If you're listed, then figure out why and the first thing to do is to check your own records to check that you really do have that autonomous system number or ISP block being assigned to you and continuing to be assigned to you. Check your own records. And then go back to your own RIR and consult with them as to their records. Because, ultimately, there are only two ways to get off this report and sending mail to me isn't one of them. Either stop advertising what you don't have. It's a very clean way of getting off. Or work with the RIR to update their data to reflect reality. IANA and the RIRs are certainly working on resolving the inconsistencies in all this and, as a more general mechanism, a web page is fine, but it would be good to have tools to allow this to happen interactively. So let's have a look at the CIDR report and see what we can find in the current listing of bogon addresses. I may have to up the size of that font. So, here are some of the addresses that I'm seeing that I can't find any valid allocation record for. The first one is being routed by AS 01/14/92 and they're described in the database as CABLEONE and here's the block showing that advertisement I can't find any records for that. As you see, in terms of address space, there is a reasonable amount of it as we keep on scrolling down and down and down. So go to the web page and have a look yourself to see if you're listed. There's no shortage of these things. Secondly, the list of autonomous system numbers that I can't find. For example, I can find no record for autonomous system 1495 actually being allocated. Yet AS 668, which is this chap here, is announcing that address. So there is the list of announcers and these are the bogons that I'm seeing. This one here is interesting. It's actually multihomed, announced by both of these ASs but I can find no record that that AS is actually validly in the system. So, yes, please have a look. If you find yourself listed there in that resource, please check your own records to make sure that you actually do have the autonomous system or address space that you're advertising and then make sure, if you really do have it, that your RIR knows that you have it and that you can get yourself listed back into the stats files. And that is the end of the report. Any questions or comments on that? BILL WOODCOCK: As somebody who used to own one of the ISPs, one of the ways that people get on there is by customers or people representing themselves to be customers, returning blocks to RIRs, perhaps without authorisation and it's really unfortunate that we've had all the hijacking problems lately because there were a lot of very good database cleanup efforts which were under way. That's pretty well stalled all the cleanup because a lot of the cleanup was sort of case-by-case decision-making, RIR staff having to look closely and make a judgment call and, of course, all of their freedom of decision-making is gone out the window with hijacking. GEOFF HUSTON: So what you're saying is that, in the same way that ISPs have an issue, when a customer comes along and says, "I have some address space, please route it," how can you tell? It's quite difficult. You're suggesting that sometimes address space gets returned to a registry from folk who don't own it and the registry conducts the return and pulls it back into its allocated pool. BILL WOODCOCK: Yes. GEOFF HUSTON: I can only agree but the issue is how do we find that out? This looks in the routing table and says, "I can see an advertisement and, no matter how I search, there is no allocation information that encompasses that." BILL WOODCOCK: I'm not disagreeing with your tool. I'm just pointing out one of the reasons for the problem. PHILIP SMITH: Any other questions at all? No? OK, if not, thank you Geoff and we'll move on to your next one - IPv6 CIDR report. GEOFF HUSTON: This is actually a very quick report and it's more to let you know it's there if you want to use it. This is the URL of the CIDR report structured for IPv6 and what it's trying to do is much the same as the CIDR report does for IPv4. It's trying to track the aggregation behaviour inside the routing table and provides an analysis of the v6 space in terms of the combined RIR address space and the current 6BONE address space. The way it works is that it takes a full snapshot of the IPv6 routing table as seen from AS 1221 every hour and then analyses that table to have a look at address ranges and aggregation possibilities. The overall picture with v6 is certainly the case that there's not an awful lot of aggregation that could be performed on the table. Currently, the v6 routing table as I see it from AS 1221 has around 500 entries and, if I do various forms of aggregation, and there are various forms of aggregation that can be done, you can take the full prepended AS path and only aggregate when the paths of both of the candidate blocks of addresses match precisely. Or, a slightly looser condition, you can remove all the prepending and just look at AS paths compressed and see if you can combine. Or, finally, the most vicious form of potential aggregation, if two adjacent IPv6 address blocks share a common origin AS, then you can suggest they'll be aggregated. So, what you find here are four colours - and it doesn't look quite obvious because the second and third are at almost the same data point. The red is the routing table. The bluish colour - let me point to it - is actually using AS path compression and the bottom one here is using just AS origin itself. So overall, the picture is right now there's very little fragmentation of the report itself. I'll quickly go through the report. The number that you get is actually quite high. The largest number of prefixes announced by a single AS. Because the view is internal to AS 1221, you end up seeing all of AS 1221's detail. I'm looking for the largest address span announced by any AS, which currently is VERIO, which is announced a /23 v6. You can look at some plots of that data, if that's what you want to look at. The next part of the report may be hard to read - it's very small font. It tells you the ASs that actually could do the most good in aggregating. Telstra could get down from 23 to 3. Because it's an internal view, that's not possible. From outside AS 1221, I would hope that you only actually see three prefixes. I'm also getting an internal feed from PCH so, again, even though PCH is listed second here, I think that's actually not a very good data point. What it's showing is that there's very little aggregation that could be ... format on the v6 address space at the moment. The other part that may be interesting for people who are actually tracking the adoption of IPv6 is actually listing who is adding prefixes into the routing table each week. So there is someone from Portugal, someone from Demon in the Netherlands, someone from SpeedKom in Germany that have added prefixes in the last seven days. Someone has changed from two prefixes to four in Belgrade, Yugoslavia and Mojo Networks has withdrawn its only prefix that it was announcing. So, on that scale, you can see from day to day who is coming and who is going. You can see in aggregate form what is going on. In terms of total activity, most of the activity is around the/32, /48 level. Yes, there are /48s in the global network and there are /64s in the global network, or there were. It's now gone back down to zero. Some of them slip out out globally. I'm only seeing one bogon in this table and it happens to be an autonomous system number. It's a private one coming out of PCH. It's not real simply because the period I'm having ... the PCH exposes internal detail that you probably wouldn't see. PHILIP SMITH: Thanks very much Geoff. OK, next up, we have two presentations on the IRR basically. The first one is a survey of utilisation of IRR objects by Kengo Nagahashi and after that we have improving the reliability of the IRR database from Masasi Eto. KENGO NAGAHASHI: I'm from the University of Tokyo. This is the agenda for this presentation. First, we talk about the background about this research. Second, we talk about the goal to achieve. Third, we point out related work about this research. Finally, we describe our approaches. First, background, what benefit IRR offers? There are many benefits to using ... getting contact information. Second, router configuration, such as some applications. There are many benefits in IRRs. Firstly, how many IRR objects are registered? Is all routing information registered in IRR? So, our goal is to understand divergence between IRRs and BGP prefixes. So our goal consists of three parts. One is how many prefixes are registered in IRRs? Second - what difference in IRRs, such as RADB, RIPE, APNIC or other IRRs? So there's many considerations such as region, history, operation. Third - is IRR very practical for router configuration? So these are the three items of our goal to achieve. And there is a couple of related goals in here. We point out the RIPE RRCC. The point of similarity with us is comparing with BGP routing table and IRR database. So we are also using BGP routing table, IRR database. The difference point is we analyse RADB, RIPE, APNIC, JPIRR database and also unified database, which I will describe later. So our approach is how to match IRR and BGP routing table? There are two ways of matching method. One is exact match. Exact match is very straightforward method. That just means the IRR prefix and the BGP prefix is identical. The second point is the best match method. The best match is the collation of networks. If IRR registers this prefix and BGP announces this prefix, then it is correction network and is a match. So this we apply two matching methods to our BGP routing table. Also there is eBGP multihop from two ISPs. So this is a summary of IRR database. IRR database number of route objects as of 2003/08/11. RADB is here, RIPE is here, APNIC is here and JPIRR is here. So, using the unified database. So what is unified database? Here is a definition of unified database. What is unified database? Unified database is a combination of RADB, RIPE, APNIC and JPIRR. But there are a couple of duplicated objects. So we removed duplicated records. What part of duplicated I will describe later. So why unified database is needed? So, as you know, routing information is worldwide spread but, in IRR route objects, this is regional spread. So ideal database that covers all regions. So that is unified database. So we make unified database for this research. So unified database is … number of duplicated objects. That is the exact match method. So RADB and RIPE and APNIC and JPIRR. So total number of route objects in unified database is total number of route objects minus duplicate of these is 92,500 records is total number of unified database. So we apply unified database and BGP routing table. So here is a snapshot as of 2003/08/11. The exact match ratio is 47%, that is about 26149 of 54,663 that matched. And about best match, the ratio is 94%. That means, this ratio. Also, this is a summary - total summary of unified database, RADB, RIPE, APNIC and JPIRR. So this is also a snapshot of unified database. The difference with our picture is the Y axis is the number of prefixes. This is exact match and this is best match. This is about RADB unified database. The distribution of RADB is here. So, as we talk about best match - utilisation of best match is very high, 94%, about /24. Exact match is low - 47%. So best match is indeed very low so are we happy using best match for routing configuration? Is it still indeed best match is valid? So, we need to investigate validation of Origin-AS. Validation of Origin-AS is we are focusing on only RADB and RIPE. Number of total best match prefix is in RADB is here and about correct origin, that is here and incorrect origin is in here. So incorrect origin ratio is in RADB is 65%. In RIPE, total number of best match prefix is here, correct origin is here, incorrect origin is here and incorrect origin ratio in RIPE is 34%. So why high ratio of incorrect origin data in RADB? So one cause is obsolete data, for example, invalid origin at RADB is this one. So this is conclusion. So first of all is how much prefixes are registered in IRR? The unified database on average in best match is very high - 92.4%. Second goal is what difference in IRRs such as RADB, RIPE, APNIC? So, as I already said, RADB stores many unmaintained objects; in contrast with this, RIPE stores more maintained objects than RADB. We can't understand why incorrect origin ratio about this conclusion. For example, RADB incorrect origin ratio is 65% and in RIPE incorrect origin ratio is 34%. So, final goal is - is IRR practical for router configuration? Current accuracy of RIPE IRR is high. Therefore, it is relatively practical to make router configuration using RIPE authoritative IRR objects. Future - future investigation is still needed to clarify these differences between RIPE and other databases. You can see some statistics on this site. This is the end of my presentation. PHILIP SMITH: Are there any questions at all? No? Everybody is very quiet this afternoon. OK, thank you very much. Next presentation is Masasi Eto. MASASI ETO: Improving reliability of IRR database. Prefix validation using IRR database. Improvement of consistency among AS policies on IRR database. Our goal is improving reliability of IRR database. So more widespread use of IRR. Prefix validation using IRR database. So, one of severe problems in interdomain routing is hijacking. Why? The reason for this problem is one AS propagates invalid origin prefix. For example, in this figure, but AS 4133. In this AS 1 and AS2. So, counter major approach. One of approaches is authenticate prefix in BGP update. BGP routers exchange certificate. There are several candidates, such as s-BGP, soBGP. BGP holds over 120,000 prefixes. So it takes a long time to deploy. So the motivation is to check a correct prefix by lightweight and simple way. What we need to check? To identify invalid origin prefix. To use certificate is too heavy, same as sBGP, soBGP. How to verify it - we are using database. Our approach is first - router download request for database. Our second is response prefix/origin-as pairs. So example one and example two: Just checking. So, we need simple protocol. Download router requests download to DB. Frequency is once a day. The second is response. Database responses to router. Response Prefix/origin-as pairs in database. Problems to be solved? Future work - to hold 120,000 prefix/origin-as pairs is overhead. Utilisation of IRR - all entries are registered in IRR database. Duration of update - is once a day too long? GEOFF HUSTON: Do we want to take questions now? PHILIP SMITH: Should we take questions now? OK. Let him finish the presentation first and then we'll take questions. MASASI ETO: Generate router configuration from routing policy registered in IRR with RIconfigure. However, there are many inconsistencies in the database. As a result, when we generate the router configurations from IRR database, the connectivity between them will be lost: IRR inspects only policy's syntax. To find out inconsistencies systematically, we need to check inconsistency of the information from other peers. This is an example of inconsistency of importing. In this figure, S 1 and S 2 are there. S 3, 4 and 5. This policy, which is configured to S 2, 3, 4 and 5. Then on other hand... in this policy, this is missing by accident: S 1 couldn't establish connectivity with S5. Next, this is an example of inconsistency of exporting. AS 1 and AS 2 are under the same. In this case, AS 2 according to the contract. However, in the policy of AS 1, to import is missing by accident. As a result, AS 1 couldn't establish the connectivity with AS 5. Classification of inconsistencies. In this research, we found out all the inconsistencies systematically. To examine how many inconsistencies there are in our database. And to prevent increasing inconsistency with checks, which consists of stuff, and it prevents increasing inconsistencies. Database checker inspects how many inconsistencies exist on unified IRR database. We're inspecting the inconsistency. There is an example of query to examine the policy. Users will see this in here. If policy checks any inconsistencies, it takes 30 seconds for inspection. This figure shows how many inconsistencies are on the policy of each AS. We have found that 55.8% of AS has at least one inconsistency. The detail of the inconsistency is shown in this table. It shows the level of each inconsistency. In this table, inconsistency number 3 and number 4 have 36%. Here, AS existed in IRR database. Future work - we are going to deploy a policy checker on JPIRR and collect practical data. In the future, we will start a service to notify result of investigation to JPIRR users periodically. Thank you for your listening. PHILIP SMITH: Any questions? GEOFF HUSTON: It was proposed to load the entire IRR database on every router or was it merely the origin prefix/origin-as pairs? We go up a slide or two. More. More. Right. so, as far as I can see, neither of these sort of approaches seem practical to me. That having every router that's doing external hearing, requesting a download to the database all the time doesn't strike me as a reasonable approach here, even only once a day. Nor the router issuing all these requests to the database. I'm not sure that I understand that just doing prefix-origin AS matching really gains you an awful lot in securing integrity of the routing system. I was kind of wondering are there other approaches that might be more scalable. Because what you've suggested here is simple but I'm not sure that it scales? KENGO NAGAHASHI: What is better way to improve this kind of quality? Do you have any idea? GEOFF HUSTON: To my mind, I prefer the approach of DB. Have information injected that you can trust. And then have its transportation within the routing system undertaken in such a way that it is not corrupted. The approach that you're suggesting here is kind of like half locking the door. It's still unlocked. And security doesn't really get answered that way. So while I understand the intent of what you're trying to do, in this particular space, I suppose I'm offering the perspective that realistically we do need to go down a path of SPGP if we want a routing system that actually carries authentically and validly, carries. KENGO NAGAHASHI: This intention is very simple. So this is what I try. GEOFF HUSTON: The initial observation that you make is contents of most of these routing databases are extremely badly maintained. And part of the reason why they're badly maintained is that there's no motivation for operators to make it accurate. Why should an operator spend money, time and money, maintaining a database that isn't used? So, it's sort of a chicken and egg situation, that if it's not used, it's not going to get maintained. If what you're suggesting in terms of using it is only a small part of what needs to be addressed in a secure BGP environment, then you haven't got over this barrier of making it worth my time to maintain my data. So I suppose I'm suggesting here that we're yet to see in the operator community a strong motivation why you would maintain the integrity of IRR data and the only reason why I can see it gets maintained is that if an RIR looks at that data in conjunction with an allocation request, then I'll update the data about a second before I launch the request and I'll leave the data go rotten until I next need to talk to the RIR and that means most of the data is not maintained. KENGO NAGAHASHI: Any other questions. If not, thank you very much for your presentations. RANDY BUSH: What you've seen so far are some measurements of the static data in the IRR and Geoff comparing the IRR to what's actually announced on the network. We're doing some routing research and trying to look at BGP as it performs in the network and comparing what we think BGP is to what we're getting. We believe BGP works by an announcement and a withdraw coming into some router and then the BGP mesh - this router announces and it withdraws and says, "I'm making an announcement and withdrawing." What we see in the world is that announcement and withdraw come in and this router says all sorts of things - a lot of noise, OK? And we're trying to understand how much, why this happens? Is it due to problems in the BGP design? Is it due to problems in router implementations of BGP? Is it due to configuration itself? One of the ways we're looking at this is we have these things we call BGP beacons, which is a router that announces a prefix into the global Internet and then we can watch that from other places. And we know the specific prefix it's announcing. This is an example of one beacon. And we know when it is announcing and withdrawing that prefix. It's doing it at well-known times. So that beacon is going up for two hours, the announcement, dropping it for two hours, announcing, withdrawing, on a fixed schedule of an NTP timer, etc. This is a single-homed beacon. We have a multihomed beacon that has a more complex schedule. It announces to two ASs, at midnight it announces to one. At 2 o'clock it goes to announcing two, at 4 in the morning, it announces one, at 6 in the morning, it announces 0. It's connected to two ISPs and it's simulating a circuit to one of them going down, both of them going down, etc. We have instrumented one large ISP, a global ISP - I can't tell you who it is - this is over multiple months and we have measurements coming from all their routers, all routers appear with other ISPs and we're injecting a beacon in Seattle. That happens to be in my rack so it's in Seattle and we're watching it at all the edges of that ISP. We've actually relayed more than one ISP but we're only going to look at this one today, OK? And what we actually were hoping for - we were looking for some other measurements about other things. But there's an old saying that the sound of discovery is not 'eureka!" it's "oh my God". This is the fourth "oh my God" we've hit in this. Watch for the notation - 2003, July 1, at 2000, the router made an announcement and the Seattle router where we're measuring saw that one announcement from AS A, in other words, ISP 1, it saw the announcement this is the AS the beacon is in - 3130 is the beacon. If the beacon is multihomed, this announcement says it went from no announcement to an announcement to ISP A. This says, "We switched from A to B and we saw the withdraw of B and we see the announcement of B." Pretty simple, pretty quiet. This is right next to where the beacon is. Remember the beacon was in Seattle and we're measuring it, we're looking at this router to see this announcement. We go to Chicago for a simple announcement of just raising the beacon to one ISP. We see four announcements and what's happening is an oscillation between four different nodes. We want to know why it's doing the oscillation, why is it doing this? Here we see much more complexity. Here we're going from no announcement to announcing to ISPs. We see 41 events - 39 announcements and two withdraws in the middle of them. This happens in 26 seconds from the announcement. The announcement starts at 1300. At 1300:26 we see the last announcement in the sequence. They don't charge extra for the announcements but the vendors tell me that BGP is rock-solid stable. If this is rock- solid stable, I wouldn't want to be standing on that rock. But, in fact, we are all standing on this rock. OK? Why is this happening? OK. Really, BGP is a path vector protocol. Remember RIP? OK? And it is a distributed computation in time and delay and this delay is made work by the timing of when the announcement is made, the final router, and I connect. IfI am a router and I connect to four other routers, they're actually in the BGP specification as something called the MinRouteAdvertTimer that says that I delay before propagating that route. I give it a chance to stabilise. Well, in fact, 30 seconds is advised and implementations vary. Some don't do it at all. OK? But - so - but, no matter what, the difference here you're going to see, the fact that some vary and some do it at all just exaggerates, just makes this much worse than it might be, even if they were all identical, the same problem would occur - it's just a statistical phenomenon. Of course, being from Seattle, I think Seattle looks better just because it's much nicer than Chicago. We also have much better latte. Here you are in Lotte World and you can't even get a latte. The messages in transit or queued up are not shown, MEDs and IGPs are not always shown. One sequence is explored - I want to know if I can understand why I'm seeing that oscillation, why I'm seeing that withdraw? I just want an example. Now, we're going to look at the actual topology. 3130 is connected to ISP A. These routers have long MinRouteADvertTimers. These are the MEDs and we just know that X is less than Y and ispA happens to have two ASs and they're actually using BGP configuration. So these are eBGP and these are iBGP and here's my monitor and here is - so, in reality, this is very common. The customer is connected to one of the aggregation routers, there are multiple aggregation routers, they are connected to two backbone routers and the backbone routers have links from Seattle to Chicago. OK? Very simple, normal ISP configuration for a large ISP. So, in state one, the announcement comes out. So R says, "My path goes to here." He propagates to here first and says the path goes to here, OK? Propagates here, path goes to here, we're now going to get the announcement here, the path goes here, so this is the AS path we get. C says "35 is the MED and 3130 is the origin". OK? We then get the next announcement in that this one comes through and this one comes through. This one finally decided to propagate this way. First, he propagated this way. Now, he's propagating this way. Oops, 34 is less than 35 - new announcement. OK? S2, notice, hasn't heard from R yet, because R has a long MinRouteadvertTimer. R tells S 2, OK, so S 2 now goes this way. A tells B, because A tells B this route became invalid, so B sends a withdraw. So now we have the withdraw. Then, we have the new announcement because I go over here - MED is 35. And then, oh, I can go this way, MED 34. And then, finally, R announces to S2. Oops, my arrow is wrong there because R - oh, yeah, it connects to A, goes there, path 35 again. It comes over here... and then we settle again, path and MED 34 and it's simple topology. We saw this: Simple announcements can be very noisy singly homed. In reality, when it's multihomed, it has two routers. One goes to a different ISP. That ISP peers with the original ISP. That ISP uses also a slow-announcing router. First, this guy announces quicker cause it's a fast announcer. We come this way, he gets this path, so he says, "Oops, AS A AS B gets the short MED, but let's not worry about the MED." Here, the slow path starts to learn, OK? OK, so it comes here and he's starting to learn it but he hasn't made it down there yet. So I'm still converged through AS B. And I have this path too. Now, I have this path, which I will take, because this is a customer of here, I will prefer it to a peer route. Remember, we always prefer customers to peers, otherwise you get inconsistent routing. So, now, AS A is the path. Then, that same switch we had last time, so we get a withdraw. Then, we settle on this path, so we have A:B. Then, we settle for this path, prefer it because it's a customer, we have A. So here's all the things we have. I believe that with multiple S nodes and multiple X nodes - OK, multiple of these and multiple of these connecting over here - we can have multiple withdraws in the sequence which we were seeing in the original. And it's been shown in the lab - in other words, a place with real routers and racks of them - that with reasonable configurations that NEVER settle, that you raise the multihomed announcement and it will keep going forever. You also might want to see Tim's paper on iBGP configuration issues that will give you some suggestions for configuring your system so they won't have these issues. But when my friend Curtis tells me that, if route withdraws from treated immediately and changes propagated more slowly - that's MinRouteAdvert then the withdraw order is one, the route addition is order one and the addition of a better route is order one and a route change where the better route is removed is order one. That's idealistic. It has nothing to do with what's actually happening on the net. Questions? GEOFF HUSTON: It appears from what you're saying here that the MinRouteAdvertTimer doesn't time all the interfaces at the same time. RANDY BUSH: Specified not to. GEOFF HUSTON: It's making it a relatively coherent wave front of change into fragmented change that enduces oscillation. RANDY BUSH: It's not by interface, it's by peers. For instance, if you have a multimedia interface... Secondly, there's a trade-off. If you announce everything immediately, you will have - this is shown by simulation. We unfortunately don't have reality measurements because we weren't measuring in 1994 before MinRouteAdvert was introduced. Simulation shows that - no delay - the network converges more quickly but with a lot of noise. As you increase the delay, the convergence time becomes longer - it takes longer for the information to propagate - but there's much less announcements. Now, as you continue increasing MinRouteAdvert, convergence time keeps getting worse, but it does not get much quieter. And, in fact, simulation shows that the point in the curve where increasing MinRouteAdvert doesn't reduce noise much more but only delays propagation - in other words, what is the maximum useable MinRouteAdvert - is dependent on the complexity of the topology and its .... Does that help? GEOFF HUSTON: A little bit and I'm comparing it to stuff that I only dimly remember about hold-down timers and grip because I'm going back 10 years. In the RIP hold-down timer model, the hold-down timer was expressed for all of my RIP neighbours and, when the hold-down timer expired and I released, I released the same information at the same time on all interfaces. RANDY BUSH: But nothing really happens at the same time and remember, as you were looking at this stuff, that none of these routers implemented MinRouteAdvert. Even though they're on zero, the announcement doesn't come out at exactly the same time on both. GEOFF HUSTON: What I'm wondering - it's almost a simulation question - is that if MinRouteAdvert wasn't on peer and it was a local timer and you released simultaneously when the timer expired, have you looked at that in simulations? RANDY BUSH: It helps but my point is that this essentially is that and you still get switching. It's the simultaneous of zero but the reason that MinRouteAdvert is there is that if there were more announcements - and there were no announcements. What particular value MinRouteAdvert has here is not interesting. GEOFF HUSTON: I was wondering if you have a non-zero hold-down happening on S 1, S 2 A and B, but it was a true RIP-style hold-down, would it do what the router vendors think MinRouteAdvert actually does? RANDY BUSH: My point was that it will change the statistical probability of the worst events. It will not remove them. GEOFF HUSTON: I'd agree with that. It feels intuitively right. RANDY BUSH: It just changes the curve. And this is inherent in the protocol design. There are things you can do in your topology to reduce the problem and that was Tim Griffin's paper to which I referred to. GEOFF HUSTON: Which gets down to almost an invariant that routing complexity has a lot to do with mapping policy over topology and scaling factors has less to do with the actual amount of prefix load you're actually carrying. Right. It's great work thank you, it's a wonderful presentation, very illuminating. PHILIP SMITH: Thank you, Geoff. Any other questions? OK. Well, thank you very much, thank you Randy. As Geoff said, a very interesting presentation. Well, that brings us to the end of the routing SIG. Has anybody got anything else they want to say or talk about in the three minutes that we have left? If not, I'm going to do my usual virus announcement. We have two more that appeared within the last 20 minutes so, if you're in the room, hang on. I'll get my wire. You are up here - 221.143.6.155 and 221.143.6.156. You got it about 20 minutes ago because that's when you started blasting the network. I suggest you disconnect from the network, go to the APNIC help desk and find Darrin who will put it right for you. OK, apart from that, I have nothing else to say. Just a quick announcement that the APOPs BOF, starting at six o'clock, it's down on your program as being in the Emerald Room. It will be held here because there is no other BOF and we just want to make use of this facility, given that it's all set up and ready. APOPs BOF will be at six o'clock. You have half an hour comfort break and, if you'd like to come back for the Operations BOF at 6 pm, I look forward to seeing you then. Otherwise, if not, thank you for attending and we'll see you again for APRICOT 2004 in Kuala Lumpur at the end of February next year. Thank you and thanks to the speakers. APPLAUSE