Re: [sig-dns]progressing the APNIC Lame DNS sweep proposal

  • To: Joe Abley <jabley at isc dot org>
  • Subject: Re: [sig-dns]progressing the APNIC Lame DNS sweep proposal
  • From: Edward Lewis <edlewis at arin dot net>
  • Date: Tue, 29 Apr 2003 09:25:38 -0400
  • Cc: sig-dns at lists dot apnic dot net
  • In-reply-to: <840AE4BF-79C3-11D7-A0BA-00039312C852 at isc dot org>
  • List-archive: <http://www.apnic.net/mailing-lists/sig-dns/>
  • List-help: <mailto:sig-dns-request@lists.apnic.net?subject=help>
  • List-id: APNIC SIG on DNS issues <sig-dns.lists.apnic.net>
  • List-post: <mailto:sig-dns@lists.apnic.net>
  • List-subscribe: <http://mailman.apnic.net/mailman/listinfo/sig-dns>,<mailto:sig-dns-request@lists.apnic.net?subject=subscribe>
  • List-unsubscribe: <http://mailman.apnic.net/mailman/listinfo/sig-dns>,<mailto:sig-dns-request@lists.apnic.net?subject=unsubscribe>
  • References: <840AE4BF-79C3-11D7-A0BA-00039312C852@isc.org>
  • Sender: sig-dns-admin@lists.apnic.net
    • First question I use: is the effort to reduce lameness targetting to reduce the load on the infrastructure servers (root, TLDs, etc.) or to make the overall DNS function better?

      If the purpose is to reduce the load, then the aim is to make sure every zone that is supposed to exist answers (eventually). Resolvers already should cycle through the NS RR's and cache the referrals, so that as long as resolvers are not induced to retry the query 'from the top' in response to a lame referral, we can accomplish the goal with 'eventually get some answer.'

      If the purpose is to clean the DNS and thereby make the Internet more responsive, that is, allowing web browser resolvers to spend less time thumbing through NS RR's and more time getting to the answer, then we need to more than 'eventually' get to an answer.

      At 17:51 -0400 4/28/03, Joe Abley wrote:
      1. When exactly should a delegation be considered lame?

      - when all/some of the delegated nameservers give no response to queries?
      - when some/some of the delegated nameservers give non-authoritative or otherwise unreasonable responses to queries?
      - some other criteria?
      There is more to this. An NS RR can lead to (at least) the following:

      1) No response to the query for the name server's address
      2) A negative answer to the query for the name server's address
      3) An answer you categorically can't use - e.g., AAAA with only a v4 stack
      4) No response from (one of) the name server's address(es)
      5) A negative answer from the same
      6) A positive answer
      7) Some combination of the other 6, unbounded in size, as an NS might have
      multiple address records

      By 'a delegation' do you mean that which is represented by the NS RR set for a child zone registered at the parent? (As opposed to being an individual NS RR, an address off an NS RR, or even the authoritative set of servers at the child.)

      At ARIN we are reporting errors that we see falling into category #5. We haven't taken action on the others yet. The reasons for this - those errors are easy to detect automatically (no retries needed) and are easy to discuss with registrants over the phone, partly because they are reproducible.

      As an ARIN staff member, I'd refrain from making a statement of what ARIN should do with regards to the ultimate goal - as our members haven't made such a statement. Perhaps that is why my answer avoids a definite opinion. As an IETF engineer with loads of experience in DNS, I'd say that what should be considered lame really depends on what you want to accomplish - as in the first question I raised above, and then see the first half of this paragraph to see why I am once again hedging my answer.

      The DNS is an incredibly resilient beast. It can suffer lameness easy enough, as we can see, the lameness out there isn't bringing the DNS to its knees - it's just annoying to those of us that need to use the network. ;) The fact that DNS suffers lameness well is because of its loosely coupled management structure, which is also why DNS scales and why the DNS is a success. It is important to keep in mind that in any effort to make DNS "better," we don't compromise the loose coupling of the management.

      2. When should a lame delegation be considered "lame enough" for some action to be taken?

      - how many measurements?
      - exactly what query or queries?
      - measured from how many places?
      Some random comments...

      Last summer I ran a probe that measured lameness - a quantitative effort. Within the run, I repeated queries on a timeout basis. Besides a problem with traffic shaping (1.5M link upstream of me), I did notice it was "un"-rare for a name server to answer to the second or even the third query seconds apart. So, persistence in the small had a benefit.

      Overdoing persistence is a problem. I feared that I might have been triggering some IDS' - particularly on some servers that had 6000+ delegations. After a while, these large ones seemed to shut me down.

      Besides those data points, if I were to test once a day, how many days in a row should I wait until I declare a server unreachable? One factor - the mean time to repair of single points of failure (or maybe cover maybe 'x' standard deviations).

      How many places is an interesting probelm. The purist in me says that every server listed in DNS ought to be reachable from anywhere. Practically, that ain't going to happen.

      I once entertained the fanatsy of having four probe reporting to a central site every time I ran the test. I think that's overkill. Just rotating where the one probe runs among four sites (with four being any number you are comfortable with) on successive days is sufficient to get around routing 'issues' and link outages.

      I recommend querying for the SOA record. It's the only record that must be in a zone that is uniquely answered from the zone. (The NS set is hinted at by the parent.) The queries ought to also have RD turned off, even though this does not guarantee that the server won't answer from what it has already cached. (Authoritative servers ought not recurse, but, but, you teach the kids, you lecture them, and do they listen?)

      In the answer, I (want to) look at the return code, the answer count, the flags, and what's in the anser section. In particular, the AA bit and the RA bits of flags. No AA bit is an error, RA is something I don't use now, but it is helpful in diagnosing the problem (as in a server answering but not authoritatively). I also look at the answer section - we have one out-of-zone CNAME in our tests (1 of about 200K zones).

      Opinions from the list on these two questions would be very good to hear.
      
      Remember, Joe, "vengeful." ;)
      --
      -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
      Edward Lewis                                            +1-703-227-9854
      ARIN Research Engineer
      
      A compiler-directive person living in an HTML-tagged world.