Back to index
End-to-End Routing Behavior in the Internet
Vern Paxson, LBL
One-line summary:
Wide-area routing behavior in the Internet has become less predictable
with respect to short- and long-scale route changes, routing asymmetry,
and routing anomalies (loops, bad routes, etc).
Overview/Main Points
  -  Two sets of end-to-end traceroute measurements taken between
       pairs of 37 hosts, analyzed for:
       
         -  routing pathologies
         
 -  routing changes ("fluttering")
         
 -  routing asymmetry
       
 
   -  D1 measurements: traceroute between two sites with mean interval
       1-2 days.  D2 measurements: 60% of measurements with 2 hour mean
       inter-measurement time, 40% with 2.75 days mean inter-measurement
       time.
  
 -  Arguments supporting the methodology and goals:
       
         -  even tho few sites, routes are representative because
              they include non-negligible fraction of AS's (autonomous systems)
              comprising the Internet.
         
 -  Stable inter-AS routing doesn't imply stable e2e routing,
              since AS's are large entities and can have internal
              inconsistencies. 
       
 
 -  Properties of the measurements:
       
         -  Additive random sampling
         
 -  Poisson process; therefore PASTA holds (Poisson Arrivals
              See Time Averages), so percentage of measurements that
              observe a given state is asymptotically proportional to
              the percentage of time the Internet spends in that state,
              even if the process is non-Markovian and the Poisson
              arrivals are non-homogeneous (Wolff 82).
              This property is used extensively in the analysis.
         
 -  Caveat: in the event of (temporary) lost connectivity,
              network can anticipate that no measurement will
              occur during the next traceroute; but PASTA requires that
              the observed process not be able to anticipate observation
              arrivals.  Effect of this is tendency to
              underestimate prevalence of connectivity problems.
         
 -  Caveat: end-to-end measurement can agglomerate many
              intermediate effects into a single observation, with no
              information about where/why the problem occurred.
              Sometimes author called AS administrators to track down.
       
 
   -  All measurements reported to 95% confidence interval, including
       measurements for which it was necessary to establish that
       differences between two sets of data are not due to chance.
  
 -  Routing loops and anomalies.  Apt to form when network
       connectivity 
       changes are not immediately propagated to routers.
       
         -  "Persistent loop" == unresolved by the end of traceroute.
              More prevalent in Wash. DC area, where many long-haul
              providers interchange packets.
         
 -  "Temporary  loop" == resolved during traceroute; usually
              transient effect of a single change rippling through the
              network.
         
 -  Observation: All routing loops were confined to a
              single AS, so the Border Gateway Protocol's
              loop-suppression mechanism is working well in practice.
         
 -  Erroneous routing: one instance seen.  Security
              implication: can't assume where your packets are going,
              but that's not news.
         
 -  Unreachable due to low TTL: operational diameter of the
              Internet is now >30 hops, but variance in hop count is
              high and doesn't seem to correlate with geographical
              distance.
         
 -  Observation: Time of day effects: time of day
              computed as mean 
              time-of-day between measured sites.  Clear correlation
              between better connectivity and less loaded hours.
       
 
   -  Routing changes ("fluttering").
       
         -  Most egregious: wustl load balancing shunts
              alternate packets to either coast.
         
 -  Inter-AS fluttering creates stability and symmetry
              problems.  Intra-AS load balancing does not.
         
 -  Observation: "Deflection routing" schemes, which
              shunt load rather than drop packets, have virtually the
              same properties as inter-AS fluttering.  Don't use them.
       
 
   -  Argument: pathology measurements are representative; top 3 AS"s
       accounted for nearly half of them.
  
 -  Observation: None of the pathologies improved between D1
       and D2 (several months apart), and some got worse; so wide-area
       Internet predictability is getting worse.
  
 -  Route prevalence and persistence.  Three granularities:
       hostname, city, sequence-of-AS's.
       
         -  Prevalence: how likely is this same route to be observed
              in the future?  vs. Persistence: how long will this route
              stick around before it's replaced by a different route?
         
 -  Analysis focuses on dominant route, the one
              observed most often between site pair.
         
 -  Assumption: routing changes are semi-Markov, so
              that steady-state probability of observing a state is
              asymptotically equal to amount of time spent in that
              state.  Because of PASTA, measurements provide this time
              average.
         
 -  Observation: Internet paths dominated by a single
              route, but a fairly wide spread is observed.  ("heavy
              headed and heavy tailed" :-))
         
 -  To measure route persistence, must rule out
              short-time-scale route fluttering.  2-minute route changes
              are negligible; 10-minute route changes are rare but
              non-negligible.  On larger scales, route changes seem to
              happen at 12-48 hour granularity.
         
 -  Observation: Routing persistence at least bimodal;
              routes that do not change on short, medium or long scales
              as defined above, tend to be stable (90% chance) for at
              least a week.
         
 -  Argument: routing changes on short time scales (< days)
              happen inside the network, not at stub networks.
              Therefore our measurements are likely to be similar to
              those observed by other (unmeasured) sites.
       
 
   -  Route symmetry.  Important since some protocols estimate
       latency as 1/2 of RTT (eg NTP), or infer network conditions from
       packet interarrivals.
       
         -  In D2, measurements were paired, so asymmetries could be
              unambiguously determined.  49% measurements observed
              asymmetry at the level of cities.
         
 -  D1 analyzed to conservatively (ie under-)estimate
              asymmetry.  Observation: things have gotten worse
              (30% to 49%).
       
 
   -  Overall observations.
       
         -  Chance of seeing some routing pathology doubled
              from  1.5% to 3.4% from end 1994 to end 1995.
         
 -  Most paths dominated by single route, but time periods
              over which routes persist show wide variation.
         
 -  Getting harder to provide consistent topological view of
              the wide-area Internet.  There is no "typical" Internet
              path. 
       
 
 
Relevance
  -  A tour de force of data reduction and analysis; a surprising
       amount was deduced using only sparse end-to-end measurements.
  
 -  The routing asymmetries and route change behavior can affect the
       performance of various network time and anti-congestion algorithms.
 
Flaws
  -  Needed to make a few assumptions, e.g. route changes are
       semi-Markovian and the applicability of the PASTA principle
       (Poisson Arrivals See Time Averages).
 
Back to index