Back to index
End-to-End Routing Behavior in the Internet
Vern Paxson, LBL
One-line summary:
Wide-area routing behavior in the Internet has become less predictable
with respect to short- and long-scale route changes, routing asymmetry,
and routing anomalies (loops, bad routes, etc).
Overview/Main Points
- Two sets of end-to-end traceroute measurements taken between
pairs of 37 hosts, analyzed for:
- routing pathologies
- routing changes ("fluttering")
- routing asymmetry
- D1 measurements: traceroute between two sites with mean interval
1-2 days. D2 measurements: 60% of measurements with 2 hour mean
inter-measurement time, 40% with 2.75 days mean inter-measurement
time.
- Arguments supporting the methodology and goals:
- even tho few sites, routes are representative because
they include non-negligible fraction of AS's (autonomous systems)
comprising the Internet.
- Stable inter-AS routing doesn't imply stable e2e routing,
since AS's are large entities and can have internal
inconsistencies.
- Properties of the measurements:
- Additive random sampling
- Poisson process; therefore PASTA holds (Poisson Arrivals
See Time Averages), so percentage of measurements that
observe a given state is asymptotically proportional to
the percentage of time the Internet spends in that state,
even if the process is non-Markovian and the Poisson
arrivals are non-homogeneous (Wolff 82).
This property is used extensively in the analysis.
- Caveat: in the event of (temporary) lost connectivity,
network can anticipate that no measurement will
occur during the next traceroute; but PASTA requires that
the observed process not be able to anticipate observation
arrivals. Effect of this is tendency to
underestimate prevalence of connectivity problems.
- Caveat: end-to-end measurement can agglomerate many
intermediate effects into a single observation, with no
information about where/why the problem occurred.
Sometimes author called AS administrators to track down.
- All measurements reported to 95% confidence interval, including
measurements for which it was necessary to establish that
differences between two sets of data are not due to chance.
- Routing loops and anomalies. Apt to form when network
connectivity
changes are not immediately propagated to routers.
- "Persistent loop" == unresolved by the end of traceroute.
More prevalent in Wash. DC area, where many long-haul
providers interchange packets.
- "Temporary loop" == resolved during traceroute; usually
transient effect of a single change rippling through the
network.
- Observation: All routing loops were confined to a
single AS, so the Border Gateway Protocol's
loop-suppression mechanism is working well in practice.
- Erroneous routing: one instance seen. Security
implication: can't assume where your packets are going,
but that's not news.
- Unreachable due to low TTL: operational diameter of the
Internet is now >30 hops, but variance in hop count is
high and doesn't seem to correlate with geographical
distance.
- Observation: Time of day effects: time of day
computed as mean
time-of-day between measured sites. Clear correlation
between better connectivity and less loaded hours.
- Routing changes ("fluttering").
- Most egregious: wustl load balancing shunts
alternate packets to either coast.
- Inter-AS fluttering creates stability and symmetry
problems. Intra-AS load balancing does not.
- Observation: "Deflection routing" schemes, which
shunt load rather than drop packets, have virtually the
same properties as inter-AS fluttering. Don't use them.
- Argument: pathology measurements are representative; top 3 AS"s
accounted for nearly half of them.
- Observation: None of the pathologies improved between D1
and D2 (several months apart), and some got worse; so wide-area
Internet predictability is getting worse.
- Route prevalence and persistence. Three granularities:
hostname, city, sequence-of-AS's.
- Prevalence: how likely is this same route to be observed
in the future? vs. Persistence: how long will this route
stick around before it's replaced by a different route?
- Analysis focuses on dominant route, the one
observed most often between site pair.
- Assumption: routing changes are semi-Markov, so
that steady-state probability of observing a state is
asymptotically equal to amount of time spent in that
state. Because of PASTA, measurements provide this time
average.
- Observation: Internet paths dominated by a single
route, but a fairly wide spread is observed. ("heavy
headed and heavy tailed" :-))
- To measure route persistence, must rule out
short-time-scale route fluttering. 2-minute route changes
are negligible; 10-minute route changes are rare but
non-negligible. On larger scales, route changes seem to
happen at 12-48 hour granularity.
- Observation: Routing persistence at least bimodal;
routes that do not change on short, medium or long scales
as defined above, tend to be stable (90% chance) for at
least a week.
- Argument: routing changes on short time scales (< days)
happen inside the network, not at stub networks.
Therefore our measurements are likely to be similar to
those observed by other (unmeasured) sites.
- Route symmetry. Important since some protocols estimate
latency as 1/2 of RTT (eg NTP), or infer network conditions from
packet interarrivals.
- In D2, measurements were paired, so asymmetries could be
unambiguously determined. 49% measurements observed
asymmetry at the level of cities.
- D1 analyzed to conservatively (ie under-)estimate
asymmetry. Observation: things have gotten worse
(30% to 49%).
- Overall observations.
- Chance of seeing some routing pathology doubled
from 1.5% to 3.4% from end 1994 to end 1995.
- Most paths dominated by single route, but time periods
over which routes persist show wide variation.
- Getting harder to provide consistent topological view of
the wide-area Internet. There is no "typical" Internet
path.
Relevance
- A tour de force of data reduction and analysis; a surprising
amount was deduced using only sparse end-to-end measurements.
- The routing asymmetries and route change behavior can affect the
performance of various network time and anti-congestion algorithms.
Flaws
- Needed to make a few assumptions, e.g. route changes are
semi-Markovian and the applicability of the PASTA principle
(Poisson Arrivals See Time Averages).
Back to index