Back to index
Virtual Network Transport Protocols for Myrinet
Brent N. Chun, Alan M. Mainwaring, and David E. Culler
One-line summary:
AM-II is presented: it provides virtual networks, protected direct network
access, reliable message delivery (timeouts and retransmissions), and
automagic network mapping and route discovery/installation.
Overview/Main Points
- Assumptions:
- fast, low-latency interconnect (> 1Gb/s, <1 usec
link bandwidth/latency)
- homogeneous network interfaces
- single cluster network (order hundreds or
small number of thousands of machines
- steve says: these assumptions eliminate all of the
complexities of transport protocols - which is good,
but it does limit the scope of applicability)
- Architecture:
- AM-II API
- short message (4-8 word payload), medium
messages (~256 bytes), and bulk messages
with DMA memory-to-memory transfers are
supported.
- endpoints are network delivery abstractions
(like sockets). Message tags allow
multiplexing of application traffic over a
single endpoint.
- return-to-sender error model
- virtual networks
- network access via endpoints - collection of
endpoints is virtual interconnect.
- endpoints correspond to allocated
buffers/queues in the NIC itself. Since more
endpoints can be created than the NIC can
handle, NIC memory is used as a cache of
endpoint state, and faulting is done when
necessary. (Assumption: bursty traffic.)
- NIC firmware
- firmware implements reliability -
retransmission, duplicate elimination.
Reordering not necessary because of
link-level properties (such as
backpressure). (Is this true?)
- Programmed I/O for small transfers, and
medium/large AM transfer. DMA for
medium/large data transfer.
- Hardware
- 100+ 167-MHz Sun UltraSPARCs connected with
160 MB/s full-duplex link Myrinet using 40
8-port crossbar switches in a fat-tree like
topology. (Although top-level links are not
fatter, limiting bisection bandwidth.)
- NIC is LANai 4.1 card on SBUS, with 37.5 MHz
embedded processor, 256 KB of SRAM, and
single host SBUS DMA engine.
- NIC protocols
- endpoint scheduling: how long to service endpoint
before go to next. Weighted round-robin is used: empty
endpoints are skipped, for endpoint with pending
messages, NIC makes 2^k attempts to send and loiters
once endpoint is empty in case host sends more
stuff. (Again, bursty assumption.). They used k=8!!!!!
- flow control: bandwidth-delay product is on the order
of 2 messages. 3 levels of flow control: active
message credits (which assume request-reply protocol to
replenish credits), NIC stop-and-wait flow control over
logical channels (4 outstanding messages per channel
allowed), and network link-level backpressure. Note
that they rate-limit senders, so any given receiver
can still be hammered by multiple senders, but
backpressure helps here.
- simple look-up-tables in NIC for channel management,
routing, and message information. Order #channels x
#NICs table space needed, and linear search for
retransmission timeouts is performed. This ultimately
limits scalability of cluster (NIC RAM is the largest
issue, not cost of the search.)
- Error handling: system attempts 255 retransmission
(why? why? myrinet is ultra-reliable, so a dropped
packet almost certainly means congestion. 255
retransmissions is a bad idea. Plus, paper doesn't
mention if they do exponential backoff or not.)
- Performance
- It's fast, but twice as slow as GAM (from a RTT
perspective), mostly because of
the virtual network overhead (message descriptor
munging in endpoints).
- Throughput approaches 31 MB/s (links support 160
MB/s - where's the rest gone?) under optimal
conditions, i.e. streaming DMA transfer between two
peers when there's no network contention/congestion.
- Get expected 1/x degradation as multiple people contend
for one destination.
Flaws
- see italics above - there are some pretty remarkable and
unjustified constants hard-coded in their protocols, and on
inspection there is a lot of unrealized bandwidth that hasn't
been explained.
- How about some macro-benchmarks? Have any good demonstrable
applications been written on top of AM-II?
Back to index