Leslie Ann Goldberg's Contention Resolution Notes

Notes on Contention Resolution
Leslie Ann Goldberg
2 Oct 2000

These notes contain a survey of some work on contention resolution in multiple-access channels. I have written the notes in connection with EPSRC Research Grant GR/L60982, Design and Analysis of Contention-Resolution Protocols. The notes are not currently up to date (they were last updated in October 2002) but I may update them in the future.

A multiple-access channel is a broadcast channel that allows multiple users to communicate with each other by sending messages onto the channel. If two or more users simultaneously send messages, then the messages interfere with each other (collide), and the messages are not transmitted successfully. The channel is not centrally controlled. Instead, the users use a contention-resolution protocol to resolve collisions. Thus, after a collision, each user involved in the collision waits a random amount of time (which is determined by the protocol) before resending.

Although the most familiar multiple-access channels are local-area networks (such as the Ethernet network) which are implemented using wire, multiple-access channels are now being implemented using a variety of technologies including packet-radio, fibre-optics, free-space optics and satellite transmission (see [GFL87]). These multiple-access channels are used for communication in many distributed systems such as distributed computer networks.

Research in contention resolution typically relies on the slotted time assumption [1] to simplify the analysis. The two models [2] considered are the following:

The Queue-Free model is as follows. At the beginning of each time step, a random number of messages enter the system, each of which is associated with a new user which has no other messages to send. The number of messages that enter the system is drawn from a Poisson distribution with mean lambda [3]. During each time step, each message chooses independently whether to send to the channel. If exactly one message sends to the channel, then this message leaves the system. Otherwise, all of the messages remain in the system and the next time step is started.
In the Queueing model, it is assumed that there are N users, and that, for 1 <= i <= N, user i generates a message independently on each step with probability lambda_i. (This message is added at the end of the user's message queue.) The total mean arrival rate is taken to be lambda = lambda_1 + ... + lambda_N. During each time step, each user chooses independently whether to send the message at the head of its queue to the channel. If exactly one message sends to the channel, then this message leaves the system. Otherwise, all of the messages remain in the system and the next time step is started.

There is no single definition which is used to determine whether or not a contention-resolution protocol is ``good'' (or ``stable'' [4] ), but the basic idea is that a protocol should allow sufficiently frequent retransmissions (so that channel time is not wasted while users wait) but it should not allow too-frequent retransmissions (which waste channel time with collisions). Typically, one models the execution of the protocol as a Markov chain. If the protocol is good (for a given arrival rate lambda ), the corresponding Markov chain will be recurrent (with probability 1, it will eventually return to the empty state in which no messages are waiting.) Otherwise, the chain is said to be transient (and we also say that a protocol is transient). Note that transience is a very strong form of instability. Informally, if a protocol is transient then with probability 1 it is in "small" states (states with few backlogged messages) for only a finite number of steps. Another way to measure the success of a protocol is to measure its capacity. A protocol is said to achieve throughput lambda if, when it is run with input rate lambda, the average success rate is lambda. The capacity of the protocol [EH98] is the maximum throughput that it achieves.

Most research on collision-resolution protocols focuses on simple protocols because

these protocols are easiest to implement, and are therefore most relevant to practice, and
these protocols are easiest to understand.

Several natural classes of protocols have been defined.

Age-based protocols: An age-based protocol [Kel85] can be associated with a sequence of probabilities, p0,p1,... In a given time slot of the corresponding protocol, every message which has been the head of a queue for i time slots is sent (independently) with probability pi. (In the queue-free model, every message is assumed to be the head of its queue.) Some Results.
Backoff protocols: Backoff protocols are similar to age-based protocols, except that the notion of ``age'' is slightly different. A backoff protocol can also be associated with a sequence of probabilities, p0,p1,... In a given time slot of the corresponding protocol, every message which is the head of a queue and has been sent unsuccessfully i times is sent (independently) with probability pi. Some Results.
Acknowledgement-based protocols: In acknowledgement-based protocols, users make retransmission decisions using very limited information about the multiple-access channel. In particular, it is assumed that the only information that a user receives about the state of the system is the history of its own transmission attempts. All age-based protocols and backoff protocols are acknowledgement-based. Some Results.
Almost Acknowledgement-based protocols: The protocols in [RU95] and [GMPS98] are almost acknowledgement-based in the sense that the only additional information about the multiple-access channel that the users are given is a rough estimate of the number of users. Some Results.
Full-sensing protocols: An alternative to the acknowledgement-based model is the full-sensing model, in which every user listens to the channel at every step [4A]. Some Results.

Chlebus has written a nice survey of this topic and also of communication in radio networks. See [Chl00].

Some Results

Age-based protocols: Kelly [Kel85], Kelly and MacPhee [KM87], and MacPhee [MacP89] considered age-based protocols in the queue-free model. They showed, for any given age-based protocol, how to calculate a critical arrival rate such that the expected number of successful transmissions is infinite if and only if the arrival rate is at most the critical arrival rate. They showed that the critical arrival rate is 0 if and only if p0+ ... + pt is omega(log(t)). Proof Method. Ingenoso [Ing93] also considered age-based protocols in the queue-free model. He showed that if p0, p1,... is monotonically decreasing then the protocol is unstable in the sense that the corresponding Markov chain is transient [5]. Proof Method.

Backoff protocols:

Queue-Free Model: Kelly and MacPhee's work (which we discussed in the context of results about age-based protocols) does not require independence (because of the use of the externally jammed channel), so their work shows that if the reciprocal of pt is sub-exponential (as a function of t) then the critical arrival rate of the protocol is 0 [6]. Aldous [Ald87] shows that, for every positive arrival rate, the binary exponential backoff protocol [MB76] (the protocol corresponding to pt = 2^-t ) has o(t) expected transmissions during time slots 0,...,t, and is transient [5]. (See the related question in [7].) Proof Method. MacPhee [MacP89] posed the question of whether there exists a backoff protocol which is recurrent for some positive arrival rate lambda. Goldberg, Jerrum, Kannan and Paterson [GJKP00] showed that no backoff protocol is recurrent for lambda >=.42 and that the capacity of every backoff protocol is at most .42. This implies that backoff protocols have (provably) smaller capacity than full-sensing protocols, since there is a full-sensing protocol with capacity .48776. Proof Method.
Queueing Model: Hastad, Leighton and Rogoff [HLR96] use the word "stable" to describe a positive recurrent protocol in which the expected average number of waiting messages over time is finite [9] and the word "unstable" to describe a protocol which
- is not positive recurrent, and
- has an infinite expected average (over time) number of waiting messages.
These two categories (stable/unstable) are not exhaustive, but Hastad et al. show
- If lambda_i = lambda/N for all users i, and lambda > 0.568 then binary exponential backoff is unstable.
- If lambda < 1 and alpha is a constant > 1 then the "polynomial" backoff protocol pt = (t+1)^(-alpha) is stable.
- If lambda>0, N is sufficiently large (as a function of lambda), and lambda_i = lambda/N for all users i, then for any constant alpha <= 1, the polynomial backoff protocol pt = (t+1)^(-alpha) is unstable.
- If lambda_i = lambda/N for all users i, lambda>0, and the polynomial backoff protocol pt = (t+1)^(-alpha) is run then the expected average number of messages in the system is Omega(N^((alpha+1)/alpha))
- If, for some positive constant epsilon, lambda_i = lambda/N for all users i, and lambda >= 1/2 + epsilon and N is sufficiently large (as a function of epsilon) then binary exponential backoff is unstable.
Proof Method. In contrast to the final result of Hastad, Leighton and Rogoff, Goodman, Greenberg, Madras and March [GGMM88] have shown that there is a positive constant alpha such that binary exponential backoff is positive recurrent [8], provided that lambda < N^(- alpha log(N)). Proof Method. Al-Ammal, Goldberg, and MacKenzie [AGM00] showed that there is a positive constant epsilon such that for any eta < 0.25, binary exponential backoff is positive recurrent (for sufficiently large N) provided only that lambda < epsilon N^{-(1-eta)}. Raghavan and Upfal [RU95] proved that for a class of protocols which includes all backoff protocols (and all age-based protocols) that there is a constant lambda0 < 1 such that if lambda > lambda0 and lambdai = lambda/N for all users i, then the expected delay of each message is Omega(N) [10]. Goldberg and MacKenzie [GM99] considered systems composed of N users and K multiple-access channels in which, on each step, user i generates a message for channel j with probability lambdai,j. Contention-resolution on such a system doesn't reduce to K instances of contention-resolution on a single-server system because a user can only send one message (and not K messages) at a time. Goldberg and MacKenzie showed that as long as the mean user-channel arrival rate is less than 1, then any superlinear polynomial backoff protocol is stable in the sense of Hastad et al. (The mean user-channel arrival rate is defined to be the maximum over all users i and channels j of the sum of the mean arrival rates associated with user i or channel j. That is, it is the maximum over i and j of Sum (over channels j') lambdai,j' + Sum (over users i') lambdai',j - lambdai,j. It is interesting to note that meta-stability does not seem to occur for binary exponential backoff in the multi-channel case. For example, when quadratic backoff and binary exponential backoff are simulated for a million steps with 50 users and 1 channel with lambdai = .2/N then the two protocols are indistinguishable (both seem stable), but when the number of channels is increased to 50 the difference between the two protocols is spectacular, with quadratic backoff looking stable and binary exponential backoff appearing to jam the channel from the start. Details of the Multi-Server Protocol. Proof Method.

Acknowledgement-based protocols: Goldberg, Jerrum, Kannan and Paterson [GJKP00] showed that no acknowledgement-based protocol is recurrent for lambda >=.530045 and that the capacity of every acknowledgement-based protocol is at most .530045. Proof Method.

Almost Acknowledgement-based protocols: Raghavan and Upfal's protocol [RU95] applies to the queueing model. It is almost acknowledgement-based in the sense that the only additional information about the multiple-access channel that the N users are given is a reasonably good estimate of the logarithm of N (one that is correct up to a constant factor). If the arrival rate is less than 1/10, then the expected waiting time of each message is O(log N). The idea motivating the protocol is that the reason that backoff protocols have high message-waiting times is that backoff protocols do not handle catastrophic events well. In order to handle these unlikely but harmful events (such as the simultaneous arrival of many messages) users of Raghavan and Upfal's protocol observe the history of their successes and, when it seems that the channel is excessively clogged, a user enters a reset state for a while in which it does not often try to send (so the channel can clear). Description of protocol. Proof Method. Raghavan and Upfal's protocol can be used in systems of more than one multiple access channel. Goldberg, MacKenzie, Paterson and Srinivasan's paper [GMPS98] gives an queue-free model protocol and a queueing-model protocol. While the queue-free-model protocol is not acknowledgement-based (the protocol requires the users to share a clock), the queueing-model protocol which simulates it is almost acknowledgement-based in the sense that the only additional information about the multiple-access channel that the N users are given is some ( any) upper bound on N. When the arrival rate is less than 1/e, the expected average message waiting-time is O(1). The analysis in the queueing case allows for the possibility that the N users might start and stop running the protocol over time, but it requires that each time a user starts running the protocol, it continues to run the protocol for a while. Description of Protocols. Proof Method.

Full-Sensing Protocols: Mosely and Humblet [MH85] discovered a full-sensing protocol with capacity .48776. This protocol is a "tree protocol" in the sense of Capetanakis [C79] and Tsybakov and Mikhailov [TM78] For a simple analysis of the protocol, see [V86]. Vvedenskaya and Pinsker have shown how to modify Mosely and Humblet's protocol to achieve an improvement in the capacity (in the seventh decimal place) [VP83]. Much work has gone into determining upper bounds on the capacity that can be achieved by a full-sensing protocol. The current best result is due to Tsybakov and Likhanov [TL87] who have shown that no protocol can achieve capacity higher than .568. (For more information, see [EH98] [GFL87] [MP93] [RS90].) In the full-sensing model, one typically assumes that messages are born at real "times" which are chosen uniformly from the unit interval. Recently, Loher [L98a] [L98b] has shown that if a protocol is required to respect these birth times, in the sense that packets must be successfully delivered in their birth order, then no protocol can achieve capacity higher than .4906. Intuitively, the "first-come-first-served" restriction seems very strong, so it is somewhat surprising that the best known algorithm without the restriction (that of Vvedenskaya and Pinsker) does not beat this upper bound. The algorithm of Humblet and Mosely satisfies the first-come-first-served restriction.

Some Protocols

Goldberg and MacKenzie's Multi-Channel Protocol: The variant of polynomial backoff studied by Goldberg and MacKenzie is actually the non-geometric variant discussed in footnote [6]. This variant is better for practical implementation because it does not require as many random numbers to be generated as the geometric variant.

The N-user K-channel protocol studied by Goldberg and MacKenzie is as follows: Each user keeps track of the number of times it has failed on each channel. On each step it checks which channels it would have sent on, had it been running K independent copies of polynomial backoff. If it would have sent on exactly one channel, then it does so. Otherwise, it assumes that all of the messages that it would have sent collided (and it doesn't send any messages).

Goldberg, MacKenzie, Paterson and Srinivasan's Protocols:

Queue-Free-Model Protocol: The model considered is the same as the queue-free model described in these notes except that the users are assumed to share a clock. That is, they all agree about which time-step is being executed. The synchronisation allows the users to use different time steps for different purposes.

The protocol itself is simple. We construct an explicit, easily computable collection {Si,t : i,t = 0,1,2,..} of finite sets of nonnegative integers Si,t where for all i and t, every element of Si,t is smaller than every element of Si+1,t A message born at time t which has made i (unsuccessful) attempts to send to the channel so far, picks a time w uniformly at random from Si,t, and tries using the channel at time w. If it succeeds, it leaves the system. Otherwise, it increments i and repeats this process.

The sets Si,t are constructed as follows: Imagine an infinite tree with a countably infinite number of leaves ordered left-to-right at level 0. The parents of the leaves are at level 1, and their parents are at level 2, and so on. Each internal node has 4 Delta^2 children (for an appropriately chosen constant Delta, which depends upon lambda). Each node of the tree will be given a "trial set" of time-steps. (If the node is at level i, then its trial set has size b (4 Delta)^i, for a constant b, which is chosen to be sufficiently large.) During the lifetime of the protocol, a message is regarded as residing at some node of the infinite tree. Initially, a message at time-step t resides at the left-most leaf whose trial set contains all nodes which are greater than t. The message picks a time-step w uniformly at random from the trial set of the leaf where it resides and tries to use the channel at time w. If the message fails, then it moves to the parent of its current home. It then picks a random time-step uniformly at random from the trial set of its new home and repeats. The trial sets at level i of the tree consist of those time-steps which, when written in base Delta, have zeros in their i least significant bits, and a non-zero in their next least significant bit. The trial-sets are assigned to the nodes in order. That is, the leftmost leaf of the tree is given the smallest b "level 0" time-steps, the next leaf is given the next smallest b "level 0" time-steps, and so on. The trial set of a node v at level i consists of the b (4 Delta)^i smallest time-steps which are greater than all of the time-steps in the trial sets of v's children. It turns out that all of the time-steps in v's trial sets are larger than the time-steps in the trial sets of v's left siblings.

Queueing Model Protocol: The basic idea is to simulate the queue-free-model protocol described above with N users. If the N users are synchronised (that is, they share a clock) the simulation is straightforward, but it involves a few data structures issues (because each user is responsible for a potentially unbounded number of messages, yet it must manage them in constant time at each time step.) The basic solution is for each user to maintain, for each level of the tree, a linked lists of its messages at that level. At every "level i" time-step, the user calculates the probability that at least one of these messages tries to send, and sends a message if one tries to send. Technically, if it finds that more than one of the messages under its control tries to send, then it should send a dummy message to the channel (to ensure that the simulation is faithful to the original protocol). It is not necessary to send dummy messages, however. This follows from the fact that the queue-free-model protocol is deletion resilient, which essentially means that the evolution of the protocol from any given state s stochastically dominates the evolution of the protocol from any state s' which is the same as s except that some messages have been removed. Let P be a protocol running on N completely synchronised users which simulates the synchronised queue-free protocol for N^2 - 1 steps and then skips a step and then continues. (Inputs might arrive during the skipped step.) We will refer to P below.

Since the users in the queueing model are not synchronised (they do not share a clock), it is necessary to build the synchronisation into the protocol itself. Thus, the structure of the protocol is as follows. Most of the time, the users are synchronised and are simulating the queue-free-model protocol (in fact, they are simulating P). The users occasionally enter a synchronising phase to make sure that the clocks are synchronised (or to resynchronize after a user enters the system). The synchronisation phase has some probability of (undetectably) failing, and thus it must be repeated periodically to guarantee constant expected delay. The synchronising phase of the protocol is somewhat complicated, because it must synchronise the users even though communication between users can only be performed through acknowledgements (or lack thereof) from the multiple-access channel. A further complicating factor is that the synchronising phases have to be robust, in the sense that the model allows users to start and stop, so some users may miss synchronisation phases, others may try to start synchronising in the middle of an ongoing synchronisation phase and so on.

For the purposes of description, we assume that all of the users know N, but it suffices for N to be an upper bound on the number of users. W is taken to be 12 N^4. In the normal state a user maintains a buffer B of size N^7 and a queue Q of unbounded length. B and Q contain messages to be sent, and when a message is generated it is put into B. For each message m in B the user maintains a variable trial(m) which contains the next step on which the user will attempt to send m. The step trial(m) will be chosen using protocol P. When P is ``skipping a step'' the protocol will take the opportunity to try to send some messages from Q: at such steps, with probability 1/(3N), the user attempts to send the first message in Q. Each user also maintains a list L which keeps track of the results (either ``failure'' or ``success'') of the most recent message sending attempts from Q, up to N^2 of them. A user goes into a synchronising state if a message has remained in the buffer for N^7 steps or if L is full (contains N^2 results) and only contains failures. It also goes into a synchronising state from time to time even when these events do not occur. (It synchronises if it has been simulating P for at least N^{40} steps, and it synchronises with probability N^{-30} on any given step.) If the user does go into a synchronising state, it transfers all messages from B to the end of Q. In the synchronising state, a user could be in one of many possible stages, and its actions depend on the stage that it is in. It will always put any generated messages into the queue. Also, it only sends dummy messages in the synchronising state. (The dummy messages are used for synchronising. Real messages that arrive during the synchronisation phase must wait until the next normal phase to be sent.) The various synchronisation stages are as follows (a user goes through these stages in order).

JAMMING The user starting the synchronisation jams the channel by sending messages at every step. In this way, it signals other users to start synchronising also.
FINDING LEADER Each user sends to the channel with probability 1/N on each step. The first user to succeed is the leader.
ESTABLISHING LEADER In this stage, a user has decided it is the leader, and it jams the channel so no other user will decide to be the leader.
SETTING CLOCK In this stage, a user has established itself as the leader, and it jams the channel once every 4W steps, giving other users a chance to synchronise with it.
COPYING CLOCK In this stage, a user has decided it is not the leader, and it attempts to copy the leader clock by polling the channel repeatedly to attempt to find the synchronisation signal (namely, the jamming of the channel every 4W steps by the leader). Specifically, it sends to the channel with probability 1/(3N) on each step, and if it succeeds, it knows that the current step (mod 4W) does not correspond to the leader's clock. After many attempts, it should only be left with one step (mod 4W) that could correspond to the leader's clock. At the end of this stage, it synchronises its clock to the leader's clock.
WAITINGThis stage is used by a user after COPYING CLOCK in order to synchronise with the leader's clock. The user idles during this stage.
POLLING A user in this stage is simply ``biding its time'' until it switches to a normal stage. While doing so, it attempts to send to the channel occasionally (with probability 1/(3N) on each step) in order to detect new users which might be joining the system and re-starting a synchronisation phase. If new users are detected, the user re-starts the synchronisation phase. Otherwise, it begins the normal phase of the protocol.

The length of each of these stages is very important, both in terms of achieving a high probability of synchronisation and a high level of robustness. The high probability of synchronisation is achieved by making the ``preliminary'' stages (i.e., JAMMING, FINDING LEADER, and ESTABLISHING LEADER) of length Theta(W) (this is long enough to guarantee all users in a normal state will detect a synchronisation), and the ``synchronising'' stages (i.e. SETTING CLOCK, COPYING CLOCK, and WAITING) of length Theta(WN^2) (this gives users enough time to determine the leader's clock modulo 4W, with high probability). The high level of robustness is achieved by the following properties:

Having the lengths of the ``preliminary'' and ``synchronising'' stages as above,
Noticing that only the preliminary stages can cause the channel to be jammed,
Noticing that the ``synchronising'' stages cannot detect a new synchronisation occurring,
Having the POLLING stage be of length Theta(Wn^3) (longer than all of the other stages combined), and
having the POLLING stage be able to detect new synchronisations.

The differing lengths of time for the ``preliminary'', ``synchronising'' and POLLING stages, and the fact that only the POLLING stage could cause another synchronisation to occur, guarantee that bad events such as separate groups of users continually interrupting each other's synchronisation cannot occur, even when up to N users are starting at different times (and stopping periodically). Whenever a user joins the multiple-access channel, it starts the protocol JAMMING with L empty.

Raghavan and Upfal's Protocol: Each user has a message buffer of size O(log(N)) and a queue. During the protocol the user is either in the normal state or in the reset state. During each step in which the user is in the normal state, it places any new messages in the buffer. Then it chooses a random slot of the buffer. If the slot contains a message, then the user tries to send that message. Otherwise, with probability 1/N^2, it tries to send the message at the head of the queue. If

(1) the user has tried to send from the buffer for at least 1/4 of the most recent O(log n) steps, and it has failed on more than 5/8 of these tries, or

(2) the buffer is more than 1/4 full,

then the user moves all messages in the buffer to the end of the queue and switches to the reset state for a while. During each step in which the user is in the reset state, it places any new messages in the queue. With probability 1/N^2, it tries to send the message at the head of the queue.

Some Proof Methods

Aldous's Method: Aldous's instability proof for the binary exponential backoff protocol is based on the following idea. Suppose that at a given time slot it is the case that for every ``small'' positive integer b, there are ``many'' messages which have been sent unsuccessfully b times. (This event occurs with some small, but positive probability.) Aldous shows that when this event occurs, there is a positive probability that the system gets into a worse and worse state with more and more messages building up. His argument is essentially a potential function (Lyapounov function/supermartingale) argument in which the potential of the system at a given time-slot is taken to be the expected number of sends during the time-slot. The proof shows that the potential of the system tends to infinity as time goes on.

More details: Define a function L(t) which is Theta(log(t)). The potential at t is said to be ``small'' if it is at most (lambda/2) L(t) and ``large'' otherwise. A state is said to be ``good'' if, for all small enough b, at least lambda 2^b messages have backoff counter b (note that this is the number of messages with backoff counter b in the equilibrium state of an externally blocked channel).
Theorem: Given a good state at time t0, with positive probability, all following states have large potential.
Proof: Pick any time t >= t0 and show that the probability that t is the first step with small potential is very small (small enough to sum over t). In particular,

If t - t0 is small, this follows because the start state was good and so (for most relevant integers b), lots (at least ((1/2)+epsilon)lambda 2^b) of the messages with backoff counter b at the start still have backoff counter b at step t.
Otherwise, the goodness of the potential prior to step t guarantees that for many relevant integers b (for Theta(log t) such integers b) lots (at least ((1/2)+epsilon)lambda 2^b) of messages have backoff counter b at step t. This follows because
- The large potential encourages collisions and guarantees that there is a high probability of having few successes in the ~2^b steps preceding t
- With high probability, even in an externally jammed channel, many messages are born in the ~2^b steps preceding t which have backoff counter b at step t.

Note that the particular potential function that Aldous uses can only be used to show instability for protocols for which the proportion of time-slots in which successful transmissions occur is vanishingly small.

Goodman, Greenberg, Madras and March's Method: The proof uses the following theorem [11]:
Theorem: Suppose that epsilon is a positive constant and y(t) is a discrete time stochastic process on a countable state space. Let T be a stopping time [12] with respect to y(t). Suppose f(t) is non-negative and

E(f(t+1) - f(t) | y(0),...,y(t), t < T) <= -epsilon. Then E(T) <= f(0)/epsilon.

The Lyapounov function f is defined to be the sum of

The number of messages waiting in the tails of the queues (the tail of a queue consists of all but the first entry)
for each message in the head of a queue, beta times the expected success time of the message, where the expected success time is computed assuming that no messages arrive, and that all messages in tails of queues are ignored.

The stopping time T is chosen to be the first time that the system returns to the start state. Goodman et al. then show that for every state y other than the start state, (*) E(f(t+1) - f(t) | y(t)=y) <= -epsilon. Using the theorem, this implies that E(T) <= E(f(1) | f(0) = start state)/epsilon, which is finite.

The proof of (*) proceeds in two parts:

Consider the state obtained by executing step t, obtaining new arrivals (for step t+1), but not (yet) promoting any messages to the heads of the queues (place new arrivals in the tails of the queues). The expected potential decreases over this step, because the new arrivals only contribute +lambda, but the expected success time of some existing message decreases by one, contributing (at least) a -beta. (We will require lambda <= beta.)
Now observe that promoting messages to the heads of the queues does not increase the potential. This is because of the fact that adding a new message to a system of messages (the system consisting of messages currently at the heads of the queues) only increases the expected lifetime of each existing message by a bounded amount and the bounded amount can be ``cancelled'' by choosing beta to be small.

The proof requires beta to be as small as N^(- alpha log (N)), so the result applies for lambda <= N^(- alpha log(N)). Goodman et al. observed that their proof cannot be made to work for beta as large as epsilon/N (for any epsilon) since that would imply a contradiction to Aldous's result.

Goldberg, Jerrum, Kannan and Paterson's Method:

Backoff protocols: The basic method is to construct a positive potential function f such that
- from every state with sufficiently large potential, the expected change in potential is at least a positive quantity epsilon,
- and the absolute value of the maximum change in potential in a single step is bounded.
The result then follows from a drift theorem from
[FFH86]. In order to construct a potential function with bounded change, the authors first restrict attention to a model with a truncated arrivals distribution and then apply stochastic domination. To a first approximation, the potential of a state counts the number of messages in the state, but the potential function also includes a small but crucial correction term, which is a function of the probability that a message will be sent on the following step. In order to establish the necessary drift, the authors prove that the expected change in potential is always bounded below by a certain two-variable function. They prove that the function is bounded below by the value of certain (explicitly-defined) polynomials, evaluated at around 40,000 (particular) points. The authors use Mathematica to evaluate the polynomials at these points.
Acknowledgement-Based Protocols: An acknowledgement-based protocol can be viewed a system which, at each step, decides what subset of the old messages to send. The decision is a probabilistic one dependent on the histories of the messages held. As a technical device for proving the bounds, the authors introduce the notion of a "genie", which (in general) has more freedom in making these decisions than a protocol. A genie is a random variable over the natural numbers, dependent on the complete history (of arrivals and sends of messages) up to time t-1, which gives a natural number representing the number of (old) messages to send at time t. It is clear that for every acknowledgement-based protocol there is a corresponding genie. However there are genies which do not behave like any protocol, e.g., a genie may give a cumulative total number of "sends" up to time t which exceeds the actual number of arrivals up to that time. It is fairly easy to show that any such genie is transient for arrival rates higher than the solution of lambda = e^(-lambda), which is around .567. (Thus, by domination, every acknowledgement-based protocol is transient for such high arrival rates). In order to prove the stronger result, the authors further require that any genie must simulate the real protocol for those messages that have only failed at most once. They show that every such genie is dominated (in a very weak sense) by an eager genie, which always sends at least one message. The domination proof involves coupling the behaviour of an arbitrary genie with the behaviour of an appropriate eager genie. Using a martingale tail bound, it is shown that, with probability 1, the number of successes of the eager genie is eventually not much smaller than the number of successes of the arbitrary genie. The eager genie is then analysed to show that it builds up a big backlog of unsent messages as time goes on, and this implies that every genie (and hence, every acknowledgement-based protocol) is transient for arrival rates higher than the solution of lambda = (1+lambda)e^(-2 lambda), which is about .530045.

Goldberg and MacKenzie's Method: In order to simplify the analysis, Goldberg and MacKenzie introduce a "step counter" si,j for each user i and channel j. If the backoff counter bi,j for user i and channel j is incremented then si,j is set to the ceiling of (bi,j +1)^alpha. On every step, user i decides to send on channel j with probability equal to the inverse of si,j, and then si,j is decreased by 1.

The general approach is that of Hastad et al (see below) with potential function

f(s) = Sum_(1<=i<=N)Sum_(1<=j<=K) ( qi,j + (bi,j+1)^(alpha + (1/2)) - si,j^(1-1/(4alpha)) ). Thus the goal is to prove that the corresponding Markov chain is positive recurrent and that (*) E(lim_(t->infinity)(1/t) Sum_(1<=i<=t) f(t)) is finite.

Once it is proved that the Markov chain is ergodic (positive recurrent) and strictly stationary (the probability distribution remains the same after any given number of steps), the strong ergodic theorem applies and says that (*) boils down to showing that the expectation of f(t) is finite in the stationary distribution. As in the work of Hastad et al., this boils down to showing that for any state s with f(s) <= V, there is a stopping time Ts <= V-1 such that

E( f(y(Ts))^2 - f(y(0))^2 | y(0) = s) <= - f(s).

Thus, the hard part of the proof is once again a detailed analysis of the drift for an appropriately defined stopping time. The main difficulty in the analysis is the dependence between different users, many of whom must simultaneously be successful to decrease the potential. Actually, the analysis is so complicated in the multi-channel case that it would be very nice to have a better proof method.

Goldberg, MacKenzie, Paterson and Srinivasan's Method:

Queue-Free-Model Protocol: The partitioning of time-steps for different purposes makes it possible to prove that as long as lambda is at most 1/e, the expected delay of any message is O(1). Intuitively, the reason that the protocol works is that it handles bursts of arrivals well: When a large number of messages arrives in a short time, these messages are handled independently since the trial sets of the relevant nodes are pairwise disjoint. Over time, the problem messages get spread out from each other (so they succeed) but they also are kept out of the way of new arrivals (because they eventually reside at high levels of the tree).

The analysis of the protocol is based on the following idea, where c denotes e(1+delta) for some constant delta>0: suppose we have ell time slots and at most ell/c messages. A reasonable strategy is as follows:

All messages pick a random time slot from the first ell/c time slots. With high probability, at most (ell/c)(1 - 1/c) messages survive.
Each remaining message picks a random time slot from the next ell/c^2 time slots. With high probability, at most (ell/c)(1 - 1/c)^2 messages survive.
Each remaining message picks a random time slot from the next ell/c^3 time slots. With high probability, at most (ell/c)(1-1/c)^3 messages survive...

Note that in all at most ell time steps are used. However, for any positive constant epsilon, if we run the above for a sufficient number (m(epsilon)) of steps, then the probability that any given message survives is at most epsilon and the probability of having ell mu messages remaining is exponentially small in ell.

The protocol is designed to essentially simulate many (an infinite number of!) parallel copies of the above idea. As a messages moves up a tree it is choosing a random trial time from a set that is 4 Delta times larger. However, each internal node has 4 Delta^2 children, so effectively, the size of the trial set is going down by a factor of Delta with each level.

The proof method is to show that for every level i, the probability that at least

b (4 Delta)^i (2 Delta)^{t-1}/(e(1+delta)) messages enter it is doubly exponentially small in i and t. The theorem then follows from the above analysis by looking at the case t=1.

Queueing-Model Protocol: The proof uses the approach of Raghavan and Upfal in showing that, conditioned on any given system state at a given time, the expected system state is good T steps later. The key steps of the proof are

Proving that the synchronisation is likely to work, as long as no users interrupt it by starting and stopping,
Proving that the synchronisation is robust in the sense that, no matter what state the system is in (i.e., possibly normal, possibly in the middle of a synchronisation), if some user starts a synchronisation (possibly because the user just started), then the system is likely to recover quickly and proceed to a synchronised state (as long as new users don't interrupt!)
Proving that, given any system state just before step t, the system is likely to be in a good state N^(31) steps later. A good state is one in which the users are synchronised and are simulating P without observing too many errors. (This implies that the expected time that a message spends in the buffer is O(1).)
Proving that the probability of a message entering a queue is low, the probability of a queue being very full is low, and the rate at which the messages are sent from the queue is high enough so that the expected time any given message spends in the queue is low. In particular,
- The probability that a message enters a queue is low.
- If a message is generated at a step in which no users start or stop soon before or after, the expected time that it spends in a queue is O(1).
- The expected time that any message spends in a queue is O(N^(37)).

The proof then follows because of the requirement that once a user starts it has to continue running "for a while". This ensures that most steps do not have users starting or stopping, so in fact the expected average waiting time of messages is O(1) (and not O(n^(37))!).

Hastad, Leighton and Rogoff's Method: In order to prove that polynomial backoff is stable, Hastad et al. study the potential function

f(s) = Sum_(1<=i<=N) qi + Sum_(1<=i<=N) (bi+1)^(alpha + (1/2))-N, where qi is the length of the i'th queue in state s and bi is the size of the i'th backoff counter in s. They infer stability from the following conditions, where delta, d and V are taken to be constants:

(1) For any state s with potential at least V, there is a stopping time Ts <= d such that E( f(y(Ts))^2 - f(y(0))^2 | y(0) = s) <= - delta f(s)
(2) Any state with potential less than V can only move to states with potential O(V).

Condition (1) is used to show that for some constant c (which depends upon delta, d and V), (*) E(Sum_(0 <= t <= T*) L(y(t)) | y(0)=s) <= c f(s)^2, where T* is the stopping time giving the time of the first visit to a state with potential at most V and L(y) denotes the number of messages in the system in state y.

By Theorem 14.0.1 of Meyn and Tweedie [MT93], (*) implies that the chain is positive recurrent with a stationary distribution such that the expected value of L (the number of messages in the system) is finite in the stationary distribution.

The proof of (*) is by induction on time, by considering a new stopping time T*j which is the first time that either

time j is reached, or
stopping condition T* occurs.

Then it is shown by induction on j that E(Sum_(0 <= t <= T*j) L(y(t)) | y(0)=s) <= c f(s)^2.

The main part of the proof is to establish condition (1), and this is done by detailed (and clever) case analysis (with a careful choice of the stopping time). The analysis is complicated by dependence between the events that are considered.

To prove the instability of binary exponential backoff, Hastad et al. find a potential function for which, for any state in the system, the potential is expected to increase by at least a fixed amount during the subsequent transition.

Given the nature of the potential function, it is easy to see that the expected average waiting time of messages (and, therefore, the expected average number of waiting messages in the system) grows without bound. One can then prove that the system is not positive recurrent by using a theorem such as the following (Theorem 2.2.6 of [FMM95]).
Theorem: For an irreducible Markov chain to be non-positive-recurrent, it is sufficient that there exist a positive function f on the state space and constants c and d such that

There is a state with potential greater than c and a state with potential at most c.
From every state with potential greater than c, the expected increase in potential over the next step is non-negative.
The expected absolute value of the change in potential from any state is at most d.

Ingenoso's Method: We can assume that the sequence {pi} tends to zero as i tends to infinity. (Otherwise, the result follows from Kelly and MacPhee's result.) Furthermore, we can assume that for any fixed age T, the expected time that elapses before an age-T message next attempts to send is finite. Thus, the sum (from t = T to infinity) of the product (from i = T to t) of (1 - pi) is finite. The finiteness of this sum prevents the {pi}s from tailing off too quickly and guarantees that, with positive probability, the backlog of messages grows without bound. In particular, Ingenoso defines a sequence t1,t2,... of ``observation times'' and shows

With probability 1, the number of messages arriving between any two observation times is (eventually) at least a (given) fraction of the expected number. (This follows easily from the fact that the amount of time between consecutive arrival times grows sufficiently quickly)

With positive probability, the number of messages succeeding between any two observation times is always at most a (given) small fraction of the bound on the number of arrivals given above. (This is the difficult part of the proof. The proof is by induction, assuming that the process behaved roughly as expected during previous intervals between observation points to establish well-behavedness in the current interval. Monotonicity is used to bound the state of the protocol (i.e., to lower bound the probability of messages sending) given the behaviour of the protocol in the previous intervals).

Kelly and MacPhee's Method: Consider a system in which the protocol runs in an externally jammed channel, with Poisson arrivals with mean lambda. Let qt denote the sum p0+ ... + pt. Let F(lambda) be a random variable denoting the set of steps in which at most one message tries to transmit. The number of attempted transmissions at step t is Poisson with mean lambda qt, so the expected size of F(lambda) is the sum from t=1 to infinity of

(1 + lambda qt) exp(- lambda qt). Note that the expected size of F(lambda) is non-increasing in lambda. The critical arrival rate for the protocol is defined to be lambdac = inf { lambda | E( | F(lambda) | < infinity) }. If lambda > lambdac then, with probability 1, | F(lambda) | is finite.

Now suppose that lambda > lambdac and consider a system in which the protocol runs with arrival rate lambda without an externally jammed channel. Consider adding a new stream of Poisson arrivals with mean epsilon for some epsilon > 0. If F(lambda) is finite, the probability that every step in F(lambda) is blocked by one of the new arrivals is at least some constant p > 0. Thus, the probability that there are no successes is at least p. This implies that for any step t, the probability that there are no successes from step t onwards is at least p. This holds even if we condition on the system being in any given state at step t. Thus, if Ti denotes the step in which the i'th success takes place, then the probability that there are no successes after step Ti , conditioned on the fact that Ti is finite, is at least p. Finally, the probability that there are at least n successes is at most (1-p)^n, so the expected number of successes is finite. We conclude that for any arrival rate lambda > lambdac, the expected number of successful transmissions is finite.

We now wish to show that if lambda is less than lambdac then the expected number of successful transmissions is infinite. To do this, we note first that the set of steps in which there is at most one retransmission is a superset of F(lambda). (This is also true when we run without the blocked channel.) On any step in which there is exactly one retransmission, the probability of success is at least e^(-lambda) and on any step in which there is no retransmission, the probability of success is at least p0 lambda e^(-lambda). The result follows because Kelly and MacPhee assume that p0 = 1. (Note: A similar result holds even if p0 = 0. Let T be the smallest integer such that pT > 0. Then consider the set of steps in which there are at most one transmission by messages older than T. Then calculate the (positive) probability of having exactly one send on such a step... )

Finally, we show that the critical arrival rate is zero if and only if qt = omega(log(t)). To see this, note that the expected size of F(lambda) is at most a constant times the sum from t = 1 to infinity of

qt exp(- lambda qt). Rewriting this, we see that the expected size of F(lambda) is at most a constant times the sum from t=1 to infinity of qt t^(- lambda (qt/log(t))). Finally, since qt <= t, the expected size of F(lambda) is at most a constant times the sum from t=1 to infinity of t^(1- lambda (qt/log(t))). It is apparent from the above that E( | F(lambda) | ) is finite if qt >= ((2 + epsilon)/lambda)log(t).

Raghavan and Upfal's Method: A basic tool used by Raghavan and Upfal is a lemma which shows that, for any x, if the system has x messages in the buffer at step t, then, with high probability, the buffer is at most 1/4 full at step t + T(n), for an appropriately chosen function T(n). The Markov chain describing the messages in the buffers has a stationary distribution and in that stationary distribution, the probability that any user switches to the reset state is at most N^(-8).

These two facts can be used to show that the expected amount of time that a message spends in a buffer is O(log N). Similarly, they can be used to show that, with high probability (with probability at least 1 - N^(-5)), the message is delivered from the buffer and never enters the queue.

The final part of the proof is to show that if the message is put in the queue, then the expected time that it spends there is O(N^4 log(N)). This is established by showing that the probability that the message is put in a queue of any given length is (relatively) low, and the expected time needed to spend the message, given its position in the queue is not too long. The analysis is done by breaking the period of time after the message is put in the queue into intervals and then proving (using the tool described earlier) that there is a reasonably high probability of successful deliveries from the queue during each interval.

Some References

There has been a huge amount of work on the contention-resolution problem. This bibliography is not intended to contain all publications on the subject. Instead, we are most interested in publications which prove theoretical results about contention on multiple-access channels, without using unproven assumptions, and in publications which are useful for proving such results. I'm sure that I omitted many important publications. Feel free to tell me (mailto:leslie@dcs.warwick.ac.uk) about them. I haven't read all of the publications on this list!

[IEEE85] Special issue of IEEE Trans. on Information Theory, IT-31 1985.

[Abr73] N. Abramson,The Aloha system, In Computer-Communication Networks, N. Abramson and F. Kuo, editors, Prentice-Hall, (1973).

[Ald87] D. J. Aldous, Ultimate Instability of Exponential Back-Off Protocol for Acknowledgement-Based Transmission Control of Random Access Communication Channels, IEEE Trans. on Information Theory, IT-33(2) (March, 1987) 219-223.

[AGM00] H. Al-Ammal, L.A. Goldberg and P. MacKenzie, Binary Exponential Backoff is stable for high arrival rates, Proc. Int'l. Symp. on Theoretical Aspects of Computer Science 17 Lecture Notes in Computer Science 1770 H. Reichel and S. Tinson (eds.) (Springer-Verlag 2000) 169-180.

[Cap79] J.I. Capetanakis, Tree Algorithms for Packet Broadcast Channels, IEEE Trans. on Information Theory IT-25(5) (1979) 505-515.

[Chl00] B.S. Chlebus, Randomized Communication in Radio Networks, to appear in Handbook on Randomized Computing, Kluwer Academic Publishers, P.M. Pardalos, S. Rajasekaran, J. Reif and J.D.P. Rolim, Eds. (2000)

[CFM83] M. Cottrell, J-C Fort and G. Malgouyres, Large deviations and rare events in the study of stochastic algorithms, IEEE Trans. Automat. Contrl. AC-28 (1983) 907-920

[EH98] A. Ephremides and B. Hajek, Information theory and communication networks: an unconsummated union, IEEE Trans. Inf. Theory 44(6) (1998) 2416-2432.

[FMM95]G.Fayolle, V.A. Malyshev, and M.V. Menshikov, Topics in the Constructive Theory of Countable Markov Chains, Cambridge University Press, 1995.

[FFH86]G. Fayolle, P. Flajolet and M. Hofri, On a functional equation arising in the analysis of a protocol for a multi-access broadcast channel, Adv. Appl. Prob. 18 (1986) 441-472.

[FFHJ85]G. Fayolle, P. Flajolet, M. Hofri and P. Jacquet, Analysis of a stack algorithm for random multiple-access communication, IEEE Trans. on Information Theory IT-21(2) (1985) 244-254.

[FJ87] P. Flajolet and P. Jacquet, Analytic models for tree communication protocols, in Flow control of congested networks, NATO Advance Science Institute Series, Series F: Computer and systems sciences 38 (A.R. Odoni, L. Bianco and G. Szego, Eds.) (Springer-Verlag 1987) 223-234.

[GJKP00] L.A. Goldberg, M. Jerrum, S. Kannan and M. Paterson, A bound on the capacity of backoff and acknowledgement-based protocols, Research Report 365, Department of Computer Science, University of Warwick, Coventry CV4 7AL, UK (Jan 2000).

[GM99] L. A. Goldberg and P. MacKenzie, Analysis of Practical Backoff Protocols for Contention Resolution with Multiple Servers. Journal of Computer and System Sciences 58 (1999) 232--258.

[GMPS98] L. A. Goldberg, P. MacKenzie, M. Paterson and A. Srinivasan, Contention Resolution with Constant Expected Delay, Submitted for journal publication. (Preliminary versions of this work appeared in a paper written by the third and fourth authors (FOCS 95, 104-113) and in a paper written by the first and second authors (FOCS 97, 213-222).)

[GGMM88] J. Goodman, A.G. Greenberg, N. Madras and P. March, Stability of binary exponential backoff, J. of the ACM 35(3) (1988) 579-602.

[GFL87] A. G. Greenberg, P. Flajolet and R. E. Ladner, Estimating the multiplicities of conflicts to speed their resolution in multiple access channels, J. of the ACM 34(2) (1987) 289-325.

[H82] B. Hajek, Acknowledgement-based random access transmission control: An equilibrium analysis, Proc. Int. Conf. Communications New York, IEEE Press, 1982.

[HLR96] Johan Hastad, Tom Leighton and Brian Rogoff, Analysis of Backoff Protocols for Multiple Access Channels, SIAM J. Comput, 25(4) 740-774 (1996)

[Ing93] M.J. Ingenoso, Stability Analysis for Certain Queueing Systems and Multi-Access Communication Channels, PhD Thesis, University of Wisconsin-Madison, 1993.

[Kel85] F. P. Kelly, Stochastic Models of Computer Communication Systems, J. R. Statist. Soc B 47(3) (1985) 379-395

[KM87] F.P. Kelly and I.M. MacPhee, The Number of Packets Transmitted by Collision Detect Random Access Schemes, The Annals of Probability 15(4) (1987) 1557-1568.

[L98a] U. Loher, Efficiency of first-come first-served algorithms, Proc. ISIT (1998) p108.

[L98b] U. Loher, Information-theoretic and genie-aided analyses of random-access algorithms, PhD Thesis, Swiss Federal Institute of Technology, DISS ETH No. 12627, Zurich (1998).

[MacP89] I.M. MacPhee, On Optimal Strategies in Stochastic Decision Processes, D. Phil Thesis, University of Cambridge, (1987).

[MF85] P. Mathys and P. Flajolet, Q-ary Collision Resolution Algorithms in Random-Access Systems with Free or Blocked Channel Access, IEEE Trans on Information Theory IT-31(2) (1985) 217-243.

[MB76] R. Metcalfe and D. Boggs, Distributed packet switching for local computer networks, Comm. ACM 19 (1976) 395-404.

[MT93] S.P. Meyn and R.L. Tweedie, Markov Chains and Stochastic Stability, Springer-Verlag, 1993.

[MT81]V.A. Mikhailov and T.S. Tsybakov, Upper bound for the capacity of a random multiple access system, Problemy Peredachi Informatsii, 17(1) (1981) 90-95.

[MP93]M. Molle and G.C. Polyzos, Conflict Resolution Algorithms and their Performance Analysis, Technical Report Number CS93-300, Computer Systems Research Institute, University of Toronoto, 1993.

[MH85 J. Mosely and P.A. Humblet, A class of efficient contention resolution algorithms for multiple access channels, IEEE Trans. on Communications, COM-33(2) (1985) 145-151.

[Pip81]N. Pippenger, Bounds on the performance of protocols for a multiple access broadcast channel, IEEE Trans. on Information Theory, IT-27 (1981) 145-151.

[RS90]R. Rom and M. Sidi, Multiple access protocols: performance and analysis, (Springer-Verlag, 1990).

[Ros84]W.A. Rosenkrantz, Some theorems on the instability of the exponential back-off protocol, Performance '84 (E. Gelenbe, Ed) (North Holland) (1984) 199-205.

[RU95] P. Raghavan and E. Upfal, Contention resolution with bounded delay, Proc. ACM Symp. on Theory of Computing 24 (1995) 229-237.

[Sti74] S. Stidham, The last word on L= lambda W, Operations Research 22 (1974) 417-421.

[SR86] W. Szpankowski and V. Rego, Instability conditions arising in analysis of some multiaccess protocols, technical report CSD-TR-577, Purdue University, Lafeyette, Indiana, 1986.

[TL87] B.S. Tsybakov and N. B. Likhanov, Upper Bound on the Capacity of a Random Multiple-Access System, Problemy Peredachi Informatsii, 23(3) (1987) 64-78.

[TM78] B.S. Tsybakov and V.A. Mikhailov, Free synchronous packet access in a broadcast channel with feedback, Probl. Information Transmission, 14(4) (1978) 259-280.

[V86] S. Verdu, Computation of the efficiency of the Mosely-Humblet contention resolution algorithm: a simple method, Proc. of the IEEE 74(4) (1986) 613-614.

[VP83] N. D. Vvedenskaya and M.S. Pinsker, Non-optimality of the part-and-try algorithm, In Abstracts of the International Workshop on Convolutional Codes, Multiuser Communication, Sochi, USSR, (1983) 141-144.

Footnotes

[1] It is standard to work in a slotted model in which time is divided into discrete time slots. During any time slot, each user can send at most one message. The user can determine (by listening to the channel) whether the message was successfully transmitted. While real implementations of multiple-access channels do not fit precisely within the slotted model, it can be shown (for example, see [HLR96] and [KM87]) that results obtained in the slotted model do apply to realistic multiple-access channels.

[2]

The queue-free model is mathematically clean, and separates the issue of contention-resolution from the issue of queueing.
Stability results in the queue-free model usually translate to stability results in the queueing model [FJ87], [HLR96], [GMPS98].
Instability results in the queue-free model do not always translate to instability results in the queueing model. For example, consider the results mentioned above regarding the polynomial backoff protocol.
Techniques developed to prove instability results in the queue-free model can lead to useful lower bounds in the queueing model [HLR96]
The choice of probability distributions which is built into the models is probably not very important. It is likely that most known results would carry over to other distributions. See, for example [GMPS98]

[3] That is, the probability that k messages arrive on a given time step is (lambda^k / k!)e^(-lambda)

[4] For a good discussion of different notions of stability, see [FMM95], [MT93] and [HLR96].

[4A] In practice, it is possible to implement the full-sensing model when there is a single channel, but this becomes increasingly difficult in situations where there are multiple shared channels, such as optical networks. Thus, acknowledgement-based protocols are sometimes preferable to full-sensing protocols. For work on contention-resolution in the multiple-channel setting, see [GM99].

[5] An irreducible aperiodic Markov chain is said to be transient if the probability of returning to the start state (in this case, the state in which no messages are waiting) is less then 1. Note: An irreducible Markov chain is one in which every state can be reached from every other state. It is said to be aperiodic if, for every state s and every sufficiently large t, there is positive probability of moving from state s back to state s in t steps.

[6] Perhaps the easiest way to see this (which is from Kelly and MacPhee's paper) is to consider the following variant of a backoff protocol: Suppose that when a message is sent unsuccessfully for the i'th time, it chooses an integer w in [1,...,(i+1)^k] (for some constant k) and then waits w steps before trying again. This variant is similar to the backoff protocol in which pi= (i+1)^(-k) but it is easier to understand (in this context). Now let R(t) be a random variable representing the number of times that a message tries (in the variant) before it is age t. Clearly, the sum from i=1 to R(t) of (i+1)^k is at least t, so (R(t)+1)^(k+1) is at least t and, since k is constant, R(t) is omega(log t). The result for the variant of backoff now follows from Kelly and MacPhee's result. A similar result holds for the geometric version of backoff protocols described in these notes.

[7] Aldous notes ``simulations and heuristic arguments ( [CFM83], [GGMM88], and [H82]) suggest that, for fairly general [backoff protocols], if lambda is rather smaller than 1/e, then the process quickly reaches a quasistationary distribution which persists for a long, but finite time T_lambda before instability sets in. Formalising this and estimating E[T_lambda] is [a] challenging problem.''

[8] An irreducible aperiodic Markov chain is said to be positive recurrent if the expected time to return to the start state is finite. This implies that the chain has a unique stationary distribution.

[9] More formally, let Li be the number of messages in the system after step i and let Lavg be the limit as t goes to infinity of (1/t) Sum_(1<= i <= t) L_i. Their requirement is that the expectation of Lavg be finite (bounded from above by a function of N). This requirement is stronger than just requiring Lavg to be finite with probability 1, for example, consider the case in which Lavg is equal to 2^i with probability 2^(-i). Work by Jewell and Stidham (see [Sti74]) shows that with probability 1, Lavg = lambda Wavg, where Wi is the waiting time of the i'th message, and Wavg is the limit as m goes to infinity of (1/m) Sum_(1<= i <= m) W_i.

[10] Suppose that each user has a sequence of probabilities p0,p1,... Every time the user receives a message when its queue is empty it acts like an age-based protocol with the above sequence until either it sends the message or another message arrives (after this, the protocol might possibly do something else). Raghavan and Upfal show that if there are n/2 users with empty queues at time t, then the probability that at least two of them sends at step t + n/3 is at least a constant q. Thus, if lambda is larger than 1 - q, then either there are many steps with fewer than n/2 users with empty channels or there are many steps with collisions, and an ever-increasing load.

[11] Goodman, Greenberg, Madras and March note that the stated theorem is

A special case of the results of [SR86].
A special case of the following theorem, which Goodman et al. prove:
Theorem: Suppose that epsilon is a positive constant and y(t) is a discrete time stochastic process on a countable state space. Suppose that s1, s2, ... is a sequence of stopping times with respect to y(t) with s1=0 and st+1 > st. Let T be a stopping time with respect to the process y(s1), y(s2),... Suppose f(st) is non-negative and E(f(st+1) - f(st) + epsilon ( st+1 - st ) | y(s1), ..., y(st), st < T ) <= 0. Then E(T) <= f(0)/epsilon.

[12] A stopping time T with respect to y(t) is a random variable taking non-negative integers as values for which y(0),...,y(t) determines whether or not T=t.