Process Sleep and Wakeup on a Shared-memory Multiprocessor
The problem of enabling a ‘sleeping’ process on a shared-memory multiprocessor is a difficult one, especially if the process is to be awakened by an interrupt-time event. We present here the code for sleep and wakeup primitives that we use in our multiprocessor system. The code has been exercised by years of active use and by a verification system.
Our problem is to synchronise processes on a symmetric shared-memory multiprocessor. Processes suspend execution, or sleep, while awaiting an enabling event such as an I/O interrupt. When the event occurs, the process is issued a wakeup to resume its execution. During these events, other processes may be running and other interrupts occurring on other processors.
More specifically, we wish to implement subroutines called sleep, callable by a process to relinquish control of its current processor, and wakeup, callable by another process or an interrupt to resume the execution of a suspended process. The calling conventions of these subroutines will remain unspecified for the moment.
We assume the processors have an atomic test-and-set or equivalent operation but no other synchronisation method. Also, we assume interrupts can occur on any processor at any time, except on a processor that has locally inhibited them.
The problem is the generalisation to a multiprocessor of a familiar and well-understood uniprocessor problem. It may be reduced to a uniprocessor problem by using a global test-and-set to serialise the sleeps and wakeups, which is equivalent to synchronising through a monitor. For performance and cleanliness, however, we prefer to allow the interrupt handling and process control to be multiprocessed.
Our attempts to solve the sleep/wakeup problem in Plan 9 [Pik90] prompted this paper. We implemented solutions several times over several months and each time convinced ourselves — wrongly — they were correct. Multiprocessor algorithms can be difficult to prove correct by inspection and formal reasoning about them is impractical. We finally developed an algorithm we trust by verifying our code using an empirical testing tool. We present that code here, along with some comments about the process by which it was designed.
Since processes in Plan 9 and the UNIX system have similar structure and properties, one might ask if UNIX sleep and wakeup [Bac86] could not easily be adapted from their standard uniprocessor implementation to our multiprocessor needs. The short answer is, no.
The UNIX routines take as argument a single global address that serves as a unique identifier to connect the wakeup with the appropriate process or processes. This has several inherent disadvantages. From the point of view of sleep and wakeup, it is difficult to associate a data structure with an arbitrary address; the routines are unable to maintain a state variable recording the status of the event and processes. (The reverse is of course easy — we could require the address to point to a special data structure — but we are investigating UNIX sleep and wakeup, not the code that calls them.) Also, multiple processes sleep ‘on’ a given address, so wakeup must enable them all, and let process scheduling determine which process actually benefits from the event. This is inefficient; a queueing mechanism would be preferable but, again, it is difficult to associate a queue with a general address. Moreover, the lack of state means that sleep and wakeup cannot know what the corresponding process (or interrupt) is doing; sleep and wakeup must be executed atomically. On a uniprocessor it suffices to disable interrupts during their execution. On a multiprocessor, however, most processors can inhibit interrupts only on the current processor, so while a process is executing sleep the desired interrupt can come and go on another processor. If the wakeup is to be issued by another process, the problem is even harder. Some inter-process mutual exclusion mechanism must be used, which, yet again, is difficult to do without a way to communicate state.
In summary, to be useful on a multiprocessor, UNIX sleep and wakeup must either be made to run atomically on a single processor (such as by using a monitor) or they need a richer model for their communication.
Consider the case of an interrupt waking up a sleeping process. (The other case, a process awakening a second process, is easier because atomicity can be achieved using an interlock.) The sleeping process is waiting for some event to occur, which may be modeled by a condition coming true. The condition could be just that the event has happened, or something more subtle such as a queue draining below some low-water mark. We represent the condition by a function of one argument of type void*; the code supporting the device generating the interrupts provides such a function to be used by sleep and wakeup to synchronise. The function returns false if the event has not occurred, and true some time after the event has occurred. The sleep and wakeup routines must, of course, work correctly if the event occurs while the process is executing sleep.
We assume that a particular call to sleep corresponds to a particular call to wakeup, that is, at most one process is asleep waiting for a particular event. This can be guaranteed in the code that calls sleep and wakeup by appropriate interlocks. We also assume for the moment that there will be only one interrupt and that it may occur at any time, even before sleep has been called.
For performance, we desire that multiple instances of sleep and wakeup may be running simultaneously on our multiprocessor. For example, a process calling sleep to await a character from an input channel need not wait for another process to finish executing sleep to await a disk block. At a finer level, we would like a process reading from one input channel to be able to execute sleep in parallel with a process reading from another input channel. A standard approach to synchronisation is to interlock the channel ‘driver’ so that only one process may be executing in the channel code at once. This method is clearly inadequate for our purposes; we need fine-grained synchronisation, and in particular to apply interlocks at the level of individual channels rather than at the level of the channel driver.
Our approach is to use an object called a rendezvous, which is a data structure through which sleep and wakeup synchronise. (The similarly named construct in Ada is a control structure; ours is an unrelated data structure.) A rendezvous is allocated for each active source of events: one for each I/O channel, one for each end of a pipe, and so on. The rendezvous serves as an interlockable structure in which to record the state of the sleeping process, so that sleep and wakeup can communicate if the event happens before or while sleep is executing.
Our design for sleep is therefore a function
void sleep(Rendezvous *r, int (*condition)(void*), void *arg)
called by the sleeping process. The argument r connects the call to sleep with the call to wakeup, and is part of the data structure for the (say) device. The function condition is described above; called with argument arg, it is used by sleep to decide whether the event has occurred. Wakeup has a simpler specification:
void wakeup(Rendezvous *r).
Wakeup must be called after the condition has become true.
The Rendezvous data type is defined as
Our Locks are test-and-set spin locks. The routine lock(Lock\f1*l) returns when the current process holds that lock; unlock(Lock\f5*l) releases the lock.
Here is our implementation of sleep. Its details are discussed below. Thisp is a pointer to the current process on the current processor. (Its value differs on each processor.)
sleep(Rendezvous *r, int (*condition)(void*), void *arg)
s = inhibit(); /* interrupts */