Next: Implementation Issues Up: A Survey of Process Previous: Precursors to Process Migration

Motivations for Process Migration

Clearly, it seems to be much easier to decide on which node a process will run at the time that it's born than it is to disembody an actively running process from one node and resume its execution on another. As we will see in later sections, there are many subtle, nontrivial implementation issues to be considered. This leads us naturally to the question: is the benefit of process migration worth the cost? As with most engineering questions, there is no single, clear-cut answer. Different environments have different needs; in some, process migration can be a big benefit, and in others it has little positive impact.

For example, consider a loosely coupled parallel supercomputer such as the IBM SP/2 [10]. Such a machine is often used for so-called ``Grand Challenge'' problems such as fluid dynamics and molecular kinetics. These types of problems are ``embarrassingly parallel,'' meaning they can scale evenly to virtually any number of processing elements. In addition, the processing elements are not shared with multiple users during computation; a node is assigned exclusively to a particular user for the duration of the job. In this scenario, the optimal distribution of processes to processors can be determined before the processes begin running. The load will always be evenly balanced, and there will never be short-term contention for a node's resources (i.e., contention takes place at course granularity when assigning processes to users for hours at a time). Process migration will be of virtually no use in this case.

On the other hand, consider an environment where there are a large number of processes concurrently running that belong to different users. It is impossible to predict how many users will be contending for resources or how long any of their jobs will be. This is a typical usage pattern of a central university computing facility. In this case, there are two typical solutions:

Buy a very large, very expensive multi-processor SMP machine such as an SGI Challenge or Digital AlphaServer
Buy numerous smaller, lower-powered workstations--configured with transparency, such as in MIT's Athena or CMU's Andrew system--and coarsely balance the load across them by distributing the users. (This can be done by asking users to ``log in to a random machine'', or by using software that distributes user logins, such as a load-balancing DNS [11] server.¹)

The SMP solution has effective load-balancing across processors due to the tightly coupled nature of the system. However, such a solution can be quite expensive and also suffers from a lack of redundancy--a failure of the single, central machine denies service to all users. It is also not very scalable; most SMP machines can not support more than 16 processors. The second solution, while more scalable and resistant to failures, will often suffer from severely unbalanced loads. This is because a ``user'' is not a constant unit of load; different users generate different loads, and a particular user's generated load can change from minute to minute.

In this case, an operating system that supports dynamic process migration can be very beneficial. With a pool of processing nodes (e.g., workstations or PCs) dedicated to servicing the user load, an efficient process migration scheme can balance the user load almost as effectively as an SMP machine. In addition, the pool of processors is far more redundant than a single SMP machine. Plus, it is much more scalable: if the control of the pool is appropriately designed (i.e., distributed), as many nodes as desired can be added, incrementally, as the expected nominal load on the system increases. These are the goals that motivated the development of MOSIX [12].

Other environments fall somewhere in between these two extremes. For example, a popular trend these days is to harness the power of idle personal workstations. In this case, there is no investment made in centralized computing resources, as in the previous two examples. Rather, the model is that a typical user--say, Alice--has a workstation on her desk. Alice is willing to let her workstation be put to good use when it's idle, as long as she can have complete control of the machine when she is sitting in front of it.

This is an example where static scheduling goes a long way, but does not quite solve the problem completely. Static-scheduling systems such as Linda [9] can automatically monitor the load on all workstations, and schedule processes to idle processors when a parallel job begins. The trouble comes if Alice returns to her workstation while Bob is in the middle of running a large simulation on it. If we are using static scheduling, our options are limited:

Allow Bob's job to run to completion. This makes Alice unhappy because her workstation will be slow until Bob's job is done.
Immediately evict Bob's job from Alice's workstation by terminating Bob's process. This makes Bob unhappy because he loses work.

It was the desire to find a solution that satisfies both Alice and Bob that motivated the development of process migration in systems such as Sprite [13] and Condor [14]. The designers of these systems were less interested in doing dynamic load balancing in a pool of dedicated processors, but rather were looking for a graceful way to evict processes from foreign machines when the machines' owners returned.

Next: Implementation Issues Up: A Survey of Process Previous: Precursors to Process Migration

Jeremy Elson 2000-04-05