In order to use a multiprocessor or a multicomputer systems that have a set of processors, the first step of parallel programming is to partition a large process into a set of smaller processes according to some principles. The second step is to schedule or map the set of processes onto a set of processors.
Assume that a process is partitioned into a set of processes p_{1}, p_{2}, ..., p_{n}. These processes must form a determinate process system. That is, the results of executing the set of processes should be unique despite variations in process execution rates and the order of process executions \cite{[CoDe73]}. A process is a set of instructions which performs computations on its input ($I$) and produce its output ($O$). Physically, the input and output are a set of memory cells. For ensuring the uniqueness of results, $p_{i}$ and $p_{j}$, $i \neq j$ must satisfy either of the following conditions:
\begin{figure} \input{30i1_max.tex} \caption{Discover a maximally parallel partition}\label{m:para} \end{figure}
Based on the necessary conditions and~\ref{caps}> listed above, we can use the following algorithm to discover a maximally parallel partition \cite{[CoDe73]}.
Sequential programs consist of four constructs: sequence, loop, decision, and subprogram. These four constructs can express any sequential programs. What constructs are available for parallel programs? Based on the essential principles described above, we can design and analyze a set of parallel constructs to be modules of a parallel partition \cite{[Lewi93]}.
There are parallel programming paradigms: instruction-parallel or data-parallel.
The instruction-parallel paradigm is the most general, but does not yield a high degree of parallelism. Its partition is based on instructions not data. Therefore, the execution time is only a function of the number processors, $p$, used. The so-called Amdahl's law reflects this kind of parallelism. The law assumes that t(1) = 1, thus, its speedup is
\begin{eqnarray*} speedup & = & \frac{1}{\beta + \frac{(1 - \beta)}{p}} \\ & = & \frac{p}{\beta p + (1 - \beta)} \end{eqnarray*}
where $\beta$ is the fraction of the instructions inherently sequential. If $\beta = 0.5$, the equation shows that as $p$ increases, the speedup goes asymptotically to $2$.
Data-parallelism is more restricted, but generally yields very high levels of parallelism. It considers the partitions of data so that time is a function of both the number of processors $p$ and the data size of $n$ and $t(1,n) = t(n)$; $t(p, n) = t(\frac{n}{p})$. Intuitively, its speedup is positively proportional to the number of processors $p$. The so-called Gustafson-Barsis law reflects this parallelism. The law assumes that $t(p,n) = 1$, then, its speedup is \begin{eqnarray*} speedup & = & \frac{\beta + (1 - \beta) p}{1} \\ & = & p - (p - 1) \beta \end{eqnarray*}
If $\beta = 0.5$, its speedup can reach $(p+1)/2$. This kind of parallel algorithm is known as a scalable algorithm.
We will start our discussion from data-parallel programming. We know that a loop in a sequential program represents a locality of the program. There must be a set of data available for the statements inside the loop. Data-parallelism is, thus, mainly referring to a loop of a sequential program. When precedence relations present, instructions cannot be independently parallelized. Then, instruction-parallelism should be involved.
We need a tool to describe parallelism. Data availability is the necessary requirement for data-parallelism. We used an ID_net to graphically illustrate the data and the associated statements. An ID_net is a modified Petri net. The ID_net is same as a Petri net and other modified nets such as P_net, D_net \cite{[King88]}, program graphs \cite{[Lewi93]}, etc. in the sense that it consists of transitions, places, arcs, and tokens, where a transition is a representation of an instruction; an arc indicates a control path from a {\tt place} to a transition or from a transition to a place. The ID_net differs from these nets in that it uses data as the token in places.
An instruction accepts a set of available input data and produces another set of data as the output for the next instruction. It is just like a token is available for a place so that the corresponding transition can be fired and a new token is produced for the next place. Thus, an ID_net can be used to describe a process which is a thread of control consisting of instructions and data. A parallel program consists of a set of concurrent processes. The concurrency depends on the sharing of data. Thus, an ID_net can naturally indicate the synchronization points of a parallel algorithm. A {\tt transition} represents an instruction and associates with a time function which expresses the computation complexity of the instruction so that an ID_net can be used for the performance analysis. ID_nets will be used in the following sections to describe and analyze the performance of parallel algorithms \cite{[Xuch95]}.
We can recognize the compound statement in a static loop (a for loop which statically indicates the number of iterations) as a process. Thus, the iterations from $1$ to $n$ represent a set of processes {p_{1}, p_{2}, \ldots, p_{n}}. If the set of processes satisfy the conditions above, then we can apply the data parallel fan to them.
Some examples of this category are vector addition, matrix multiplication, Gaussian elimination, etc.
When the set of processes associated with a loop satisfies the following conditions:
For those sequential programs which contain
the way to parallelize turns to instruction-parallelism That is, the statements are grouped into a set of processes, which may contain different statements, according to the parallel conditions:
Then the set of processes can be assigned onto a set of processors and all of the processes are synchronized for every iteration. This is the construct Par. It refers to mimd parallelization.
Let s_{i} denote ith statement. We will name
We can apply different ways to eliminate these data dependencies. Some examples are Gaussian elimination, fast Fourier transformation, etc.
The worst case is that all of statements inside a loop have
In other words, we can neither parallelize all of the $n$ iterations nor the statements inside an iteration. We can only use another parallel construct known as pipe.
Some examples are Vandermonde linear system, etc.
If a process has been partitioned into a set of processes P = {p_{1}, p_{2}, ..., p_{n}}, the scheduling refers to which process, $p_{i}$, is executed where---the processor that is assigned and when---the starting time. Evidently, it depends on
The purpose of parallel programming is to speed up the execution of a process. The partitioning provides the possibility and the scheduling realizes the speedup.
The definition of speedup is
\begin{eqnarray*} speedup = \frac{t(1, n)}{t(p, n)} \end{eqnarray*}
where $t(1,n)$ is the time for the execution of a process with data size of $n$ in $1$ processor (a sequential program); $t(p,n)$ is the time for the execution of a process with data size of $n$ in $p$ processors (a parallel program).
The speedup of a parallel algorithm is affected by many factors (Figure~\ref{s:aff}).
\begin{figure} \input{30i2_aff.tex} \caption{Factors that affect a speedup} \label{s:aff} \end{figure}
First of all, the speedup of a parallel algorithm depends on the parallel programming paradigms: instruction-parallel or data-parallel as discussed in Section~\ref{para:digm}.
The instruction-parallel paradigm is the most general, but does not yield a high degree of parallelism. Its partition is based on instructions not data. Therefore, the execution time is only a function of the number processors, $p$, used.
Data-parallelism is more restricted, but generally yields very high levels of parallelism. It considers the partitions of data so that time is a function of both the number of processors $p$ and the data size of $n$ and $t(1,n) = t(n)$; $t(p, n) = t(\frac{n}{p})$. Intuitively, its speedup is positively proportional to the number of processors $p$.
The speedup is related to the consideration of efficiency. The definition of efficiency is
\begin{eqnarray*} efficiency = \frac{speedup}{p} = \frac{t(1,n)}{t(p,n)~p} \end{eqnarray*}
It measures how busy are all of the processors. We may select the number of processors $p$ closer to $\frac{t(1,n)}{t(p,n)}$ in order to keep a high efficiency. But, it may sacrifice some degree of speedup due to the fact that using more processors may yield a higher speedup as shown in the previous section.
\begin{figure} \vspace{3in} \caption{Shared memory and message passing environments} \label{sha:mes} \end{figure}
The execution time of a parallel algorithm is not only a function of the number of processors, but also a function of communication delay between processors. Generally speaking, a tightly coupled system provides shared memory environment with a shorter communication delay compared with a loosely coupled system which supports message passing communication. They are shown in Figure~\ref{sha:mes} \cite{[BeTs89]}.
For a scalable algorithm, increasing the number of processors will increase the speedup, but at the same time it may increase the amount of communication delay which in turn decreases the speedup. Thus, the real speedup depends on the combination of the effects of the communication delay and the number of processors used.
Parallel programming distributes a set of processes onto a set of processors. Each of them may have different execution time, this is referred to as the computation load. The speedup of a parallel algorithm depends on the critical path, the longest execution time experienced among the set of processors. Obviously, the highest speedup corresponds to the situation when the loads are evenly distributed among all of processors such that all processors have the same execution time.
Finally, the speedup of a parallel algorithm is affected by the data arrangement in the physical memory since processes need synchronization when either precedence relations or data dependencies (sharing) encountered. The synchronization requires faster processes waiting for slower processes or the serial execution of processes. A better data arrangement may reduce the number of synchronizations or increase the degree of parallelism.
All of these factors will be discussed in examples shown in Chapter~\ref{examp}. The algorithm that considering all of these factors and mapping a set of processes onto a set of processors is known as scheduling algorithm. A detailed discussion is stated in \cite{[ElLA94]}.
As an application users, we only have a limited ability to control the scheduling. We may select the architecture of parallel machines to be used, but actually, we have no choice since we only have what we have. We may select the programming paradigm, but actually, it mainly depends on the behavior of the application program. We may select the number of processors, but actually, it is a parameter for the performance measurement. Therefore, the data arrangement and load balancing are two major topics that can be considered in our algorithm design.
Consequently, instead of designing a scheduling algorithm, we will discuss the relationship between the performance and scheduling when we discuss the parallel programming examples.