Parallel programming

In order to use a multiprocessor or a multicomputer systems that have a set of processors, the first step of parallel programming is to partition a large process into a set of smaller processes according to some principles. The second step is to schedule or map the set of processes onto a set of processors.

Partitioning

Assume that a process is partitioned into a set of processes p_{1}, p_{2}, ..., p_{n}. These processes must form a determinate process system. That is, the results of executing the set of processes should be unique despite variations in process execution rates and the order of process executions \cite{[CoDe73]}. A process is a set of instructions which performs computations on its input ($I$) and produce its output ($O$). Physically, the input and output are a set of memory cells. For ensuring the uniqueness of results, $p_{i}$ and $p_{j}$, $i \neq j$ must satisfy either of the following conditions:

  1. $p_{i} \prec p_{j}$ or $p_{i} \succ p_{j}$, or
  2. $O_{p_{i}} \cap O_{p_{j}} = I_{p_{i}} \cap O_{p_{j}} = O_{p_{i}} \cap I_{p_{j}} = 0$, for $i \leq j$.
Thus, the necessary conditions of parallelism are determined by the precedence relations and the data dependencies. The degree of satisfaction is the degree of inherent parallelism of the problem.

An algorithm for a maximally parallel partition

\begin{figure} \input{30i1_max.tex} \caption{Discover a maximally parallel partition}\label{m:para} \end{figure}

Based on the necessary conditions and~\ref{caps}> listed above, we can use the following algorithm to discover a maximally parallel partition \cite{[CoDe73]}.

input

A sequential algorithm that represents a solution of a problem and a set of memory cells, M = m_{1}, m_{2}, ..., m_{m}, accessed by this algorithm (Figure~\ref{m:para}--(a)).

algorithm

  1. Partition the sequential process into a set of sub-processes P = {p_{1}, p_{2}, ..., p_{n}} and keep the subscripts in a precedence order (Figure~\ref{m:para}--(b));
  2. Construct a table which indicates that each memory cell m_{j} is in the input $I$ and/or output $O$ of each process p_{i} (Figure~\ref{m:para}--(c));
  3. Form the transitive closure of the relation R = {(p_{i}, p_{j}) | (p_{i} \prec p_{j})~ and~ ((O_{p_{i}} \cap O_{p_{j}}) \cup (I_{p_{i}} \cap O_{p_{j}}) \cup (O_{p_{i}} \cap I_{p_{j}}) \neq 0)\}$ (Figure~\ref{m:para}--(d));
  4. Drawing a maximally parallel graph (Figure~\ref{m:para}--(e)).

Parallel constructs

Sequential programs consist of four constructs: sequence, loop, decision, and subprogram. These four constructs can express any sequential programs. What constructs are available for parallel programs? Based on the essential principles described above, we can design and analyze a set of parallel constructs to be modules of a parallel partition \cite{[Lewi93]}.

Parallel programming paradigms

There are parallel programming paradigms: instruction-parallel or data-parallel.

Instruction-parallel

The instruction-parallel paradigm is the most general, but does not yield a high degree of parallelism. Its partition is based on instructions not data. Therefore, the execution time is only a function of the number processors, $p$, used. The so-called Amdahl's law reflects this kind of parallelism. The law assumes that t(1) = 1, thus, its speedup is

\begin{eqnarray*} speedup & = & \frac{1}{\beta + \frac{(1 - \beta)}{p}} \\ & = & \frac{p}{\beta p + (1 - \beta)} \end{eqnarray*}

where $\beta$ is the fraction of the instructions inherently sequential. If $\beta = 0.5$, the equation shows that as $p$ increases, the speedup goes asymptotically to $2$.

Data-parallel

Data-parallelism is more restricted, but generally yields very high levels of parallelism. It considers the partitions of data so that time is a function of both the number of processors $p$ and the data size of $n$ and $t(1,n) = t(n)$; $t(p, n) = t(\frac{n}{p})$. Intuitively, its speedup is positively proportional to the number of processors $p$. The so-called Gustafson-Barsis law reflects this parallelism. The law assumes that $t(p,n) = 1$, then, its speedup is \begin{eqnarray*} speedup & = & \frac{\beta + (1 - \beta) p}{1} \\ & = & p - (p - 1) \beta \end{eqnarray*}

If $\beta = 0.5$, its speedup can reach $(p+1)/2$. This kind of parallel algorithm is known as a scalable algorithm.

We will start our discussion from data-parallel programming. We know that a loop in a sequential program represents a locality of the program. There must be a set of data available for the statements inside the loop. Data-parallelism is, thus, mainly referring to a loop of a sequential program. When precedence relations present, instructions cannot be independently parallelized. Then, instruction-parallelism should be involved.

ID_net

We need a tool to describe parallelism. Data availability is the necessary requirement for data-parallelism. We used an ID_net to graphically illustrate the data and the associated statements. An ID_net is a modified Petri net. The ID_net is same as a Petri net and other modified nets such as P_net, D_net \cite{[King88]}, program graphs \cite{[Lewi93]}, etc. in the sense that it consists of transitions, places, arcs, and tokens, where a transition is a representation of an instruction; an arc indicates a control path from a {\tt place} to a transition or from a transition to a place. The ID_net differs from these nets in that it uses data as the token in places.

An instruction accepts a set of available input data and produces another set of data as the output for the next instruction. It is just like a token is available for a place so that the corresponding transition can be fired and a new token is produced for the next place. Thus, an ID_net can be used to describe a process which is a thread of control consisting of instructions and data. A parallel program consists of a set of concurrent processes. The concurrency depends on the sharing of data. Thus, an ID_net can naturally indicate the synchronization points of a parallel algorithm. A {\tt transition} represents an instruction and associates with a time function which expresses the computation complexity of the instruction so that an ID_net can be used for the performance analysis. ID_nets will be used in the following sections to describe and analyze the performance of parallel algorithms \cite{[Xuch95]}.

Data parallel fan

If a set processes satisfy the following conditions:
  1. No precedence restrictions; and
  2. $O_{p_{i}} \cap O_{p_{j}} = I_{p_{i}} \cap O_{p_{j}} = O_{p_{i}} \cap I_{p_{j}} = 0$, for all i and j.
then according to the parallelization principles described in Chapter 1, we can ensure that all these processes can be executed in parallel. A host processor can broadcast these processes to a set of processors, each processor performs one processes, and then sends the computation results back to the host processor. This kind of parallel construct is a data parallel fan.

We can recognize the compound statement in a static loop (a for loop which statically indicates the number of iterations) as a process. Thus, the iterations from $1$ to $n$ represent a set of processes {p_{1}, p_{2}, \ldots, p_{n}}. If the set of processes satisfy the conditions above, then we can apply the data parallel fan to them.

Some examples of this category are vector addition, matrix multiplication, Gaussian elimination, etc.

Reduction tree

When the set of processes associated with a loop satisfies the following conditions:

  1. No precedence restrictions; and
  2. $O_{p_{i}} \cap O_{p_{j}} = I_{p_{i}} \cap O_{p_{j}} = O_{p_{i}} \cap I_{p_{j}} = {m} $, for all of i and j.
where, the $m$ is a global variable shared by every process. Their objective is to reduce a given set of data into one value, for example, summation of a vector, search for the maximum or minimum value among a set of data, inner product, etc. They can be parallelized by using a reduction tree. Every level of the tree will need a synchronization and the number of data will be reduced by half.

Independent-loop Par

For those sequential programs which contain

  1. No loops, or
  2. Dynamic loops ({\tt while} or {\tt repeat-until} loops which need to check a terminate condition for every iteration), or
  3. Static loops ({\tt for} loops) with loop-carried data dependency $x[i+1] = f(x[i])$

the way to parallelize turns to instruction-parallelism That is, the statements are grouped into a set of processes, which may contain different statements, according to the parallel conditions:

  1. Processes have no precedence restriction; and
  2. $O_{p_{i}} \cap O_{p_{j}} = I_{p_{i}} \cap O_{p_{j}} = O_{p_{i}} \cap I_{p_{j}} = 0$, for all of i and j.

Then the set of processes can be assigned onto a set of processors and all of the processes are synchronized for every iteration. This is the construct Par. It refers to mimd parallelization.

Let s_{i} denote ith statement. We will name

  1. $O_{s_{i}} \cap O_{s_{j}} \neq 0$, for $i < j$ as the output-dependency.
  2. $I_{s_{i}} \cap O_{s_{j}} \neq 0$, for $i < j$ as the anti-dependency.
  3. $O_{s_{i}} \cap O_{s_{j}} \neq 0$, for $i < j$ as the flow-dependency.

We can apply different ways to eliminate these data dependencies. Some examples are Gaussian elimination, fast Fourier transformation, etc.

Dependent-loop pipe

The worst case is that all of statements inside a loop have

  1. Precedence relation; or
  2. $O_{s_{i}} \cap O_{s_{j}} \neq 0$ and/or $I_{s_{i}} \cap O_{s_{j}} \neq 0$ and/or $O_{s_{i}} \cap I_{s_{j}} \neq 0$, for all of i and j.

In other words, we can neither parallelize all of the $n$ iterations nor the statements inside an iteration. We can only use another parallel construct known as pipe.

Some examples are Vandermonde linear system, etc.

Scheduling

If a process has been partitioned into a set of processes P = {p_{1}, p_{2}, ..., p_{n}}, the scheduling refers to which process, $p_{i}$, is executed where---the processor that is assigned and when---the starting time. Evidently, it depends on

The purpose of parallel programming is to speed up the execution of a process. The partitioning provides the possibility and the scheduling realizes the speedup.

Speedup

The definition of speedup is

\begin{eqnarray*} speedup = \frac{t(1, n)}{t(p, n)} \end{eqnarray*}

where $t(1,n)$ is the time for the execution of a process with data size of $n$ in $1$ processor (a sequential program); $t(p,n)$ is the time for the execution of a process with data size of $n$ in $p$ processors (a parallel program).

The speedup of a parallel algorithm is affected by many factors (Figure~\ref{s:aff}).

\begin{figure} \input{30i2_aff.tex} \caption{Factors that affect a speedup} \label{s:aff} \end{figure}

Parallel programming paradigms

First of all, the speedup of a parallel algorithm depends on the parallel programming paradigms: instruction-parallel or data-parallel as discussed in Section~\ref{para:digm}.

The instruction-parallel paradigm is the most general, but does not yield a high degree of parallelism. Its partition is based on instructions not data. Therefore, the execution time is only a function of the number processors, $p$, used.

Data-parallelism is more restricted, but generally yields very high levels of parallelism. It considers the partitions of data so that time is a function of both the number of processors $p$ and the data size of $n$ and $t(1,n) = t(n)$; $t(p, n) = t(\frac{n}{p})$. Intuitively, its speedup is positively proportional to the number of processors $p$.

Efficiency

The speedup is related to the consideration of efficiency. The definition of efficiency is

\begin{eqnarray*} efficiency = \frac{speedup}{p} = \frac{t(1,n)}{t(p,n)~p} \end{eqnarray*}

It measures how busy are all of the processors. We may select the number of processors $p$ closer to $\frac{t(1,n)}{t(p,n)}$ in order to keep a high efficiency. But, it may sacrifice some degree of speedup due to the fact that using more processors may yield a higher speedup as shown in the previous section.

Shared memory versus message passing

\begin{figure} \vspace{3in} \caption{Shared memory and message passing environments} \label{sha:mes} \end{figure}

The execution time of a parallel algorithm is not only a function of the number of processors, but also a function of communication delay between processors. Generally speaking, a tightly coupled system provides shared memory environment with a shorter communication delay compared with a loosely coupled system which supports message passing communication. They are shown in Figure~\ref{sha:mes} \cite{[BeTs89]}.

For a scalable algorithm, increasing the number of processors will increase the speedup, but at the same time it may increase the amount of communication delay which in turn decreases the speedup. Thus, the real speedup depends on the combination of the effects of the communication delay and the number of processors used.

Load balancing

Parallel programming distributes a set of processes onto a set of processors. Each of them may have different execution time, this is referred to as the computation load. The speedup of a parallel algorithm depends on the critical path, the longest execution time experienced among the set of processors. Obviously, the highest speedup corresponds to the situation when the loads are evenly distributed among all of processors such that all processors have the same execution time.

Memory contention

Finally, the speedup of a parallel algorithm is affected by the data arrangement in the physical memory since processes need synchronization when either precedence relations or data dependencies (sharing) encountered. The synchronization requires faster processes waiting for slower processes or the serial execution of processes. A better data arrangement may reduce the number of synchronizations or increase the degree of parallelism.

All of these factors will be discussed in examples shown in Chapter~\ref{examp}. The algorithm that considering all of these factors and mapping a set of processes onto a set of processors is known as scheduling algorithm. A detailed discussion is stated in \cite{[ElLA94]}.

Ability of application users

As an application users, we only have a limited ability to control the scheduling. We may select the architecture of parallel machines to be used, but actually, we have no choice since we only have what we have. We may select the programming paradigm, but actually, it mainly depends on the behavior of the application program. We may select the number of processors, but actually, it is a parameter for the performance measurement. Therefore, the data arrangement and load balancing are two major topics that can be considered in our algorithm design.

Consequently, instead of designing a scheduling algorithm, we will discuss the relationship between the performance and scheduling when we discuss the parallel programming examples.