Evolution of computer architectures

One of reasons we are interested in parallel computing is the hardware limitations. Many efforts have been put into improving the computation speed of computers. The architecture of computers have evolved from traditional von Neumann's architecture to multiprocessors or multicomputers.

Speed of computers

At the heart of computer is a complex web of circuitry linking millions of switches that combine to manipulate electronically stored information. The organization of these interconnections, called the computer's architecture, is a major factor in determining the machine's operating speed \cite{[Time87]}.

Von Neumann's architecture

Traditional von Neumann architecture that is outlined by John von Neumann in 1945 is characterized by the major elements:

A sequential model of operations

This architecture limits the speed of computation by imposing a sequential model of operation.

\begin{figure} \caption{A serial computer's work pattern} \label{ser:pattern} \end{figure}

A simple operation, such as adding A to B and storing the result at C, involves a dozen separate steps. Each cycle consists of 4 steps as shown in Figure~\ref{ser:pattern}:

  1. Fetch instruction: the control unit reads the information stored at a particular address in memory and brings the program instruction located there into the CPU.
  2. Decode: the control unit translates the instruction; it separates the opcode (LOAD, ADD, or STORE), which identifies the operation to be performed, from the operand (A, B, or C), which names the memory address of the piece of data that is to be operated upon.
  3. Data transfer: carrying out the action required by the LOAD opcode, the control unit directs the memory to transmit the data located at the address identified as A. This value is put in a small high-speed storage unit within the CPU called a register.
  4. Process: in the final phase of the cycle for those instructions such as LOAD, STORE, the processor remains idle because all the actions triggered by the opcode have been performed.
\begin{figure} \caption{A gallery of speed thieves} \label{idle:time} \end{figure}

Idle of components

This sequential operation will cause the CPU idle in many time periods. (see Figure~\ref{idle:time}

Concurrency or parallelism

Concurrency or parallelism can be applied to reduce the CPU idle time since they can make different parts of the computer labor simultaneously on separate segments of the problem.

Concurrency can be implemented on many levels. For example, the CPU can be subdivided and set to work on different stages of successive instructions, or the slowness of one element may be offset by adding identical elements to share the work. Parallel operation, which can extend even to multiple processors, makes a fundamental shift from von Neumann architecture. Its development forms several stages in the history.

Striking a balance among the components

To overcome some of the delays inherent in one-step-at-a-time von Neumann machines, designers have investigated a number of ways to get the computer to perform several operations simultaneously.

Adding a processor for input and output

\begin{figure} \vspace{3in} \caption{Adding a I/O processor} \label{add:io} \end{figure}

Add a separate computer dedicated entirely to communication with the outside world as shown in Figure~\ref{add:io}. The I/O processor relieves the CPU of overseeing the movement of information to and from the slower mechanical devices such as disk drives, printers or video screens, which may be as much as a thousand times slower than the all-electronic central processor.

Rather than interrupt its processing work for long periods in order to supervise input and output, the CPU can send a brief message to the I/O processor, which then begins moving the desired information into or out of memory which the CPU resumes its computations.

More access to memory

\begin{figure} \vspace{3in} \caption{Adding more access channel to memory} \label{add:acce} \end{figure}

Other design changes aim at reducing the speed imbalance between processor and memory that often keeps the processor waiting for data. Partitioning the memory into several smaller units, or banks, allows the banks to function concurrently but out of phase with one another and thus keep data streaming to the processor as shown in Figure~\ref{add:acce}. Multiple pathways between memory and the CPU allow more information to be transferred in a given amount of time. On one beat of the clock, the first bank sends a packet; on the next beat, as that packet is in transit, the next bank sends another packet, and so on. With a four-path memory bus, four times as much information can travel between the CPU and memory in one clock cycle as is possible with only one path.

A fast intermediate memory

\begin{figure} \vspace{3in} \caption{Adding cache memory} \label{add:cache} \end{figure}

A small, high-speed memory between the main memory and the CPU can be used to hold data and instructions that are needed repeatedly in the course of a particular program. Because this addition, called cache memory as shown in Figure~\ref{add:cache}, is significantly smaller than main memory --- 16,000 bytes compared with 16 million bytes --- it takes less time to locate a given piece of data and transfer it to the CPU.

Special assistants to the processor

To make the CPU faster for number-crunching, the simplest way is dividing the CPU in two, completely separating the control unit from the processor, and then adding specialized circuits to the processor to help with the mathematical work. These helper circuits are designed to do only a few tasks, but to do them very quickly.

\begin{figure} \vspace{3in} \caption{Breaking the job down} \label{brea:job} \end{figure}

In one alternative architecture, the processor is partitioned into elements called functional units, each capable of a particular kind of calculation: addition, multiplication, etc. as shown in Figure~\ref{brea:job}. The control unit directs operations in memory with signals carried by one channel and receives instructions through another channel. As each instructions is decoded, it is routed to the proper functional unit for execution; data for the operation comes directly from memory to the processor through the data channel. These functional units can work concurrently. A drawback to this division-of-labor scheme is that some units may be able to perform a single operation faster than others: one addition takes less time than one multiplication. To avoid errors caused by intermediate results being used out of sequence, the program must be designed to coordinate the actions of the functional units.

\begin{figure} \vspace{3in} \caption{Splitting off the hard parts} \label{spli:part} \end{figure}

Another way to specialize the processing is to link a mathematical coprocessor which handles a narrow range of tasks, for example, floating-point arithmetic, to the main processor as shown in Figure~\ref{spli:part}. A coprocessor is subordinate to the main processor, which simply handles off data and instructions. The coprocessor receives data and instructions from the main processing unit and passes its results back to the main processor, where they may be used in further computations or sent to memory.

Making use of idle time

\begin{figure} \vspace{3in} \caption{An assembly-line control unit} \label{asse:unit} \end{figure}

Each phase of one fetch-execution cycle is handled by a different element in the computer, this means that at any given point in the cycle, parts of the computer are idle. A design concept called pipelining speeds up the computer by putting these idle elements to work and allowing the CPU to operate on more than one instruction at a time as shown in Figure~\ref{perf:pipe}.

\begin{figure} \vspace{3in} \caption{The efficiency of pipelining} \label{perf:pipe} \end{figure}

The chart compares the performance of an ordinary central processing unit with that of one that has been pipelined. The ordinary CPU must complete an entire cycle of four clock ticks before it can begin the next instruction; three instructions thus require 12 clock ticks to complete. In a pipelined CPU which starts on a new instruction with each tick, the same task is completed in half the time.

The principles of pipelining can also be applied to the processor that actually executes each instruction. For example, in a processor with separate functional units, each unit may be pipelined; some complex operations the processor can be pipelined for different suboperations.

The vector approach

\begin{figure} \vspace{3in} \caption{The vector approach} \label{vec:comp} \end{figure}

One of the most successful architectural approaches to speed up a von Neumann computer is the addition of a processor, called vector processor, that is designed to manipulate specially created lists of numbers called vectors. The processor treats each vector as a single entity; it executes an instruction by performing the same operation on every number in the vector simultaneously. Thus, adding two vectors together takes only as long as it takes to add two numbers. The vector processor receives instructions directly from the control unit. It works only on vectors. A vector computer retains an ordinary {\em scalar processor} to perform calculations on numbers that are not part of a vector; each processor transfers information to and from memory along its own pathway.

An array of workers under one boss

\begin{figure} \vspace{6in} \caption{A processor array} \label{proc:ary} \end{figure}

To gain speed for large-scale number crunching, one alternative is a processor array as shown in Figure~\ref{proc:ary}, which places several processor under the direction of a single control unit. Each processor has its own small memory for storing initial data and intermediate results, and the processors can share information with one another through a connecting system called a {\em routing network}. At the start of a program, each processor is given data for a small part of the job. The processors work in lockstep, performing identical operations on their separate data. At the completion of the job, the partial results are combined into the final product.

When one processor needs to communicate with another, the routing network provides paths. The speed of the routing network affects the computer's overall performance. The fastest network, in theory, is one that allows every processor to communicate directly with every other one. But a machine with hundreds of processors would require costly circuitry and complicated programming.

Two kinds of multicomputing

\begin{figure} \vspace{6in} \caption{A multiprocessor} \label{mult:proc} \end{figure}

A processor array speeds program execution by applying more than one processor to the task, it is relatively inflexible because it has only one control unit: at any given moment, all of the processors must execute the same instruction.

This limitation can be overcome by supplying each processor with its own control unit, thus enabling the computer to perform different operations simultaneously. Such a machine can run dissimilar parts of one program concurrently; it can even run more than one program at once, devoting a separate processing element (control unit plus processor) to each.

This kind of computer is known as {\em a multiprocessor} as shown in Figure~\ref{mult:proc}. Each of the linked elements in a multiprocessor is an entire CPU, with a processor and a control unit. All share a common memory, which is partitioned into separate modules so that different processing elements can have access to memory simultaneously. A routing network routes input and output, and provides each processing element with direct access to memory and to the other units. The multiprocessor relies on a single shared memory, it needs features that prevent more than one processor at a time from working with data in a particular part of memory. To avoid delays arising from this kind of conflict, the programmer must divide the program appropriately.

\begin{figure} \vspace{6in} \caption{A multicomputer} \label{mult:comp} \end{figure}

Another architecture is known as a multicomputer. Each processing element has its own memory and is effectively a self-sufficient computer. Each computer can work on its part of a problem independently, pausing to exchange data with others as necessary. A routing network connects the computing units with one another, with memory and with the input/output channels. In a multicomputer, the problem of contention for data in memory is largely overcome, since the data that each processor needs is stored in its own individual memory. Delays do occur, however, when one processor requires information that is in midcomputation elsewhere. Programmers try to avoid such delays by minimizing the requirement for interprocessor communications. A routing network connects the computing units with one another, with memory and with the input/output channels.