Parallelism

With the era of increasing processor speeds slowly coming to and end, computer architects are exploring news ways of increasing throughput. One of the most promising is to look for and exploit different types of parallelism in code.

Types of Parallelism

Instruction Level Parallelism

Instruction level parallelism (ILP) takes advantage of sequences of instructions that require different functional units (such as the load unit, ALU, FP multiplier, etc). Different architectures approach this in different ways, but the idea is to have these non-dependent instructions executing simultaneously to keep the functional units busy as often as possible.

Data Level Parallelsim

Data level parallelism (DLP) is more of a special case than instruction level parallelism. DLP to the act of performing the same operation on multiple datum simultaneously. A classic example of DLP is performing an operation on an image in which processing each pixel is independent from the ones around it (such as brightening). This type of image processing lends itself well to having multiple pixels modified simultaneously using the same modification function. Other types of operations that allow the exploitation of DLP are matrix, array, and vector processing.

Thread Level Parallelism

Thread level parallelism (TLP) is the act of running multiple flows of executionof a single process simultaneously (see Threads). TLP is most often found in applications that need to run independent, unrelated tasks (such as computing, memory accesses, and IO) simultaneously. These types of applications are often found on machines that have a high workload, such as web servers. TLP is a popular ground for current research due to the rising popularity of multi-core and multi-processor systems, which allow for different threads to truly execute in parallel.

Existing Architectural Techniques

One of the main goals of computer architects is to try and produce architectures in which the functional units have maximum utilization. Until fairly recently, processor speeds have been increasing so fast that this task was typically focused on taking advantage of instruction level parallelism on a uniprocessor system. Two approaches were introduced that, between them, are found in almost every processor produced today--superscalar and VLIW architectures. However, with decreasing gains on higher clocking, refining these architectures has become less important, and approaches such as multithreading, multiprocessing, and multicomputing have shifted to be the forefront of research interests.

The Superscalar Approach

Superscalar refers to the process of issuing multiple instructions at the same time. The goal is to allow instructions to be executed out of order and in parallel if functional units are available and no true data dependencies are violated (i.e. exploit instruction-level parallelism). Superscalar relies on sophisticated hardware to dynamically schedule and manage the issuing, execution, and completion of multiple instructions simultaneously. In doing so, it has increased on chip complexity, power dissipation, and heat production. However, unlike VLIW processors, it does not require any extra software (compiler) support, which is likely a key reason that it has been a tremendous success. Most desktop and workstation architectures are superscalar.

Very Long Instruction Words

Another approach to the parallelism problem is to exploit instruction level parallelism by having the compiler create bundles of instructions that take advantage of the chip's known functional units. For instance, if the processor is capable of executing 2 ALU operations, 1 load/store operation, and one multiply operation simultaneously, the compiler can do its best to arrange the instructions in such a way that groups consisting of all these elements will be formed. Together, the group will be issued as a very long instruction.

DIAGRAM

This technique is not as popular as superscalar because of the high dependency on compiler support, and the initial lack thereof. VLIW avoids the chip complexity issues that are present in superscalar, but it is hindered by the fact that if there is no compiler capable of efficiently created very long instructions, the architecture is basically useless. The VLIW technique is probably most useful in certain implementations of high-performance computers where the types of programs that will be executed are known in advance and that extensive compiler support is not needed.

Multi-Threading

Multithreading is the process of executing multiple threads concurrently on a processor. It takes the idea of processes sharing the CPU to a lower level, and allows threads to be switched off and on the processor without any latency. There are a number of ways that multithreading can be implemented, including: fine-grained multithreading, coarse-grained multithreading, and simultaneous multithreading.

Fine-Grained Multithreading

Fine-grained multithreading involves instructions from threads issuing in a round-robin fashion--one instruction from process A, one instruction from process B, another from A, and so on (note that there can be more than two threads). This type of multithreading applies to situations where multiple threads share a single pipeline or are executing on a single-issue CPU.

Coarse-Grained Multithreading

The next type of multithreading is coarse-grained multithreading. Coarse-grained multithreading allows one thread to run until it executes an instruction that causes a latency (cache miss), and then the CPU swaps another thread in while the memory access completes. If a thread doesn't require a memory access, it will continue to run until its time limit is up. As with fine-grained multithreading, this applies multiple threads sharing a single pipeline or executing on a single-issue CPU.

Simultaneous Multithreading (SMT)

Simultaneous multithreading is a refinement on coarse-grained multithreading. The scheduling algorithm allows the active thread to issue as many instructions as it can (up to the issue-width) to keep the functional units busy. If a thread does not have sufficient ILP to do this, other threads may issue instructions to fill the empty slots. SMT only applies to superscalar architectures which are characterized by multiple-issue CPUs.