Parallel Programming

2008-07-29

During the last few years, we have seen a trend change in CPU design. Until about 2003, CPUs became more powerful through frequency scaling. The number of operations per second increased exponentially over a period of 20 years. Clock speeds went from 4.77 MHz in the first IBM PC (1981) to 3.3 GHz in a high-end PC of the year 2003. Thus, Moore’s law was supported chiefly by increasing clock speed. Then something “strange” happened: clock speeds plateaued. For a brief moment, it appeared that Moore’s law was nearing its end. Not because of the physical limits of miniaturisation, or because higher clock speeds were impossible, but because of thermal problems associated with such high clock frequencies.

Water cooled systems were already introduced at the top-end of the market, but obviously the energy efficiency of these systems presents a serious economic problem. Then in 2004, the first PC dual core processor was introduced. Dual cores became widely available in 2005 and by now we have become used to quadcore processors, while eight-core systems are around the corner. The trend is obvious: in the forseeable future we will move from multi-core to many-core CPUs with dozens and perhaps even more cores.

Increasing the clock speed of a CPU has the effect, that program execution speed increases proportionally with clock speed. The increase is independent of software and hardware architectures. A sequential algorithm simply runs twice as fast on a machine where CPU operations take half the time. Unfortunately, this is not the case for a machine with twice as many CPUs. A sequential algorithm runs just as fast on a 3 GHz single core CPU than on a 3 GHz dual core CPU, because it uses only one of the CPU cores. Increased overall execution speed is therefore only achieved when multiple tasks run concurrently, because they can be executed simultaneously on different CPUs.

This problem of parallelism is not new in computer science. Unfortunately though, today’s programming languages aren’t very well equipped to deal with parallel scaling. The computer language idiom that typifies multi-core concurrency is the thread model. Threads are lightweight processes that are typically used for self-contained subtasks. Several languages, notably Java, offer APIs and native support for their implementation. Threads are well suited to asynchronous processing, for example in communication and networking. They can also be used for simple parallelisation, where a computing problem is parallel by nature (for example serving multiple web documents at the same time).

However, threads are relatively cumbersome and thus not really suitable for fine-grained parallel programming, such as traversing data structures or executing nested loops in parallel. Edward A. Lee describes the problems with threads in this excellent article.

The fundamental problem that software engineers face is described by Amdahl’s law. Amdahl’s formula expresses the speed gain of an algorithm achieved by parallelisation: speedup = N / ( (B*N) + (1-B) ), where B is the non-parallelizable (sequential) percentage of the problem and N is the number of worker threads, or processor cores. There are two notable things about Amdahl’s law: (1) the speedup is highly dependent on B, and (2) the curve flattens logarithmically with increasing N. It’s also important to know that Amdahl’s law assumes a constant problem size, which is unrealistic given that parallelisation requires a certain amount of control overhead.

Nevertheless, we can draw several conclusions from Amdahl’s law. First, the performance gain from parallel scaling is lower than that of frequency scaling. Second, it is not proportional to the number of cores. Third, it is highly dependent on software architecture, since the latter chiefly determines the size of B. From the perspective of a software engineer, the last conclusion is probably the most interesting one. It leads to the question what can be done to maximise parallelisation at the level of the algorithm exploiting data and task parallelisms. The present thread model is too coarse-grained to provide solutions at the algorithm-level. Hence, what is called for are new programming idioms that make this task easier.

Ideally, parallelisation is fully automatic, implemented by the compiler and the underlying hardware architecture. In this case, the application programmer could simply continue to formulate sequential algorithms without having to worry about parallelism. Unfortunately, automatic parallelisation has turned out to be extremely complex and difficult to realise. Despite many years of research in compiler architecture, there are no satisfactory results. Parallelisation at the algorithm level involves several abstraction steps, such as decomposition, mapping, scheduling, synchronisation, and result merging. The hardest among these tasks is very likely decomposition, which means breaking down a sequential task into parts that can be solved in parallel.

The preferred method for doing this is divide and conquer. A problem is divided into identical smaller pieces, whereas the division can be expressed recursively or iteratively. The problem chunks can then be solved in parallel and the subtask results are joined into one final result. Prime examples for this strategy are merge sort and sequential search algorithms. These are easily parallelisable. Likewise many operations on arrays and collections can be parallelised fairly easily. But certain tasks are not obviously parallelisable, for example the computation of the Fibonacci series: f (n) = f (n-1) + f (n-2). Since every step in the computation depends on the previous step, the Fibonacci algorithm is sequential by nature. In theoretical informatics, the question whether algorithms of the complexity classes P and NP are in principle parallelisable is still unsolved. For practical purposes, some problems are simply non-parallelisable.

Given that automatic parallelisation is currently out of reach, the need for the facilitation of algorithm parallelisation in computer languages boils down to (1) the need for expressive constructs that allow programmers to express parallelisms intuitively, and (2) the need for freeing programmers from writing boilerplate code for the drudge work of parallel execution, such as scheduling, controlling, and synchronisation. There are already a number of computer languages that provide built-in idioms for parallel programming, such as Erlang, Parallel Haskell, MPD, Fortress, Linda, Charm++ and others. However, these are fringe languages with a small number of users. It is questionable whether the parallel scaling trend in hardware will lead to a wide adoption of any of these languages.

Perhaps mainstream languages will evolve to acquire new APIs, libraries, and idioms to support parallel programming. For example, there are Compositional C++, MPI, Unified Parallel C, Co-Array Fortran and other existing extensions that make widely used languages more suitable for parallel programming, although there aren’t any established standards yet. It also remains to be seen whether the functional programming paradigm will catch on in view of parallel programming. Java is promising, because the JDK 7 (Dolphin) will contain a new API for fine-grained parallel processing.

In this very informative article by Brian Goetz, the author of Java Concurrency in Practice, introduces the new java.util.concurrency API features of Java SE 7. It will include a fork-join framework that simplifies the expression of parallel processing, for example in loops or in recursive method invocation. It will also provide new data structures, such as ParallelArray, that make parallel iteration easier to express. To learn more about parallel computing, read the free Introduction To Parallel Computing by Blaise Barney or search the free parallel programming books at Google Books.