Next: 2.9 Summary Up: 2 Designing Parallel Algorithms Previous: 2.7 Case Study: Floorplan Optimization

2.8 Case Study: Computational Chemistry

Our third case study, like the first, is from computational science. It is an example of an application that accesses a distributed data structure in an asynchronous fashion and that is amenable to a functional decomposition.

2.8.1 Chemistry Background

Computational techniques are being used increasingly as an alternative to experiment in chemistry. In what is called ab initio quantum chemistry , computer programs are used to compute fundamental properties of atoms and molecules, such as bond strengths and reaction energies, from first principles, by solving various approximations to the Schrödinger equation that describes their basic structures. This approach allows the chemist to explore reaction pathways that would be hazardous or expensive to explore experimentally. One application for these techniques is in the investigation of biological processes. For example, Plate 6

shows a molecular model for the active site region in the enzyme malate dehydrogenase, a key enzyme in the conversion of glucose to the high-energy molecule ATP. This image is taken from a simulation of the transfer of a hydride anion from the substrate, malate, to a cofactor, nicotinamide adenine diphosphate. The two isosurfaces colored blue and brown represent lower and higher electron densities, respectively, calculated by using a combined quantum and classical mechanics methodology. The green, red, blue, and white balls are carbon, oxygen, nitrogen, and hydrogen atoms, respectively.

Fundamental to several methods used in quantum chemistry is the need to compute what is called the Fock matrix, a two-dimensional array representing the electronic structure of an atom or molecule. This matrix, which is represented here as F, has size N N and is formed by evaluating the following summation for each element:

where D is a two-dimensional array of size N N that is only read, not written, by this computation and the I represent integrals that are computed using elements i , j , k , and l of a read-only, one-dimensional array A with elements. An integral can be thought of as an approximation to the repulsive force between two electrons.

Because Equation 2.3 includes a double summation, apparently 2 integrals must be computed for each element of F, for a total of 2 integrals. However, in practice it is possible to exploit redundancy in the integrals and symmetry in F and reduce this number to a total of . When this is done, the algorithm can be reduced to the rather strange logic given as Algorithm 2.3. In principle, the calculation of each element of F requires access to all elements of D and A; furthermore, access patterns appear highly irregular. In this respect, the Fock matrix construction problem is representative of many numeric problems with irregular and nonlocal communication patterns.

For the molecular systems of interest to chemists, the problem size N may be in the range . Because the evaluation of an integral is a fairly expensive operation, involving operations, the construction of the Fock matrix may require operations. In addition, most methods require that a series of Fock matrices be constructed, each representing a more accurate approximation to a molecule's electronic structure. These considerations have motivated a considerable amount of work on both efficient parallel algorithms for Fock matrix construction and improved methods that require the computation of less than integrals.

2.8.2 Chemistry Algorithm Design

Partition.

Because the Fock matrix problem is concerned primarily with the symmetric two-dimensional matrices F and D, an obvious partitioning strategy is to apply domain decomposition techniques to these matrices to create N(N+1)/2 tasks, each containing a single element from each matrix (, ) and responsible for the operations required to compute its . This yields N(N+1)/2 tasks, each with data and each responsible for computing 2 integrals, as specified in Equation 2.3.

This domain decomposition strategy is simple but suffers from a significant disadvantage: it cannot easily exploit redundancy and symmetry and, hence, performs eight times too many integral computations. Because an alternative algorithm based on functional decomposition techniques is significantly more efficient (it does not perform redundant computation and does not incur high communication costs), the domain decomposition algorithm is not considered further.

Figure 2.31: Functional decomposition of Fock matrix problem. This yields about data tasks, shown in the upper part of the figure, and computation tasks, shown in the lower part of the figure. Computation tasks send read and write requests to data tasks.

Quite a different parallel algorithm can be developed by focusing on the computation to be performed rather than on the data structures manipulated, in other words, by using a functional decomposition. When redundancy is considered, one naturally thinks of a computation as comprising a set of integrals (the integral procedure of Algorithm 2.3), each requiring six D elements and contributing to six F elements. Focusing on these computations, we define ``computation'' tasks, each responsible for one integral.

Having defined a functional decomposition, we next need to distribute data structures over tasks. However, we see no obvious criteria by which data elements might be associated with one computation task rather than another: each data element is accessed by many tasks. In effect, the F, D, and A arrays constitute large data structures that the computation tasks need to access in a distributed and asynchronous fashion. This situation suggests that the techniques described in Section 2.3.4 for asynchronous communication may be useful. Hence, for now we simply define two sets of ``data'' tasks that are responsible only for responding to requests to read and write data values. These tasks encapsulate elements of the two-dimensional arrays D and F (, ) and of the one-dimensional array A (), respectively. In all, our partition yields a total of approximately computation tasks and data tasks (Figure 2.31).

Communication and Agglomeration.

We have now defined computation tasks and data tasks. Each computation task must perform sixteen communications: six to obtain D matrix elements, four to obtain A matrix elements, and six to store F matrix elements. As the computational costs of different integrals can vary significantly, there does not appear to be any opportunity for organizing these communication operations into a regular structure, as is advocated in Section 2.3.2.

On many parallel computers, the cost of an integral will be comparable to the cost of a communication. Hence, communication requirements must be reduced by agglomeration. We describe two alternative strategies that can be used to achieve this goal. Their data requirements are illustrated in Figure 2.32.

Figure 2.32: Agglomeration strategies for Fock matrix construction with N=P=5 , for (a) the total replication algorithm and (b) the partial replication algorithm. In each case, the five tasks are shown with shading used to represent the portion of the symmetric D and F matrices allocated to each task. In (a), each matrix is replicated in each task. In (b), each task is given a single row and column; this corresponds to a factor of two replication.

1. Total replication. Communication costs can be cut dramatically by replicating the F and D matrices in each of P tasks, one per processor of a parallel computer. Each task is given responsibility for 1/P of the integrals. Computation can then proceed in each task without any communication. The only coordination required is a final summation to accumulate partial F matrices. This can be achieved using a parallel vector reduction algorithm described in Section 11.2.

The technique of replicating data structures on each processor of a parallel computer is commonly used in parallel computing to reduce software engineering costs. It allows an existing sequential code to be adapted quickly for parallel execution, since there is no need to modify data structures. The principal disadvantage of the technique is that it is nonscalable. Because total memory requirements scale with the number of tasks created, the largest problem that can be solved is determined not by the total amount of memory in a parallel computer, but by the amount available in a single processor. For example, on a 512-processor computer with 16 MB of memory per processor, an implementation of the quantum chemistry code DISCO that uses this strategy cannot solve problems with N>400 . In principle, it would be interesting to solve problems where N is 10 times larger.

Figure 2.33: Data requirements for integral clusters. Each task accesses three rows (and sometimes columns) of the D and F matrices.

2. Partial replication. An alternative approach is as follows. First, we agglomerate computation in what seems an obvious way, namely, by making the inner loop of the procedure fock_build in Algorithm 2.3 into a task. This yields computation tasks, each responsible for integrals. Next, we examine the communication requirements of each such task. We find that there is considerable locality in the data required by these clusters of integrals: each cluster accesses the i th, j th, and k th row (and sometimes column) of D and F (Figure 2.33). To exploit this locality, we agglomerate data to create N data tasks, each containing a row/column pair of the two-dimensional arrays D and F and all of the one-dimensional array A. In this scheme, each element of D and F is replicated once, and A is replicated N times, so total storage requirements are increased from an average of N to 3N per task. Because of this replication, each computation task now requires data from just three data tasks. Hence, the number of messages is reduced from to . The total volume communicated remains . Because the cost of communicating a word is typically much less than the cost of computing an integral, this is an efficient parallel algorithm.

Mapping.

The ``partial replication'' Fock matrix construction algorithm creates N data tasks and computation tasks. We use the notation (i j k) to identify the computation task responsible for computing the integrals I ; this task requires data from data tasks i , j , and k . To complete the parallel algorithm, we must define a mapping of data and computation tasks to processors.

We assume processors. Since each data task will receive roughly the same number of requests, we allocate one data task to each processor. This leaves the problem of mapping computation tasks. We can imagine a variety of approaches:

A simple mapping, in which task (i j k) is mapped to the same processor as data task i ; since each task communicates with data tasks i , j , and k , off-processor communication requirements are reduced by one third. A disadvantage of this strategy is that since both the number of integrals in a task and the amount of computation per integral can vary, different processors may be allocated different amounts of computation.
A probabilistic mapping, in which computation tasks are mapped to processors at random or using a cyclic strategy.
A task-scheduling algorithm to allocate tasks to idle processors. Since a problem can be represented by three integers ( i , j , k ) and multiple problems can easily be agglomerated into a single message, a simple centralized scheduler can be used. (Empirical studies suggest that a centralized scheduler performs well on up to a few hundred processors.)
Hybrid schemes in which, for example, tasks are allocated randomly to sets of processors, within which a manager/worker scheduler is used.

The best scheme will depend on performance requirements and on problem and machine characteristics.

2.8.3 Chemistry Summary

We have developed two alternative parallel algorithms for the Fock matrix construction problem.

The F and D matrices are replicated in each of N tasks. Integral computations are distributed among the tasks, and a summation algorithm is used to sum F matrix contributions accumulated in the different tasks. This algorithm is simple but nonscalable.
The F, D, and A matrices are partitioned among N tasks, with a small amount of replication. Integral computations are agglomerated into tasks, each containing integrals. These tasks are mapped to processors either statically or using a task-scheduling scheme.

This case study illustrates some of the tradeoffs that can arise in the design process. The first algorithm slashes communication and software engineering costs; however, it is not scalable. In contrast, the second algorithm has higher communication costs but is highly scalable: its memory requirements increase only with problem size, not the number of processors. To choose between the two algorithms, we need to quantify their parallel performance and then to determine the importance of scalability, by assessing application requirements and the characteristics of the target parallel computer.

Next: 2.9 Summary Up: 2 Designing Parallel Algorithms Previous: 2.7 Case Study: Floorplan Optimization