Laurentiu Cucos and Elise de Doncker
Department of Computer Science
Western Michigan University
Kalamazoo, MI 49008 USA
{lcucos,elise}@cs.wmich.edu
Keywords: Quasi-Monte Carlo, synchronous/ asynchronous distributed integration, lattice rules, computational finance, multivariate t-distribution.
Abstract
We compare alternative strategies of a distributed algorithm to integrate a multidimensional function using Quasi-Monte Carlo techniques. We give a parallel performance analysis and experimental results, and introduce a new algorithm strategy that leads to near ideal speedup results, using up to 30 processors of a Beowulf type cluster. The results are obtained for a computational finance problem in 360 dimensions and for a 5-dimensional t-distribution arising in multiple comparison problems.
1. INTRODUCTION
We apply a Quasi-Monte Carlo (QMC) method which computes a sequence of randomized lattice rules to approximate integrals over a -dimensional hypercube region [1,4,3,2]. Let
(1) |
allows for a standard error estimation by
The algorithm calculates the values of (2) for successively larger numbers of integration points until either an answer is found to the user-specified accuracy or the function count limit is reached.
We denote by a cell of work , where , and is the length of a row. The th row of cells is . Currently we use rules from [4], with , and generally , for .
The error generally decreases in a stepwise manner from one row to the next. The error estimate is expected to improve along a row.
In Section 2 we analyze the algorithm of [2] and show its limitations theoretically and experimentally. Based on these results, Section 3 introduces a new strategy which delivers ideal speedup on a homogeneous system. Section 4 gives experimental results, and Section 5 examines possible adaptations for heterogeneous environments.
2. SINGLE CELL ALGORITHM
Consider using processors for the computation, one of which is assigned a controlling role, and the remaining ones are workers. In the single cell version of the algorithm, currently used in PARINT1.1, each of the workers is assigned one cell of work at a time. The controller updates the result and sends a new cell if necessary [2]. Figure 1 illustrates a case with less workers than cells in a row.
Since the computations are asynchronous, different processors may be working on different rows. At termination it is not necessary that a row is completed in order to use its result (if the row has reached a set minimum length).Results from this algorithm tend to show a degradation in performance when the number of workers is relatively large (compared to ). We found that this behavior can usually be attributed to breaking loss, which is loss in parallelism caused by processors doing useless work towards the end of the computation. This is actually replaced by starvation loss, where processors are idle waiting for work, if the computation proceeds to the last possible row (of index ).
Let us consider the special case where the execution in fact behaves in a synchronous manner and the number of processors is a multiple of the number of cells in a row, that is, with .
Assume the problem requires rows to terminate. Initially, the first processors are working on row , the next processors work on and so on. Assuming that the processors finish their cells and receive new tasks at the same time, then once is done, the first set of processors moves on to ; once is done the second set starts row and so on. When the last row required for this problem is assigned, all other rows of lower index are either completed or in the process of being completed. Since the work required for lower index rows is much smaller than the work required for , and there are only q processors assigned to , all other workers will be reassigned to rows , before is completed. However this work is useless, since the controller stops the computation when is completed; thus breaking loss is incurred.
In order to evaluate the parallel performance, let be the time required to complete a function evaluation; then the sequential computing time can be estimated by . In the case under consideration, the total parallel execution time equals the time of the processors that eventually work on the last row . That is,
(4) |
|
Now let us consider ( All the processors work in one row at a time. Once a row is finished all workers move to the next one. When the last row is completed there are no processors working on cells beyond the last row, resulting in a near ideal speedup. The same can be shown if and is a multiple of (however, breaking or starvation loss may be incurred when is not a multiple of ).
Note furthermore that this method imposes a limitation on the number of processors, since the maximum number of workers to participate in the computation is (corresponding to the case when each worker is assigned only one cell). Since usually is in the range , this limitation may start for small numbers of processors.
The above observations lead to the idea of distributing every row among all the workers, so that (in a synchronous setting) all workers at a time process the same row. This leads to the algorithm of Section 3 below.
3. ROW DISTRIBUTED METHOD
This strategy splits every cell into pieces of equal size , where every individual piece has points to compute (see Figure 2).
Every worker is assigned a work unit After the controller has received the row results from all workers, it assembles the cell results , and generates the row mean and standard error estimate according to (2) and (3).
This new strategy tends to eliminate breaking loss considerably. Even though the number of messages is now bounded by and the size of a result message has increased by a factor the net time increase is not that pronounced since communication at one worker is generally overlapped by computation performed at other workers. The apparent limitation of the number of workers to the size of the first rule (31), where each worker gets only function evaluations, can be alleviated by skipping the first row (or rows as needed). An issue that remains is that the performance is easily degraded if some processors are slower than others, making this strategy unsuitable for heterogeneous systems.
4. EXPERIMENTAL RESULTS
In this section we present results from parallel runs on our Beowulf cluster composed of 32 (1.2GHz) Athlon processors connected through a fast ethernet switch. This is our largest size homogeneous subcluster.
Using the mortgage-backed security problem from [6,7], we tested the new algorithm of Section 3 for up to 30 processors. Figure 3 shows the speedup obtained when calculating the (360-dimensional) integral. The function count limit was set to 17,552; that corresponds to , with 8 cells per row. A function evaluation requires 0.3 sec. The result obtained is 286045.51, with an estimated error of 7.391.
We also include results for a multivariate Student-t distribution problem arising in multiple comparisons [5,2]. Figure 4 shows the speedup obtained for a 5-dimensional integral. The function limit was set to 21,666,096, corresponding to 25 rows, with 6 cells per row. The result obtained is 0.899999 with an estimated error of 9.46e-6. A function evaluation requires sec.
The new algorithm exhibits a near ideal speedup for these two problems of wide-spread computational granularity, up to the maximum number of processors used.
5. ADAPTATIONS FOR HETEROGENEOUS SYSTEMS
Since for each row, every processor performs the same amount of work in the strategy of Section 3, the total run time for the completion of N rows is determined by the slower machines. To overcome this limitation we consider three alternative strategies.
5.1. Proportional row splitting
For each row, a worker gets a slice of work
proportional with its processor power. This information can be acquired
either statically, dynamically or using a combination of both.
Statically, the controller can retrieve the physical properties
of each machine (CPU, memory, bus speed, etc). For a cluster, these
numbers almost never change, and can be stored in and retrieved from a file.
Dynamically, each worker reports its average time per function
evaluation. Or, as a combination, the controller first splits the work
according to the static information, and later adjusts the distribution based on
reported function evaluation times and worker response times as the computation proceeds.
This strategy has a number of disadvantages.
The static approach gives a very rough estimation. The processing time is
affected by a number of factors besides the ones mentioned above. These include:
memory latency, caching, etc., which can affect the computation time significantly
from problem to problem.
The dynamic approach works well if the workers are dedicated to the
computation. Otherwise, processes sharing the same machine, affect the
estimation of the worker power.
5.2. Fixed work splitting
Each work unit can have the same size, such that
earlier rows are split in fewer units than the later ones. This gives
faster processors a chance to request more often work than the slower
ones. This strategy has a few drawbacks. On the one hand the total number of
messages increases from row to row.
Furthermore, choosing the optimum work unit size is
a delicate task, mainly because it depends on the integrand function.
A big work unit can introduce a significant starvation or breaking loss
effect (in case the last units from the last row are assigned to
the slower workers); while a small work unit can introduce significant
communication time.
5.3. Constant row splitting
This strategy splits each row in a constant number of work units: where K is a constant. The size of each work unit is proportional with the row size. If some processors are slower than others, they will process fewer units from each row. This strategy has the advantage that the total number of messages is constant across the rows (and not of the total amount of work). As in the previous case, starvation or breaking loss can occur in situations where the last work units are assigned to the slower processors.
6. CONCLUSIONS
We presented a highly scalable distributed algorithm to integrate multidimensional functions using QMC techniques on homogeneous systems. Its superior performance is justified theoretically and verified experimentally.
For future work we consider an adaptation of the algorithm for heterogeneous environments, by modifying the row splitting strategy of the new algorithm according to the machine power.