next up previous
Next: Bibliography

Distributed QMC Algorithms: New Strategies
and Performance Evaluation

Laurentiu Cucos and Elise de Doncker
Department of Computer Science
Western Michigan University
Kalamazoo, MI 49008 USA
{lcucos,elise}@cs.wmich.edu

Keywords: Quasi-Monte Carlo, synchronous/ asynchronous distributed integration, lattice rules, computational finance, multivariate t-distribution.

  

Abstract

We compare alternative strategies of a distributed algorithm to integrate a multidimensional function using Quasi-Monte Carlo techniques. We give a parallel performance analysis and experimental results, and introduce a new algorithm strategy that leads to near ideal speedup results, using up to 30 processors of a Beowulf type cluster. The results are obtained for a computational finance problem in 360 dimensions and for a 5-dimensional t-distribution arising in multiple comparison problems.

  
  
1. INTRODUCTION

We apply a Quasi-Monte Carlo (QMC) method which computes a sequence of randomized lattice rules to approximate integrals over a $ d$-dimensional hypercube region [1,4,3,2]. Let

$\displaystyle K_N({\mbox{\boldmath$\beta$}}) = \frac{1}{N} \sum_{i=1}^{N} f(\{\frac{i}{N}{\bf v}+ {\mbox{\boldmath$\beta$}}\}),$ (1)

where $ {\bf v}$ is the (Korobov) lattice generator vector, randomized by a uniformly distributed random vector $ {\mbox{\boldmath $\beta$}}$ [1]. Then with a sample set (of size $ q$) of random $ \beta$$ _j$, the mean

$\displaystyle \bar{K}_N = \frac{1}{q} \sum_{j=1}^{q} K_N({\mbox{\boldmath$\beta$}}_j)$ (2)

allows for a standard error estimation by

$\displaystyle E_N^2 = \frac{1}{q(q-1)} \sum_{j=1}^{q} (K_N({\mbox{\boldmath$\beta$}}_j)-\bar{K}_N)^2.$ (3)

The algorithm calculates the $ \bar{K}_N$ values of (2) for successively larger numbers $ N$ of integration points until either an answer is found to the user-specified accuracy or the function count limit is reached.

We denote by $ k_{ij}$ a cell of work $ K_{N_i}($$ \beta$$ _j)$, where $ i=1,\ldots,r_{\mbox{max}}; j=1,\ldots,q$, and $ q$ is the length of a row. The $ i$th row of cells is $ R_i=\{k_{ij}\mid j=1,\ldots,q \}$. Currently we use $ r_{\mbox{max}}=25$ rules from  [4], with $ N_1=31,\ldots,N_{25}=601942$, and generally $ N_i»N_j$, for $ i>j$.

The error generally decreases in a stepwise manner from one row to the next. The error estimate is expected to improve along a row.

In Section 2 we analyze the algorithm of  [2] and show its limitations theoretically and experimentally. Based on these results, Section 3 introduces a new strategy which delivers ideal speedup on a homogeneous system. Section 4 gives experimental results, and Section 5 examines possible adaptations for heterogeneous environments.

  
  
2. SINGLE CELL ALGORITHM

Consider using $ p$ processors for the computation, one of which is assigned a controlling role, and the remaining ones are workers. In the single cell version of the algorithm, currently used in PARINT1.1, each of the workers is assigned one cell of work $ k_{ij}$ at a time. The controller updates the result and sends a new cell if necessary  [2]. Figure  1 illustrates a case with less workers than cells in a row.

Figure: Single Cell work assignment for $ p=4, q=6$
\includegraphics[width=4in]{work-split.eps}

Since the computations are asynchronous, different processors may be working on different rows. At termination it is not necessary that a row is completed in order to use its result (if the row has reached a set minimum length).

Results from this algorithm tend to show a degradation in performance when the number of workers is relatively large (compared to $ q$). We found that this behavior can usually be attributed to breaking loss, which is loss in parallelism caused by processors doing useless work towards the end of the computation. This is actually replaced by starvation loss, where processors are idle waiting for work, if the computation proceeds to the last possible row (of index $ N_{\mbox{max}}$).

Let us consider the special case where the execution in fact behaves in a synchronous manner and the number of processors $ p$ is a multiple of the number of cells in a row, that is, $ p=tq$ with $ t>0$.

Assume the problem requires $ n$ rows to terminate. Initially, the first $ q$ processors are working on row $ R_1$, the next $ q$ processors work on $ R_2$ and so on. Assuming that the processors finish their cells and receive new tasks at the same time, then once $ R_1$ is done, the first set of $ q$ processors moves on to $ R_{t+1}$; once $ R_2$ is done the second set starts row $ R_{t+2}$ and so on. When the last row required for this problem is assigned, all other rows of lower index are either completed or in the process of being completed. Since the work required for lower index rows is much smaller than the work required for $ R_n$, and there are only q processors assigned to $ R_n$, all other workers will be reassigned to rows $ R_\ell$, $ \ell>n$ before $ R_n$ is completed. However this work is useless, since the controller stops the computation when $ R_n$ is completed; thus breaking loss is incurred.

In order to evaluate the parallel performance, let $ \tau$ be the time required to complete a function evaluation; then the sequential computing time can be estimated by $ T_1=q\sum_{i=1}^{n} N_i \tau$. In the case under consideration, the total parallel execution time equals the time of the processors that eventually work on the last row $ R_n$. That is,

$\displaystyle T_p=\sum_{s=1}^{r} N_{k_s} \tau,$ (4)

where $ r=\lceil{\frac{n q}{p}}\rceil=\lceil{\frac{n}{t}}\rceil$ is the number of rows computed by these processors, $ k_s = k_1+t(s-1),$ this is the row index, $ k_1=n-(r-1)t,$ and $ k_r = n.$ In this case the theoretical speedup is

$\displaystyle S=\frac{T_1}{T_p} = q \frac{\sum_{i=1}^{n} N_i}{\sum_{s=1}^{\lceil{n/t}\rceil} N_{k_s}}.$ (5)

For example, with $ q=6, p=30, n=25,$ (5) gives $ S=\frac{6\sum_{i=1}^{25} N_i}{N_5+N_{10}+N_{15}+N_{20}+N_{25}}$ = 15.628. Measuring the runtime experimentally (on a homogeneous cluster), we obtain a speedup of 15.495. Table 1 lists the experimental vs. theoretical speedup, obtained for numbers of workers between 1 and 30.


Table: Theoretical vs. Experimental Speedup for MVT problem with $ q$=6
Nb.Workers Theoretical Experimental
1 1.000 1.000
2 2.000 1.998
4 3.749 3.746
6 6.000 5.990
8 6.841 6.828
10 7.961 7.950
12 9.999 9.974
14 10.320 10.296
16 10.974 10.933
20 12.778 12.722
24 14.443 14.389
28 14.659 14.594
30 15.628 15.495


Note that the speedup in (5) is a function of $ q$, and does not depend on the integrand evaluation time or granularity. The total communication time is small as only one pair of messages per cell is required, amounting to $ {\cal O}(nq).$

Now let us consider $ p=q$ ($ t = 1).$ All the processors work in one row at a time. Once a row is finished all workers move to the next one. When the last row is completed there are no processors working on cells beyond the last row, resulting in a near ideal speedup. The same can be shown if $ p<q$ and $ q$ is a multiple of $ p$ (however, breaking or starvation loss may be incurred when $ q$ is not a multiple of $ p$).

Note furthermore that this method imposes a limitation on the number of processors, since the maximum number of workers to participate in the computation is $ n q$ (corresponding to the case when each worker is assigned only one cell). Since usually $ q$ is in the range $ 4,\ldots,12$, this limitation may start for small numbers of processors.

The above observations lead to the idea of distributing every row among all the workers, so that (in a synchronous setting) all workers at a time process the same row. This leads to the algorithm of Section 3 below.

  
  
3. ROW DISTRIBUTED METHOD

This strategy splits every cell $ k_{ij}$ into $ p$ pieces of equal size $ k^m_{ij}, m=1,\ldots,p$, where every individual piece has $ N_i/p$ points to compute (see Figure  2).

Figure: Row Distributed work split for $ p=4, q=6$
\includegraphics[width=3.1in]{work-split-homo.eps}

Every worker $ w$ is assigned a work unit $ u^w_i = \{ k^w_{ij}\mid j=1,\ldots,q \}.$ After the controller has received the row results from all workers, it assembles the cell results $ k_{ij} = \sum_w k^w_{ij}$, and generates the row mean and standard error estimate according to (2) and (3).

This new strategy tends to eliminate breaking loss considerably. Even though the number of messages is now bounded by $ {\cal O}(np)$ and the size of a result message has increased by a factor $ q,$ the net time increase is not that pronounced since communication at one worker is generally overlapped by computation performed at other workers. The apparent limitation of the number of workers to the size of the first rule (31), where each worker gets only $ q$ function evaluations, can be alleviated by skipping the first row (or rows as needed). An issue that remains is that the performance is easily degraded if some processors are slower than others, making this strategy unsuitable for heterogeneous systems.

  
  
4. EXPERIMENTAL RESULTS

In this section we present results from parallel runs on our Beowulf cluster composed of 32 (1.2GHz) Athlon processors connected through a fast ethernet switch. This is our largest size homogeneous subcluster.

Using the mortgage-backed security problem from [6,7], we tested the new algorithm of Section 3 for up to 30 processors. Figure 3 shows the speedup obtained when calculating the (360-dimensional) integral. The function count limit was set to 17,552; that corresponds to $ n = 7$, with 8 cells per row. A function evaluation requires 0.3 sec. The result obtained is 286045.51, with an estimated error of 7.391.

We also include results for a multivariate Student-t distribution problem arising in multiple comparisons [5,2]. Figure 4 shows the speedup obtained for a 5-dimensional integral. The function limit was set to 21,666,096, corresponding to 25 rows, with 6 cells per row. The result obtained is 0.899999 with an estimated error of 9.46e-6. A function evaluation requires $ 13\mu$sec.

The new algorithm exhibits a near ideal speedup for these two problems of wide-spread computational granularity, up to the maximum number of processors used.

Figure 3: Speedup for Finance Problem
\includegraphics[width=3in]{speedup-finance.ps}

Figure 4: Speedup for MVT Problem
\includegraphics[width=3in]{speedup-mvt.ps}

  
  
5. ADAPTATIONS FOR HETEROGENEOUS SYSTEMS

Since for each row, every processor performs the same amount of work in the strategy of Section 3, the total run time for the completion of N rows is determined by the slower machines. To overcome this limitation we consider three alternative strategies.

  
5.1. Proportional row splitting

For each row, a worker gets a slice of work proportional with its processor power. This information can be acquired either statically, dynamically or using a combination of both. Statically, the controller can retrieve the physical properties of each machine (CPU, memory, bus speed, etc). For a cluster, these numbers almost never change, and can be stored in and retrieved from a file. Dynamically, each worker reports its average time per function evaluation. Or, as a combination, the controller first splits the work according to the static information, and later adjusts the distribution based on reported function evaluation times and worker response times as the computation proceeds. This strategy has a number of disadvantages. The static approach gives a very rough estimation. The processing time is affected by a number of factors besides the ones mentioned above. These include: memory latency, caching, etc., which can affect the computation time significantly from problem to problem. The dynamic approach works well if the workers are dedicated to the computation. Otherwise, processes sharing the same machine, affect the estimation of the worker power.   

5.2. Fixed work splitting

Each work unit can have the same size, such that earlier rows are split in fewer units than the later ones. This gives faster processors a chance to request more often work than the slower ones. This strategy has a few drawbacks. On the one hand the total number of messages increases from row to row. Furthermore, choosing the optimum work unit size is a delicate task, mainly because it depends on the integrand function. A big work unit can introduce a significant starvation or breaking loss effect (in case the last units from the last row are assigned to the slower workers); while a small work unit can introduce significant communication time.   

5.3. Constant row splitting

This strategy splits each row in a constant number of work units: $ C=Kp$ where K is a constant. The size of each work unit is proportional with the row size. If some processors are slower than others, they will process fewer units from each row. This strategy has the advantage that the total number of messages is constant across the rows (and not of the total amount of work). As in the previous case, starvation or breaking loss can occur in situations where the last work units are assigned to the slower processors.

  
  
6. CONCLUSIONS

We presented a highly scalable distributed algorithm to integrate multidimensional functions using QMC techniques on homogeneous systems. Its superior performance is justified theoretically and verified experimentally.

For future work we consider an adaptation of the algorithm for heterogeneous environments, by modifying the row splitting strategy of the new algorithm according to the machine power.




next up previous
Next: Bibliography
Rodger Zanny 2002-02-20