# Memory Requirements for Balanced Computer Architectures\*.†

H. T. KUNG

Department of Computer Science, Carnegie-Mellon University, Pittsburgh, Pennsylvania 15213

In this paper, a processing element (PE) is characterized by its computation bandwidth, I/O bandwidth, and the size of its local memory. In carrying out a computation, a PE is said to be balanced if the computing time equals the I/O time. Consider a balanced PE for some computation. Suppose that the computation bandwidth of the PE is increased by a factor of  $\alpha$  relative to its I/O bandwidth. Then when carrying out the same computation the PE will be imbalanced; i.e., it will have to wait for I/O. A standard method of avoiding this I/O bottleneck is to reduce the overall I/O requirement of the PE by increasing the size of its local memory. This paper addresses the question of by how much the PE's local memory must be enlarged in order to restore balance.

The following results are shown: For matrix computations such as matrix multiplication and Gaussian elimination, the size of the local memory must be increased by a factor of  $\alpha^2$ . For computations such as relaxation on a k-dimensional grid, the local memory must be enlarged by a factor of  $\alpha^k$ . For some other computations such as the FFT and sorting, the increase is exponential; i.e., the size of the new memory must be the size of the original memory to the  $\alpha$ th power. All these results indicate that to design a balanced PE, the size of its local memory must be increased much more rapidly than its computation bandwidth. This phenomenon seems to be common for many computations where an output may depend on a large subset of the inputs.

Implications of these results for some parallel computer architectures are also discussed. One particular result is that to balance an array of p linearly connected PEs for performing matrix computations such as matrix multiplication and matrix triangularization, the size of each PE's local memory must grow linearly with p. Thus, the larger the array is, the larger each PE's local memory must be. © 1985 Academic Press, Inc.

<sup>\*</sup> Presented at the Symposium on Complexity of Approximately Solved Problems, April 17, 1985.

<sup>†</sup>The research was supported in part by Defense Advanced Research Projects Agency (DOD), monitored by the Air Force Avionics Laboratory under Contract F33615-81-K-1539, and Naval Electronic Systems Command under Contract N00039-85-C-0134, and in part by the Office of Naval Research under Contract N00014-80-C-0236, NR 048-659.

#### 1. Introduction

With today's technology, the challenge in designing a high-performance computer is usually not in providing processing elements with the required high-computation bandwidths, but in making sure that information can flow to and from these elements with sufficient speed. For example, very fast processing elements can be built using off-the-shelf 16-MHz, 32-bit microprocessors (Gupta and Toong, 1983) and/or floating-point chips capable of delivering 10 million operations per second (Frandrianto and Woo, 1985). The computation bandwidth of such a processing element can be further increased by incorporating multiple copies of these chips and operating them in parallel. However, the I/O bandwidth with the rest of the system (e.g., system memory and interconnections) cannot be increased as easily, and as a result it often becomes a bottleneck for the performance of the entire system.

A standard approach to alleviating this I/O problem is to provide a local memory at a processing element. This local memory can "cache" frequently used data and instructions, so that the required I/O bandwidth with the outside world is reduced. It is well known that the size of the local memory must be large if the computation bandwidth of the processing element is large, as represented by the "Amdahl's rule" (Siewiorek et al., 1982). But exactly how large should this local memory be? This paper answers the question for several important computational tasks.

To help study the problem formally, an information model is introduced in Section 2 to characterize a processing element. Section 3 derives results on how the local memory of a processing element must be increased as the computation bandwidth increases. Section 4 discusses implications of these results for some parallel computer architectures. Concluding remarks are provided in Section 5.

#### 2. THE INFORMATION MODEL

As illustrated in Fig. 1, we characterize a processing element (PE) by:

- 1. C: the computation bandwidth, which is the number of operations that the PE can deliver per second;
- 2. IO: the I/O bandwidth, which is the number of words that the PE can communicate with the outside world per second; and
  - 3. M: the size of the PE's local memory, in terms of number of words.

In carrying out a computation such as the fast Fourier transform (FFT) or matrix multiplication, a PE is said to be *balanced* if the I/O time equals the computing time. When a PE is balanced for a given computation, we know that its computation, I/O, and memory subsystems are not over- or under-



Fig. 1. Processing element characterized by its computation bandwidth (C), I/O bandwidth (IO), and size of local memory (M).

designed for that computation. A challenge for computer architects is to keep a PE balanced, while taking advantage of technological opportunities such as large increases in computation bandwidth. Since it is usually difficult or expensive to increase the I/O bandwidth, we ask the following question:

Assume that a PE is balanced for a given computation. Now C/IO is increased by a factor of  $\alpha$ . To rebalance the PE for the same computation (without increasing IO), by how much must M be increased?

The following symbols and equalities are useful in deriving answers to the question. For carrying out a given computation on a PE, let  $C_{\rm comp}$  ("cost for computation") and  $C_{\rm io}$  ("cost for I/O") denote the total number of operations needed for the computation and for the I/O, respectively. We assume that one I/O operation can transfer a word to or from the PE. Then the computing and I/O times are  $C_{\rm comp}/C$  and  $C_{\rm io}/IO$ , respectively. Therefore, the PE is balanced if and only if

$$\frac{C_{\rm comp}}{C} = \frac{C_{\rm io}}{IO},$$

or

$$\frac{C}{IO} = \frac{C_{\text{comp}}}{C_{\text{io}}}.$$
 (1)

Now suppose that C/IO is increased by a factor of  $\alpha$ . Then by (1) the PE is rebalanced if and only if the ratio  $C_{\text{comp}}/C_{\text{io}}$  is increased by a factor of  $\alpha$ . This provides a method that we can use in rebalancing a PE. For many computations, this can be accomplished by increasing the size of the PE's local memory.

To be precise, let  $M_{\rm old}$  be the size of the original local memory, and  $M_{\rm new}$  the *minimum* size of the new memory necessary to rebalance the PE. In the rest of the paper, we study by how much (expressed in terms of  $\alpha$ )  $M_{\rm new}$  must be larger than  $M_{\rm old}$ .

## 3. RESULTS FOR SOME COMPUTATIONS

Consider a PE that is balanced for a given computation. Now suppose that C/IO is increased by a factor of  $\alpha$ . This section derives answers to the question proposed in the preceding section for several computations. The following is a summary of the results:

- Matrix computation such as matrix multiplication and triangularization:  $M_{\text{new}} = \alpha^2 M_{\text{old}}$ ;
  - Grid computation:
    - two-dimensional:  $M_{\text{new}} = \alpha^2 M_{\text{old}}$ ;
    - $\circ d$ -dimensional:  $M_{\text{new}} = \alpha^d M_{\text{old}}$ ;
  - FFT:  $M_{\text{new}} = (M_{\text{old}})^{\alpha}$ ;
  - Sorting:  $M_{\text{new}} = (M_{\text{old}})^{\alpha}$ ;
- I/O bounded computations such as matrix-vector multiplication and solution of triangular linear systems: Impossible; i.e., PE cannot be rebalanced by merely enlarging its local memory, without increasing its I/O bandwidth.

Throughout this section, we will assume that for all the computations the problem size N is arbitrarily large, and that N is much larger than the size of the PE's local memory M.

# 3.1. Matrix Multiplication

Consider the problem of multiplying two  $N \times N$  matrices, assuming a local memory of size M. In the following, we use a decomposition scheme that minimizes the I/O requirement of the PE.

The product matrix is computed in  $(N/\sqrt{M})^2$  steps, each being the computation of a  $\sqrt{M} \times \sqrt{M}$  submatrix of the product matrix. Every step is a multiplication of a  $\sqrt{M} \times N$  submatrix of the first input matrix with an  $N \times \sqrt{M}$  submatrix of the second. This can be carried out in  $C_{\text{comp}} = \Theta(N \cdot M)$  arithmetic operations, and  $C_{\text{io}} = \Theta(N \cdot \sqrt{M})$  I/O operations. Thus.

$$\frac{C_{\text{comp}}}{C_{\text{io}}} = \Theta(\sqrt{M}). \tag{2}$$

Assume that for this computation, the PE is balanced. Now suppose that the computation bandwidth is increased by a factor of  $\alpha$  relative to the I/O bandwidth. Then by (1), for rebalancing the PE, we must increase  $C_{\text{comp}}/C_{\text{io}}$  by a factor of  $\alpha$ . From (2), we see that this can be done only if M is increased by a factor of  $\alpha^2$ . That is, for this matrix multiplication computation, we have

$$M_{\text{new}} = \alpha^2 M_{\text{old}}.$$
 (3)

 $<sup>{}^{1}</sup>f(N) = \Theta(g(N))$  means  $f(N) = c \cdot g(N) + \text{lower-order terms in } N$ , where c is some positive constant.

The decomposition scheme we use here for matrix multiplication is just one of many possible ones. It has been shown (Hong and Kung, 1981) that for matrix multiplication, any decomposition scheme yields

$$\frac{C_{\text{comp}}}{C_{\text{io}}} = h(M),$$

where the function h(M) cannot exceed  $\sqrt{M}$  in order of magnitude. This implies that the result of (3) is the best possible among all decomposition schemes, as far as minimizing  $M_{\text{new}}$  is concerned.

# 3.2. Matrix Triangularization

Given an  $N \times N$  matrix A, the triangularization problem is to determine an  $N \times N$  "multiplier matrix" Q and an upper triangular matrix U such that

$$QA = U$$
.

By triangularization, many problems in matrix computation can be reduced to that of solving triangular linear systems. For example, this is the major step in all direct methods for solving linear systems. When Q is restricted to be an orthogonal matrix, it is also the key step in computing least squares solutions and in the QR algorithm for computing eigenvalues. Gaussian elimination and Givens rotation are standard algorithms for triangularization.

The triangularization problem can be solved in  $N/\sqrt{M}$  steps, where each step annihilates portions of  $\sqrt{M}$  consecutive columns which are in the lower triangular part, and updates the rest of the matrix to prepare it for the next step. It is easy to check that the first step can be carried out in  $C_{\text{comp}} = \Theta(N^2 \cdot \sqrt{M})$  arithmetic operations, and  $C_{\text{io}} = \Theta(N^2)$  I/O operations, assuming a local memory of size M. Thus,

$$\frac{C_{\rm comp}}{C_{\rm in}} = \Theta(\sqrt{M}).$$

The same ratio is maintained for all the steps. Therefore, as in the case of matrix multiplication, we have

$$M_{\text{new}} = \alpha^2 M_{\text{old}}.$$

# 3.3. Grid Computation

Consider a two-dimensional grid computation. Given an  $N \times N$  grid, the task is to perform a large number of iterations on the grid, where each iteration involves updating every grid point by some weighted average of points in a surrounding window of fixed size. For some applications, on the

order of N iterations may be performed. In scientific computation and image processing, this computation is usually called relaxation.

Assume that the computation is performed by an array of PEs. Each PE is responsible for the storing and updating of all the grid points in a  $\sqrt{M} \times \sqrt{M}$  subgrid. For every iteration, each PE performs  $C_{\text{comp}} = \Theta(\sqrt{M} \times \sqrt{M})$  arithmetic operations, and  $C_{\text{io}} = \Theta(\sqrt{M})$  I/O operations. Thus, for the two-dimensional grid computation, we have

$$M_{\text{new}} = \alpha^2 M_{\text{old}}$$

It is straightforward to show that for a d-dimensional grid computation, we have

$$M_{\text{new}} = \alpha^d M_{\text{old}}$$
.

# 3.4. Fast Fourier Transform

Consider the problem of computing an N-point discrete Fourier transform by the fast Fourier transform (FFT) algorithm, assuming a local memory of size M.

Decomposition for the FFT is not as straightforward as that for matrix multiplication and other computations considered above. Figure 2 depicts an N-point FFT computation and a decomposition scheme for N=16 and M=4. Results of subcomputation blocks are shuffled before they are used as inputs of other subcomputation blocks. Note that each subcomputation block is sufficiently small that it can be entirely carried out inside a PE with M words of local memory. Each subcomputation performs  $C_{\text{comp}} = \Theta(M \cdot \log_2 M)$  arithmetic operations, and  $C_{\text{io}} = \Theta(M)$  I/O operations. Thus

$$\frac{C_{\text{comp}}}{C_{\text{in}}} = \Theta(\log_2 M).$$



Fig. 2. (a) sixteen-point FFT; (b) decomposing the FFT.

This implies that to increase the ratio  $C_{\text{comp}}/C_{\text{io}}$  by a factor of  $\alpha$ , we must increase M to  $M^{\alpha}$ . Therefore for the FFT, we have

$$M_{\text{new}} = (M_{\text{old}})^{\alpha}$$
.

It has been shown (Hong and Kung, 1981) that for the FFT, any decomposition scheme yields

$$\frac{C_{\text{comp}}}{C_{io}} = k(M) \tag{4}$$

where the function k(M) cannot exceed  $\log_2 M$  in order of magnitude. This implies that the result of (4) is the best possible among all decomposition schemes, as far as minimizing  $M_{\text{new}}$  is concerned.

## 3.5. Sorting

Consider the problem of sorting N keys by comparisons only. We will perform the sorting in two phases. Phase 1 sorts the N/M subsets of M keys each to produce N/M sorted lists. Phase 2 merges the sorted lists using an M-way merge algorithm. In phase 1, for each subset we perform  $C_{\text{comp}} = \Theta(M \cdot \log_2 M)$  comparisons, and  $C_{\text{io}} = \Theta(M)$  I/O operations, and this can be carried out in a local memory of size M. In phase 2, for each M-way merge we maintain a heap of M elements which are the first elements of the current M sorted lists. The heap can be implemented in a memory of size M, and for each I/O operation to the heap there are  $\Theta(\log_2 M)$  comparisons to be performed. Therefore for both phases, we have

$$\frac{C_{\text{comp}}}{C_{\text{io}}} = \Theta(\log_2 M).$$

Like the FFT case, this implies that for sorting,

$$M_{\text{new}} = (M_{\text{old}})^{\alpha}. \tag{5}$$

Using an information-theoretic argument, it is quite easy to show (Song, 1981) that the result of (5) is the best possible among all sorting methods, as far as minimizing  $M_{\text{new}}$  is concerned.

## 3.6. I/O Bounded Computations

All the computations considered so far have been computation bounded, in the sense that computation takes more operations than I/O in order of magnitude. Computations that are not computation bounded are called I/O bounded. Matrix—vector multiplication and solution of triangular linear systems are examples of I/O bounded computations. For I/O bounded com-

putations, after an increase of C/IO for a PE, there is no way to rebalance the PE by merely enlarging its local memory without increasing IO. The reason is that for these computations, inputs and intermediate results are not used more than a constant number of times on the average, so having a local memory to buffer data will not reduce the overall I/O requirement of the PE after the size of the memory exceeds certain constant.

## 4. IMPLICATIONS FOR SOME PARALLEL COMPUTER ARCHITECTURES

The summary of results in the beginning of Section 3 suggests a classification of computations in terms of their memory requirements in achieving balanced architectures. Consider, for instance, scientific computations. They involve matrix triangularization, matrix multiplication, grid computations of various dimensionalities, and also sparse matrix operations that have relatively high I/O requirements. Therefore in view of the results of Section 3, for scientific computations it is reasonable to assume the following:

$$M_{\text{new}} \ge \alpha^2 M_{\text{old}}.$$
 (6)

That is, if the computation bandwidth of a PE is increased by a factor of  $\alpha$  relative to its I/O bandwidth, then the size of the PE's local memory must be increased by a factor of at least  $\alpha^2$ .

For the rest of this section, we consider designing mesh-connected parallel computers for computations for which (6) holds.

On a parallel computer, a computation that is usually performed by one PE in a conventional serial machine is carried out by a collection of, say, p PEs. We can view this collection of p PEs as a new processing element that has p times as much computation bandwidth as the old PE. With this viewpoint, parallel processing is just a particular method of increasing the computation bandwidth of a PE. Therefore our methodology for rebalancing a PE by increasing the size of its local memory applies directly to parallel architectures. This is shown in the following subsections.

# 4.1. One-Dimensional Processor Array

We want to use p linearly connected PEs to perform computations that were formerly done by a single PE, as illustrated in Fig. 3. The collection of p PEs



Fig. 3. Using p PEs to perform computation formerly done by one PE.

can be viewed as a "new processing element" that has p times as much bandwidth as the original PE. The I/O bandwidth of this "new processing element" is the same as that of the original PE, as only the two boundary PEs in the PE collection can communicate with the outside world. Therefore with respect to the "new processing element," the C/IO is increased by a factor of  $\alpha = p$ . This implies from (6) that the "new processing element" should have a total of at least  $p^2$  times as much local memory as the original PE. That is, in the parallel arrangement, each PE should have at least p times as much local memory as the original PE. This translates to the following result:

When using an array of linearly connected PEs for computations for which (6) holds, the size of each PE's local memory should grow at least linearly with the number of PEs in the array, to keep the array balanced.

# 4.2. Two-Dimensional Processor Array

We want to use  $p \times p$  two-dimensional connected PEs to perform computations that were formerly done by a single PE, as illustrated in Fig. 4. By arguments similar to those used for the case of the one-dimensional processor array above, the computation and I/O bandwidths of this two-dimensional array of PEs are  $p^2$  and p times larger than those of the original PE, respectively. Therefore, C/IO is increased by a factor of  $\alpha = p$ . For computations such as matrix multiplication where (6) holds with equality, the parallel arrangement should have a total of  $p^2$  times as much local memory as the original PE. This is automatically satisfied, since there are  $p^2$  PEs in the parallel setup. Therefore, we have the following result:

When using a square array of mesh-connected PEs for computations for which (6) holds, it is possible to make the size of each PE's local memory to



Before: 1 PE Now: p x p PEs

Fig. 4. Using  $p \times p$  PEs to perform computation formerly done by one PE.

be independent of the number of PEs in the array, while keeping the array balanced. That is, the processor array is automatically balanced as more PEs with local memories of the same size are added to the array.

The possibility referred to above depends on whether or not the computation can actually be decomposed for the parallel execution on the processor array. This is possible, for example, for matrix multiplication and triangularization, as demonstrated by various two-dimensional systolic arrays for these computations (Gentleman and Kung, 1981; Kung and Leiserson, 1978).

However, for computations (such as the d-dimensional grid computation with d > 2) where (6) holds with a strict inequality, an automatically rebalanced, square processor array is never possible. For these computations, the size of each PE's local memory must be increased as the size of the array increases.

# 5. CONCLUDING REMARKS

For most of the computations considered in this paper, to rebalance a PE, the size of its local memory must be increased much more rapidly than its computation bandwidth, if the I/O bandwidth is kept constant. For some computations such as the FFT and sorting, the local memory size must be increased exponentially as computation bandwidth increases. In this case, the size of the local memory may become unrealistically large, and the size of the application may also have to become unrealistically large in order to utilize all the memory. Therefore, for these computations one should not expect any substantial speedup without a significant increase in the PE's I/O bandwidth. Since increasing I/O bandwidth is difficult in practice, this partially explains why the performance of computer systems in general has not kept up with the rapid improvement in the computation bandwidth of processing elements.

For parallel architectures; we have shown configurations where each PE's memory should grow at least linearly with the number of PEs in the parallel system.

The CMU Warp machine (Arnould et al., 1985; Gross et al., 1985) consists of a one-dimensional systolic array, which is an array of linearly connected, programmable PEs. With a local memory of up to 64K 32-bit words, each PE can perform 10 million 32-bit floating-point operations per second, and transfer 20 million words per second to and from its neighboring PEs. Having a rather large I/O bandwidth and a relatively large local memory for each PE of the Warp machine reflects the results of this paper.

The methodology and analysis techniques of this paper can be used for many other computations and architectures in addition to those considered here. Further work in characterizing other computations, in terms of their memory requirements for achieving balanced architectures, and in analyzing the impact of these results to various architectures, will certainly provide additional insights to the design of high-performance computers.

## **ACKNOWLEDGMENTS**

Comments from Duane Adams, Allan Fisher, Monica Lam, Onat Menzilcioglu, and Alan Sussman of CMU are appreciated.

### REFERENCES

- ARNOULD, F., KUNG, H. T., MENZILCIOGLU, O., AND SAROCKY, K. (1985), A systolic array computer, in "Proceedings, IEEE International Conference on Acoustics, Speech and Signal Processing, March 1985," pp. 232–235.
- Frandrianto, J., and Woo, B. Y. (1985), VLSI floating-point processors, in "Proceedings, 7th Symposium on Computer Arithmetic, June 1985," pp. 93-100, IEEE Computer Society, New York.
- GENTLEMAN, W. M., AND KUNG, H. T. (1981), Matrix triangularization by systolic arrays, in "Proceedings of SPIE Symposium, Vol. 298, Real-Time Signal Processing IV," pp. 19–26, Society of Photo-Optical Instrumentation Engineers.
- GROSS, T., KUNG, H. T., LAM, M., AND WEBB, J. (1985), Warp as a machine for low-level vision, in "Proceedings, IEEE International Conference on Robotics and Automation, March 1985," pp. 790-800.
- GUPTA, A., AND TOONG, H. D. (1983), An architectural comparison of 32-bit microprocessors, *IEEE Micro* 3, No. 1, 9-22.
- HONG, J.-W., AND KUNG, H. T. (1981), I/O complexity: The red-blue pebble game, in "Proceedings, Thirteenth Annual ACM Symposium on Theory of Computing, May 1981," pp. 326-333, ACM SIGACT, New York.
- KUNG, H. T., AND LEISERSON, C. E. (1979), Systolic arrays (for VLSI), in Duff, I. S., and Stewart, G. W. (Eds.), "Sparse Matrix Proceedings 1978," pp. 256-282, Society for Industrial and Applied Mathematics.
- SIEWIOREK, D. P., BELL, C. G., AND NEWELL, A. (1982), "Computer Structures: Principles and Examples," McGraw-Hill, New York.
- Song, S. W. (1981), "On a High-Performance VLSI Solution to Database Problems," Ph.D. thesis, Computer Science Department, Carnegie-Mellon University; also available as a CMU Computer Science Department technical report.