Simulation of Large Communication Networks on the Connection Machine Honsing Cheng and Lamine Diatta May 1995 1. Introduction Suppose we have a stochastic model for a real life discrete event dynamic system (DEDS) for which we can measure the performance based on the value of a parameter w. That is, we want to find J(w) where J is the performance function of the system. In particular, the area of stochastic optimization deals with how to find a value w* which will yield the "best" performance, both analytically and experimentally through simulations. In cases where the solution for w* is analytically hard or impossible (due to complexity of modeling or limitation of mathematical tools), simulation must be used to obtain performance measurements J(w) for a finite set of w's. w* is picked to be the w for which J(w*) >= J(w) for all w's tested. In fact, many systems require such optimization technique. It is therefore very important to be able to run simulation of large complicated systems in least time so to speed up the evalution of J(w) for each w in the entire candidate set. This project attempts to provide some guidelines of how to use massively parallel computers for efficient simulation of DEDS, which is a crucial step in experimental stochastic optimization. We first present a high level picture of how parallel computer fits into the overall picture of optimization. Two models will be presented, according to two different levels of application of parallelism. After this, we will argue that how parallel simulation can improve the confidence of optimality of J(w*), which is the probability that J(w*) belongs to the top p-th percentile of the entire population (from which only a finite set is drawn for simulation). We then present an example which applies the technique to solve the capacity assignment problem of communication network modeled as a DEDS. We will also discuss how parallelism can allow efficient evaluation of aggregation techniques, which are frequently used to find simpler representation for large queueing netowrks. Throughout this paper, when we say "a design" of a system, we mean "a value for a parameter" on which the performance function depends. When we say "a set of designs," we mean "a set of values for some parameter;" the specific parameter is not of concern. 2. Use of Parallelism in Performance Evaluation Efficient evaluation of designs is critical in experimental optimization, both in terms of accuracy and solution time (the time it takes to find the solution). There are two levels at which parallelism can be applied for improvement in both respects. Consider a system as made up of different subsystems. If two or more processors deal with a _common_ subsystem, we say they operate at the _local_ level. On the other hand, if two or more processors operate on _different_ subsystems, they operate at the _global_ level. In Figure 1a, P1 and P2 are engaged in a global level optimization, while they work at the local level in Figure 1b, where the local subsystem is the entire system itself. |-------------| |--------| | subsystem 1 |<---P1 | |<---P1 |-------------| | System | | subsystem 2 |<---P2 | |<---P2 |-------------| |--------|
2.1. Local Optimization We have a subsystem for which we want to find the best design for one of its parameters. Suppose we have design w1, w2, ..., wN. Each of them will affect the subsystem in some way which in turn affect the performance of the entire system. The problem is to find w* which will yield the best performance J(w*). That is: J(w*) >= J(w1), J(w2), ..., J(wN) Then, if we have N processors, we can have each of them simulate the system under one of the w's and therefore produce one of the J(w)'s in parallel. This means we can find w* in time regardless of N. What is the implication of this? Consider a system that does not allow any practical analytical approach to its optimization, nor is there any useful heuristic to guide the search of the optimum design using simulation. Suppose we find, through simulations, that w' to be the best design of a set of N designs randomly chosen from the design space. Then we would expect w' to get closer to the true optimum as N gets larger. In other words, this is simply saying that the sample optimum gets closer to the true optimum as the sample size gets larger. The fact that by applying parallelism, we can obtain a sample of performance values regardless of the sample size implies that we can obtain better sample optimum without increasing the solution time. To obtain J(w1), J(w2), ..., J(wN) on a sequential machine, and then determine the sample optimum, we need N*T where T is the time of a single simulation. Now with parallelism, we only need time T for any sample size. The saving in solution time can be used in another way which improves the approximation by simulation. Put in differently, running a simulation for a finite length of time, the resulting performance measure is only an approximation of the true system performance for that parameter. That is, we observe: J'(w) = J(w) + X where J(w) is the true performance and X is an arbitrary noise term resulted from the finiteness of simulation. To get rid of the noise term completely, a simulation has to run for infinite time. The saving in time of (N*T - T) can be used to diminish the noise. That is, we simply let each processor to run longer and the resulting J(w1), J(w2), ..., J(wN) will be a much better approximation. 2.2. Global Optimization A divide-and-conquer strategy is an effieient way in using parallelism to solve global optimization problem. Each subsystem by itself is to be optimized, and is taken care of by one processor. The set of N processors proceed in parallel and in time T independent of the number of subsystems (N), the optimum parameter for each and all of the subsystems are found. Therefore, the solution time is independent of number of subsystems to be solved, which in some way tells the complexity of the system being studied. More specifically, if the entire system consists of subsystems S1, S2, ..., SN and T(S) is the time to solve subsystem S, then: T = max{ T(S1), T(S2), ..., T(SN) } is the time when the global optimization is finished. (Some time is also needed in integrating the results and checking feasibility.) The above approach works only when there is limited interactions among the subsystems, so that "locally" optimized subsystems, when put together, will yield a globally optimized system. Also, the combined system might not even be feasible. For instance: A --- >O ---| |-- >O --- --- >O ---| X B where each '>O' represents a server in a queueing network. Suppose putting a buffer of size x for server A and y for B separately might yield good performance, however, (x+y) might exceed certain resource constraint and hence the naive combination will not be feasible. If these conditions hold, then parallelism will gain a factor of N = #of subsystems comprising the whole system. As the system gets more complicated and N gets larger, the gain from parallelism increases as well. 3. Gain of Optimality Confidence Suppose we use some way to solve an optimization problem and find that w* will give the best J(w) for a finite set of w's we experimented. As explained above, the performance of w* we obtain is an approximation. That is: J'(w*) = J(w*) + X; X is the noise term We define the confidence of optimality of an approximate optimum parameter w* at q percentile to be: Prob[ Given J'(w*) is in top q percentile of all J'(w) for all w's experimented, J(w*) actually belongs to top q percentile of the entire population of w ] Confidence of optimality is important because if it is too small, we cannot expect w*, when used in the true system (as opposed to the simulated one), will give a performance really in the top q percentile, though this is true for the simulated system. If the optimality confidence is high, we can expect the best we obtain from simulation is indeed the best in real system. Suppose we try out N w's in each simulation and find the best. Then collect the best of M such simulations into a set S. Call the best in S w*. On the other hand, run a single simulation which tests M*N w's and call it w'. Then, it can be shown that: Prob[ J(w*) in top q% ] >= Prob[ J(w') in top q% ] The implication of this is this. In solving optimization problems, finding the best of M best values each obtained from a simulation of size N gives a confidence of optimality at least as high as the best obtained from a single simulation of size M*N. Therefore, we can use parallel computers to carry out the M independent simulations in parallel and find the best from the M best results. The confidence of optimality achieved is potentially higher and the solution time can be much reduced. 4. An Example of Local Optimization using the CM-5 In this section, we present an example of applying the parallel local optimization technique to solving the capacity assignment problem in communication networks modeled as a DEDS. The capacity of a communication channel is defined to be the maximum amount of data that can be transmitted through the channel in unit time. The problem is to find optimal capacity assignment for all channels in the network, under some constraint. The measurement criterion is average message delay. For specifics of the model, please read Appendix A. The following are some important points: 1) Channels are modeled as queues with infinite buffer size. The only delay a message will encounter are all from waiting for being served by a channel and the actual service time by the channel. 2) Traffic from one node to the other follows a Poisson process. So the time between the creation of successive messages going between any pair of nodes is therefore exponentially distributed. Service time at a particular channel depends only on the length of the message. 3) Routing procedure is fixed. That is, the route between any pair of nodes is both unique and fixed. The input file to the simulation will provide such information. The network being simulated is derived from the ARPANET in 1975. The data, such as traffic rate between nodes and sum of channel capacity, are hypothetical. The topology is shown in Figure 2.
Topology of the simulated ARPANET The local optimization technique is applied as follows. Each of the M node of the CM-5 chooses N random set of capacity assignments under the constraint and determines the best one from simulations. The solution is then determined to be the best among these M best assignments. The implementation details and output are shown in Appendix B. 5. Efficient Evalution of Aggregation Techniques Aggregating a queueing network means to find a simpler queueing network that can capture the performance characteristics of the original system. Therefore, it is a form of approximation. The aggregated system (the simpler one) can then be sued either for analytical optimization or for simulation. Certainly, not all characteristics of the original system can be preserved in the aggregated system; and some of them will be "aggregated away," i.e. no longer be preserved. Therefore, suppose we want to optimize average time delay in a communication network, we are only concerned with preserving the average time delay characteristics of the original system when finding an aggregate for it. How good an aggregation technique is depends on how well it preserve the concerned characteristics. If we only have a finite set of w's to experiment and we only want to know which w will give the best J(w). Then the absolute values of the J(w)'s might not matter, as long as the best w in the original system will also give best performance in the aggregated system. We can then define: Prob[ w* = w' | w* gives best performance in original system and w' gives best performance in aggregated system ] to be the alignment probability for that technique, which can thus be used to evaluate aggregation techniques. The most frequently encountered technique, namely the equivalent server technique, is described in some detail in Appendix C. Parallel computers can also help speeding up the evaluation process. Suppose we have an original system P and an aggregated system Q, and N different w's are to be experimented. We obtain performance of the N w's via N simulations of P in parallel, and then N simulations of Q also in parallel. To get the alignment probability, we need to repeat the above procedure quite many times. Fortunately, with parallel simulations, the time it requires depends only on how accurate we want the obtained alignment probability be, but not on how many w's we are comparing. Note that this task of evaluation can just be prohibitively expensive on a uniprogramming machine. 6. Conclusion In this paper, we have emphasized on hwo parallel computers can be used to improve various aspects of stochastic optimization including solution time and confidence of optimality. Since aggregation technique is commonly used in optimizations, we have also given a general outline of how parallel computers can speed up the process of evaluation of aggregation techniques. Although some of the materials are quite straightforward, we hope the clear statements of various algorithms and some preliminary analysis provided in this paper can be helpful for getting a more solid picture of how parallelism fits into the picture of stochastic optimization.