CONCURRENCY: PRACTICE AND EXPERIENCE Concurrency: Pract. Exper., Vol. 11(9), 461–477 (1999) Predicting the execution time of message passing models J. L. RODA∗ , C. RODRÍGUEZ, D. G. MORALES AND F. ALMEIDA Dpto. Estadı́stica, Investigación Operativa y Computación, Universidad de La Laguna, 38271, La Laguna, Tenerife, Spain (e-mail: jlroda@ull.es) SUMMARY Recent publications prove that runtime systems oriented to the Bulk Synchronous Parallel Model usually achieve remarkable accuracy in their predictions. That accuracy can be seen in the capacity of the software for packing the messages generated during the superstep and their capability to find a rearrangement of the messages sent at the end of the superstep. Unfortunately, barrier synchronisation imposes some limits both in the range of available algorithms and in their performance. The asynchronous nature of many MPI/PVM programs makes their expression difficult or infeasible using a BSP oriented library. Through the generalisation of the concept of superstep we propose two extensions of the BSP model: the BSP Without Barriers (BSPWB) and the Message Passing Machine (MPM) models. These new models are oriented to MPI/PVM parallel programming. The parameters of the models and their quality are evaluated on four standard parallel platforms. The use of these BSP extensions is illustrated using the Fast Fourier Transform and the Parallel Sorting by Regular Sampling algorithms. Copyright 1999 John Wiley & Sons, Ltd. 1. INTRODUCTION A computing model defines the behaviour of a theoretical machine. The objective of a model is to allow the design and analysis of parallel algorithms that can be executed efficiently on a variety of architectures. This definition implies methodologies to design algorithms and to compute the time spent on their executions. These methodologies constrain the languages and lead to the implementation of their compilers for the architectures conforming to the model. A fundamental obstacle to the widespread use of parallel machines for general purpose computing has been the lack of a widely accepted standard model of parallel computation. The development of a reasonable abstraction of parallel machines is a formidable challenge. A simple and precise model of parallel computation is necessary to guide the design and analysis of parallel algorithms. Among the plethora of solutions proposed, PRAM, Networks, LogP and BSP models are the most popular. The PRAM model[1] has been widely used to represent the complexity of parallel algorithms. The model is simple and useful for a gross classification of parallel algorithms but is not accurate enough because all processors work synchronously and interprocessor communication is free. The PRAM model assumes that interprocessor communication has ∗ Correspondence to: J. L. Roda, Dpto. Estadı́stica, Investigación Operativa y Computación, Universidad de La Laguna, 38271, La Laguna, Tenerife, Spain. CCC 1040–3108/99/090461–17$17.50 Copyright 1999 John Wiley & Sons, Ltd. Received 28 April 1998 Revised 12 January 1999 462 J. L. RODA ET AL. infinite bandwidth, zero latency and zero overhead. To make it more practical, different variants have been proposed[2]. In the Network model[3], communications are only allowed between directly connected processors; other communications are explicitly forwarded through intermediate nodes. Many algorithms have been created which are perfectly matched to the structure of a particular network. However, these elegant algorithms usually do not map with equal efficiency onto interconnection structures different from those for which they were designed. Other models take as their starting point the fact that many of the current parallel computers consist of a collection of complete computers connected through a network interface to a multistage interconnection network. Many commercialy, massively parallel computers follow this hardware organisation, and this seems to be a clear trend for the near future. A parallel hardware/software platform, in the LogP model[4], is characterised by four parameters: the latency (L), the communication overhead (o), the gap (g) and the number of processors (P). The model also assumes that if a processor attempts to transmit more than [L/g] not consumed messages, it will stall until the message can be sent without exceeding the limit. Although the model encourages the careful scheduling of communication and overlapping of communications and computations, there is a concern that a complete LogP analysis for non-trivial algorithms is, in not a few cases, almost infeasible. To be practical, one of the simplifications is to disregard the differentiation between overhead o and gap g, reducing this model to a simple linear function T (n) = L + gn, as in [5–7]. The computational BSP model[8] proposes a parallel machine made of a set of p processor-memory pairs, a global communication network and a mechanism for synchronising the processors. A BSP calculation consists of a sequence of supersteps. The cost of a BSP program can be calculated simply by summing up the cost of each separated superstep executed by the program. For each superstep the cost can be divided into: (i) local computation; (ii) global exchange of data; and (iii) barrier synchronisation. The communication pattern performed in a given superstep is called an h-relation if the maximum number of packets a processor communicates in the superstep is h: h = max{ini @outi /i ∈ {0, . . . , p − 1}} (1) where @ is a binary operator, usually the maximum or the sum. Values ini and outi respectively denote the number of packets entering and leaving processor i . Both the particular operation @ and the size of the packet depend on the particular architecture. Operation @ is close to the maximum for machines allowing a complete parallel processing of incoming/outgoing messages. The correct interpretation for each case depends on the number of input/output ports of the network interface. The results presented in this paper take operator @ as the sum. The two basic BSP parameters that model a parallel machine are: the gap g, which reflects per-processor network bandwidth, and the minimum duration of a superstep L, which reflects the latency to send a packet through the network as well as the overhead to perform a global synchronisation. The fundamental feature of the BSP model lies in the h-relation hypothesis introduced by Valiant. It states that the communication time spent on an h-relation is given by Communication Time = gh Copyright 1999 John Wiley & Sons, Ltd. (2) Concurrency: Pract. Exper., 11, 461–477 (1999) 463 MESSAGE PASSING MODELLING TIMES Let W denote the maximum time spent in local computation by any processor during the superstep. The BSP model guesses that the running time of a superstep is bounded by the formula: Time Superstep = W + gh + L (3) In consequence, the design of algorithms under the BSP model tries to minimise the number of supersteps, the maximum number of operations performed by any processor W and the maximum number h of packets communicated. Complexity orders of the max interpretation of @ with regard to the sum differ at most by a factor of two. Although BSP programs can be expressed using existing communication libraries such as PVM and MPI, the counterpart is not always true. The asynchronous nature of many MPI/PVM programs does not easily fit inside the BSP model. In the next Section we propose two extensions to the BSP model for MPI/PMV parallel programs: the BSP Without Barriers (BSPWB) and the Message Passing Machine (MPM) models. It has been argued[9] that the BSP values of g and L depend not only on the hardware performance of the target architecture but also on the software, and therefore systems not designed with BSP in mind do not deliver good values of g and L. Runtime systems oriented to the BSP model try to get actual machines closer to the BSP ideal machine by packing individual messages generated during a superstep and optimising communication time by rearranging the order in which messages are sent at the end of the superstep[10]. These design considerations strongly contribute to the remarkable accuracy of some BSP runtime systems as the Oxford BSP library. Beyond these statements, there are no rigorous studies measuring the validity of the BSP model in current standard parallel platforms. Section 3 studies the influence of the number of processors and communication patterns in the h-relation hypothesis and estimates the BSP Without Barriers and Message Passing Machines models gaps and start-up latencies in six representative platforms. Four of them use PVM: an IBM SP2, a Silicon Origin 2000, a 10 Mbit Coaxial Ethernet Local Area Network, and a UTP Ethernet LAN. The other two use MPI: a Cray T3E and a Silicon Origin 2000. The IBM Scalable POWERparallel SP2 used in the experiments is a distributed-memory parallel computer. Processors or nodes are interconnected through a High Performance Switch (HPS). The HPS is a bi-directional multistage interconnection network. The computing nodes are Thin2, each powered by a 66 MHz Power2 RISC System/6000 processor. The tests were implemented in PVMe[11], the improved version of the message-passing software PVM[12]. The Origin 2000 nodes are R10000 processors (196 MHz). Origin systems use distributed shared memory (DSM). To a processor, main memory appears as a single addressable space. The HUB ASIC four-port crossbar switch interconnects the four interfaces of the node: processor, main memory and directory memory, router board, and the I/O subsystem. The router board interfaces the node with the CrayLink Interconnect fabric. Both the 10Mbsec coaxial LAN and the UTP LAN are made of Sun Sparc workstations at 70 MHz running Solaris 2.5. The UTP LAN uses a FORESYSTEMS ES-3810 Ethernet Workgroup switch that interconnects all the computers in the LAN. According to the leaflets, a theoretical 10 point-to-point Mbits is guaranteed. Version 3.3.11 of PVM was used for these three last platforms. The Cray T3E is a distributed memory multiprocessor system, with 32 300 MHz processors connected in a 3D torus network. Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 11, 461–477 (1999) 464 J. L. RODA ET AL. We have selected for our study the set 5 = {E, P P, O A, AO, A A} of the five communication patterns that most commonly appear in parallel algorithms. Exchange (E), PingPong (P P), OneToAll (O A), AllToOne ( AO) and AllToAll ( A A). In an Exchange pattern, E, a certain number p of processor pairs simultaneously send messages of length m E and immediately proceed to receive the messages. In the more asymmetric PingPong pattern, P P, one of the processors in the pair sends a message of size m P P and the other receives the message. In the Personalised OneToAll, O A, a processor sends a different message of size m O A to each of the other processors. Reciprocally, in the AllToOne pattern, AO, an incoming processor receives p − 1 messages of size m AO sent by the other processors. Under the AllToAll pattern, A A, all the p processors send their message of size m A A to all the other p − 1 processors. Under the @ = + model, all these five patterns give place to the same h-relation if the following sequence of equalities holds: h = m P P = 2∗ m E = ( p − 1)∗ m O A = ( p − 1)∗ m AO = 2∗ ( p − 1)∗ m A A (4) According to the h-relation hypothesis, the communication time of these patterns for these message sizes has to be the same. Results in Section 3 show how actual systems deviate from this prediction. The use of the BSPWB and MPM models proposed in the next Section will be illustrated later in Section 4 using the Fast Fourier Transform and the Parallel Sorting by Regular Sampling. Conclusions are presented in Section 5. 2. BSP WITHOUT BARRIERS: THE BSPWB AND MPM MODELS The barrier synchronisation after each step imposed by BSP does not completely agree with the way many PVM and MPI programs are written. The execution of a PVM or MPI program in any processor consists in phases of computation followed by the communications necessary to provide and obtain the data for the next phase. Communication in this context means a continuous stream of messages. We propose a generalisation of the BSP concept of superstep to MPI/PVM programming that we call Message steps or ‘M-steps’. In any M-step each processor performs some local computation, sends the data needed by the other processors and receives the data it needs for the next M-step. Processors may be in different M-steps at a given time, since no global barrier synchronisation is used. The absence of barriers is the main difference from the BSP programming style. However, as in pure BSP, we assume that the total number of M-steps, R, performed by all the p processors, is the same, and communications always occur among processors in adjacent steps k − 1 and k. Computation on any processor can be arbitrarily divided by the designer to achieve this goal. The time ts,i when processor i finishes its step s is bounded by what we will call the BSP Without Barriers (BSPWB) time Ts , given by: T1 = max{w1,i } + max{g ∗ h 1,i + L}, where i = 0, 1, . . . , p − 1 Ts = Ts−1 +max{ws,i }+max{g ∗ h s,i + L}, where s = 2, . . . , R and i = 0, 1, . . . , p − 1 (5) where ws,i and h s,i respectively denote the time spent in computing and the h-relation established by processor i in the step s defined as: h s,i = ins,i @outs,i for i = 0, 1, . . . , p − 1 and s = 1, . . . , R Copyright 1999 John Wiley & Sons, Ltd. (6) Concurrency: Pract. Exper., 11, 461–477 (1999) 465 MESSAGE PASSING MODELLING TIMES and ins,i and outs,i respectively denote the number of packets incoming/outgoing from/to processor i in the M-step s and the @ operation is defined depending on the architecture. Gap and latency values g and L can be computed as proposed in the next paragraph. Instead, being associated with barrier synchronisation, parameter L describes start-up time spent in the h-relation. A closer boundary to the actual MPI/PVM time ts,i when processor i finishes its sth M-step is the value 8s,i given by the Message Passing Machine (MPM) model we propose here. We define the set s,i for a given processor i and M-step s as the set s,i = { j/Processor j sends a message to processor i in step s} ∪ {i } (7) The Message Passing Machine (MPM) time of a MPI/PVM program is defined by the formulas: 81i = max{w1, j /j ∈ 1,i } + (g ∗ h 1,i + L), where i = 0, 1, . . . , p − 1 8s,i = max{8s−1, j + ws, j /j ∈ s,i } + (g ∗ h s,i + L), (8) where s = 2, . . . , R and i = 0, 1, . . . , p − 1 and h s,i = max{ins, j @outs, j /j ∈ s,i }, where s = 1, . . . , R and i = 0, 1, . . . , p − 1 (9) Processors in the set s,i are called ‘the incoming partners of processor i in step s’. The total time of a MPI/PVM program in the MPM model is given by 9 = max{8 R, j /j ∈ {0, . . . , p − 1}, } (10) where R is the total number of M-steps. Instead of bounding the time as if there were a hypothetical global barrier synchronisation at the end of the M-step, the MPM model assumes a processor synchronisation with its incoming partners. Formula (8) becomes the BSPWB time of formula (5) when for any processor and any step s, the family of sets s,i is the whole set of processors {0, . . . , p − 1}. This will be the case of the AllToAll pattern. As is depicted by the examples in Section 4, in a large number of parallel algorithms there is a heavy loaded processor that reaches the maximum time in each M-step. In such cases the BSPWB and the MPM models agree in their prediction. Figure 1 illustrates the differences between the two models through a bar diagram example. The diagram corresponds to an application running on a 4-processor machine in two steps. White areas correspond to computation while black areas stand for communication. Bars labelled MPMi show the Message Passing Model schedule of processor i . Arrows indicate the source and target of messages. The chunk of bar inside brackets pointing to label 8s,i corresponds to the MPM time of processor i during step s. During the first step, processors 0 and 1 perform a task heavier than the one performed by processors 2 and 3. After an exchange operation the situation is inverted and processors 0 and 1 do the lighter part compensating the former unbalance. Finally, there is another exchange between processors 0 and 2 and processors 1 and 3. The time predicted by the MPM model (10 s) is more accurate than the time predicted by the BSPWB model (12 s). Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 11, 461–477 (1999) 466 J. L. RODA ET AL. Figure 1. MPM and BSPWB times 3. ESTIMATION OF THE BSPWB AND MPM PARAMETERS g AND L IN FOUR DIFFERENT ARCHITECTURES The values of g and L have been computed using linear fit on the average times Taverage (h) of the times Timeρ,i (h) obtained for the five patterns ρ ∈ 5 = {E, P P, O A, AO, A A} and for the different number of processors i ∈ Hρ : Taverage (h) = 6ρ∈5 (6i∈Hρ Timeρ,i (h)/|Hρ |)/|5| (11) where Hρ = {2, 4, 6, 8} for the Exchange and PingPong patterns, and i is in Hρ = {4, 6, 8} for the OneToAll, AllToOne and AllToAll patterns. To produce the same h-relation size, message sizes m ρ were chosen according to formula (4). 3.1. PVM values of g and L Figure 2 shows the variation of gρ,i expressed in seconds per word for the IBM SP2 and the Origin 2000 with the number i of processors. The group of curves between 3 × 10−8 and 4.5 × 10−8 s/byte corresponds to the IBM SP2. Observe that all the Origin 2000 curves are under the IBM SP2 curves, and that the Origin 2000 is between two and three times faster than the IBM SP2. The invariance of both machines in the number of processors is better than in the communication patterns. The slowest pattern in both machines corresponds to the most sequential one: the PingPong pattern, where all the h bytes are sent through the output port. The other patterns benefit from the existence of different ports for input and output. Compare the BSPWB PVM g values for the IBM SP2 with the corresponding Oxford BSP library values[13] for the same machine at C4[14]: g 0 = 35 × 10−8 s/word = 8.75 × 10−8 s/byte. The Oxford BSP value is larger since the definition of the h-relation is based on the operator @= max and the estimation of g is based on the AllToAll pattern[10]. Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 11, 461–477 (1999) 467 MESSAGE PASSING MODELLING TIMES Figure 2. g values for the IBM SP2 and Origin 2000 in PVM Figure 3. g values for the UTP LAN in PVM Figures 3 and 4 present the corresponding variation of g for the UTP LAN and Coaxial LAN, respectively. The value of g decreases for the coaxial LAN from 2 × 10−5 for the Exchange pattern to 5 × 10−6 for the OneToAll. The improvement of the value of g, not only in the coaxial but also in the IBM SP2 and the Origin, in the OneToAll pattern with the number of processors is remarkable. The authors of previous work[15] have studied this phenomenon. The poor performance of the Exchange and AllToAll patterns in the single bus network can be explained by the increase in the number of collisions. Table 1 presents the BSPWB and MPM values of g and L for the four architectures Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 11, 461–477 (1999) 468 J. L. RODA ET AL. Figure 4. g values for the UTP COA in PVM Table 1. Values of g and L using PVM L g ORIGIN IBM SP2 UTP LAN COA LAN −1.53 × 10−5 4.88 × 10−8 8.08 × 10−5 1.38 × 10−7 3.82 × 10−3 6.04 × 10−6 1.26 × 10−2 1.01 × 10−5 considered using PVM. The negative value of the Latency in the Origin 2000 is due to the unsuitability of taking one byte as the unit packet size. The appropriate value of the unit packet size depends on the specific architecture. For the Origin 2000 a 512 bytes packet size delivers a positive value. In order to compare the four architectures we have preferred to express h-relation sizes in words. 3.2. MPI values of g and L We have also computed the values for MPI and for a larger number of processors. In this case, the Origin 2000 and the Cray T3E were used. Figures 5 and 6 show the variation of g expressed in seconds per word for the Origin 2000 and the Cray T3E architectures. A perfect network holding the h-relation hypothesis is one for which all the six gρ,i curves are overlapped in a single horizontal line. The independence of the h-relation time in the number of processors is better than in the communication pattern. There are two differentiated clusters of curves. The permutation patterns {E, P P} are slower than the collective patterns {O A, P O A, AO, A A}. While the time of permutation patterns grows with the number of processors, the time of collective patterns takes advantage of the processors to make the movement of the given h-relation parallel. The slowest pattern in the two machines is the PingPong, since all the h words are sequentially sent through the output port. The other patterns benefit from the existence Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 11, 461–477 (1999) 469 MESSAGE PASSING MODELLING TIMES Figure 5. g values for the Origin 2000 in MPI Figure 6. g values for the Cray T3E in MPI of different ports for input and output. The improvement of the OneToAll time compared to the Personalised OneToAll time is remarkable. The two O A and P O A pattern curves run in parallel. The improvement of the O A against the P O A gave us a measure of the optimisation reached in the implementation of the software/hardware platform. Although the OneToAll has been considered a (P −1)×m h-relation in formula (4) the actual number of packets injected by the root processor is surely smaller and depends on the particular MPI Bcast() implementation: the OneToAll curve cannot be considered a proper h-relation as is specified in formula (4). Thus, the variation ranges of Figures 5 and 6 narrow to Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 11, 461–477 (1999) 470 J. L. RODA ET AL. Table 2. Values of g and L and L 0 in seconds/word L g Cray T3E Origin −2.11 × 10−5 2.92 × 10−8 −1.75 × 10−3 5.53 × 10−8 3 × 10−8 for the Cray T3E and 6 × 10−8 for the Origin 2000. The BSP Without Barriers values of g and L in Table 2 have been computed using linear fit on the average times Taverage (h) of the times Timeρ,i (h) obtained for ρ ∈ 5 = {E, P P, O A, P O A, AO, A A} and for the different number of processors i ∈ Hρ: Taverage (h) = 6ρ∈5 (6i∈Hρ Timeρ,i h)/|Hρ |)/|5| (12) where Hρ = {2, 4, 6, 8, 16, 24} for the Exchange and PingPong patterns and Hρ = {4, 6, 8, 16, 24} for the OneToAll, Personalized OneToAll, AllToOne and AllToAll patterns. To produce the same h-relation size, message size m ρ were chosen according to formula (4). The h-relations sizes were in the range of 4330 to 4945920 32-bit words. 4. PREDICTING THE TIME OF MPI/PVM PROGRAMS WITH THE BSPWB AND MPM MODELS To simplify the expression of the BSPWB and MPM M-step partition associated with an MPI program, we superimpose commented code into the example codes in Figures 7 and 8. When uncommented, variable M step contains at any time the value of the current BSPWB and MPM M-step. This M step variable may be used as an additional tag when communicating messages, to check the validity of the BSPWB and the MPM programs. The use of this tag also solves the well-known library-user tag conflict appearing in the old versions of PVM that gave rise to the concept of communicator. 4.1. The fast Fourier algorithm example We will illustrate the BSPWB and MPM models using a parallel algorithm to compute the fast Fourier transform. Consider a sequence A = (A[0], . . . , A[N − 1]) of length N. The DFT of the sequence A is the sequence C = (C[0], . . . , C[N − 1]), given by C[i ] = 6k=0...N−1 A[k]wki (13) √ e2π −1/N is the primitive nth root of the unity in the complex plane. The where w = following division can be deduced from the definition C[i ] = 6k=0...N/2−1 A[2k](w2)ki + wi 6k=0...N/2−1 A[2k + 1](w2 )ki (14) From this formula it follows that the DFT C of A can be obtained by combining B, the DFT of the even components and D, the DFT of the odd components of A. Figure 7 shows the pseudo-code executed by processor NAME. At the beginning of the computation, all the processors hold a copy of vector A. Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 11, 461–477 (1999) 471 MESSAGE PASSING MODELLING TIMES Figure 7. FFT pseudo-code In lines 1 to 4, processors compute the N/ p subvector of A resulting in the odd–even pattern given by the bits of its NAME in reverse order. At this point, they proceed to compute the sequential DFT (line 5). No communications are required until this point. Lines from 12 to 17 contain the combination phase. Processors with the same log( p)−1−i bits share the same i ancestors in the division tree. Thus, an active processor with its bit i equal to 1 sends its odd subvector to the processor that has computed the matching even part giving place to an h = N × 2i / p-relation (lines 7–11). The time spent by the algorithm in the first M-step is broken down into the division time (lines 1–4) of order O(N/ p), the corresponding DFT time O(N/ p log(N/ p)) and the first communication (lines 6–11): T1 = 81,NAME = t1,NAME = D(N/ p) + F N/ p log(N/ p) + L + g(N/ p) 1,NAME = {NAME, NAME XOR 1} (15) For s = 2, . . . , log( p)−1 the time of the M-step s is composed of the combination of lines 12–17 with loop index i = s − 2 plus the time spent in the communication in lines 7–11 with loop index i = s − 1: Ts = 8s,NAME = ts,NAME = Ts−1 + R N2s−2 / p + (L + g(N2s−1 / p)) (16) and 1,NAME = {NAME, NAME XOR 2s−1 } if processor NAME is active; otherwise we assume that processor NAME makes an empty (ghost) step. The last M-step, s = log( p)+1 performed by processor NAME = 0, only includes a combination Tlog( p)+1 = 8log( p)+1,0 = tlog( p)+1,0 = Tlog( p) + R N/2 (17) Table 3 contains the values of the three computational constants D, F and R. Values for the Coaxial and UTP platforms are the same, since the processors are the same. Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 11, 461–477 (1999) 472 J. L. RODA ET AL. Figure 8. MPI code for the PSRS Table 3. Values of the computational constants Constants Origin 2000 IBM SP2 LANs D F R 2.39 × 10−7 5.52 × 10−7 2.84 × 10−7 5.86 × 10−7 4.81 × 10−7 8.69 × 10−7 2.65 × 10−6 6.04 × 10−7 1.44 × 10−6 The accuracy of the models for a 524.288 complex instance is shown in Figures 9 and 10. The Error percentage displayed in Figure 11 has been computed using the formula Error = 100∗(Actual Time − Model Time)/Actual Time (18) Model and real time curves corresponding to the IBM SP2 appear overlapped. The error grows with the number of processors. The curious exception is the Coaxial LAN. The PingPong pattern occurs in the FFT algorithm with 1, 2 and 4 pairs of partners. The time predicted in the coaxial LAN for the first two PingPong communications (i = 0 and i = 1) is over the actual time, while it is under the actual time for the third communication (i = 2). This produces a compensation that leads to a false accuracy. In fact, this is the only architecture where the models fail their prediction of a decreasing behaviour in the actual time. It could be considered a surprise that UTP times are worse than Coaxial times. Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 11, 461–477 (1999) 473 MESSAGE PASSING MODELLING TIMES Figure 9. Model and actual times for the LANs Figure 10. Model and actual times for the IBM SP2 and Origin 2000 This is a consequence of the fact that the Coaxial LAN is beaten by the UTP LAN only in the first PingPong communication (i = 0), and that the quantity of data in this first movement is small compared with the other two later communications (i = 1 and i = 2). The UTP LAN and the Origin 2000 reach an error percentage of 34.30% and 34.10% for eight processors. 4.2. A sorting example The PSRS proposed by Li et al.[16] is a straightforward example of a BSP-like algorithm where only collective communication operations are used (Figure 8). The algorithm starts with the personalised broadcasting (line 2) by processor 0 to the other P − 1 processors of the different segments of the array A to sort. The BSPWB time T 1 of the first M-step is: T 1 = g(P − 1) Copyright 1999 John Wiley & Sons, Ltd. N +L P (19) Concurrency: Pract. Exper., 11, 461–477 (1999) 474 J. L. RODA ET AL. Figure 11. FFT error percentage In the second M-step, each of the P processors proceeds to sequentially sort its segment (line 4). A sample of size P 2 is selected in lines 5 and 6. The M-step ends with the gathering of the different samples by processor 0 (line 7). It gives place to an h-relation with h = P(P − 1), N N + C P + g P(P − 1) + L + T 1 (20) T 2 = B log P P In the third M-step the P samples are merged (line 10), the P − 1 pivots are selected (line 11) and they are sent to the other P − 1 processors (line 13), T 3 = D × P2 + E × (P − 1) + g × (P − 1) × (P − 1) + L + T 2 (21) Using a binary search for each of the P − 1 pivots in the array A, procedure ComputeFragments computes the array MySizes containing the sizes of the segments of A between consecutive pivots (line 15). The sizes of the fragments are exchanged through the personalised all to all communication at line 16. The P elements of the array Ones are initialised to 1 and the i th element of the array Linear is initialised to i . The h-relation is h = 2(P − 1), since each processor sends its P − 1 sizes and receives P − 1 sizes from the others, N + g × 2(P − 1) + L + T 3 (22) T 4 = F(P − 1) log P Once the processors have got all their corresponding sizes they compute the offsets OffsetIn, where the incoming segments will be stored, the offsets OffsetOut, where the segments to be sent start, and the final number of items, MySize (line 18). The theoretical and experimental results given in [17] prove that an average size of N/P 2 for each segment can be expected. At the end of the fifth M-step, each processor sends the P − 1 segments of size N/P 2 and receives its P − 1 segments of size N/P 2 from the others (line 19), T 5 = H (P − 1) + g × 2(P − 1) Copyright 1999 John Wiley & Sons, Ltd. N + L + T4 P2 (23) Concurrency: Pract. Exper., 11, 461–477 (1999) 475 MESSAGE PASSING MODELLING TIMES Figure 12. PSRS predicted vs. actual times: 1048576 integers The P sorted segments are merged in line 21. In order to prepare the gathering of the segments by the root processor, the P − 1 processors send to processor 0 the sizes MySize of their arrays (line 22), T6 = I N + g(P − 1) + L + T 5 P (24) The root processor computes the displacements OffsetIn for the sorted segments at line 24. The P − 1 intervals are gathered and directly placed in the array A in their final position at line 25, T 7 = J (P − 1) + g(P − 1) N + L + T6 P (25) To avoid the influence of secondary factors that may obscure the accuracy of the model as cache size, the computational constants have been adjusted to the number of processors. Figure 12 presents the BSPWB predicted time and the actual time for the two machines. The accuracy is such that the model and real time curves for the Cray T3E appear overlapped. The same occurs for the Origin 2000 up to 16 processors. The inferior scalability of the Origin for 24 processors increases the distance between the model and the actual Origin 2000 time. 5. CONCLUSIONS Current standard message passing programming models are PVM and MPI. Although BSP or LogP consider current parallel architectures, none of them completely adapts to these standard message passing libraries. To obtain an accurate prediction of the time execution, it is necessary to use specific libraries like BSPLib for BSP or Active Messages for LogP. We have presented two new parallel computational models that work under standard message passing libraries and current standard architectures. The first one, the BSP Without Barriers model, works when only collective communications are used. The Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 11, 461–477 (1999) 476 J. L. RODA ET AL. second, the Message Passing Machine model, is more general and can be used to design and analyse a wide range of PVM/MPI algorithms. Based on the most commonly used communication patterns, we have proposed a new methodology to obtain the parameters of the models. The quality and accuracy of the parameters have been presented. We have also studied the validity of the h-relation hypothesis and evaluated, using PVM and MPI, the parameters g and L in five different representative platforms: a 10 Mbit coaxial Ethernet local area network, a UTP Ethernet LAN, an IBM-SP2, a Cray T3E and a Silicon Origin 2000. The use and the prediction capacity of the models are illustrated through two examples, a fast Fourier transform algorithm and the parallel sorting by regular sampling. ACKNOWLEDGEMENTS Part of this research has been carried out at Centre de Computació i Communicacions de Catalunya (C4), Centro de Investigaciones Energéticas Medioambientales y Tecnológicas CIEMAT, and at Institute Astrofı́sico de Canarias (IAC). REFERENCES 1. S. Fortune and J. Wyllie, ‘Parallelism in randomized machines’, Proceedings of STOC, 1978, pp. 114–118. 2. P. B. Gibbons, ‘A more practical PRAM model’, Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, ACM, 1989, pp. 158–168. 3. T. Leighton, Introduction to Parallel Architectures; Arrays, Trees, Hypercubes, Morgan Kaufmann, San Mateo, CA, 1992. 4. D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian and T. von Eicken, ‘LogP: Towards a realistic model of parallel computation’, Proceedings of the 4th ACM SIGPLAN, Sym. Principles and Practice of Parallel Programming, May 1993. 5. J. Dongarra and T. Dunigan, Message-Passing Performance of Various Computers. http://www.netlib.org/utk/papers/commperf.ps. 1995. 6. J. Roda, J. Herrera, J. C. González, C. Rodrı́guez, F. Almeida and D. González, ‘Practical experiments to improve PVM algorithms’, 3rd EUROPVM Meeting Group, Munich, Oct. 1996. 7. B. Schmidt and V. Sunderam, ‘Empirical analysis of overheads in cluster environments’, Concurrency: Pract. Exp., 6(1), 1–32 (1994). 8. L. G. Valiant, ‘A bridging model for parallel computation’, Commun. ACM, 33(8), 103–111 (1990). 9. D. B. Skillcorn, J. Hill and W. F. McColl, ‘Questions and answers about BSP’, Oxford University Computing Laboratory, Report PRG-TR-15-96, Nov. 1996. 10. J. Hill and S. Donaldson, ‘Stability of communication performance in practice: from the CRAYT3E to networks of workstations’, Oxford University Computing Laboratory, PRG-TR-33-97, 1997. 11. IBM PVMe for AIX User’s Guide and Subroutine, Reference Version 2, Release 1, Document number GC23-3884-00, IBM Corp., 1995. 12. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Mancheck and V. Sunderam, PVM: Parallel Virtual Machine – A User’s Guide and Tutorial for Network Parallel Computing, MIT Press, 1994. 13. J. Marı́n and A. Martı́nez, ‘Testing PVM versus BSP programming’, VIII Jornada de Paralelismo, Sept. 1997, pp. 153–160. 14. J. M. Arruabarrena, A. Arruabarrena, R. Beivide and J. A. Gregorio, ‘Assessing the performance of the new IBM-SP2 communication subsystem’, IEEE Parallel Distrib. Technol., 4, 12–22 (1996). Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 11, 461–477 (1999) MESSAGE PASSING MODELLING TIMES 477 15. J. Roda, C. Rodrı́guez, F. Almeida and D. Morales, ‘Prediction of parallel algorithms performance on bus based networks using PVM’, 6th EuroMicro Workshop on Parallel and Distributed Processing, Madrid, Spain, IEEE Computer Society, Jan. 1998. 16. X. Li, P. Lu, J. Schaefer, J. Shillington, P. S. Wong and H. Shi, ‘On the versatility of parallel sorting by regular sampling’, Parallel Comput., 19, 1079–1103 (1993). 17. J. Hill, B. McColl, D. Stefanescu, M. W. Goudreau, K. Lang, S. Rao, S. Torsten, T. Tsantilas and R. Bisseling, The BSP Programming Library. http://www.bsp-worldwide.org/, 1997. 18. J. Roda, C. Rodrı́guez, F. Almeida and D. Morales, ‘Predicting the performance of injection communication patterns on PVM’, 4th European PVM/MPI Meeting Group, Cracow, Poland, Springer-Verlag, Nov. 1997. 19. M. Snir, S. Otto, S. Huss-Lederman, D. Walker and J. Dongarra, MPI: The Complete Reference, MIT Press, Cambridge, MA, 1996. http://www.netlib.org/utk/papers/mpi-book/mpi-book.html Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper., 11, 461–477 (1999)

1/--страниц