вход по аккаунту


Predicting the execution time of message passing models

код для вставкиСкачать
Concurrency: Pract. Exper., Vol. 11(9), 461–477 (1999)
Predicting the execution time of message passing
Dpto. Estadı́stica, Investigación Operativa y Computación, Universidad de La Laguna, 38271, La Laguna,
Tenerife, Spain
Recent publications prove that runtime systems oriented to the Bulk Synchronous Parallel
Model usually achieve remarkable accuracy in their predictions. That accuracy can be seen
in the capacity of the software for packing the messages generated during the superstep and
their capability to find a rearrangement of the messages sent at the end of the superstep.
Unfortunately, barrier synchronisation imposes some limits both in the range of available
algorithms and in their performance. The asynchronous nature of many MPI/PVM programs
makes their expression difficult or infeasible using a BSP oriented library. Through the
generalisation of the concept of superstep we propose two extensions of the BSP model: the
BSP Without Barriers (BSPWB) and the Message Passing Machine (MPM) models. These new
models are oriented to MPI/PVM parallel programming. The parameters of the models and
their quality are evaluated on four standard parallel platforms. The use of these BSP extensions
is illustrated using the Fast Fourier Transform and the Parallel Sorting by Regular Sampling
algorithms. Copyright  1999 John Wiley & Sons, Ltd.
A computing model defines the behaviour of a theoretical machine. The objective of a
model is to allow the design and analysis of parallel algorithms that can be executed
efficiently on a variety of architectures. This definition implies methodologies to design
algorithms and to compute the time spent on their executions. These methodologies
constrain the languages and lead to the implementation of their compilers for the
architectures conforming to the model. A fundamental obstacle to the widespread use of
parallel machines for general purpose computing has been the lack of a widely accepted
standard model of parallel computation. The development of a reasonable abstraction
of parallel machines is a formidable challenge. A simple and precise model of parallel
computation is necessary to guide the design and analysis of parallel algorithms. Among
the plethora of solutions proposed, PRAM, Networks, LogP and BSP models are the most
The PRAM model[1] has been widely used to represent the complexity of parallel
algorithms. The model is simple and useful for a gross classification of parallel algorithms
but is not accurate enough because all processors work synchronously and interprocessor
communication is free. The PRAM model assumes that interprocessor communication has
∗ Correspondence to: J. L. Roda, Dpto. Estadı́stica, Investigación Operativa y Computación, Universidad de La
Laguna, 38271, La Laguna, Tenerife, Spain.
CCC 1040–3108/99/090461–17$17.50
Copyright  1999 John Wiley & Sons, Ltd.
Received 28 April 1998
Revised 12 January 1999
infinite bandwidth, zero latency and zero overhead. To make it more practical, different
variants have been proposed[2].
In the Network model[3], communications are only allowed between directly connected
processors; other communications are explicitly forwarded through intermediate nodes.
Many algorithms have been created which are perfectly matched to the structure of a
particular network. However, these elegant algorithms usually do not map with equal
efficiency onto interconnection structures different from those for which they were
Other models take as their starting point the fact that many of the current parallel
computers consist of a collection of complete computers connected through a network
interface to a multistage interconnection network. Many commercialy, massively parallel
computers follow this hardware organisation, and this seems to be a clear trend for the
near future. A parallel hardware/software platform, in the LogP model[4], is characterised
by four parameters: the latency (L), the communication overhead (o), the gap (g) and
the number of processors (P). The model also assumes that if a processor attempts to
transmit more than [L/g] not consumed messages, it will stall until the message can be
sent without exceeding the limit. Although the model encourages the careful scheduling of
communication and overlapping of communications and computations, there is a concern
that a complete LogP analysis for non-trivial algorithms is, in not a few cases, almost
infeasible. To be practical, one of the simplifications is to disregard the differentiation
between overhead o and gap g, reducing this model to a simple linear function T (n) =
L + gn, as in [5–7].
The computational BSP model[8] proposes a parallel machine made of a set of
p processor-memory pairs, a global communication network and a mechanism for
synchronising the processors. A BSP calculation consists of a sequence of supersteps.
The cost of a BSP program can be calculated simply by summing up the cost of each
separated superstep executed by the program. For each superstep the cost can be divided
into: (i) local computation; (ii) global exchange of data; and (iii) barrier synchronisation.
The communication pattern performed in a given superstep is called an h-relation if the
maximum number of packets a processor communicates in the superstep is h:
h = max{ini @outi /i ∈ {0, . . . , p − 1}}
where @ is a binary operator, usually the maximum or the sum. Values ini and outi
respectively denote the number of packets entering and leaving processor i . Both the
particular operation @ and the size of the packet depend on the particular architecture.
Operation @ is close to the maximum for machines allowing a complete parallel processing
of incoming/outgoing messages. The correct interpretation for each case depends on the
number of input/output ports of the network interface. The results presented in this paper
take operator @ as the sum.
The two basic BSP parameters that model a parallel machine are: the gap g, which
reflects per-processor network bandwidth, and the minimum duration of a superstep L,
which reflects the latency to send a packet through the network as well as the overhead to
perform a global synchronisation. The fundamental feature of the BSP model lies in the
h-relation hypothesis introduced by Valiant. It states that the communication time spent on
an h-relation is given by
Communication Time = gh
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 461–477 (1999)
Let W denote the maximum time spent in local computation by any processor during
the superstep. The BSP model guesses that the running time of a superstep is bounded by
the formula:
Time Superstep = W + gh + L
In consequence, the design of algorithms under the BSP model tries to minimise the
number of supersteps, the maximum number of operations performed by any processor
W and the maximum number h of packets communicated. Complexity orders of the max
interpretation of @ with regard to the sum differ at most by a factor of two.
Although BSP programs can be expressed using existing communication libraries such
as PVM and MPI, the counterpart is not always true. The asynchronous nature of many
MPI/PVM programs does not easily fit inside the BSP model. In the next Section we
propose two extensions to the BSP model for MPI/PMV parallel programs: the BSP
Without Barriers (BSPWB) and the Message Passing Machine (MPM) models.
It has been argued[9] that the BSP values of g and L depend not only on the hardware
performance of the target architecture but also on the software, and therefore systems not
designed with BSP in mind do not deliver good values of g and L. Runtime systems
oriented to the BSP model try to get actual machines closer to the BSP ideal machine by
packing individual messages generated during a superstep and optimising communication
time by rearranging the order in which messages are sent at the end of the superstep[10].
These design considerations strongly contribute to the remarkable accuracy of some BSP
runtime systems as the Oxford BSP library. Beyond these statements, there are no rigorous
studies measuring the validity of the BSP model in current standard parallel platforms.
Section 3 studies the influence of the number of processors and communication patterns
in the h-relation hypothesis and estimates the BSP Without Barriers and Message Passing
Machines models gaps and start-up latencies in six representative platforms. Four of them
use PVM: an IBM SP2, a Silicon Origin 2000, a 10 Mbit Coaxial Ethernet Local Area
Network, and a UTP Ethernet LAN. The other two use MPI: a Cray T3E and a Silicon
Origin 2000.
The IBM Scalable POWERparallel SP2 used in the experiments is a distributed-memory
parallel computer. Processors or nodes are interconnected through a High Performance
Switch (HPS). The HPS is a bi-directional multistage interconnection network. The
computing nodes are Thin2, each powered by a 66 MHz Power2 RISC System/6000
processor. The tests were implemented in PVMe[11], the improved version of the
message-passing software PVM[12]. The Origin 2000 nodes are R10000 processors (196
MHz). Origin systems use distributed shared memory (DSM). To a processor, main
memory appears as a single addressable space. The HUB ASIC four-port crossbar switch
interconnects the four interfaces of the node: processor, main memory and directory
memory, router board, and the I/O subsystem. The router board interfaces the node with the
CrayLink Interconnect fabric. Both the 10Mbsec coaxial LAN and the UTP LAN are made
of Sun Sparc workstations at 70 MHz running Solaris 2.5. The UTP LAN uses a FORESYSTEMS ES-3810 Ethernet Workgroup switch that interconnects all the computers in the
LAN. According to the leaflets, a theoretical 10 point-to-point Mbits is guaranteed. Version
3.3.11 of PVM was used for these three last platforms. The Cray T3E is a distributed
memory multiprocessor system, with 32 300 MHz processors connected in a 3D torus
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 461–477 (1999)
We have selected for our study the set 5 = {E, P P, O A, AO, A A} of the five
communication patterns that most commonly appear in parallel algorithms. Exchange (E),
PingPong (P P), OneToAll (O A), AllToOne ( AO) and AllToAll ( A A). In an Exchange
pattern, E, a certain number p of processor pairs simultaneously send messages of length
m E and immediately proceed to receive the messages. In the more asymmetric PingPong
pattern, P P, one of the processors in the pair sends a message of size m P P and the other
receives the message. In the Personalised OneToAll, O A, a processor sends a different
message of size m O A to each of the other processors. Reciprocally, in the AllToOne
pattern, AO, an incoming processor receives p − 1 messages of size m AO sent by the
other processors. Under the AllToAll pattern, A A, all the p processors send their message
of size m A A to all the other p − 1 processors. Under the @ = + model, all these five
patterns give place to the same h-relation if the following sequence of equalities holds:
h = m P P = 2∗ m E = ( p − 1)∗ m O A = ( p − 1)∗ m AO = 2∗ ( p − 1)∗ m A A
According to the h-relation hypothesis, the communication time of these patterns for
these message sizes has to be the same. Results in Section 3 show how actual systems
deviate from this prediction.
The use of the BSPWB and MPM models proposed in the next Section will be illustrated
later in Section 4 using the Fast Fourier Transform and the Parallel Sorting by Regular
Sampling. Conclusions are presented in Section 5.
The barrier synchronisation after each step imposed by BSP does not completely
agree with the way many PVM and MPI programs are written. The execution of a
PVM or MPI program in any processor consists in phases of computation followed
by the communications necessary to provide and obtain the data for the next phase.
Communication in this context means a continuous stream of messages. We propose
a generalisation of the BSP concept of superstep to MPI/PVM programming that we
call Message steps or ‘M-steps’. In any M-step each processor performs some local
computation, sends the data needed by the other processors and receives the data it needs
for the next M-step. Processors may be in different M-steps at a given time, since no global
barrier synchronisation is used. The absence of barriers is the main difference from the
BSP programming style. However, as in pure BSP, we assume that the total number of
M-steps, R, performed by all the p processors, is the same, and communications always
occur among processors in adjacent steps k − 1 and k. Computation on any processor can
be arbitrarily divided by the designer to achieve this goal. The time ts,i when processor
i finishes its step s is bounded by what we will call the BSP Without Barriers (BSPWB)
time Ts , given by:
T1 = max{w1,i } + max{g ∗ h 1,i + L}, where i = 0, 1, . . . , p − 1
Ts = Ts−1 +max{ws,i }+max{g ∗ h s,i + L}, where s = 2, . . . , R and i = 0, 1, . . . , p − 1
where ws,i and h s,i respectively denote the time spent in computing and the h-relation
established by processor i in the step s defined as:
h s,i = ins,i @outs,i for i = 0, 1, . . . , p − 1 and s = 1, . . . , R
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 461–477 (1999)
and ins,i and outs,i respectively denote the number of packets incoming/outgoing from/to
processor i in the M-step s and the @ operation is defined depending on the architecture.
Gap and latency values g and L can be computed as proposed in the next paragraph.
Instead, being associated with barrier synchronisation, parameter L describes start-up time
spent in the h-relation.
A closer boundary to the actual MPI/PVM time ts,i when processor i finishes its sth
M-step is the value 8s,i given by the Message Passing Machine (MPM) model we propose
here. We define the set s,i for a given processor i and M-step s as the set
s,i = { j/Processor j sends a message to processor i in step s} ∪ {i }
The Message Passing Machine (MPM) time of a MPI/PVM program is defined by the
81i = max{w1, j /j ∈ 1,i } + (g ∗ h 1,i + L), where i = 0, 1, . . . , p − 1
8s,i = max{8s−1, j + ws, j /j ∈ s,i } + (g ∗ h s,i + L),
where s = 2, . . . , R and i = 0, 1, . . . , p − 1
h s,i = max{ins, j @outs, j /j ∈ s,i }, where s = 1, . . . , R and i = 0, 1, . . . , p − 1
Processors in the set s,i are called ‘the incoming partners of processor i in step s’.
The total time of a MPI/PVM program in the MPM model is given by
9 = max{8 R, j /j ∈ {0, . . . , p − 1}, }
where R is the total number of M-steps. Instead of bounding the time as if there were
a hypothetical global barrier synchronisation at the end of the M-step, the MPM model
assumes a processor synchronisation with its incoming partners. Formula (8) becomes the
BSPWB time of formula (5) when for any processor and any step s, the family of sets s,i
is the whole set of processors {0, . . . , p − 1}. This will be the case of the AllToAll pattern.
As is depicted by the examples in Section 4, in a large number of parallel algorithms there
is a heavy loaded processor that reaches the maximum time in each M-step. In such cases
the BSPWB and the MPM models agree in their prediction.
Figure 1 illustrates the differences between the two models through a bar diagram
example. The diagram corresponds to an application running on a 4-processor machine
in two steps. White areas correspond to computation while black areas stand for
communication. Bars labelled MPMi show the Message Passing Model schedule of
processor i . Arrows indicate the source and target of messages. The chunk of bar inside
brackets pointing to label 8s,i corresponds to the MPM time of processor i during step s.
During the first step, processors 0 and 1 perform a task heavier than the one performed by
processors 2 and 3. After an exchange operation the situation is inverted and processors
0 and 1 do the lighter part compensating the former unbalance. Finally, there is another
exchange between processors 0 and 2 and processors 1 and 3. The time predicted by the
MPM model (10 s) is more accurate than the time predicted by the BSPWB model (12 s).
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 461–477 (1999)
Figure 1. MPM and BSPWB times
The values of g and L have been computed using linear fit on the average times Taverage (h)
of the times Timeρ,i (h) obtained for the five patterns ρ ∈ 5 = {E, P P, O A, AO, A A}
and for the different number of processors i ∈ Hρ :
Taverage (h) = 6ρ∈5 (6i∈Hρ Timeρ,i (h)/|Hρ |)/|5|
where Hρ = {2, 4, 6, 8} for the Exchange and PingPong patterns, and i is in Hρ = {4, 6, 8}
for the OneToAll, AllToOne and AllToAll patterns. To produce the same h-relation size,
message sizes m ρ were chosen according to formula (4).
3.1. PVM values of g and L
Figure 2 shows the variation of gρ,i expressed in seconds per word for the IBM SP2 and
the Origin 2000 with the number i of processors. The group of curves between 3 × 10−8
and 4.5 × 10−8 s/byte corresponds to the IBM SP2. Observe that all the Origin 2000
curves are under the IBM SP2 curves, and that the Origin 2000 is between two and
three times faster than the IBM SP2. The invariance of both machines in the number
of processors is better than in the communication patterns. The slowest pattern in both
machines corresponds to the most sequential one: the PingPong pattern, where all the h
bytes are sent through the output port. The other patterns benefit from the existence of
different ports for input and output. Compare the BSPWB PVM g values for the IBM SP2
with the corresponding Oxford BSP library values[13] for the same machine at C4[14]:
g 0 = 35 × 10−8 s/word = 8.75 × 10−8 s/byte. The Oxford BSP value is larger since
the definition of the h-relation is based on the operator @= max and the estimation of g is
based on the AllToAll pattern[10].
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 461–477 (1999)
Figure 2. g values for the IBM SP2 and Origin 2000 in PVM
Figure 3. g values for the UTP LAN in PVM
Figures 3 and 4 present the corresponding variation of g for the UTP LAN and Coaxial
LAN, respectively. The value of g decreases for the coaxial LAN from 2 × 10−5 for the
Exchange pattern to 5 × 10−6 for the OneToAll. The improvement of the value of g, not
only in the coaxial but also in the IBM SP2 and the Origin, in the OneToAll pattern with the
number of processors is remarkable. The authors of previous work[15] have studied this
phenomenon. The poor performance of the Exchange and AllToAll patterns in the single
bus network can be explained by the increase in the number of collisions.
Table 1 presents the BSPWB and MPM values of g and L for the four architectures
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 461–477 (1999)
Figure 4. g values for the UTP COA in PVM
Table 1. Values of g and L using PVM
−1.53 × 10−5
4.88 × 10−8
8.08 × 10−5
1.38 × 10−7
3.82 × 10−3
6.04 × 10−6
1.26 × 10−2
1.01 × 10−5
considered using PVM. The negative value of the Latency in the Origin 2000 is due to the
unsuitability of taking one byte as the unit packet size. The appropriate value of the unit
packet size depends on the specific architecture. For the Origin 2000 a 512 bytes packet
size delivers a positive value. In order to compare the four architectures we have preferred
to express h-relation sizes in words.
3.2. MPI values of g and L
We have also computed the values for MPI and for a larger number of processors. In this
case, the Origin 2000 and the Cray T3E were used.
Figures 5 and 6 show the variation of g expressed in seconds per word for the Origin
2000 and the Cray T3E architectures. A perfect network holding the h-relation hypothesis
is one for which all the six gρ,i curves are overlapped in a single horizontal line.
The independence of the h-relation time in the number of processors is better than in the
communication pattern. There are two differentiated clusters of curves. The permutation
patterns {E, P P} are slower than the collective patterns {O A, P O A, AO, A A}. While the
time of permutation patterns grows with the number of processors, the time of collective
patterns takes advantage of the processors to make the movement of the given h-relation
parallel. The slowest pattern in the two machines is the PingPong, since all the h words
are sequentially sent through the output port. The other patterns benefit from the existence
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 461–477 (1999)
Figure 5. g values for the Origin 2000 in MPI
Figure 6. g values for the Cray T3E in MPI
of different ports for input and output. The improvement of the OneToAll time compared
to the Personalised OneToAll time is remarkable. The two O A and P O A pattern curves
run in parallel. The improvement of the O A against the P O A gave us a measure of the
optimisation reached in the implementation of the software/hardware platform. Although
the OneToAll has been considered a (P −1)×m h-relation in formula (4) the actual number
of packets injected by the root processor is surely smaller and depends on the particular
MPI Bcast() implementation: the OneToAll curve cannot be considered a proper h-relation
as is specified in formula (4). Thus, the variation ranges of Figures 5 and 6 narrow to
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 461–477 (1999)
Table 2. Values of g and L and L 0 in seconds/word
Cray T3E
−2.11 × 10−5
2.92 × 10−8
−1.75 × 10−3
5.53 × 10−8
3 × 10−8 for the Cray T3E and 6 × 10−8 for the Origin 2000.
The BSP Without Barriers values of g and L in Table 2 have been computed using
linear fit on the average times Taverage (h) of the times Timeρ,i (h) obtained for ρ ∈ 5 =
{E, P P, O A, P O A, AO, A A} and for the different number of processors i ∈ Hρ:
Taverage (h) = 6ρ∈5 (6i∈Hρ Timeρ,i h)/|Hρ |)/|5|
where Hρ = {2, 4, 6, 8, 16, 24} for the Exchange and PingPong patterns and Hρ =
{4, 6, 8, 16, 24} for the OneToAll, Personalized OneToAll, AllToOne and AllToAll
patterns. To produce the same h-relation size, message size m ρ were chosen according
to formula (4). The h-relations sizes were in the range of 4330 to 4945920 32-bit words.
To simplify the expression of the BSPWB and MPM M-step partition associated with
an MPI program, we superimpose commented code into the example codes in Figures 7
and 8. When uncommented, variable M step contains at any time the value of the current
BSPWB and MPM M-step. This M step variable may be used as an additional tag when
communicating messages, to check the validity of the BSPWB and the MPM programs.
The use of this tag also solves the well-known library-user tag conflict appearing in the old
versions of PVM that gave rise to the concept of communicator.
4.1. The fast Fourier algorithm example
We will illustrate the BSPWB and MPM models using a parallel algorithm to compute the
fast Fourier transform. Consider a sequence A = (A[0], . . . , A[N − 1]) of length N. The
DFT of the sequence A is the sequence C = (C[0], . . . , C[N − 1]), given by
C[i ] = 6k=0...N−1 A[k]wki
e2π −1/N
is the primitive nth root of the unity in the complex plane. The
where w =
following division can be deduced from the definition
C[i ] = 6k=0...N/2−1 A[2k](w2)ki + wi 6k=0...N/2−1 A[2k + 1](w2 )ki
From this formula it follows that the DFT C of A can be obtained by combining B, the
DFT of the even components and D, the DFT of the odd components of A. Figure 7 shows
the pseudo-code executed by processor NAME. At the beginning of the computation, all
the processors hold a copy of vector A.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 461–477 (1999)
Figure 7. FFT pseudo-code
In lines 1 to 4, processors compute the N/ p subvector of A resulting in the odd–even
pattern given by the bits of its NAME in reverse order. At this point, they proceed to
compute the sequential DFT (line 5). No communications are required until this point.
Lines from 12 to 17 contain the combination phase. Processors with the same log( p)−1−i
bits share the same i ancestors in the division tree. Thus, an active processor with its bit i
equal to 1 sends its odd subvector to the processor that has computed the matching even
part giving place to an h = N × 2i / p-relation (lines 7–11).
The time spent by the algorithm in the first M-step is broken down into the division time
(lines 1–4) of order O(N/ p), the corresponding DFT time O(N/ p log(N/ p)) and the first
communication (lines 6–11):
T1 = 81,NAME = t1,NAME = D(N/ p) + F N/ p log(N/ p) + L + g(N/ p)
For s = 2, . . . , log( p)−1 the time of the M-step s is composed of the combination of lines
12–17 with loop index i = s − 2 plus the time spent in the communication in lines 7–11
with loop index i = s − 1:
Ts = 8s,NAME = ts,NAME = Ts−1 + R N2s−2 / p + (L + g(N2s−1 / p))
and 1,NAME = {NAME, NAME XOR 2s−1 } if processor NAME is active; otherwise we
assume that processor NAME makes an empty (ghost) step. The last M-step, s = log( p)+1
performed by processor NAME = 0, only includes a combination
Tlog( p)+1 = 8log( p)+1,0 = tlog( p)+1,0 = Tlog( p) + R N/2
Table 3 contains the values of the three computational constants D, F and R. Values for
the Coaxial and UTP platforms are the same, since the processors are the same.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 461–477 (1999)
Figure 8. MPI code for the PSRS
Table 3. Values of the computational constants
Origin 2000
2.39 × 10−7
5.52 × 10−7
2.84 × 10−7
5.86 × 10−7
4.81 × 10−7
8.69 × 10−7
2.65 × 10−6
6.04 × 10−7
1.44 × 10−6
The accuracy of the models for a 524.288 complex instance is shown in Figures 9 and
10. The Error percentage displayed in Figure 11 has been computed using the formula
Error = 100∗(Actual Time − Model Time)/Actual Time
Model and real time curves corresponding to the IBM SP2 appear overlapped. The
error grows with the number of processors. The curious exception is the Coaxial LAN.
The PingPong pattern occurs in the FFT algorithm with 1, 2 and 4 pairs of partners. The
time predicted in the coaxial LAN for the first two PingPong communications (i = 0 and
i = 1) is over the actual time, while it is under the actual time for the third communication
(i = 2). This produces a compensation that leads to a false accuracy. In fact, this is the
only architecture where the models fail their prediction of a decreasing behaviour in the
actual time. It could be considered a surprise that UTP times are worse than Coaxial times.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 461–477 (1999)
Figure 9. Model and actual times for the LANs
Figure 10. Model and actual times for the IBM SP2 and Origin 2000
This is a consequence of the fact that the Coaxial LAN is beaten by the UTP LAN only
in the first PingPong communication (i = 0), and that the quantity of data in this first
movement is small compared with the other two later communications (i = 1 and i = 2).
The UTP LAN and the Origin 2000 reach an error percentage of 34.30% and 34.10% for
eight processors.
4.2. A sorting example
The PSRS proposed by Li et al.[16] is a straightforward example of a BSP-like algorithm
where only collective communication operations are used (Figure 8).
The algorithm starts with the personalised broadcasting (line 2) by processor 0 to the
other P − 1 processors of the different segments of the array A to sort. The BSPWB time
T 1 of the first M-step is:
T 1 = g(P − 1)
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 461–477 (1999)
Figure 11. FFT error percentage
In the second M-step, each of the P processors proceeds to sequentially sort its segment
(line 4). A sample of size P 2 is selected in lines 5 and 6. The M-step ends with the
gathering of the different samples by processor 0 (line 7). It gives place to an h-relation
with h = P(P − 1),
+ C P + g P(P − 1) + L + T 1
T 2 = B log
In the third M-step the P samples are merged (line 10), the P − 1 pivots are selected
(line 11) and they are sent to the other P − 1 processors (line 13),
T 3 = D × P2 + E × (P − 1) + g × (P − 1) × (P − 1) + L + T 2
Using a binary search for each of the P − 1 pivots in the array A, procedure
ComputeFragments computes the array MySizes containing the sizes of the segments of
A between consecutive pivots (line 15). The sizes of the fragments are exchanged through
the personalised all to all communication at line 16. The P elements of the array Ones are
initialised to 1 and the i th element of the array Linear is initialised to i . The h-relation is
h = 2(P − 1), since each processor sends its P − 1 sizes and receives P − 1 sizes from
the others,
+ g × 2(P − 1) + L + T 3
T 4 = F(P − 1) log
Once the processors have got all their corresponding sizes they compute the offsets
OffsetIn, where the incoming segments will be stored, the offsets OffsetOut, where the
segments to be sent start, and the final number of items, MySize (line 18). The theoretical
and experimental results given in [17] prove that an average size of N/P 2 for each segment
can be expected. At the end of the fifth M-step, each processor sends the P − 1 segments
of size N/P 2 and receives its P − 1 segments of size N/P 2 from the others (line 19),
T 5 = H (P − 1) + g × 2(P − 1)
Copyright  1999 John Wiley & Sons, Ltd.
+ L + T4
Concurrency: Pract. Exper., 11, 461–477 (1999)
Figure 12. PSRS predicted vs. actual times: 1048576 integers
The P sorted segments are merged in line 21. In order to prepare the gathering of the
segments by the root processor, the P − 1 processors send to processor 0 the sizes MySize
of their arrays (line 22),
T6 = I
+ g(P − 1) + L + T 5
The root processor computes the displacements OffsetIn for the sorted segments at
line 24. The P − 1 intervals are gathered and directly placed in the array A in their final
position at line 25,
T 7 = J (P − 1) + g(P − 1)
+ L + T6
To avoid the influence of secondary factors that may obscure the accuracy of the model
as cache size, the computational constants have been adjusted to the number of processors.
Figure 12 presents the BSPWB predicted time and the actual time for the two machines.
The accuracy is such that the model and real time curves for the Cray T3E appear
overlapped. The same occurs for the Origin 2000 up to 16 processors. The inferior
scalability of the Origin for 24 processors increases the distance between the model and
the actual Origin 2000 time.
Current standard message passing programming models are PVM and MPI. Although BSP
or LogP consider current parallel architectures, none of them completely adapts to these
standard message passing libraries. To obtain an accurate prediction of the time execution,
it is necessary to use specific libraries like BSPLib for BSP or Active Messages for LogP.
We have presented two new parallel computational models that work under standard
message passing libraries and current standard architectures. The first one, the BSP
Without Barriers model, works when only collective communications are used. The
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 461–477 (1999)
second, the Message Passing Machine model, is more general and can be used to design
and analyse a wide range of PVM/MPI algorithms.
Based on the most commonly used communication patterns, we have proposed a new
methodology to obtain the parameters of the models. The quality and accuracy of the
parameters have been presented. We have also studied the validity of the h-relation
hypothesis and evaluated, using PVM and MPI, the parameters g and L in five different
representative platforms: a 10 Mbit coaxial Ethernet local area network, a UTP Ethernet
LAN, an IBM-SP2, a Cray T3E and a Silicon Origin 2000. The use and the prediction
capacity of the models are illustrated through two examples, a fast Fourier transform
algorithm and the parallel sorting by regular sampling.
Part of this research has been carried out at Centre de Computació i Communicacions de
Catalunya (C4), Centro de Investigaciones Energéticas Medioambientales y Tecnológicas
CIEMAT, and at Institute Astrofı́sico de Canarias (IAC).
1. S. Fortune and J. Wyllie, ‘Parallelism in randomized machines’, Proceedings of STOC, 1978,
pp. 114–118.
2. P. B. Gibbons, ‘A more practical PRAM model’, Proceedings of the ACM Symposium on
Parallel Algorithms and Architectures, ACM, 1989, pp. 158–168.
3. T. Leighton, Introduction to Parallel Architectures; Arrays, Trees, Hypercubes, Morgan
Kaufmann, San Mateo, CA, 1992.
4. D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian and
T. von Eicken, ‘LogP: Towards a realistic model of parallel computation’, Proceedings of the
4th ACM SIGPLAN, Sym. Principles and Practice of Parallel Programming, May 1993.
5. J. Dongarra and T. Dunigan, Message-Passing Performance of Various Computers. 1995.
6. J. Roda, J. Herrera, J. C. González, C. Rodrı́guez, F. Almeida and D. González, ‘Practical
experiments to improve PVM algorithms’, 3rd EUROPVM Meeting Group, Munich, Oct.
7. B. Schmidt and V. Sunderam, ‘Empirical analysis of overheads in cluster environments’,
Concurrency: Pract. Exp., 6(1), 1–32 (1994).
8. L. G. Valiant, ‘A bridging model for parallel computation’, Commun. ACM, 33(8), 103–111
9. D. B. Skillcorn, J. Hill and W. F. McColl, ‘Questions and answers about BSP’, Oxford
University Computing Laboratory, Report PRG-TR-15-96, Nov. 1996.
10. J. Hill and S. Donaldson, ‘Stability of communication performance in practice: from the CRAYT3E to networks of workstations’, Oxford University Computing Laboratory, PRG-TR-33-97,
11. IBM PVMe for AIX User’s Guide and Subroutine, Reference Version 2, Release 1, Document
number GC23-3884-00, IBM Corp., 1995.
12. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Mancheck and V. Sunderam, PVM: Parallel
Virtual Machine – A User’s Guide and Tutorial for Network Parallel Computing, MIT Press,
13. J. Marı́n and A. Martı́nez, ‘Testing PVM versus BSP programming’, VIII Jornada de
Paralelismo, Sept. 1997, pp. 153–160.
14. J. M. Arruabarrena, A. Arruabarrena, R. Beivide and J. A. Gregorio, ‘Assessing the performance
of the new IBM-SP2 communication subsystem’, IEEE Parallel Distrib. Technol., 4, 12–22
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 461–477 (1999)
15. J. Roda, C. Rodrı́guez, F. Almeida and D. Morales, ‘Prediction of parallel algorithms
performance on bus based networks using PVM’, 6th EuroMicro Workshop on Parallel and
Distributed Processing, Madrid, Spain, IEEE Computer Society, Jan. 1998.
16. X. Li, P. Lu, J. Schaefer, J. Shillington, P. S. Wong and H. Shi, ‘On the versatility of parallel
sorting by regular sampling’, Parallel Comput., 19, 1079–1103 (1993).
17. J. Hill, B. McColl, D. Stefanescu, M. W. Goudreau, K. Lang, S. Rao, S. Torsten, T. Tsantilas
and R. Bisseling, The BSP Programming Library., 1997.
18. J. Roda, C. Rodrı́guez, F. Almeida and D. Morales, ‘Predicting the performance of injection
communication patterns on PVM’, 4th European PVM/MPI Meeting Group, Cracow, Poland,
Springer-Verlag, Nov. 1997.
19. M. Snir, S. Otto, S. Huss-Lederman, D. Walker and J. Dongarra, MPI: The Complete Reference,
MIT Press, Cambridge, MA, 1996.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 461–477 (1999)
Без категории
Размер файла
212 Кб
model, times, passing, execution, prediction, message
Пожаловаться на содержимое документа