close

Вход

Забыли?

вход по аккаунту

?

907

код для вставкиСкачать
CONCURRENCY: PRACTICE AND EXPERIENCE
Concurrency: Pract. Exper., Vol. 11(15), 887–911 (1999)
Managing clusters of geographically distributed
high-performance computers
MATTHIAS BRUNE1∗ , JÖRN GEHRING2 , AXEL KELLER2 AND ALEXANDER REINEFELD1
1 Konrad-Zuse-Zentrum für Informationstechnik Berlin, Takustrasse 7, D-14195 Berlin-Dahlem, Germany
2 Paderborn Center for Parallel Computing, Fürstenallee 11, D-33102 Paderborn, Germany
(e-mail: [email protected], [email protected], [email protected], [email protected])
SUMMARY
We present a software system for the management of geographically distributed highperformance computers. It consists of three components: 1. The Computing Center Software
(CCS) is a vendor-independent resource management software for local HPC systems. It
controls the mapping and scheduling of interactive and batch jobs on massively parallel
systems; 2. The Resource and Service Description (RSD) is used by CCS for specifying and
mapping hardware and software components of (meta-)computing environments. It has a
graphical user interface, a textual representation and an object-oriented API; 3. The Service
Coordination Layer (SCL) co-ordinates the co-operative use of resources in autonomous
computing sites. It negotiates between the applications’ requirements and the available system
services. Copyright  1999 John Wiley & Sons, Ltd.
1. INTRODUCTION
With the increasing availability of fast interconnection networks high-performance
computing (HPC) has undergone a metamorphosis from the use of local computing
facilities towards network-centered computing. One key motivation for this trend is the aim
to better utilize expensive hardware and software by linking supercomputers to one virtual
metacomputer[1] or metasystem[2]. In such environments, users benefit by faster response
times due to the use of resources that are better suited to their application’s demands than
their own local computer. In these days, it seems that not only academia but also industry[3]
is getting attracted by the potential benefits of metasystems.
The great variety of metacomputer components (computers, networks, files, software
services) and their dynamical behavior poses special problems for the resource
management regime. It must be able to cope with unreliable networks and heterogeneity
at multiple levels: administrative domains, scheduling policies, operating systems, and
protocols. The use of shared resources makes it necessary to constantly update the status of
the system and, if necessary, to re-configure the application environment. From the user’s
view, a metacomputer should be as easy to use as a personal workstation. This implies a
vendor-independent access interface and a transparent mapping of applications onto a set
of suitable target platforms.
∗ Correspondence to: Matthias Brune, Konrad-Zuse-Zentrum für Informationstechnik Berlin, Takustrasse 7,
D-14195 Berlin-Dahlem, Germany.
Contract/grant sponsor: European Union; Contract/grant number: MICA 20966; PHASE 23486; DYNAMITE 23499
Contract/grant sponsor: German Ministry of Research and Technology (BMBF), project UNICORE
CCC 1040–3108/99/150887–25$17.50
Copyright  1999 John Wiley & Sons, Ltd.
Received 20 April 1999
Revised July 1999
888
M. BRUNE ET AL.
In this paper we describe the architecture of a software system for access and
management of geographically distributed HPC systems. There are three key components:
1. The Computing Center Software (CCS) is a resource management software for the
access and system administration of HPC systems that are operated in a single site.
It gives a vendor-independent interface to parallel systems and it provides tools for
specifying, configuring and scheduling system components.
2. The Resource and Service Description (RSD) is used by CCS for specifying
hardware and software components of (meta-)computing environments. It provides a
graphical user interface, a descriptive language interface and an object oriented API.
3. The Service Coordination Layer (SCL) links autonomous computing sites to a
metasystem for registered applications. It provides a ‘brokerage’ between the
application requirements and the distributed services.
While the CCS software alone could be regarded as ‘just another resource management
system’, the whole environment must be seen in a broader context. Taken together,
the system components CCS, RSD, SCL build the nucleus of an open metacomputer
environment[4], analogous to the well-known GLOBUS[5] project. In fact, some of
our components are similar to those implemented in the GLOBUS environment; for a
comparison see Section 6.1.
The contents of this paper are as follows. In the next section, we present the architecture
of CCS and its technical implementation. Thereafter, we introduce the modules required
for metacomputing (Center Resource Manager and Center Information Service) and show
how they work together. Section 4 describes the Service Coordination Layer (SCL). In
Section 5, we present the Resource and Service Description tool (RSD). Section 6 briefly
reviews some related projects and Section 7 summarizes the work presented here.
2. COMPUTING CENTER SOFTWARE
Back in 1992, we started the CCS project [6–8] because we were in need of a vendorindependent resource management system that
(a)
(b)
(c)
(d)
supports very large parallel systems (≥1000 PEs)
handles interactive and batch jobs concurrently
performs dynamical system partitioning and scheduling
provides a high degree of fault tolerance for remote user access.
CCS first became operational on a 1024-node transputer system and was later adapted
to Parsytec MPPs and Intel clusters. Aiming at portability, CCS was designed to run on a
variety of UNIX derivates: Linux, SunOS, Solaris, and AIX. The program code has been
kept modular so that it can be easily ported to other operating systems, cf. Section 2.2.
2.1. CCS program architecture
2.1.1. The island concept
Initially, CCS was designed to manage all machines in a computing center and to provide
one single, coherent user interface. This caused performance bottlenecks at the central
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
889
HIGH-PERFORMANCE COMPUTER CLUSTER MANAGEMENT
. . .
. . .
u s e r a c c e s s
la y e r
U s e r In te rfa c e s (U I)
A M
A c c e ss M g r.
IM
Is la n d M g r.
Q M
Q u e u e M g r.
m a n a g e m e n t
la y e r
M M
M a c h in e M g r .
h a rd w a re
la y e r
Figure 1. Architecture of a CCS ‘island’
scheduler and, of course, the system was not fault tolerant. Hence, in the current (4th)
release, we decided to manage each machine with an autonomous ‘CCS island’.
An island has six components (see Figure 1), which will be described in more detail later
in the text:
(i) The User Interface (UI) provides X- or ASCII-access to the system features. It wraps
the physical and technical system characteristics and provides a homogeneous access
to single or multiple heterogeneous systems.
(ii) The Access Manager (AM) manages user interfaces and is responsible for
authorization and accounting.
(iii) The Queue Manager (QM) schedules user requests.
(iv) The Machine Manager (MM) has full control over the parallel system.
(v) The Island Manager (IM) provides name services and watchdog functions for
reliability purposes.
(vi) The Operator Shell (OS) allows administrators to control the system.
With the island concept scalability, reliability and error recovery have been improved
by separating management of different machines into different islands. Each machine uses
a dedicated scheduling strategy and can therefore be operated in a different mode (e.g.,
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
890
M. BRUNE ET AL.
interactive, batch, mixed). Additional user interfaces can be implemented to cope with
special system features.
2.1.2. Reliability
When a communicating process in a distributed system does not receive an answer from
its partner, it is unclear whether the network is down, whether it is congested or whether
the communication partner has died. Hence, there is a need for an instance that has global
and up-to-date information on the status of all system components. In CCS, this instance
is the Island Manager (IM).
When the system is started or shut down, all CCS daemons must register at the IM.
During runtime, the IM maintains a consistent view on the status of all processes in an
island. The IM is authorized to stop erroneous daemons or to restart crashed ones.
In addition, the IM also provides name services. It maintains an address translation table
that matches the symbolic names <center, island, process> to inet addresses.
This provides a level of indirection, thereby allowing the IM to migrate daemons to other
hosts in the case of overloads or system crashes. With the symbolic names, several CCS
islands can be run concurrently on a single host.
2.1.3. User management
The User Interface (UI) runs in a standard UNIX shell environment (e.g., tcsh, bsh, ssh).
Common UNIX mechanisms for IO-redirection, piping and shell scripts can be used and
all job control signals (ctrl-z, ctrl-c, . . . ) are supported. Five CCS commands are available:
(i)
(ii)
(iii)
(iv)
(v)
ccsalloc for allocating and/or reserving resources
ccsbind for re-connecting to a lost interactive application/session
ccsinfo for displaying information on the job schedule, users, job status etc.
ccsrun for starting jobs on previously reserved resources
ccskill for resetting or killing jobs and/or for releasing resources.
The Access Manager (AM) analyzes the user requests and is responsible for
authentication, authorization and accounting. CCS supports project specific user
management. Privileges can be granted to either a whole project or to specific project
members, for example
•
•
•
•
access rights per machine (batch, interactive, the right for reserving resources)
access grants by time of day (day, night, weekend, . . . )
maximum number of concurrent resources
accounting per machine (product of CPU-time and #PEs).
User requests are sent to the Queue Manager (QM) which schedules the jobs according
to the current policy. CCS provides several scheduling modules[9] that can be plugged in
by the system administrator.
2.1.4. Job scheduling
CCS has been designed for managing exclusive resources – sometimes also called spacesharing (vs. time-sharing). However, with the adaptation to the management of clusters,
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
891
HIGH-PERFORMANCE COMPUTER CLUSTER MANAGEMENT
4
Figure 2. Scheduler GUI displaying the scheduled nodes (vertical axis) over the time axis
we are currently upgrading CCS for time-shared resources. As in Condor, Codine, or LSF,
the system administrator may specify a maximum load factor that is allowed on each node.
In their resource requests, users must specify expected CPU time of their jobs. Based
on this information, CCS determines a fair and deterministic schedule. Both batch and
interactive requests are processed in the same scheduler queue. The request scheduling
problem is modeled as an n-dimensional bin-packing problem, where one dimension
corresponds to the continuous time flow, and the other n − 1 dimensions represent
system characteristics, such as the number of processor elements. Currently, CCS uses
an enhanced first-come-first-served (FCFS) scheduler with backfilling[9], which fits best
to the request profile in our center. Waiting times are reduced by first checking whether a
new request fits in a gap of the current schedule.
Figure 2 shows a graphical representation of the current schedule, which is displayed in
an X-window. The corresponding software has been implemented in Tcl/Tk.
With CCS, users may reserve resources for a given time in the future. This is a convenient
feature when planning interactive sessions or online events. As an example, consider we
want to run an application on 32 nodes of an SCI cluster from 9 to 11 am at 13 February
1999. This is done with the command ccsalloc -m SCI -n 32 -s 9:13.2.99
-t 2h.
Deadline scheduling is another useful feature. Here, CCS guarantees the job to be
completed at (or before) the specified time. A typical scenario for this feature is an
overnight run that must be finished when the user comes back to the office the next
morning. Deadline scheduling gives CCS the flexibility to improve the system utilization
by scheduling batch jobs at the earliest convenient and at the latest possible time.
The CCS scheduler is able to handle two kinds of requests, those that are fixed in time
and variable ones. A resource that has been reserved for a given time frame is fixed: it
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
892
M. BRUNE ET AL.
Figure 3. Administrator interface to the machine manager daemon
cannot be shifted on the time axis (see the hatched rectangles in Figure 2). Interactive
requests, in contrast, can be scheduled earlier but not later than asked for. Such shifts on
the time axis might occur when resources are released before their estimated finishing time.
2.1.5. System partitioning
Optimal system utilization combined with a high degree of system independence are two
conflicting requirements. To deal with them, we have split the scheduler into two parts.
One of them, the QM, is totally independent of the underlying hardware architecture.
With this separation, the scheduler daemon has no information on mapping constraints
such as minimum cluster size, or the amount/location of link entries. Machine dependent
tasks are performed by a separate instance, the Machine Manager (MM), see Figure 3. The
MM verifies whether a schedule given by the QM can be mapped onto the hardware at
the specified time, now also taking reservations by other applications into account. If the
schedule cannot be mapped onto the machine, the MM returns an alternative schedule to
the QM.
This separation between the hardware-independent QM and the system-specific MM
allows us to employ dedicated mapping heuristics that are implemented in separate
modules. Special requests for IO nodes, partition shapes, memory constraints, etc. can
be taken into consideration in the verifying process. Moreover, with the machinespecific information encapsulated in the MM, CCS islands can be easily adapted to other
architectures.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
893
HIGH-PERFORMANCE COMPUTER CLUSTER MANAGEMENT
m a c h in e a d m in is tr a tio n
D is p a tc h e r
M V
M S M
M a p p in g
V e r ifie r
M a s te r S e s s io n
M a n a g er
C M
C o n fig u r a tio n
M a n a g er
jo b e x e c u tio n
. . .
N S M
N o d e S e s s io n
M a n a g er
E M
E x e c u tio n
M a n a g er
E M
E x e c u tio n
M a n a g er
N S M
N o d e S e s s io n
M a n a g er
E M
E x e c u tio n
M a n a g er
p h y s ic a l
lin k s
lin k to
s y s te
c o n fig u r
h a rd w
m
th e
a tio n
a re
Figure 4. Detailed view of the machine manager (MM)
2.1.6. Process creation and control
At configuration time, the QM sends a user request to the MM, which then allocates the
compute nodes, loads and starts the application code and releases the resources after the
run. Because the MM also verifies the schedule, which is a computational hard problem,
a single daemon might become a computational bottleneck. We have therefore split the
MM into two parts (Figure 4) – one for the machine administration and one for the job
execution. Each part contains a number of modules and/or daemons.
The machine administration part consists of three separate daemons (MV, MSM, CM)
that execute asynchronously. A small Dispatcher co-ordinates the low-level components.
The Mapping Verifier (MV) checks whether the schedule given by the QM can be
realized at the specified time with the specified resources. Based on its more detailed
information on machine structure (hardware and software) it executes specific scheduling
and partitioning algorithms. The resulting schedule is then returned to the QM.
The Configuration Manager (CM) provides a link to the hardware. It is responsible for
booting, partitioning, and shutting down the operating system software. Depending on the
system’s capabilities, the CM may gather several requests and re-organize or combine them
for improving the throughput – analogously to the work of a hard disk controller.
The Master Session Manager (MSM) controls execution of jobs. It sets up sessions,
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
894
M. BRUNE ET AL.
including pre- or post-processing, and maintains information on the status of each
application.
With some help from the Node Session Manager (NSM), the MSM allocates and
synchronizes network links between the specified entry nodes. The NSM starts and stops
jobs and controls the processes. When receiving a command from the MSM, the NSM
starts an Execution Manager (EM) which establishes the user environment (UID, shell
settings, environment variables, etc.) and starts the user application.
In time-sharing mode, the NSM invokes as many EMs as needed. It also gathers dynamic
load data and sends it to the MM and QM where it is used for scheduling and mapping
purposes.
2.1.7. Virtual terminal Ccncept
With the increasing use of supercomputers for interactive simulation and design, support
of remote access via WANs becomes more and more important. Temporary breakdowns of
the network should (ideally) be hidden from the user.
In CCS, this is done by the EM which buffers the standard IO streams (stdout, stderr) of
the user application. In case of a network break down, all open output streams are sent by
e-mail to the user or they are written into a file (when specified by the user). Users can rebind to interrupted sessions, provided that the application is still running. CCS guarantees
that no data is lost in the meantime. Figure 5 gives an overview on the control and data
flow in a CCS island.
2.2. Implementation aspects
CCS consists of about 200k lines of ANSI/POSIX C code. The software has been kept
portable by splitting all modules into two parts – a generic and a system-specific one.
Daemons are triggered by events. A watchdog performs alive-tests at regular intervals.
When a faulty daemon has been detected by the IM, it is restarted and all communication
partners are informed so that they can re-bind to the newly started instance.
All daemons include a wrapped timer that creates clock ticks for debugging purposes
and for simulating incoming requests on a variable time scale.
Even though CCS is POSIX conformant, we implemented a Runtime Environment (RTE)
layer that wraps system calls. This allows for easy porting to new operating systems.
Currently, the RTE provides interfaces for
•
•
•
•
•
•
•
management of dynamic memory, including debugging and usage logging
signal handling
file I/O including filter routines for ASCII-files
manipulation of the process environment
terminal handling (e.g. pty)
sending e-mails
logging of warnings and error messages.
Note that the scheduler modules can be changed at runtime, because the QM provides an
API that allows us to dynamically plug in other scheduler modules. This allows the system
administrator to adjust the system to specific operating modes (e.g., time sharing and/or
space sharing).
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
895
HIGH-PERFORMANCE COMPUTER CLUSTER MANAGEMENT
A M
C C S c o m m u n ic a tio n p r o to c o l
M M
D is p a tc h e r
rc m d
M S M
rc m d
N S M
fo rk a n d e x e c
U I
0
p ty M
E M
1 ,2
o n e h o st
fo rk a n d e x e c
p ty
0
1 ,2
u se r
a p p l.
fo rk a n d e x e c
sp a w n e d
a p p l.
p ro c e s s g ro u p
p ty S
Figure 5. Control and data flow in CCS
2.2.1. Communication layer
The multi-threaded communication layer separates a daemon’s code from the
communication network, allowing us to change communication protocols without the
need to change the source code of the daemons. The communication layer performs the
following tasks:
•
•
•
•
it provides a reliable and hardware-independent exchange of data
it allows us to dynamically connect/disconnect to communication partners
it checks the availability of communication partners (in co-operation with the IM)
it translates symbolic module names (in co-operation with the IM)
In our current implementation the daemons communicate through asynchronous remote
procedure calls (RPC). The binding of incoming RPC calls (events) to the corresponding
callback functions is done during runtime. This allows us to dynamically add new events
by registering the corresponding event handler.
3. INTERFACE TO THE METACOMPUTER LEVEL
With the autonomous islands described in the last section, we have one important
component for building up a metacomputing environment. The other four components are:
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
896
M. BRUNE ET AL.
L in k to o th e r R e s o u r c e
M a n a g e m e n t S y s te m s
C IS : C e n te r In fo r m a tio n S e r v e r
R S D
d a ta
p a s s iv e
in fo r m a tio n
o n s e r v ic e s
in fo
C R M : C e n te r R e s o u rc e M a n a g e r
a c tiv e
A M
A M
A c c e ss M g r.
IM
Is la n d M g r.
A M
A c c e ss M g r.
Q M
IM
Q u e u e M g r.
Is la n d M g r.
M M
Q M
Q u e u e M g r.
A c c e ss M g r.
IM
Is la n d M g r.
M M
M a c h in e M g r .
M M
M a c h in e M g r .
H P C H a rd w a re
Q M
Q u e u e M g r.
M a c h in e M g r .
H P C H a rd w a re
H P C H a rd w a re
Figure 6. A site managed by CCS
(i) a passive instance that maintains up-to-date information on the system structure and
state
(ii) an active instance that is responsible for the location and allocation of resources
within a center. It also co-ordinates the concurrent use of several systems, which
may be administered by different local resource management systems
(iii) a co-ordinating instance that connects the participating centers and establishes the
virtual entity of the metacomputer
(iv) a user-friendly tool that allows system administrators and users to specify classes of
resources and services.
These four components are: the center information manager CIS, the center resource
manager CRM, the service co-ordination layer SCL, and the resource and services
description RSD (see Figure 6).
3.1. Center Information Server (CIS)
The CIS is the ‘big brother’ of the island manager at the metacomputer level. Like
the UNIX Network Information Service NIS or the GLOBUS Metacomputing Directory
Service MDS, CIS provides up-to-date information on the resources in a site. Compared to
the active IM in the islands, CIS is a passive component.
At startup time, or when the configuration has been changed, an island signs on at the
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
HIGH-PERFORMANCE COMPUTER CLUSTER MANAGEMENT
897
CIS and informs it about the topology of its machines, the available system software, the
features of the user interfaces, the communication interfaces and so on. The CIS maintains
a database on the network protocols, the system software (e.g., programming models, and
libraries) and the time constraints (e.g. for specific connections). The CIS also plays the
role of a ‘docking station’ for mobile agents and external users.
3.2. The Center Resource Manager (CRM)
The CRM is an independent management tool on top of the CCS islands. It supports setup
and execution of multi-site applications that run concurrently on several platforms. Note
that the term ‘multi-site application’ can be interpreted in two ways: it can be just one
application that runs on several machines without explicitly being programmed for that
execution mode[10], or it can comprise different modules, each of them executing on a
machine that is best suited for running that specific piece of code. In the latter case the
modules can be implemented in different programming languages using different message
passing libraries (e.g., PVM, MPI, PARIX, and MPL). Multi-communication tools like
PLUS[11] are necessary to make this kind of multi-site application possible.
For executing multi-site applications three tasks need to be done:
• locating the resources
• allocating the resources
• starting and terminating the modules.
For locating the resources, the CRM maps the user request (given in RSD notation)
against the static and dynamic information on the available system resources.
The static information (e.g. topology of a single machine or the site) has been specified
by the system administrator, while the dynamic information (e.g. state of an individual
machine, network characteristics) is gathered at runtime. All this information is provided
by the CIS. Since resources are described as dependency graphs, users may additionally
specify the required communication bandwidth for their application. Data on the previous
runtime behavior can be gathered and condensed in an execution profile[12], thereby
providing up-to-date information for an improved process mapping and migration.
After the target resources have been determined, they must be allocated. This can be
done in analogy to the two-phase-commit protocol in distributed database management
systems: the CRM requests the allocation of all required resources at all involved islands.
If not all resources were available, it either re-schedules the job or it denies the user request.
Otherwise the job can now be started in a synchronized way. Here, machine-specific preprocessing tasks or intermachine-specific initializations (e.g. starting of special daemons)
must be performed.
Analogously to the islands level, the CRM is able to migrate user resources between
machines to achieve better utilization. Accounting and authorization at the metacomputer
level can also be implemented at this layer.
4. CO-ORDINATING THE METACOMPUTER SERVICES WITH SCL
CCS is just one important building block in metacomputing environments. Its architectural
concept proved useful in two industrial metacomputing projects where distributed compute
servers have been implemented to execute computationally-intensive jobs remotely over
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
898
M. BRUNE ET AL.
the Internet. Rather than providing full access for program development and execution, our
metacomputer environment supports only registered applications – possibly on a pay-peruse basis. The reasons are twofold: first, computationally-intensive HPC applications often
need a huge amount of application-specific data, which must be manually organized before
starting the application; secondly, industrial users are typically not willing to learn about
the vendor-specific HPC access methods just to run a code; they rather prefer to see the
machine through their application code’s interface.
The Service Coordination Layer SCL supports job distribution and multi-site
applications at the metacomputer level. The SCL software layer is located one level above
the local resource management systems. It provides the following services:
(a) co-ordinated management of autonomous sites: SCL is organized as a network of
co-operating servers. All servers are equal and each represents exactly one center.
There is no global co-ordinating instance. This offers high availability of the service.
The centers determine which of their resources become available to the others and
therefore maintain full autonomy.
(b) distributed management of users and requests: The SCL provides a set of global
user accounts for the global service. On the metacomputing level, we support only
execution of registered applications. Hence, the global accounts can be mapped to
one single account at each center. It is not necessary to reserve a large block of user
IDs at each site.
(c) job distribution scheme: The acceptance of a metacomputing service depends much
on the metacomputer’s response time. It is the main task of the job distribution
scheme (JDS) to improve this parameter. The JDS is responsible for assigning
resources to user requests. For best results, it must consider the availability of
software installations and service providers, the speed of network connections and
compute nodes, as well as special preferences of users and HPC centers.
(d) performance comparison between heterogeneous systems: From a user’s point of
view there is little interest in whether a center is able to offer 64 closely coupled Intel
Pentiums while another has a CRAY-T90. A user merely wants to know at which site
his application would run fastest. For this purpose, the raw hardware speed must be
transformed into application-specific performance values. For every application A
there must be a function fA with fA (M1 ) > fA (M2 ), if application A performs
faster on machine configuration M1 than on configuration M2 . fA can typically be
derived from the analysis of benchmark results.
4.1. Supporting single-site applications
Figure 7 depicts the overall layout of the service co-ordination layer SCL. For improved
scalability and fault tolerance there exists no single central instance in SCL. The islands
are managed by a network of co-operating servers. Each service center runs a Center
Management Server which accepts incoming user requests, locates the best available
resources on the network by means of the decentralized CISs, and delegates the job
execution to the CRMs. Since the center management servers are all alike, each of them
may equally accept a user request. When a request is initially received by a center, the
server at this site becomes the request server for this job. It broadcasts the request to all
other sites and collects incoming offers from the bidding servers as shown in Figure 8.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
899
HIGH-PERFORMANCE COMPUTER CLUSTER MANAGEMENT
C e n te r M a n a g e m e n t S e rv e r
C e n te r M a n a g e m e n t S e rv e r
C IS
C R M
C R M
H P C C e n te r
C IS
H P C C e n te r
C e n te r M a n a g e m e n t S e rv e r
C R M
U s e r
N e tw o rk o f S e rv e rs
C IS
H P C C e n te r
C e n te r M a n a g e m e n t S e rv e r
C e n te r M a n a g e m e n t S e rv e r
C e n te r M a n a g e m e n t S e rv e r
C R M
C R M
C IS
C IS
H P C C e n te r
H P C C e n te r
C R M
C IS
H P C C e n te r
Figure 7. The service co-ordination layer SCL
R e q u e s tin g S e r v e r
B id d in g S e r v e r
In c o m in g R e q u e s t
R e q u e s t fo r
o ffe r
D e te r m in e s e t o f
s u ita b le s e r v e r s
R e m o te s e rv e r
R e m o te s e rv e r
R e q u e s t o ffe rs
fro m re m o te
s e rv e rs
R e m o te s e rv e r
D e te r m in e
b e s t s u ite d
lo c a l m a c h in e
R e m o te s e rv e r
R e m o te s e rv e r
C a n w e fu lfill th e
re q u e s t?
n o
R e m o te s e rv e r
y e s
C o lle c t o ffe r s
C h o o s e b e s t o ffe r
N e tw o rk
p e rfo rm a n c e
S e n d o ffe r to
r e q u e s tin g
s e rv e r
S e n d r e je c tio n
to r e q u e s tin g
s e rv e r
A s s ig n r e q u e s t to
o n e s e r v ic e c e n te r
Figure 8. Flow diagram of JDS for single-site applications
When all answers have been collected, or when a timeout has occurred, the server
compares the offers according to the user’s selection criteria (e.g., turnaround-time, costs,
application, preferred-center). The best bid is chosen and the request is handed over to
the corresponding center. When there are several offers of equal quality, one of them is
selected at random.
The quality of an offer depends mainly upon the expected turnaround time and the
estimated costs. The costs are provided together with the bid and are calculated by the
service centers. The turnaround time consists of the computing time at the remote center
(which is also contained in the bid) and the time needed for transferring the data over
the network. With this information on each offer, the requesting server chooses the fastest
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
900
M. BRUNE ET AL.
R e q u e s tin g S e r v e r
B id d in g S e r v e r
In c o m in g R e q u e s t
R e q u e s t fo r
p a c k a g e o f o ffe rs
D e te r m in e s e t o f
s u ita b le s e r v e r s
R e q u e s t
o ffe r-p a c k a g e s fro m
re m o te s e rv e rs
D e te r m in e b e s t c o m b in e d
o ffe r
C o lle c t
m a x im
re s o u r
lo c a
N e tw o rk
p e rfo rm a n c e
m in
a l a
c e s
l m a
im
v a
o f
c h
a l a n d
ila b le
e a c h
in e
S e n d a ll th e s e
o ffe rs to th e
r e q u e s tin g s e r v e r s
A s s ig n r e q u e s t to a s e t o f
s e r v ic e c e n tr e s
Figure 9. Flow diagram of JDS for multi-site applications
offer which does not exceed the user’s cost limitations. Users may also override the default
behavior. For example, it is possible to request the cheapest solution which does not exceed
a certain time frame.
4.2. Supporting multi-site applications
The described strategy cannot be applied to multi-site applications, because the problem
partitioning is typically not known beforehand, and bidding servers do not know how many
resources they should include in their bids. We therefore introduce the concept of packages
of offers. Such a packet consists of a collection of two types of bids:
• regular offers as described above
• offers which contain a convex set of resources.
With these types of bids, requesting servers usually receive multiple offers from each
bidding server – some of them describing exact resources and some containing packages.
The remaining task is then to choose a suitable set of resources that best matches the user’s
request (see Figure 9).
For determining an optimal offer, the requesting server needs additional information on
the communication behavior of the application. It has to know, for example, how sensitive
the application is towards slow communication links between the partitions. Unfortunately,
even if exact information on the applications behavior were available, an exact solution
is often infeasible, because the problem is NP-hard. Therefore we can only try to find a
reasonable solution instead of an optimal one. On the other hand, this reduces the amount
of information needed on the communication patterns. The heuristics we have chosen is
based on the following observations:
1. The communication performance of a parallel application is either determined by
the slowest link or by the average link speed.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
HIGH-PERFORMANCE COMPUTER CLUSTER MANAGEMENT
901
2. The computation speed is either influenced by the slowest service node or by the
average node performance.
3. It is advisable to use as few partitions as possible.
Based on these observations we define the objective function:
Perf(Offer) =
1
PerfComp (Ref)
PerfComm (Ref)
CComp · Perf (Offer) + CComm · Perf
Comp
Comm (Offer)
where
Perf(Offer) describes the performance of a combined offer
PerfComm (Offer) describes the quality of the network connections
PerfComp (Offer) measures the quality of the compute nodes
Ref is a reference configuration to which all combined offers are compared
CComm , CComp specify the ratio between communication and computation of the
application on the reference configuration (CComm + CComp = 1).
Let P1 , P2 , . . . , PN be the partitions of a combined offer, NodePerf(Pi ) the speed of
the machine where partition i is located, and LinkPerf(Pi , Pj ) the speed of the network
connection between Pi and Pj . We calculate PerfComp and PerfComm as


mini=1,...,N NodePerf(Pi ), if determined by the slowest node
N
PerfComp (Offer) = 1 X

NodePerf(Pi ),
if determined by the average node
N
i=1
and
PerfComm (Offer)


mini,j ∈N,i6=j WPi ,Pj · LinkPerf(Pi , Pj )
X
= 1
WPi ,Pj · LinkPerf(Pi , Pj ),

N
i,j ∈N,i6=j
if determined by the slowest link
if determined by the average link
to emphasize the importance of specific
The weight factors WPi ,Pj allow an application
P
connections. Given the constraint that i,j ∈N,i6=j WPi ,Pj = 1, we can model the popular
master–slave approach by

 1 , if i = 0 and j 6= 0
WPi ,Pj := N − 1
0,
else
Since the communication weight factors of an application are not always available in
advance, we use

1

, if i 6= j
WPi ,Pj := N 2 − N
0,
else
as the default setting.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
902
M. BRUNE ET AL.
let B be the set of all bids
let R ∗ := ∅
for all b ∈ B do
let Rb := {b}
WHILE Rb contains less resources then required do
choose b0 ∈ B with b0 6∈ Rb and Perf(Rb ∪ {b0 }) = max
let Rb := Rb ∪ {b0 }
end while
let R ∗ := R ∗ ∪ Rb
end for
choose R ∈ R ∗ with Perf(R) = max
Figure 10. Algorithm for constructing combined offers
A d m in is tr a to r
U s e r
L a n g u a g e In te rfa c e
G r a p h ic a l In te r fa c e
A ttr ib u te s :
N a m e = "P C ²"
T y p e = " U n iv e r s ity "
P ic tu r e = " 3 d e a d 2 b e a f..."
// D E
// w e
// e a
F O R
M a c h in e A
C e n tre B
A t
T
L
B
M a c h in e C
M a c h in e B
C e n tre A
A t
N
T
P
C
M
tr ib
a m
y p e
ic tu
P U
e m
u te s :
e = "G C
= "M P P
re = "3 d
= "T 8 0 5
o ry = 4
tr ib u te s :
y p e = "A T M "
a te n c y = 1 0 0 0
a n d w id th = 3 4
O D
C L A R A T IO N
h a v e e x a c tly 2 S M P n o d e s ( g a te w a y s ) , e a c h w ith 4 p r o c e s s o r s
c h g a te w a y o ffe rs o n e S C I a n d o n e A T M p o rt
i= 0 T O 1 D O
N O D E i {
P O R T S C I; P O R T A T M ; C P U = P E N T IU M _ II; M E M O R Y = 5 1 2 M B ; M U L T I_ P R O C = 4 ;};
// th e o th e
F O R i= 2 T
N O D E
P O
O D
e l- 1 0 2 4 "
"
e a d 2 b e a f..."
"
r s a r e s in g le p r o c e s s o r n o d e s , e a c h w ith o n e p o r t
O N -1 D O
i {
R T S C I; C P U = P E N T IU M _ II; M E M O R Y = 2 5 6 M B ; O S = S O L A R IS ;};
O b je c t L ib r a r y w ith A P I
Figure 11. General architecture of the RSD approach
We decided to use a greedy approach for constructing the combined offer. This heuristics
is fast and it usually leads to sufficiently good results for this kind of problem. The strategy
is outlined in Figure 10. It starts with a single resource and adds new resources until the
resulting configuration can fulfill the request. The new resource is chosen as that node
which maximizes the overall performance of the resulting configuration. This procedure is
performed for each bid as the starting node and finally the best constructed configuration
is chosen as the result.
5. RESOURCE AND SERVICE DESCRIPTION (RSD)
The Resource and Service Description (RSD) is the glue between the local resource
management CCS and the global metacomputer layer SCL. RSD provides a portable,
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
HIGH-PERFORMANCE COMPUTER CLUSTER MANAGEMENT
903
object oriented tool. It is the common basis for describing resources and resource requests.
At the administrator level, RSD provides a means for describing the type and topology
of available resources. At the user level, RSD allows us to specify the required system
configuration for a given application.
An earlier variant of RSD, the Resource Description Language RDL[13] was purely textbased. It served for many years in CCS. Our users, however, found RDL too complex when
just trying to run a parallel code on an MPP with a simple, regular topology. It seems that it
was too early for the user community to appreciate the full descriptive power of a versatile
description language. Hence, we hid the language interface by easy-to-use command line
options. But of course, RDL was still used behind.
With the current trend towards distributed computing and metacomputing, resource
description tools become important again. Based on our experiences with the outdated
RDL, we now provide a more generic approach that has three interfaces (see Figure 11):
• a graphical interface (GUI) for specifying simple topologies and attributes
• a language interface for specifying more complex and repetitive graphs (mainly
intended for system administrators)
• an application program interface (API) for access from within an application
program.
The graphical editor stores the graphical and textual data in an internal data
representation. This data is bundled with the API access methods and sent as an object
to the target systems, where it is matched against other hardware or software descriptions.
The internal data description can only be accessed and modified through the API[14].
Modifications are possible, because a description of the component’s graphical layout
is kept as part of the internal data representation. In the following, we describe the
components of RSD in more detail.
5.1. Graphical interface
The graphical editor provides a set of simple modules that can be edited and linked together
to build a graph of the requested resources or a system description.
At the administrator level, the graphical interface is used to describe the basic computing
and networking components in a (meta-)center. Figure 12 illustrates a typical administrator
session. In this example, the components of the center are specified in a top-down manner
with the interconnection topology as a starting point. With drag-and-drop, the administrator
specifies the available machines, their links and the interconnection to the outside world.
It is possible to specify new resources by using pre-defined objects and attributes via pull
down menus, radio buttons, and check boxes.
In the next step, the machines are specified in more detail. The GUI offers a set
of standard machine layouts like Cray T3E, IBM SP2, or Parsytec and some generic
topologies like grid or torus. The administrator defines the size of the machine and
the general attributes of the whole machine. When the machine has been specified, a
window with a graphical representation of the machine pops up, in which single nodes
can be selected. Attributes like network interface cards, main memory, disk capacity, I/O
throughput, CPU load, network traffic, disk space, or the automatic start of daemons can
be assigned.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
904
M. BRUNE ET AL.
Figure 12. Graphical RSD editor
For users, RSD provides a set of standard configuration files which contain descriptions
of most common systems. Additionally, it is possible to connect to a remote site to load
its RSD dialect (i.e. attribute names and values). This allows us to perform online syntax
checks at the interface, even when the user has specified remote resources. Likewise, it is
possible to join multiple sites to a meta-site, using a different RSD dialect without affecting
the site-specific RSD dialects.
Analogously to the wide-area metacomputer manager WAMM[15], the user may click
on a center and the target machines. Interconnection topologies, node availability and the
current job schedule may be inspected. Partitions can be selected via drag and drop or in a
textual manner. For multi-site applications, the user may either specify the intended target
machines or a set of constraints in order to let the RMS choose a suitable set of systems.
5.2. Language interface
Graphical user interfaces are often not powerful enough for describing complex
metacomputing environments with a large number of services and resources. System
administrators need an additional tool for specifying irregularly interconnected, attributed
structures. Hence, we devised a language interface that is used to specify arbitrary
topologies. The hierarchical concept allows different graphs to be grouped for building
even more complex nodes, i.e. hypernodes.
In the RSD language, active nodes are indicated by the keyword NODE. Communication
interfaces (sockets) are declared by the keyword PORT. Depending on whether RSD is
used to describe hardware or software topologies, the keyword NODE is interpreted as
‘processor’ or ‘process’. A port node may be a physical socket, a process that behaves
passively within the parallel program, or a passive hardware entity like a crossbar.
A NODE definition consists of three parts:
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
905
HIGH-PERFORMANCE COMPUTER CLUSTER MANAGEMENT
A T M
p o rt
A T M
p o rt
6 2 2 M b p s A T M
e d g e _ to _ h y p e rn o d e _ p o rt
P 6
P P6 6
P P6 6
1 G b p s S C I r in g
P 6
P 6
P 6
e d g e _ to _ h y p e rn o d e _ p o rt
w s c _ m p p _ a tm
P P6 6
P P6 6
P 6
P 6
M P P 6 x 4 c lu s te r
S C I_ W S C
Figure 13. RSD example for multi-site application
NODE Meta {
// DEFINITIONS: define attributes, values and ranges
CONST BANDWIDTH = (1..1200);
// DECLARATIONS: include the two hyper nodes
INCLUDE "SCI WSC";
INCLUDE "MPP";
// CONNECTIONS: MPP with SCI workstation cluster
EDGE wsc mpp atm {
NODE SCI WSC PORT ATM <=> NODE MPP PORT ATM; BANDWIDTH = 622 Mbps;};
};
Figure 14. RSD specification of Figure 13
1. In the (optional) DEFINITION section, identifiers and attributes may be introduced
by the command line Identifier [ = ( value,. . . ) ].
2. In the DECLARATION section, the nodes are declared with their corresponding
attributes. The notion of a ‘node’ in a graph is recursive. Nodes are described by
NODE NodeName {PORT PortName; attribute 1, ...}.
3. The (optional) CONNECTION section is used to define attributed edges between
the ports of the nodes: EDGE NameOfEdge {NODE w PORT x <=> NODE
y PORT z; attribute 1; ...}. A ‘virtual edge’ provides a link between
different levels of hierarchy in the graph. This allows us to establish a link from
the described module to the ‘outside world’ by ‘exporting’ a physical port to the
next higher level. Virtual edges are defined by ASSIGN NameOfVirtualEdge
{ NODE w PORT x <=> PORT a}. Note that NODE w and PORT a are the
only entities known to the outside world.
Figures 14 and 15 show an RSD example of an application Meta that should run on
two systems (Figure 13). The metacomputer comprises an SCI workstation cluster and a
massively parallel system, interconnected by a bi-directional ATM network.
The definition of Meta is straightforward; see Figure 14. The SCI cluster in Figure 15
consists of eight nodes, two of them with quad-processor systems. For each node, the CPU
type, the memory per node, the operating system, and the port of the SCI link is specified.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
906
M. BRUNE ET AL.
NODE SCI WSC {
// DEFINITIONS:
CONST N = 8;
// number of nodes
CONST SHARED = TRUE; // allocate resources for shared use
CONST CPU[] = (‘‘PentiumII’’, ‘‘Alpha’’); // predefined CPU-types
// DECLARATIONS:
// we have 2 SMP nodes (gateways), each with 4 processors
// each gateway provides one SCI and one ATM port
FOR i=0 TO 1 DO
NODE i {
PORT SCI; PORT ATM; CPU=Alpha; MEMORY=512 MByte; MULTI PROC=4;};
OD
// the others are single processor nodes
// each with one SCI port
FOR i=2 TO N-1 DO
NODE i {
PORT SCI; CPU=Alpha; MEMORY=256 MByte; OS=SOLARIS;};
OD
// CONNECTIONS: build the 1.0 Gbps unidirectional ring
FOR i=0 TO N-1 DO
EDGE edge $i to $((i+1) MOD N) {
NODE i PORT SCI => NODE ((i+1) MOD N) PORT SCI; BANDWIDTH = 1.0 Gbps;};
OD
// establish a special virtual edge from node 0 to the
// port of the hyper node SCI WSC (=outside world)
ASSIGN edge to hypernode port {
NODE 0 PORT ATM <=> PORT ATM;};
};
Figure 15. RSD specification of the SCI part
All nodes are connected by a uni-directional SCI ring with 1.0 Gbps. In the example, the
first node is the gateway to the workstation cluster. It presents its ATM port to the next
higher node level (see ASSIGN statement in Figure 15) to allow for remote connections.
5.3. Internal data representation
An abstract data type establishes the link between the graphical and the text based
representation. It is also used to store descriptions on disk and to exchange them across
networks. The internal data representation must be capable of describing the following
properties:
• arbitrary graph structures
• hierarchical systems or organizations
• nodes and edges with arbitrary sets of valued attributes.
Furthermore it should be possible to reconstruct the original representation, either
graphically or text based. This facilitates the maintenance of large descriptions (e.g. a
complex HPC center) and allows visualization at remote sites.
In order to use RSD in a distributed environment, a common format for exchanging
RSD data structures is needed. The traditional approach is to use a data stream format.
However, this involves two additional transformation steps whenever RSD data are to
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
HIGH-PERFORMANCE COMPUTER CLUSTER MANAGEMENT
907
be exchanged (internal representation into data stream and back). Since the RSD internal
representation has been defined in an object oriented way, this overhead can be avoided,
when the complete object is sent across the network.
Today there exists a variety of standards for transmitting objects over the Internet,
e.g. CORBA, Java, or Component Object Model. Since we do not want to commit on
either of these, we only define the interfaces of the RSD object class but not its private
implementation. This allows others to choose an implementation that fits best to their own
data structures. Interoperability between different implementations can be improved by
defining translating constructors, i.e. constructors that take an RSD object as an argument
and create a copy of it using another internal representation.
6. RELATED WORK
Resource management systems emerged from the need for better utilization of expensive
HPC systems. The Network Queuing System NQS[16], developed by NASA Ames for the
Cray2 and Cray Y-MP, might be regarded as the ancestor of many modern queuing systems
like the Cray Network Queuing Environment NQE and the Portable Batch System PBS[17].
Following another path in the line of ancestors, the IBM LoadLeveler is a direct
descendant of Condor[18], whereas Codine[19] has its roots in Condor and DQS.
They have been developed to support so-called high-throughput computing on UNIX
workstation clusters. In contrast to high-performance computing, the goal is here to run
a large number of (mostly sequential) batch jobs on workstation clusters without affecting
interactive use. The Load Sharing Facility LSF[20] is another popular software for
utilizing LAN-connected workstations for high-throughput computing. For more detailed
information on cluster managing software, the reader is referred to[21,22].
These systems have been extended for supporting the coordinated execution of
parallel applications, mostly based on PVM. A multitude of schemes have been devised
for high-throughput computing on a somewhat larger scale, including the Iowa State
University’s Batrun[23], the CORBA-based Piranha[24], the Dutch Polder initiative[25],
the Nimrod project[26], and the object-oriented Legion[27] which proved useful in a
nation-wide cluster. While these schemes emphasize mostly on application support on
homogeneous systems, the AppLeS project[28] provides application-level scheduling
agents on heterogeneous systems, taking into account their actual resource performance.
6.1. Comparing the CCS environment to GLOBUS
For the work presented in this paper, the already mentioned GLOBUS project[5] is
most important. Based on the lessons learned in the I-WAY experiment[29], the National
Computational Science Alliance[30] currently implements a framework of an adaptive
wide area metacomputer environment (AWARE), where GLOBUS, among Condor and
Symbio[31] (for managing clustered WindowsNT systems), plays a key role in establishing
a national distributed computing infrastructure.
In GLOBUS, a metacomputer is regarded as a networked virtual supercomputer
constructed dynamically from geographically distributed resources that are linked by
high-speed networks. GLOBUS aims at a vertically integrated treatment of application,
middleware and network. It provides a basic infrastructure of tools building on each other:
• resource (al)location: GLOBUS Resource Manager GRM[32]
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
908
•
•
•
•
M. BRUNE ET AL.
communication layer: Nexus[33]
unified resource information service: Metacomputing Directory Service MDS[34]
authentication interface: Generic Security System GSS[35]
data access: Remote IO Facility RIO[36].
In contrast to our approach, GLOBUS applications are expected to configure themselves
to fit the execution environment delivered by the metacomputing system, and then adapt
their behavior to subsequent changes in the resource characteristics.
Our CCS software was originally designed for managing HPC systems in a single site.
Resources may be distributed, but they must be accessible in one NFS/NIS domain. CCS
provides an open interface so that several sites may be joined by higher-level tools. CCS has
a hierarchical structure with autonomous software layers that interact via message passing:
The lowest level is a self-sufficient CCS Island controlling a single machine or cluster
which can be operated stand-alone. The next higher level consists of the Center Resource
Manager (CRM) and the Center Information Server (CIS) which build the interfaces of a
site to the ‘outside world’. CCS’ open framework architecture allows us to integrate all
kinds of HPC systems.
The GLOBUS metacomputing directory service (MDS)[34] provides similar services as
our center information server (CIS). It addresses the need for efficient and scalable access
to diverse, dynamic, and distributed information. MDS is vendor-independent – just like
the RSD that is used in all CCS components. Moreover, MDS is intended to maintain and
utilize application specific information that has been found useful in previous program runs
(e.g. memory requirements, program structure, communication patterns). In CCS, this is
done by another tool (not described here), the Metacomputer Adaptive Runtime System
MARS[12].
Compared to the much more ambitious GLOBUS project, several tools are missing
in our CCS environment. We do not (yet) support remote I/O, there is no dedicated
authentication interface, and we do not support multiple (or even adaptive) communication
protocols. However, our system provides several unique features like, for example,
reserverations, deadline scheduling, or true support for multi-site applications.
7. SUMMARY
We have presented three key metacomputing components: the system management
software CCS, the resource description tool RSD, and the SCL meta-layer that makes use
of the other two.
The Computing Center Software (CCS) deals with the access and administration of
single HPC systems. Our current release has the following features:
• It is modular and autonomous on each layer. New machines, networks, protocols,
schedulers, system software, and meta-layers can be added at any point – some of
them even without the need to reboot the system.
• It is reliable. There is no single point of failure. Recovery is done at the machine
layer. The center information manager (CIS) is passive and can be restarted or
mirrored.
• It is scalable. No central instance exists. The hierarchical approach allows us to
connect to other centers’ resources. This concept has been found useful in several
industrial projects.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
HIGH-PERFORMANCE COMPUTER CLUSTER MANAGEMENT
909
• It is extensible. Other resource management systems (e.g. Codine, LSF, Condor) can
be linked to CCS without the need to adjust their internal control regime.
The second component, the Resource and Service Description (RSD), is a tool
for specifying hardware and software components of a geographically distributed
metacomputer. Its graphical interface allows users to specify resource requests. Its textual
interface gives a service provider a powerful means for specifying computing nodes,
network topology, system properties and software attributes. Its internal object-oriented
resource representation is used to link different resource management systems and service
tools.
The third component, the Service Coordination Layer (SCL), links the autonomous
computing sites and provides a brokerage between the application requirements and the
distributed computation services. SCL was found useful in two metacomputing projects
with industrial participation[3], where computationally-intensive applications were run on
WAN-connected HPC servers.
In the near future, we plan to integrate gang scheduling strategies into CCS and
to support workstation cluster management software (Codine, Condor or LSF) at the
metacomputer layer (resp. Center Resource Manager).
ACKNOWLEDGEMENTS
This work was partially supported by the European Union projects MICA (#20966),
PHASE (#23486), and DYNAMITE (#23499), and also by the German Ministry of
Research and Technology (BMBF) project UNICORE.
Thanks to the members of the CCS-team, who have spent a tremendous effort on the
development, implementation, and debugging since the project start in 1992: Bernard
Bauer, Matthias Brune, Harald Dunkel, Jörn Gehring, Oliver Geisser, Christian Hellmann,
Axel Keller, Achim Koberstein, Rainer Kottenhoff, Karim Kremers, Fru Ndenge,
Friedhelm Ramme, Thomas Römke, Helmut Salmen, Dirk Schirmer, Volker Schnecke,
Jörg Varnholt, Leonard Voos, and Anke Weber.
Also, special thanks to our collaborators at CNUCE, Pisa: Domenico Laforenza,
Ranieri Baraglia, Mauro Michelotti, and Simone Nannetti. The CCS project benefited from
many fruitful discussions with the CNUCE team, and also our Italian friends have done a
magnificent job in implementing the graphical user interface.
REFERENCES
1. L. Smarr and C. E. Catlett, ‘Metacomputing’, Commun. ACM, 35(6), 45–52 (1992).
2. A. Grimshaw, A. Ferrari, G. Lindahl and K. Holcomb, ‘Metasystems’, Commun. ACM, 41(11),
43–55 (1998).
3. J. Gehring, A. Reinefeld and A. Weber, ‘PHASE and MICA: Application specific
metacomputing’, Europar’97, Passau, Germany, 1997, pp. 1321–1326.
4. A. Reinefeld, R. Baraglia, T. Decker, J. Gehring, D. Laforenza, F. Ramme, T. Römke and
J. Simon, ‘The MOL Project: An open extensible metacomputer’, Proceedings HCW’97,
Geneve, IEEE Computer Society Press, 1997, pp. 17–31.
5. I. Foster and C. Kesselman, ‘Globus: A metacomputing infrastructure toolkit’, J. Supercomput.
Appl., (1996).
6. A. Keller and A. Reinefeld, ‘CCS resource management in networked HPC systems’,
7th Heterogeneous Computing Workshop HCW’98 at IPPS/SPDP’98, Orlando, FL, IEEE
Computer Society Press, 1998, pp. 44–56.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
910
M. BRUNE ET AL.
7. F. Ramme, T. Römke and K. Kremer, ‘A distributed computing center software for the efficient
use of parallel computer systems’, HPCN Europe, LNCS 797, Vol. II, Springer, 1994, pp. 129–
136.
8. F. Ramme, ‘Transparente und effiziente Nutzung partitionierbarer Parallelrechner’, Ph.D.
Dissertation (in German), Paderborn Center for Parallel Computing, 1997.
9. J. Gehring and F. Ramme, ‘Architecture-independent request-scheduling with tight waitingtime estimations’, IPPS’96 Workshop on Scheduling Strategies for Parallel Processing, Hawaii,
Springer, LNCS 1162, 1996, pp. 41–54.
10. T. Beisel, E. Gabriel and M. Resch, ‘An extension to MPI for distributed computing on MPPs’,
in M. Bubak, J. Dongarra and J. Wasniewski (eds.), Recent Advances in Parallel Virtual
Machine and Message Passing Interface, Springer, LNCS, pp. 25–33.
11. M. Brune, J. Gehring and A. Reinefeld, ‘Heterogeneous message passing and a link to resource
management’, J. Supercomput., 11, 355–369 (1997).
12. J. Gehring and A. Reinefeld, ‘MARS – A framework for minimizing the job execution time in
a metacomputing environment’, Future Gener. Comput. Syst., 12, 87–99 (1996).
13. B. Bauer and F. Ramme, ‘A general purpose resource description language’, in Grebe and
Baumann (eds.), Parallele Datenverarbeitung mit dem Transputer (in German), SpringerVerlag, Berlin, 1991, pp. 68–75.
14. M. Brune, J. Gehring, A. Keller and A. Reinefeld, ‘RSD – resource and service description’,
12th Intl. Symp. on High-Performance Computing Systems and Applications HPCS’98,
Edmonton, Canada, Kluwer Academic Press, 1998, pp. 193–206.
15. R. Baraglia, G. Faieta, M. Formica and D. Laforenza, ‘Experiences with a wide area network
metacomputing management tool using IBM SP-2 parallel systems’, Concurrency: Pract. Exp.,
8, (1996).
16. C. Albing, ‘Cray NQS: Production batch for a distributed computing world’, 11th Sun User
Group Conference and Exhibition, USA, Brookline, December 1993, pp. 302–309.
17. A. Bayucan, R. L. Henderson, T. Proett, D. Tweten and B. Kelly, ‘Portable batch system:
External reference specification’, Release 1.1.7, NASA Ames Research Center, June 1996.
18. M. J. Litzkow and M. Livny, ‘Condor–A hunter of idle workstations’, Proc. 8th IEEE Int. Conf.
Distr. Computing Systems, June 1988, pp. 104–111.
19. GENIAS Software GmbH, ‘Codine: Computing in distributed networked environments’,
http://www.genias.de/products/codine/, June 1999.
20. LSF, ‘Product overview’, http://www.platform.com/products/, May 1999.
21. M. Baker, G. Fox and H. Yau, ‘Cluster computing review’, Northeast Parallel Architectures
Center, Syracuse University, New York, November 1995. http://www.npac.syr.edu/.
22. J. P. Jones and C. Brickell, ‘Second evaluation of job queuing/scheduling software: Phase 1
report’, Nasa Ames Research Center, NAS Tech. Rep. NAS-97-013, June 1997.
23. F. Tandiary, S. C. Kothari, A. Dixit and E. W. Anderson, ‘Batrun: Utilizing idle workstations
for large-scale computing’, IEEE Parallel Distrib. Technol., 4, 41–48 (1996).
24. N. Carriero, E. Freeman, D. Gelernter and D. Kaminsky, ‘Adaptive parallelism and Piranha’,
IEEE Computer, 28(1), 40–49 (1995).
25. D. Epema, M. Livny, R. van Dantzig, X. Evers and J. Pruyne, ‘A worldwide flock of Condors:
Load sharing among workstation clusters’, Future Gener. Comput. Syst., 12, 53–66 (1996).
26. D. Abramson, R. Sosic, J. Giddy and B. Hall, ‘Nimrod: A tool for performing parameterized
simulations using distributed workstations’, 4th IEEE Symp. High-Performance and Distributed
Computing, August 1995.
27. A. Grimshaw, J. B. Weissman, E. A. West and E. C. Loyot, ‘Metasystems: An approach
combining parallel processing and heterogeneous distributed computing systems’, J. Parallel
Distrib. Comput., 21, 257–270 (1994).
28. F. Berman, R. Wolski, S. Figueira, J. Schopf and G. Shao, ‘Application-level scheduling on
distributed heterogeneous networks’, Supercomputing 96, November 1996.
29. I. Foster, J. Geisler, W. Nickless, W. Smith and S. Tuecke, ‘Software infrastructure for the
I-WAY high-performance distributed computing experiment’, Proc. 5th IEEE Symp. on High
Performance Distributed Computing, 1996, pp. 562–570.
30. R. Stevens, P. Woodward, T. DeFanti and C. Catlett, ‘From I-WAY to the national technology
grid’, Commun. ACM, 11, 51–60 (1997).
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
HIGH-PERFORMANCE COMPUTER CLUSTER MANAGEMENT
911
31. ‘The Symbio supercomputing system’, National Center for Supercomputing Applications.
http://symbio.ncsa.uiuc.edu/, August 1999.
32. GRM, ‘Resource manager specification v0.3’, http://www.globus.org/scheduler/grm spec.html,
July 1999.
33. I. Foster, J. Geisler, C. Kesselman and S. Tuecke, ‘Managing multiple communication methods
in high-performance networked computing systems’, J. Parallel Distrib. Comput., 40, 35–48
(1997).
34. S. Fitzgerald, I. Foster, C. Kesselman, G. von Laszewski, W. Smith and S. Tuecke, ‘A directory
service for configuring high-performance distributed computations’, Proc. 6th IEEE Symp. on
High-Performance Distributed Computing, 1997.
35. J. Linn, ‘Generic security service applications programming interface’, Internet RFC 1508,
1993.
36. I. Foster, D. Kohr, R. Krishnaiyer and J. Mogill, ‘Remote I/O: Fast access to distant storage’,
ANL Technical Report, 1994.
Copyright  1999 John Wiley & Sons, Ltd.
Concurrency: Pract. Exper., 11, 887–911 (1999)
Документ
Категория
Без категории
Просмотров
13
Размер файла
562 Кб
Теги
907
1/--страниц
Пожаловаться на содержимое документа