Open Grid: A User-Centric Approach for Grid Computing

Open Grid: A User-Centric Approach
for Grid Computing
Walfredo Cirne
Keith Marzullo
Universidade Federal da Paraíba
Departamento de Sistemas e Computação
University of California San Diego
Computer Science and Engineering
http://walfredo.cirne.net/comp
http://www.cs.ucsd.edu/users/marzullo
1. Introduction
Given the massive number of computers that are
networked together, combining the power of thousands
of processors seems a natural way to tackle computationally intensive problems. Moreover, projects like
SETI@home [11] have shown that some applications
can effectively utilize an extremely large number of geographically dispersed processors. (As of February 2000,
SETI@home encompassed 1.6 million participants in
224 countries and computes on average at 10 Teraflops.)
However, there are a number of technical and administrative issues in turning independent networked
computers into a generic production platform for highperformance computing. Addressing these challenges
underlies the Grid Computing research area [8]. The
metaphor adopted by Grid Computing is the power grid:
just as electricity is available upon demand from a power
grid, computational power can be made transparently
available upon demand from a computational grid. Considerable research has been done in the last few years
towards the realization of this vision. Grid Computing
infrastructure (which are essentially distributed operating
systems that support high-performance applications)
such as Globus [7], Condor [10] and Legion [9] have
been publicly available for years, and now we starting to
see companies such as Entropia [6] whose goal is to
commercially provide grid infrastructure.
Despite this progress, though, those for whom Grid
Computing has been aimed have been slow to adopt this
new technology. Even those who have very coarse-grain
embarrassingly parallel applications (which are quite
amenable to running on a grid) rarely avail themselves of
any of the available computational grid infrastructure.
We believe that a major reason for this state of affairs
has to do with the direction that Grid Computing design
has taken. One can argue that Grid Computing research
has fallen into two traps that had once waylaid earlier
research into distributed operating systems: adopting a
closed and inextensible architecture and not supporting
the workflow of developing a distributed application.
In this paper, we argue that the massive acceptance
of Grid Computing technology depends on building solutions that are open (do not require a particular infrastructure), extensible (ease the addition of refinements)
and complete (cover the whole production cycle). In particular, we maintain that a grid-wide notion of working
environment should be centered on the user, instead of
dependent on a particular infrastructure, as in current
designs. We then describe our system, called Open Grid,
that implements our viewpoint and that we hope is a step
towards a widely used grid.
2. The Problem
When considering computational grid environments,
it is useful to think of three kinds of participants. We'll
give them names: Sarah, Bob, and David. Sarah is a programmer who wishes to run a coarse-grain parallel application (i.e., an application whose the ratio of computation time per communications time is large). Bob is another programmer who has a fine-grained application.
Both Sarah and Bob wish to run their applications on the
grid in an effort to compute their corresponding results
as quickly as possible. David is a systems administrator
who controls some of the machines in the grid. David’s
goal is to make all users happy, but oftentimes he wants
to give preference to particular users.
Experience has shown that it is hard to develop and
deploy infrastructure that simultaneously enables Sarah
and Bob to easily use the grid, as well as gives David
control over the resources he oversees. Attempts in doing so have not been as widely adopted as originally
hoped. In order to more effectively transfer the Grid
Computing technology from research labs to end-users,
we can first concentrate of targeting less demanding grid
users like Sarah. Moreover, Sarah’s problem is a very
relevant one. Many of the people who do data mining,
massive searches (such as key breaking), parameter
sweeps, Monte Carlo simulations, fractals calculations
(such as Mandelbrot), and image-manipulation applications (such as tomographic reconstruction) have the
same requirements as Sarah.
The current Grid Computing infrastructure, on the
other hand, has either ignored the differences between
Sarah and Bob's requirements or has gravitated towards
Bob's since they are more technically challenging.
Moreover, by committing to a particular Grid Computing
infrastructure (e.g. Condor pools [10], Globus resources
[7], Computational Co-ops [4], or Nile farms [1]), both
Sarah and Bob must restrict their application to utilize
1
only processors available through such an infrastructure.
Bob may be happy to do this, but Sarah might have access to many other processors elsewhere. And, since her
application is coarse grained, using whatever processors
she can get her hands on might be very fruitful.
We believe that a Grid Computing technology Sarah
will be willing to use has to be open, extensible and
complete. By open, we mean that a solution should not
preclude Sarah from using any computing resources she
can muster. Her grid may be different than Bob's, or the
grid of some other programmer with another coarse grain
application. This implies that the Grid Computing infrastructure should provide an open environment much in
the way that RPC provides an open environment to
which it is easy to add servers and clients. Of course,
available Grid Computing infrastructures can be used,
but such infrastructures should not be mandatory.
Another was of stating the benefit of an open Grid
Computing infrastructure is that it allows Sarah to bring
her personal computing environment to the processors of
her grid rather than being forced to use an environment
imposed by the grid upon her.
Extensibility is paramount because of the everchanging characteristics of a grid (such as load and
availability) make improving an application’s performance a hard task. Research in the area points to application scheduling as crucial to achieve performance in the
Grid environment [2] [11]. Application schedulers aim to
improve the performance of the application by evaluating the current state of a grid to (i) select the resources to
use, (ii) partition the work among the selected resources,
and (iii) submit the partitioned work to the corresponding resources.
Unfortunately, successful application schedulers developed so far are closely coupled to the applications
they schedule. This represents a serious obstacle because
it frames simplicity and efficiency as mutually exclusive
features. A way to address this difficulty is to provide a
generic application scheduler as a default, and enable
Sarah to change or enhance it as desired. This way,
Sarah can start running without worrying about scheduling, and later improve her application's performance
when it proves worthwhile to do so.
Finally, Sarah’s interface to the grid should be complete. It should support the whole production cycle of the
problem it targets, from development to production to
maintenance, instead of just focusing on a particular aspect of the problem. In particular, a single set of abstractions should enable Sarah to develop, deploy, debug, and
execute her application.
able her to conveniently use her grid, in the same way
that files and processes enable programmers to use a
single computer. A working environment provides a
common denominator that Sarah can rely upon when
programming for her grid, despite the differences in the
configuration of the multiple resources that comprise the
grid. Moreover, a working environment is key in providing a complete solution for Sarah, one that eases managing input and output files, distributing application code,
and otherwise carrying on daily computational activities,
now based on a computational grid. In Open Grid, the
User Agent provides a grid-wide working environment.
Second, Sarah needs to manage the tasks that compose her application in a way that promotes the performance of her application despite the heterogeneous and
ever-changing nature of her grid (which may contain
machines of different types, running under different
load, and connected through different networks). Moreover, some of these machines may be unavailable at
times: they may fail or become unreachable. To manually cope with the resulting complexity would jeopardize
the gains Sarah realizes in using the grid. In Open Grid,
the Task Manager has the responsibility of coping with
this.
3.1. The User Agent
We call the machines that Sarah already uses for her
everyday tasks her home machines. The other machines
that Sarah uses through Open Grid to farm out tasks for
her application are called grid machines. In general,
Sarah has good access to her home machines and has set
up a comfortable working environment on them. The
User Agent provides abstractions that make it convenient
for Sarah to use the grid machines.
The services provided by the User Agent for the grid
machines are (i) remote execution, (ii) file transfer, and
(iii) the playpen abstraction. Remote execution allows a
process running on a home machine to start a process on
a grid machine. File transfer allows for the movement of
files between home machines and grid machines. Playpens allow Sarah to deal with files and storage over her
grid.
Playpens can either be temporary or permanent.
Temporary playpens provides Sarah with temporary disk
space in a manner that is independent of the local file
system conventions of a given machines. Temporary
playpens are implemented by creating a directory in a
file system that can hold the amount of data specified by
Sarah. Permanent playpens enable Sarah to distribute
permanent files (such as binaries) across the grid. Permanent playpens are implemented as directories rooted
on Sarah’s home directories. A grid machine loads files
3. Open Grid
into permanent playpens from the home machines, using
There are two general issues that arise in enabling
simple version number comparison to first check
Sarah to use her grid as a platform for her (coarse-grain
whether locally cached versions are in fact the correct
parallel) application. First, Sarah needs a grid-wide
versions.
working environment, i.e. a set of abstractions that en-
2
The User Agent services are implemented by the
User Agent Daemon, which runs on grid machines, and
the User Agent Server, which runs on home machines.
Since the User Agent provides security-sensitive services
(such as remote execution), the Daemon and the Server
rely upon public-key cryptography to authenticate each
other as being deployed by Sarah. The Daemon runs
with whatever permissions David was willing to grant
Sarah for that grid machine.
A bootstrapping problem occurs in Open Grid: the
Daemons and Servers themselves together comprise a
distributed application that needs to be installed and
monitored. This is the task of the User Agent Factory.
We have built a User Agent Factory as a set of scripts
that use crontab to start Servers and ssh to start Daemons. However, we cannot anticipate which mechanisms
will be available for Sarah to access the machines that
comprise her grid. We thus expect Sarah to customize
the User Agent Factory to meet the needs of her grid;
making it feasible for her to accommodate whatever new
problems she faces in getting a connection to a new set
of machines.
A somewhat unusual feature of the User Agent is
that it contains code provided by the user (namely, Sarah
is expected to customize/write the User Agent Factory).
That is necessary to make Open Grid an open environment. Since we cannot anticipate all that Sarah might
have to do in order to access a given resource, Sarah
contributes with code that customizes the Open Grid for
her purposes. To keep things as simple as possible, Open
Grid is designed to keep small the amount code that
Sarah has to provide. But, of course, Sarah can replace
or extend any component of the system as she pleases.
3.2. The Task Manager
The Task Manager, which runs on a home machine,
is the Open Grid part of Sarah's application. Recall that
Sarah’s application consists of a set of tasks that can be
executed independently from each other. Sarah informs
the Task Manager which tasks compose her application
by sending the og-add-task message to the Task Manager. The Task Manager also provides control commands for Sarah to pause, kill, and monitor applications
and tasks.
Each task is composed by two executables (or
scripts): the home task and the grid task. The home task
runs on a home machine. The Task Manager invokes the
home task when a grid machine becomes ready to be
used. The name of the grid machine is passed to the
home task through the OG_PROC environment variable.
The home task performs all necessary set-up activity
(such as creating and mounting playpens, transferring
files, etc), remotely executes the grid task on machine
OG_PROC, performs any finishing up activity (such as
collecting results and deleting temporary playpens), and
then informs the Task Manager that the task is completed by sending the og-task-done message to it. Figure
1 shows a simple home task written in shell script, in
which messages are sent to the Task Manager through
the invocation of Open Grid commands. Note that ogcreate-playpen, og-home2grid, og-grid2home, and ogremote-exec are supported by the User Agent.
# home task script
TaskParams=”$1”
TaskInputs=”$2”
PlayPen=`og-create-playpen $OG_PROC`
og-home2grid $TaskInputs gridtask $PlayPen
og-remote-exec $OG_PROC gridtask $TaskParams
og-grid2home $PlayPen:result∗ ~/result-dir
og-task-done $OG_PROC
Figure 1: A simple task script
The grid task runs on a grid machine and performs
the task per se. Figure 2 depicts the actions that place in
executing a task with Open Grid. Circles indicate components that are supplied by Open Grid and squares
components that Sarah is expected to write.
Two main concerns of the Task Manager are scheduling and fault recovery. In our context, scheduling consists of selecting which machine executes each task.
What makes this non-trivial is that the machines that
compose Sarah’s grid are not only different but also their
availability varies in time due to the load generated by
other users. In particular, we want to avoid assigning the
last tasks in an application to slow or loaded machines,
as such assignment largely increases the execution time
of the application as a whole.
There has been work addressing this problem, which
is primarily based on monitoring resources throughout
the network [2] [13]. Unfortunately, such a monitoring
depends upon Grid infrastructure that, by design, we
want to avoid. We cope with this problem by replicating
the last tasks of an application among many machines.
This way, the unfortunate assignment of a task to a slow
or loaded machine can be neutralized by the latter assignment of the same task to another machine.
Note that this strategy is only possible because tasks
are independent. Likewise, task independence offers us a
very straightforward way to deal with fault recovery.
Failed tasks are simply restarted in the next available
machine.
3
Home Machine
Grid Machine
add-task (1)
Task
Manager
task-done (4)
grid
task
remote exec (3)
(2)
home
task
(3c)
(3b)
User
Agent
Daemon
User
Agent
Server
playpen, file xfer,
and remote exec
(3a)
Figure 2: Sequence of events for running a task
4. The Open Grid Prototype
We have implemented a first prototype of Open
Grid. The prototype is somewhat simpler than what described in the previous section (in particular, the User
Agent is implemented through ssh/scp and has less functionality). Yet, it allows us to investigate the efficacy of
our approach.
We used the Open Grid prototype to run most of the
simulations discussed in [5]. In total, we conducted
around 600,000 simulations during a 40-day period, using 178 processors located in 6 different administrative
domains (4 at University of California San Diego, 1 at
San Diego Supercomputer Center and 1 at Northwestern
University). The processors were in normal production
(i.e., they were not dedicated to us at any point in time).
The processors were either in Intel machines running
Lunix or in Sparc machines running Solaris. For our
application, the fastest processor was about 10 times
faster than the slowest one, but the delivered speed of
course varied with the load.
Using Open Grid, the 600,000 simulations took 16.7
days, distributed over a 40-day period (the remaining
time was used to analyze the latest results and plan the
next simulations). In contrast, our desktop machine (an
UltraSparc running at 200MHz) would have taken about
5.3 years to complete the 600,000 simulations (had it
been dedicated to only that task). Using solely the fastest
machine available to us would (again, in dedicated
mode) would reduce this time to 2.2 years, still 48 times
slower than using Open Grid.
Note that the machines we used shared no common
software except ubiquitous Unix utilities such as emacs,
ssh, and gcc. In particular, Grid Computing software
(more precisely, Globus [7]) was installed only in a single administrative domain. Moreover, access mechanisms varied from one administrative domain to another.
For example, we had to cross a firewall for the machines
in one of the domains. Also, we were required to run at
lower priority in four of the administrative domains.
This lack of deployed infrastructure and the corresponding diversity in access mechanisms reinforced our
impression that a solution for Sarah must be open and
complete, allowing her to use whatever resources she has
access to. Moreover, the grid-wide working environment
provided by the User Agent freed us from worrying
about which software and data was updated across the
different administrative domains. We feel this was key to
enable us to focus on our simulations (out final goal at
that moment), making using our grid a productive endeavor.
5. Related Work
There has considerable activity over the last years in
creating infrastructure to enable grid computing. Projects
like Globus [7] and Legion [9] aim to provide comprehensive support for grid computing, a goal that is starting
to be pursued commercially by companies such as Entropia [6]. Other Grid Computing projects have more
focused goals, targeting high-throughput applications (as
Condor [10]) and high-energy physics applications (as
Nile [1]), for example. Yet other efforts have addressed
specific aspects of the Grid Computing infrastructure,
such as supporting the federation of independent sites
into large-scale grids, as the Computational Co-op [4].
All these projects view Grid Computing infrastructure
being deployed as universally available system services.
Open Grid, in opposition, adopts a user-centric approach for providing Grid Computing services. Usercentric approaches are recognized as the best strategy for
scheduling in grids, in what became known as application scheduling [2] [8] [11]. Open Grid takes the lessons
of application scheduling a step further by complementing them with (i) a working environment that enables
Sarah (our archetypical user) to conveniently use her
grid in all phases of the production cycle, and (ii) a default scheduler that makes it possible for Sarah to start
using her grid without investing time and effort to deploy a customized application scheduler.
4
Open Grid is similar in concept to systems like
SETI@home [11], Everyware [14], and APST [3]. These
systems deploy grid applications that carry their own
schedulers, create their own grid-wide abstractions, and
run over a variety of system-centric infrastructures.
Similarly, Open Grid enables Sarah to build a system
with these three characteristics. Open Grid differs from
systems like SETI@home, Everyware and APST in that
such systems are tightly coupled with the applications
they support, providing specific solutions for their
application. Open Grid, on the other hand, is designed as
a framework that Sarah uses to build her grid application. Open Grid allows Sarah to have her application
running with the minimum of effort in dealing with grid
concerns, while not precluding Sarah from putting such
effort in when she deems worthwhile.
6. Conclusions
Open Grid is not a completely generic solution,
though. Open Grid assumes Sarah to have a coarse-grain
parallel application. While this assumption makes Open
Grid of no value for a user (who we name Bob) with a
fine-grain tightly-coupled application, there are a large
number of applications that match Sarah’s requirements.
We focus on Sarah (instead of on Bob) because her requirements are simpler, which makes a comprehensive
solution for Sarah simpler than one for Bob. We see
solving Sarah's problem as the natural first step for the
mass deployment of Grid Computing technology.
References
[1] A. Amoroso, K. Marzullo, and A. Ricciardi. Wide-Area
Nile: A Case Study of a Wide-Area Data-Parallel Application. ICDCS’98 – International Conference on Distributed
Computing Systems. May 1998.
[2] F. Berman, R. Wolski, S. Figueira, J. Schopf, and G.
Shao. Application-Level Scheduling on Distributed Heterogeneous Networks. Supercomputing’96.
[3] H. Casanova, G. Obertelli, F. Berman, and R. Wolski. The
AppLeS Parameter Sweep Template: User-Level Middleware for the Grid. Supercomputing’2000, Nov. 2000.
[4] W. Cirne and K. Marzullo. The Computational Co-op:
Gathering Clusters into a Metacomputer. In Proceeding of
IPPS/SPDP’99, April 1999.
[5] W. Cirne. Using Moldability to Improve the Performance
of Supercomputer Jobs. Ph.D. Thesis. Computer Science
and Engineering, University of California San Diego,
2001.
[6] Entropia Web Page. http://www.entropia.com/
[7] I. Foster and C. Kesselman. The Globus Project: A Status
Report. Proc. IPPS/SPDP '98 Heterogeneous Computing
Workshop, pg. 4-18, 1998.
[8] I. Foster and C. Kesselman (editors). The Grid: Blueprint
for a New Computing Infrastructure. Morgan Kaufmann
Publishers. July 1998.
[9] A. Grimshaw, A. Ferrari, F. Knabe, M. Humphrey. Legion: An Operating System for Wide-Area Computing.
IEEE Computer, 32:5, May 1999: 29-37.
[10] M. Litzkow, M. Livny, and M. Mutka. Condor - A Hunter
of Idle Workstations. Proceedings of the 8th International
Conference of Distributed Computing Systems, pages
104-111, June 1988.
[11] SETI@home Web Page.
The Computational Grid is based on the ideas of distributed heterogeneous operating systems and on the
needs of programmers writing large scale, high performance parallel applications. Up to now, though, most of
these programmers have not benefited from Grid Computing. We feel that the current Grid Computing infrastructures have not served in this capacity because they
have lacked being an open, extensible and complete solution.
This paper introduces Open Grid, a Grid Computing
solution designed to be open, extensible and complete
and thus make possible deploying Grid Computing to the
users it is intended to serve. Open Grid is open in the
sense that it does not require (although can use) any infrastructure to be installed throughout the Grid. There
are promises that systems like Globus [7] and companies
like Entropia (which commercialize SETI@home-like
infrastructure) [6] will provide a grid with a massive
scale. But it will be some time (if ever) before one of
these becomes dominant. Until then, an open usercentric approach is needed to allow a programmer to
define her own grid with whatever computers she can
gain access to.
Open Grid is extensible in the sense that it makes it
http://www.seti.org/science/setiathome.html
possible for a user (who we name Sarah) to customize
any part of the system, enabling Sarah to take advantage [12] J. Weissman. Gallop: The Benefits of Wide-Area Computing for Parallel Processing. Journal of Parallel and Disof specialized knowledge of her application in order to
tributed Computing, Vol. 54(2), November 1998.
improve performance. Yet, Open Grid implements sen[13] R. Wolski, N. Spring, and J. Hayes. The Network Weather
sible defaults wherever they are possible, greatly reducService: A Distributed Resource Performance Forecasting
ing the grid-related effort Sarah has to carry on to start
Service for Metacomputing. Journal of Future Generation
running her application.
Computing Systems, 1999.
Open Grid is complete in the sense that it supports [14] R. Wolski, J. Brevik, C. Krintz, G. Obertelli, N. Spring,
and A. Su. Running EveryWare on the Computational
all activities Sarah has to perform to effectively use her
Grid. Supercomputing’99. 1999.
grid, from development and debugging, passing through
deployment and execution, to result collection and
maintenance. This is accomplished by providing a
working environment that enable Sarah to view and
reason about her grid as a whole.
5