Proposition Yves Robert

==============================================================

Thesis title: Robust scheduling algorithms for mapping
application workflows onto dynamic grid platforms.

Research topic : Algorithms, scheduling, performance
evaluation.

Projet : GRAAL, Ecole Normale Supérieure de Lyon
Advisors: Anne Benoit and Yves Robert
Contact: {Anne.Benoit|Yves.Robert}@ens-lyon.fr

Thesis subject:

The thesis will be conducted within the GRAAL project at Ecole
Normale Supérieure de Lyon, under the co-supervision of Anne
Benoit and Yves Robert.

The GRAAL project is focusing on mapping workflow applications
on dynamic grid platforms. While static platforms are
characterized by their limited heterogeneousness (processor
speed, bandwidth of the communication links) and by the fact
that their characteristics change at a very slow rate over
time, dynamic platforms are characterized by their larger size,
greater heterogeneousness, and the strong variability of their
resource parameters. Still, the resources of these platforms
(topology, message routes, etc.) and their characteristics are
assumed to be known by a centralized control mechanism, even
though they change over time.

The thesis will restrict to parallel applications (common in
the domains of image processing, computer vision, query
processing, etc) that do not involve arbitrary interaction
schemes. In many cases, the pattern is entirely predetermined,
or may vary within a range of wider patterns. The structured
approach to parallelism proposes to abstract commonly used
patterns of computation, and then application programmers can
explicitly declare that their application follows one or more
of such patterns. Developers have then the right conceptual
tools to assist them in the implementation of the application.
The issues which make correct and efficient parallel
programming hard are directly addressed by the patterns
library.

An algorithmic skeleton is a programming construct which
abstracts such a pattern of processes and interactions, usually
offered as a library. The programmer invokes one or more
skeletons to describe the structure of the application, writing
the sequential code of each part of the application.
Interactions between the different parts of the application are
inherited implicitly from the chosen skeleton. Pipeline
parallelism is a very classic pattern of computation. A
sequence of inputs is traversing a number of stages, in order
to produce a sequence of outputs. All input passes through each
stage in sequential order, but parallelism occurs because
different stages can deal with different data at the same time,
if mapped on different processors.

Parallel applications expressed as skeletons typically operate
on a very large number of data sets (this is the key motivation
for decomposing the application in several pipeline stages).
The application workflow is the abstracted operation on the
successive data sets. Key metrics for a given workflow are the
throughput (which measures the aggregate rate of processing of
data) and the latency (which measures the response time of the
system for a given data set). Detailed examples will be
provided below.

The thesis will target the simultaneous execution of several
application workflows that each operate on continuous streams
of data sets. Each workflow has its own requirements in terms
of computations and communications (for instance one
application may deal with files and another with matrices). All
workflows will compete for CPU and network resources, and it is
not easy to define a metric to optimize. To achieve a fair
balance of resource allocations one could execute the same
number of data sets per application, and try to maximize this
number. However, some applications may have higher priorities
than others, hence the possible introduction of priority
factors in the objective function, which would amount to
maximize the weighted minimum throughput achieved for a given
workflow. This maximization corresponds to the well-known
MAX-MIN fairness strategy between the different applications.

However, as before, minimizing the maximum response time, or
the weighted average response time, could be a very important
metric too (especially to the user). In fact, with several
application workflows of different sizes and characteristics,
the problem gets more complex. Aiming at minimizing the
response time might unduly privilege larger applications, so a
better (more fair) metric could be to minimize the maximum
stretch of any application. The stretch of an application is
defined as the ratio of latency of the application during the
execution with all workflows over the latency that would have
been achieved if the application had been executed alone on the
platform. Anyway, latency or stretch, both criteria are
antagonistic to the throughput, and tradeoffs should be found
between these criteria.

To illustrate the scheduling and mapping optimization problems,
consider the simplest problem instance, that of mapping a
linear workflow onto a fully-interconnected platform. The
workflow is composed of n stages. Data sets are fed into the
pipeline and processed from stage to stage, until they exit the
pipeline after the last stage. The k-th stage receives an input
from the previous stage, performs a number of computations, and
outputs data to the next stage. The first stage receives an
input from the outside world, while the last stage returns the
result. We target a heterogeneous platform, with p processors
of different speeds and fully interconnected as a (virtual)
clique. Note that we do not need to have a physical link
between any processor pair. Instead, we may have a switch, or
even a path composed of several physical links, to interconnect
any processor pair; in the latter case we would retain the
bandwidth of the slowest link in the path for the value of the
virtual link bandwidth. Communications contention is taken care
of by enforcing the one-port model. In this model, a given
processor can be involved in a single communication at any
time-step, either a send or a receive. However, independent
communications between distinct processor pairs can take place
simultaneously. The one-port model seems to fit the performance
of some current MPI implementations, which serialize
asynchronous MPI sends as soon as message sizes exceed a few
megabytes.

The general mapping problem consists in assigning application
stages to platform processors. The mapping can be one-to-one
(each stage of the application pipeline is mapped onto a
distinct processor, which is possible only if n <= p), or
interval-based (each participating processor is assigned an
interval of consecutive stages). Intuitively, assigning several
consecutive tasks to the same processors will increase their
computational load, but may well dramatically decrease
communication requirements. In fact, the best interval mapping
may turn out to be a one-to-one mapping, or instead may enroll
only a very small number of fast computing processors
interconnected by high-speed links. A possible objective
function is to maximize the throughput, or equivalently to
minimize the inverse of the throughput, which is the period of
the system. For each data set, a processor reads the input from
its predecessor, executes computations corresponding to all the
stages assigned to it, and sends the output to its successor;
the processor periodically repeats this operation cycle, and
the longest time needed by any processor is defined as the
period.

Another important objective function is to minimize the
response time, or latency, of the system. The latency is
defined as the time elapsed between a given data set enters and
exits the system. This measures the response of the system,
hence a very important quantity to minimize for the user.
However, there is a tradeoff to find between maximizing the
throughput and minimizing the latency. Intuitively, assigning
all stages to the fastest processor would suppress all
communications and accelerate computations, thereby minimizing
the latency, but achieving a very bad throughput. Conversely,
mapping each stage to a different processor is likely to
decrease the period, hence increase the throughput, but the
resulting latency will be high, because all inter-stage
communications must be accounted for in this latter mapping.
Even for a single pipeline skeleton application as discussed
now, determining a mapping achieving both a given minimum
throughput and maximum latency is an open problem (although
solutions exist for simpler, fully homogeneous clusters).

The previous example illustrates the simplest problem instance
on the simplest platform, which furthermore is assumed to be
static. The first research direction is to address the very
same problem on dynamic platforms. What if some processor speed
suddenly decreases (or worse the processor is no longer
responding)? The throughput and latency would be severely
impacted by this sudden bottleneck. Obviously, the mapping must
be robust and dynamically change as the execution progresses.
Note that we do not envision a run-time system with process
migration here. In contrast, we want to benefit both the simple
structure of the application workflow, and from the stochastic
model that will help capture the average evolution of
resources. The mapping should be prepared, so to speak, to
react to resource variations, and should be capable to reserve
more resources than needed so as to re-allocate computations as
the execution progresses, based on some histogram of the
current execution. In other words, a robust mapping would
incorporate tools to analyze current execution, and would
reserve more resources than needed to achieve a given
throughput or latency on a static platform, in order to account
for the dynamicity of the resources. There are several possible
approaches to design robust mappings. A naive solution would be
to replicate the processing of key stages, mapping them onto
distinct sets of resources, and gathering results in a
data-flow operation mode. The question of determining which
stages to replicate, and to which extent, is a very difficult
but interesting axis of future work. Indeed, there clearly is a
trade-off between the price to pay for robustness (or
performance guarantees) and the efficient usage of resources at
the system level, especially when several application workflows
compete for resources. The metric becomes even more complex,
with three antagonistic criteria to deal with, throughput,
latency (or stretch) and robustness.

The thesis will target more complex application scenarios than
the one illustrated above, even though it already poses open
challenges. The first extension is to allow a stage to be
mapped onto several processors to increase performance.
Although very natural and important, this extension complicates
mapping decisions, as a global load-balance of resource usage
should be achieved. Also, replicating for improved performance
and replicating for greater robustness are similar but
different objectives, and again tradeoffs must be found.

The second extension is to consider workflows that are still
regular but exhibit a more complex dependence pattern than
linear pipelines. In particular, in-trees, out-trees or
series-parallel dependence graphs are very common in
applications, and it is our goal to define robust scheduling
and mapping algorithms for such workflows. We may succeed in
dealing with fully general DAGs (Direct Acyclic Graphs) and
this possibility may well be explored, depending upon the
success for the above three categories.

Finally, as already stated, the third extension is to deal with
several application workflows simultaneously, aiming at mixing
objectives and achieving good levels of throughput, stretch and
robustness for each application, a very challenging algorithmic
problem indeed.

The work will be conducted along three main lines:
- Synthesis of relevant literature,
- Design and analysis of new and  multi-criteria
(throughput, latency or stretch, robustness)
scheduling techniques adapted to dynamic platforms
- Integration of results in the prototype softwares developed in
the GRAAL project (SimGRID, DIET) and validation on the
nation-wide Grid5000 architecture.


Bibliography:

- general references on scheduling:

J. Leung editor, Handbook of Scheduling: Algorithms, Models,
and Performance Analysis,CRC Press 2004.

A. Legrand and Y. Robert, Algorithmique Parallele - Cours et
exercices corriges, Dunod 2003

- references on skeleton workflows:

Anne Benoit and Yves Robert, Mapping pipeline skeletons onto
heterogeneous platforms, Research Report LIP 2007-05, ENS Lyon.
To appear in ICCS'07. Available at
\url{graal.ens-lyon.fr/~yrobert/}.

M. Spencer, R. Ferreira, M. Beynon, T. Kurc, U. Catalyurek, A.
Sussman and J. Saltz, Executing multiple pipelined data
analysis operations in the grid, 2002 ACM/IEEE Supercomputing
Conference, ACM Press.

Jaspal Subhlok and Gary Vondran, Optimal latency-throughput
tradeoffs for data parallel pipelines, ACM Symposium on
Parallel Algorithms and Architectures {SPAA'96}, 1996, 62-71,
ACM Press.

-- 

Yves Robert
GRAAL project, LIP laboratory
CNRS - ENS Lyon - INRIA - UCB Lyon
phone +33 4 72 72 85 86, fax +33 4 72 72 88 06
http://graal.ens-lyon.fr/~yrobert