Proposition Yves Robert ============================================================== Thesis title: Robust scheduling algorithms for mapping application workflows onto dynamic grid platforms. Research topic : Algorithms, scheduling, performance evaluation. Projet : GRAAL, Ecole Normale Supérieure de Lyon Advisors: Anne Benoit and Yves Robert Contact: {Anne.Benoit|Yves.Robert}@ens-lyon.fr Thesis subject: The thesis will be conducted within the GRAAL project at Ecole Normale Supérieure de Lyon, under the co-supervision of Anne Benoit and Yves Robert. The GRAAL project is focusing on mapping workflow applications on dynamic grid platforms. While static platforms are characterized by their limited heterogeneousness (processor speed, bandwidth of the communication links) and by the fact that their characteristics change at a very slow rate over time, dynamic platforms are characterized by their larger size, greater heterogeneousness, and the strong variability of their resource parameters. Still, the resources of these platforms (topology, message routes, etc.) and their characteristics are assumed to be known by a centralized control mechanism, even though they change over time. The thesis will restrict to parallel applications (common in the domains of image processing, computer vision, query processing, etc) that do not involve arbitrary interaction schemes. In many cases, the pattern is entirely predetermined, or may vary within a range of wider patterns. The structured approach to parallelism proposes to abstract commonly used patterns of computation, and then application programmers can explicitly declare that their application follows one or more of such patterns. Developers have then the right conceptual tools to assist them in the implementation of the application. The issues which make correct and efficient parallel programming hard are directly addressed by the patterns library. An algorithmic skeleton is a programming construct which abstracts such a pattern of processes and interactions, usually offered as a library. The programmer invokes one or more skeletons to describe the structure of the application, writing the sequential code of each part of the application. Interactions between the different parts of the application are inherited implicitly from the chosen skeleton. Pipeline parallelism is a very classic pattern of computation. A sequence of inputs is traversing a number of stages, in order to produce a sequence of outputs. All input passes through each stage in sequential order, but parallelism occurs because different stages can deal with different data at the same time, if mapped on different processors. Parallel applications expressed as skeletons typically operate on a very large number of data sets (this is the key motivation for decomposing the application in several pipeline stages). The application workflow is the abstracted operation on the successive data sets. Key metrics for a given workflow are the throughput (which measures the aggregate rate of processing of data) and the latency (which measures the response time of the system for a given data set). Detailed examples will be provided below. The thesis will target the simultaneous execution of several application workflows that each operate on continuous streams of data sets. Each workflow has its own requirements in terms of computations and communications (for instance one application may deal with files and another with matrices). All workflows will compete for CPU and network resources, and it is not easy to define a metric to optimize. To achieve a fair balance of resource allocations one could execute the same number of data sets per application, and try to maximize this number. However, some applications may have higher priorities than others, hence the possible introduction of priority factors in the objective function, which would amount to maximize the weighted minimum throughput achieved for a given workflow. This maximization corresponds to the well-known MAX-MIN fairness strategy between the different applications. However, as before, minimizing the maximum response time, or the weighted average response time, could be a very important metric too (especially to the user). In fact, with several application workflows of different sizes and characteristics, the problem gets more complex. Aiming at minimizing the response time might unduly privilege larger applications, so a better (more fair) metric could be to minimize the maximum stretch of any application. The stretch of an application is defined as the ratio of latency of the application during the execution with all workflows over the latency that would have been achieved if the application had been executed alone on the platform. Anyway, latency or stretch, both criteria are antagonistic to the throughput, and tradeoffs should be found between these criteria. To illustrate the scheduling and mapping optimization problems, consider the simplest problem instance, that of mapping a linear workflow onto a fully-interconnected platform. The workflow is composed of n stages. Data sets are fed into the pipeline and processed from stage to stage, until they exit the pipeline after the last stage. The k-th stage receives an input from the previous stage, performs a number of computations, and outputs data to the next stage. The first stage receives an input from the outside world, while the last stage returns the result. We target a heterogeneous platform, with p processors of different speeds and fully interconnected as a (virtual) clique. Note that we do not need to have a physical link between any processor pair. Instead, we may have a switch, or even a path composed of several physical links, to interconnect any processor pair; in the latter case we would retain the bandwidth of the slowest link in the path for the value of the virtual link bandwidth. Communications contention is taken care of by enforcing the one-port model. In this model, a given processor can be involved in a single communication at any time-step, either a send or a receive. However, independent communications between distinct processor pairs can take place simultaneously. The one-port model seems to fit the performance of some current MPI implementations, which serialize asynchronous MPI sends as soon as message sizes exceed a few megabytes. The general mapping problem consists in assigning application stages to platform processors. The mapping can be one-to-one (each stage of the application pipeline is mapped onto a distinct processor, which is possible only if n <= p), or interval-based (each participating processor is assigned an interval of consecutive stages). Intuitively, assigning several consecutive tasks to the same processors will increase their computational load, but may well dramatically decrease communication requirements. In fact, the best interval mapping may turn out to be a one-to-one mapping, or instead may enroll only a very small number of fast computing processors interconnected by high-speed links. A possible objective function is to maximize the throughput, or equivalently to minimize the inverse of the throughput, which is the period of the system. For each data set, a processor reads the input from its predecessor, executes computations corresponding to all the stages assigned to it, and sends the output to its successor; the processor periodically repeats this operation cycle, and the longest time needed by any processor is defined as the period. Another important objective function is to minimize the response time, or latency, of the system. The latency is defined as the time elapsed between a given data set enters and exits the system. This measures the response of the system, hence a very important quantity to minimize for the user. However, there is a tradeoff to find between maximizing the throughput and minimizing the latency. Intuitively, assigning all stages to the fastest processor would suppress all communications and accelerate computations, thereby minimizing the latency, but achieving a very bad throughput. Conversely, mapping each stage to a different processor is likely to decrease the period, hence increase the throughput, but the resulting latency will be high, because all inter-stage communications must be accounted for in this latter mapping. Even for a single pipeline skeleton application as discussed now, determining a mapping achieving both a given minimum throughput and maximum latency is an open problem (although solutions exist for simpler, fully homogeneous clusters). The previous example illustrates the simplest problem instance on the simplest platform, which furthermore is assumed to be static. The first research direction is to address the very same problem on dynamic platforms. What if some processor speed suddenly decreases (or worse the processor is no longer responding)? The throughput and latency would be severely impacted by this sudden bottleneck. Obviously, the mapping must be robust and dynamically change as the execution progresses. Note that we do not envision a run-time system with process migration here. In contrast, we want to benefit both the simple structure of the application workflow, and from the stochastic model that will help capture the average evolution of resources. The mapping should be prepared, so to speak, to react to resource variations, and should be capable to reserve more resources than needed so as to re-allocate computations as the execution progresses, based on some histogram of the current execution. In other words, a robust mapping would incorporate tools to analyze current execution, and would reserve more resources than needed to achieve a given throughput or latency on a static platform, in order to account for the dynamicity of the resources. There are several possible approaches to design robust mappings. A naive solution would be to replicate the processing of key stages, mapping them onto distinct sets of resources, and gathering results in a data-flow operation mode. The question of determining which stages to replicate, and to which extent, is a very difficult but interesting axis of future work. Indeed, there clearly is a trade-off between the price to pay for robustness (or performance guarantees) and the efficient usage of resources at the system level, especially when several application workflows compete for resources. The metric becomes even more complex, with three antagonistic criteria to deal with, throughput, latency (or stretch) and robustness. The thesis will target more complex application scenarios than the one illustrated above, even though it already poses open challenges. The first extension is to allow a stage to be mapped onto several processors to increase performance. Although very natural and important, this extension complicates mapping decisions, as a global load-balance of resource usage should be achieved. Also, replicating for improved performance and replicating for greater robustness are similar but different objectives, and again tradeoffs must be found. The second extension is to consider workflows that are still regular but exhibit a more complex dependence pattern than linear pipelines. In particular, in-trees, out-trees or series-parallel dependence graphs are very common in applications, and it is our goal to define robust scheduling and mapping algorithms for such workflows. We may succeed in dealing with fully general DAGs (Direct Acyclic Graphs) and this possibility may well be explored, depending upon the success for the above three categories. Finally, as already stated, the third extension is to deal with several application workflows simultaneously, aiming at mixing objectives and achieving good levels of throughput, stretch and robustness for each application, a very challenging algorithmic problem indeed. The work will be conducted along three main lines: - Synthesis of relevant literature, - Design and analysis of new and multi-criteria (throughput, latency or stretch, robustness) scheduling techniques adapted to dynamic platforms - Integration of results in the prototype softwares developed in the GRAAL project (SimGRID, DIET) and validation on the nation-wide Grid5000 architecture. Bibliography: - general references on scheduling: J. Leung editor, Handbook of Scheduling: Algorithms, Models, and Performance Analysis,CRC Press 2004. A. Legrand and Y. Robert, Algorithmique Parallele - Cours et exercices corriges, Dunod 2003 - references on skeleton workflows: Anne Benoit and Yves Robert, Mapping pipeline skeletons onto heterogeneous platforms, Research Report LIP 2007-05, ENS Lyon. To appear in ICCS'07. Available at \url{graal.ens-lyon.fr/~yrobert/}. M. Spencer, R. Ferreira, M. Beynon, T. Kurc, U. Catalyurek, A. Sussman and J. Saltz, Executing multiple pipelined data analysis operations in the grid, 2002 ACM/IEEE Supercomputing Conference, ACM Press. Jaspal Subhlok and Gary Vondran, Optimal latency-throughput tradeoffs for data parallel pipelines, ACM Symposium on Parallel Algorithms and Architectures {SPAA'96}, 1996, 62-71, ACM Press. -- Yves Robert GRAAL project, LIP laboratory CNRS - ENS Lyon - INRIA - UCB Lyon phone +33 4 72 72 85 86, fax +33 4 72 72 88 06 http://graal.ens-lyon.fr/~yrobert