Evaluating parallel job scheduling algorithms is a challenging task. Many factors affect the outcome 
of the evaluation, including workload and application choices, metrics, choice of scheduling 
algorithms and their parameters, and the hardware used or assumed. The large, non-continuous parameter 
space renders analytical evaluation extremely difficult, while simulation evaluations are very 
sensitive to the assumptions undertaken, sometimes resulting in contradicting results. Experimental 
evaluations are even rarer, due to the complexity of implementation, and the difficulty in obtaining a 
dedicated large machine for long periods of time.

This talk will describe our efforts to evaluate various scheduling strategies in a dynamic, realistic 
environment. Of the various parameters that affect job scheduling performance, workload and 
implementation play a pivotal role. Most studies either employ simulations and/or simplistic 
workloads, which contain many assumptions, including unknown ones. Instead, we developed a scheduling 
framework that implements several existing and novel algorithms on various cluster architectures of up 
to hundreds of nodes. This framework was used to produce the first experimental evaluation of several 
job scheduling strategies in a dynamic workload environment, using synthetic and scientific MPI 
applications.

This talk will discuss the challenges involved in evaluating job scheduling strategies, and the 
approaches we chose to address them. An analysis will be presented of three factors affecting 
scheduling systems running dynamic workloads: multiprogramming level, time quantum, and the use of 
backfilling for queue management -- and how they depend on offered load.

Joint work with Dror Feitelson (Hebrew U.), Fabrizio Petrini (LANL), and Juan Fernandez (Murcia U.)